{ "cells": [ { "cell_type": "markdown", "id": "d182cc95", "metadata": {}, "source": [ "# TCGA-BRCA - PAM50 Cancer Subtype Classification\n", "\n", "**Cohort Summary:**\n", "\n", "| Stage | Methylation | mRNA | miRNA | Clinical | PAM50 |\n", "| --- | --- | --- | --- | --- | --- |\n", "| **Raw** (features x samples) | 20,107 x 885 | 18,321 x 1,212 | 503 x 1,189 | 1,098 x 18 | 1,087 x 1 |\n", "| **Final aligned** (samples x features) | 769 x 20,106 | 769 x 16,758 | 769 x 354 | 769 x 17 | 769 x 1 |\n", "| **After feature selection** | 769 x 400 | 769 x 200 | 769 x 100 | 769 x 17 | 769 x 1 |\n", "\n", "**Target Definition:** PAM50 Subtype (5-class): LumA (n=419), LumB (n=140), Basal (n=130), Her2 (n=46), and Normal (n=34).\n", "\n", "\n", "### Dataset Source\n", "\n", "- **Omics Data**: [FireBrowse BRCA](http://firebrowse.org/?cohort=BRCA)\n", "- **PAM50 Data**: [TCGAbiolinks](http://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9c0bda23", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pathlib import Path\n", "root = Path(\"/home/vicente/Github/BioNeuralNet/TCGA_BRCA_DATA\")\n", "\n", "mirna_raw = pd.read_csv(root/\"BRCA.miRseq_RPKM_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False) \n", "rna_raw = pd.read_csv(root / \"BRCA.uncv2.mRNAseq_RSEM_normalized_log2.txt\", sep=\"\\t\",index_col=0,low_memory=False)\n", "meth_raw = pd.read_csv(root/\"BRCA.meth.by_mean.data.txt\", sep='\\t',index_col=0,low_memory=False)\n", "clinical_raw = pd.read_csv(root / \"BRCA.clin.merged.picked.txt\",sep=\"\\t\", index_col=0, low_memory=False)\n", "\n", "# display all shapes and first few rows of each dataset\n", "display(mirna_raw.iloc[:3,:5])\n", "display(mirna_raw.shape)\n", "\n", "display(rna_raw.iloc[:3,:5])\n", "display(rna_raw.shape)\n", "\n", "display(meth_raw.iloc[:3,:5])\n", "display(meth_raw.shape)\n", "\n", "display(clinical_raw.iloc[:3,:5])\n", "display(clinical_raw.shape)" ] }, { "cell_type": "markdown", "id": "aacae339", "metadata": {}, "source": [ "## TCGA-BioLink: Pam50\n", "\n", "This section demonstrates how to use the `TCGAbiolinks` R package to access and download molecular subtype data. It begins by ensuring `TCGAbiolinks` is installed, then loads the package. It retrieves PAM50 molecular subtype labels using `TCGAquery_subtype()` and writes them to a CSV file." ] }, { "cell_type": "markdown", "id": "a445601f", "metadata": {}, "source": [ "```R\n", " # Install TCGAbiolinks\n", " if (!requireNamespace(\"TCGAbiolinks\", quietly = TRUE)) {\n", " if (!requireNamespace(\"BiocManager\", quietly = TRUE))\n", " install.packages(\"BiocManager\")\n", " BiocManager::install(\"TCGAbiolinks\")\n", " }\n", "\n", " # Load the library\n", " library(TCGAbiolinks)\n", "\n", " # Download PAM50 subtype labels\n", " pam50_df <- TCGAquery_subtype(tumor = \"BRCA\")[ , c(\"patient\", \"BRCA_Subtype_PAM50\")]\n", " write.csv(pam50_df, file = \"BRCA_PAM50_labels.csv\", row.names = FALSE, quote = FALSE)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "5a6eb71e", "metadata": {}, "outputs": [], "source": [ "pam50 = pd.read_csv(root /\"BRCA_PAM50_labels.csv\",index_col=0)\n", "\n", "display(pam50.iloc[:5,:5])\n", "display(pam50.shape)" ] }, { "cell_type": "markdown", "id": "fd066d50", "metadata": {}, "source": [ "## Data Processing Summary\n", "\n", "1. **Transpose Data:** All raw data (miRNA, RNA, etc.) is flipped so rows represent patients and columns represent features.\n", "2. **Standardize Patient IDs:** Patient IDs in all tables are cleaned to the 12-character TCGA format (e.g., `TCGA-AB-1234`) for matching.\n", "3. **Handle Duplicates:** Duplicate patient rows are averaged in the omics data. The first entry is kept for duplicate patients in the clinical data.\n", "4. **Find Common Patients:** The script identifies the list of patients that exist in *all* datasets.\n", "5. **Sparse Feature Filter**: Remove features where missingness exceeds the imputation limit (20%). \n", "7. **Final Data Aligment:** All data tables are filtered down to *only* this common list of patients.\n", "6. **Impute Missing Values**: Remaining missing data (NaNs) are estimated and filled using mean imputation.\n", "8. **Extract Target:** The `BRCA_Subtype_PAM50` column is pulled from the `pam50` dataframe from the last cell. We assigned each label a numerical value." ] }, { "cell_type": "code", "execution_count": null, "id": "719b6e2f", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.utils import m_transform, impute_knn\n", "from bioneuralnet.utils import data_stats, sparse_filter\n", "\n", "mirna = mirna_raw.T\n", "rna = rna_raw.T\n", "meth = meth_raw.T\n", "clinical = clinical_raw.T\n", "\n", "print(f\"miRNA (samples, features): {mirna.shape}\")\n", "print(f\"RNA (samples, features): {rna.shape}\")\n", "print(f\"Methylation (samples, features): {meth.shape}\")\n", "print(f\"Clinical (samples, features): {clinical.shape}\")\n", "\n", "def trim_barcode(idx):\n", " return idx.to_series().str.slice(0, 12)\n", "\n", "# Standardized patient IDs across all files\n", "meth.index = trim_barcode(meth.index)\n", "rna.index = trim_barcode(rna.index)\n", "mirna.index = trim_barcode(mirna.index)\n", "clinical.index = clinical.index.str.upper()\n", "clinical.index.name = \"Patient_ID\"\n", "\n", "# Convert all data to numeric, coercing errors to NaN\n", "meth = meth.apply(pd.to_numeric, errors='coerce')\n", "rna = rna.apply(pd.to_numeric, errors='coerce')\n", "mirna = mirna.apply(pd.to_numeric, errors='coerce')\n", "\n", "# For any duplicate columns in the omics data -> average their values\n", "meth = meth.groupby(meth.index).mean()\n", "rna = rna.groupby(rna.index).mean()\n", "mirna = mirna.groupby(mirna.index).mean()\n", "\n", "# Drop unwanted columns\n", "clinical.drop(columns=[\"Composite Element REF\"], errors=\"ignore\", inplace=True)\n", "meth.drop(columns=[\"Composite Element REF\"], errors=\"ignore\", inplace=True)\n", "\n", "# For any duplicate rows (patients) in the clinical data -> keep the first occurrence\n", "clinical = clinical[~clinical.index.duplicated(keep='first')]\n", "\n", "# Convert beta values to M-values using bioneuralnet utility with small epsilon to avoid log(0)\n", "meth_m = m_transform(meth, eps=1e-7) \n", "\n", "print(f\"\\nMethylation shape: {meth_m.shape}\")\n", "print(f\"RNA shape: {rna.shape}\")\n", "print(f\"miRNA shape: {mirna.shape}\")\n", "print(f\"Clinical shape: {clinical.shape}\")\n", "\n", "# Standardize column names\n", "for df in [meth_m, rna, mirna]:\n", " df.columns = df.columns.str.replace(r\"\\?\", \"unknown_\", regex=True)\n", " df.columns = df.columns.str.replace(r\"\\|\", \"_\", regex=True)\n", " df.columns = df.columns.str.replace(\"-\", \"_\", regex=False)\n", " df.columns = df.columns.str.replace(r\"_+\", \"_\", regex=True)\n", " df.columns = df.columns.str.strip(\"_\")\n", "\n", "# Patients common across all data files\n", "common_patients = sorted(list(set(meth_m.index) & set(rna.index) & set(mirna.index) & set(clinical.index)))\n", "\n", "# Subset to only common patients\n", "X_meth = meth_m.loc[common_patients].copy()\n", "X_rna = rna.loc[common_patients].copy()\n", "X_mirna = mirna.loc[common_patients].copy()\n", "clinical = clinical.loc[common_patients].copy()\n", "\n", "# Calling data_stats on subset to prevent issues downstream\n", "data_stats(X_mirna, \"miRNA\")\n", "data_stats(X_rna, \"RNA\")\n", "data_stats(X_meth, \"Methylation\")\n", "\n", "# apply the sparse_filter and missing_fraction parameter\n", "X_mirna = sparse_filter(X_mirna, missing_fraction=0.20)\n", "X_rna = sparse_filter(X_rna, missing_fraction=0.20)\n", "X_meth = sparse_filter(X_meth, missing_fraction=0.20)\n", "\n", "targets = pam50['BRCA_Subtype_PAM50'] \n", "mapping_brca = {\n", " 'LumA': 0, \n", " 'Her2': 1, \n", " 'LumB': 2, \n", " 'Basal': 3, \n", " 'Normal': 4\n", "}\n", "Y_label = targets.map(mapping_brca).to_frame(name=\"target\")\n", "\n", "# Secondary aligment after sparse filter\n", "final_patients = sorted(\n", " set(X_meth.index) & \n", " set(X_rna.index) & \n", " set(X_mirna.index) &\n", " set(clinical.index) &\n", " set(Y_label.index)\n", ")\n", "\n", "# subsetting based on secondary alignment\n", "X_meth = X_meth.loc[final_patients]\n", "X_rna = X_rna.loc[final_patients]\n", "X_mirna = X_mirna.loc[final_patients]\n", "clinical = clinical.loc[final_patients]\n", "Y_label = Y_label.loc[final_patients]\n", "\n", "# impute the remaining missing values with knn\n", "X_meth = impute_knn(X_meth, n_neighbors=5)\n", "X_rna = impute_knn(X_rna, n_neighbors=5)\n", "X_mirna = impute_knn(X_mirna, n_neighbors=5)" ] }, { "cell_type": "code", "execution_count": null, "id": "4f35bd67", "metadata": {}, "outputs": [], "source": [ "# Inspect the first 3 rows and 5 colums.\n", "\n", "display(X_meth.iloc[:3,:5])\n", "display(X_meth.shape)\n", "\n", "display(X_rna.iloc[:3,:5])\n", "display(X_rna.shape)\n", "\n", "display(X_mirna.iloc[:3,:5])\n", "display(X_mirna.shape)\n", "\n", "display(clinical.iloc[:3,:5])\n", "display(clinical.shape)\n", "\n", "display(Y_label.value_counts())" ] }, { "cell_type": "markdown", "id": "ec1c688b", "metadata": {}, "source": [ "## Feature Selection\n", "\n", "Unsupervised feature selection was performed across all three omic modalities using Laplacian Score filtering." ] }, { "cell_type": "code", "execution_count": null, "id": "fa70dcca", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.utils import laplacian_score\n", "\n", "meth_lap = laplacian_score(X_meth, n_keep=400)\n", "rna_lap = laplacian_score(X_rna, n_keep=200)\n", "mirna_lap = laplacian_score(X_mirna, n_keep=100)" ] }, { "cell_type": "code", "execution_count": null, "id": "fefc4dc3", "metadata": {}, "outputs": [], "source": [ "# We inspected each of the following categorical featues and built a ordinal mapping\n", "\n", "print(clinical[\"pathologic_stage\"].value_counts())\n", "print(clinical[\"pathology_T_stage\"].value_counts())\n", "print(clinical[\"pathology_N_stage\"].value_counts())\n", "print(clinical[\"pathology_M_stage\"].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "id": "0fa581dd", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from bioneuralnet.utils import preprocess_clinical\n", "\n", "columns_to_drop = [\n", " 'Composite Element REF',\n", " 'tumor_tissue_site',\n", " 'gender',\n", " 'date_of_initial_pathologic_diagnosis',\n", " 'vital_status',\n", " 'days_to_death',\n", " 'days_to_last_followup',\n", " 'days_to_last_known_alive',\n", " 'radiation_therapy'\n", "]\n", "\n", "mappings = {\n", " 'pathologic_stage': {\n", " 'stage i': 1.0, 'stage ia': 1.1, 'stage ib': 1.2,\n", " 'stage ii': 2.0, 'stage iia': 2.1, 'stage iib': 2.2,\n", " 'stage iii': 3.0, 'stage iiia': 3.1, 'stage iiib': 3.2, 'stage iiic': 3.3,\n", " 'stage iv': 4.0, \n", " 'stage x': np.nan, 'nan': np.nan\n", " },\n", " 'pathology_T_stage': {\n", " 't1': 1.0, 't1a': 1.1, 't1b': 1.2, 't1c': 1.3,\n", " 't2': 2.0, 't2a': 2.1, 't2b': 2.2,\n", " 't3': 3.0, 't3a': 3.1,\n", " 't4': 4.0, 't4a': 4.1, 't4b': 4.2, 't4c': 4.3, 't4d': 4.4,\n", " 'tx': np.nan, 'nan': np.nan\n", " },\n", " 'pathology_N_stage': {\n", " 'n0': 0.0, 'n0 (i-)': 0.0, \n", " 'n0 (mol+)': 0.1, 'n0 (i+)': 0.2,\n", " 'n1mi': 0.5,\n", " 'n1': 1.0, 'n1a': 1.1, 'n1b': 1.2, 'n1c': 1.3,\n", " 'n2': 2.0, 'n2a': 2.1, 'n2b': 2.2,\n", " 'n3': 3.0, 'n3a': 3.1, 'n3b': 3.2, 'n3c': 3.3,\n", " 'nx': np.nan, 'nan': np.nan\n", " },\n", " 'pathology_M_stage': {\n", " 'm0': 0.0, 'cm0 (i+)': 0.5, 'm1': 1.0,\n", " 'mx': np.nan, 'nan': np.nan\n", " }\n", "}\n", "continuous_cols = ['years_to_birth', 'number_of_lymph_nodes']\n", "\n", "clinical_numeric = preprocess_clinical(\n", " X=clinical,\n", " drop_columns=columns_to_drop,\n", " continuous_columns=continuous_cols,\n", " ordinal_mappings=mappings)\n", "\n", "\n", "print(f\"\\nNaN count: {clinical_numeric.isna().sum().sum()}\")\n", "print(f\"\\nInf count: {np.isinf(clinical_numeric.values).sum()}\")\n", "print(f\"\\nColumn dtypes:\\n{clinical_numeric.dtypes}\")" ] }, { "cell_type": "markdown", "id": "c2168980", "metadata": {}, "source": [ "## Data Availability\n", "\n", "To facilitate rapid experimentation and reproduction of our results, the fully processed and feature-selected dataset used in this example has been made available directly within the package.\n", "\n", "Users can load this dataset, bypassing all preceding data acquisition, preprocessing, and feature selection steps." ] }, { "cell_type": "code", "execution_count": null, "id": "2d0340f6", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from bioneuralnet.datasets import DatasetLoader\n", "\n", "tgca_brca = DatasetLoader(\"brca\")\n", "print(tgca_brca.shape)\n", "\n", "X_methylation = tgca_brca[\"methylation\"]\n", "X_rna = tgca_brca[\"rna\"]\n", "X_mirna = tgca_brca[\"mirna\"]\n", "clinical_numeric = tgca_brca[\"clinical\"]\n", "Y_labels = tgca_brca[\"target\"]\n", "\n", "display(X_methylation.iloc[:3,:5])\n", "display(X_rna.iloc[:3,:5])\n", "display(X_mirna.iloc[:3,:5])\n", "display(clinical_numeric.iloc[:3,:5])\n", "display(Y_labels.iloc[:3,:5])" ] }, { "cell_type": "markdown", "id": "1d8d776c", "metadata": {}, "source": [ "## Reproducibility and Seeding\n", "\n", "This utility function propagates the seed to all sources of randomness, including `random`, `numpy`, and `torch`." ] }, { "cell_type": "code", "execution_count": null, "id": "a5c6c6c8", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.utils import set_seed\n", "\n", "SEED = 8183\n", "set_seed(SEED)" ] }, { "cell_type": "code", "execution_count": null, "id": "4e4ed4b6", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.network import threshold_network\n", "\n", "omics_brca = pd.concat([X_methylation, X_rna, X_mirna], axis=1)\n", "\n", "threshold_63_22 = threshold_network(omics_brca,b=6.3, k=22)\n", "threshold_67_15 = threshold_network(omics_brca,b=6.7, k=15)" ] }, { "cell_type": "code", "execution_count": null, "id": "042cfe73", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.network import NetworkAnalyzer\n", "\n", "thre_22_analysis = NetworkAnalyzer(threshold_63_22, source_omics=[X_methylation, X_rna, X_mirna])\n", "thre_22_analysis.basic_statistics(0.0001)\n", "thre_22_analysis.hub_analysis(0.0001)\n", "thre_22_analysis.find_strongest_edges(5)\n", "\n", "thre_15_analysis = NetworkAnalyzer(threshold_67_15, source_omics=[X_methylation, X_rna, X_mirna])\n", "thre_15_analysis.basic_statistics(0.00001)\n", "thre_15_analysis.hub_analysis(0.00001)\n", "thre_15_analysis.find_strongest_edges(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "cd4ac913", "metadata": {}, "outputs": [], "source": [ "# A targeted subset of clinical variables is selected. This can be modified as needed for testing/experiments\n", "\n", "clinical_numeric = clinical_numeric.copy()[[\n", " 'histological_type_infiltrating lobular carcinoma',\n", " 'histological_type_infiltrating ductal carcinoma',\n", " 'histological_type_metaplastic carcinoma',\n", " 'race_black or african american',\n", " 'race_white',\n", " 'years_to_birth',\n", " 'pathologic_stage',\n", " 'number_of_lymph_nodes',\n", " 'pathology_N_stage'\n", "]]" ] }, { "cell_type": "markdown", "id": "6f8dc453", "metadata": {}, "source": [ "## Classification using DPMON: Training and Evaluation" ] }, { "cell_type": "code", "execution_count": null, "id": "bdbf3ace", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.downstream_task import DPMON\n", "\n", "output_dir_base = Path(\"/home/vicente/Github/BioNeuralNet/dpmon_gat_results/brca\")\n", "\n", "current_output_dir = output_dir_base / \"threshold_63_22\"\n", "current_output_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "dpmon_params_base = {\n", " \"adjacency_matrix\": threshold_63_22,\n", " \"omics_list\": omics_brca,\n", " \"phenotype_data\": Y_labels,\n", " \"phenotype_col\": \"target\",\n", " \"clinical_data\": clinical_numeric,\n", " \"tune_trials\" : 20,\n", " \"model\": 'GAT',\n", " \"tune\": True, \n", " \"cv\": True, \n", " \"n_folds\": 5,\n", " \"repeat_num\": 5,\n", " \"gpu\": True,\n", " \"seed\": SEED,\n", " \"output_dir\": current_output_dir\n", "}\n", "\n", "gat_dpmon = DPMON(**dpmon_params_base)\n", "gat_predictions, gat_metrics, gat_embeddings = gat_dpmon.run()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e65fdb29", "metadata": {}, "outputs": [], "source": [ "output_dir_base = Path(\"/home/vicente/Github/BioNeuralNet/dpmon_gcn_results/brca\")\n", "\n", "current_output_dir = output_dir_base / \"threshold_63_22\"\n", "current_output_dir.mkdir(parents=True, exist_ok=True)\n", "\n", "dpmon_params_base = {\n", " \"adjacency_matrix\": threshold_63_22,\n", " \"omics_list\": omics_brca,\n", " \"phenotype_data\": Y_labels,\n", " \"phenotype_col\": \"target\",\n", " \"clinical_data\": clinical_numeric,\n", " \"tune_trials\" : 20,\n", " \"model\": 'GCN',\n", " \"tune\": True, \n", " \"cv\": True, \n", " \"n_folds\": 5,\n", " \"repeat_num\": 5,\n", " \"gpu\": True,\n", " \"seed\": SEED,\n", " \"output_dir\": current_output_dir\n", "}\n", "\n", "gcn_dpmon = DPMON(**dpmon_params_base)\n", "gcn_predictions, gcn_metrics, gcn_embeddings = gcn_dpmon.run()\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4ef5b92e", "metadata": {}, "outputs": [], "source": [ "from bioneuralnet.downstream_task import DPMON\n", "\n", "output_dir_base = Path(\"/home/vicente/Github/BioNeuralNet/dpmon_gat_multiple_results/brca\")\n", "all_results_gat = []\n", "\n", "# To test several graphs sequentially. We can loop over the graphs dynamically like this:\n", "\n", "comparison_runs = [\n", " {\"name\": \"threshold_63_22\", \"graph\": threshold_63_22, \"omics\": omics_brca},\n", " {\"name\": \"threshold_67_15\", \"graph\": threshold_67_15, \"omics\": omics_brca},\n", "]\n", "\n", "for run_config in comparison_runs:\n", " graph_name = run_config[\"name\"]\n", " A_full = run_config[\"graph\"]\n", " omics = run_config[\"omics\"]\n", " \n", " current_output_dir = output_dir_base / graph_name\n", " current_output_dir.mkdir(parents=True, exist_ok=True)\n", "\n", " dpmon_params_base = {\n", " \"adjacency_matrix\": A_full,\n", " \"omics_list\": omics,\n", " \"phenotype_data\": Y_labels,\n", " \"phenotype_col\": \"target\",\n", " \"clinical_data\": clinical_numeric,\n", " \"tune_trials\" : 20,\n", " \"model\": 'GAT',\n", " \"tune\": True, \n", " \"cv\": True, \n", " \"n_folds\": 5,\n", " \"repeat_num\":5,\n", " \"gpu\": True,\n", " \"cuda\": 0,\n", " \"seed\": SEED,\n", " \"output_dir\": current_output_dir\n", " }\n", " \n", " dpmon_tunned = DPMON(**dpmon_params_base)\n", " predictions_df, metrics, embeddings = dpmon_tunned.run()\n", "\n", " all_results_gat.append({\n", " \"graph\": graph_name,\n", " \"predictions\": predictions_df,\n", " \"metrics\": metrics,\n", " \"embeddings\": embeddings\n", " })\n", "\n", "# Once all the experiments are done, we can review the results for each graph configuration:\n", "for res in all_results_gat:\n", " graph_name = res[\"graph\"]\n", " graph_metrics = res[\"metrics\"]\n", " \n", " acc_row = graph_metrics.loc[graph_metrics['Metric'] == 'Accuracy'].iloc[0]\n", " f1_macro_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Macro'].iloc[0]\n", " f1_weighted_row = graph_metrics.loc[graph_metrics['Metric'] == 'F1 Weighted'].iloc[0]\n", " recall_row = graph_metrics.loc[graph_metrics['Metric'] == 'Recall'].iloc[0]\n", " precission_row = graph_metrics.loc[graph_metrics['Metric'] == 'Precision'].iloc[0]\n", " auc_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUC'].iloc[0]\n", " aupr_row = graph_metrics.loc[graph_metrics['Metric'] == 'AUPR'].iloc[0]\n", " \n", " acc_avg, acc_std = acc_row['Average'], acc_row['StdDev']\n", " f1_macro_avg, f1_macro_std = f1_macro_row['Average'], f1_macro_row['StdDev']\n", " f1_weighted_avg, f1_weighted_std = f1_weighted_row['Average'], f1_weighted_row['StdDev']\n", " recall_avg, recall_std = recall_row['Average'], recall_row['StdDev']\n", " precision_avg, precision_std = precission_row['Average'], precission_row['StdDev']\n", " auc_avg, auc_std = auc_row['Average'], auc_row['StdDev']\n", " aupr_avg, aupr_std = aupr_row['Average'], aupr_row['StdDev']\n", "\n", " print(f\"\\nResults for: {graph_name}\")\n", " print(f\"Accuracy (Avg +/- Std): {acc_avg:.4f} +/- {acc_std:.4f}\")\n", " print(f\"F1 Macro (Avg +/- Std): {f1_macro_avg:.4f} +/- {f1_macro_std:.4f}\")\n", " print(f\"F1 Weighted (Avg +/- Std): {f1_weighted_avg:.4f} +/- {f1_weighted_std:.4f}\")\n", " print(f\"Recall: {recall_avg:.4f} +/- {recall_std:.4f}\")\n", " print(f\"Precision: {precision_avg:.4f} +/- {precision_std:.4f}\")\n", " print(f\"AUC: {auc_avg:.4f} +/- {auc_std:.4f}\")\n", " print(f\"AUPR: {aupr_avg:.4f} +/- {aupr_std:.4f}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7178390e", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold, ParameterSampler, RepeatedStratifiedKFold\n", "from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score, average_precision_score, precision_score\n", "from sklearn.preprocessing import StandardScaler, label_binarize\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neural_network import MLPClassifier\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.base import clone\n", "from scipy.stats import loguniform, randint\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from xgboost import XGBClassifier\n", "from sklearn.svm import SVC\n", "\n", "SEED = 8183\n", "y = Y_labels[\"target\"]\n", "X = pd.concat([omics_brca,clinical_numeric],axis=1)\n", "\n", "print(f\"Successfully created X vector with shape: {X.shape}\")\n", "print(f\"Successfully created y vector with shape: {y.shape}\")\n", "\n", "pipe_lr = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', LogisticRegression(solver='lbfgs', max_iter=1000, penalty=None, random_state=SEED))\n", "])\n", "\n", "pipe_mlp = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', MLPClassifier(max_iter=500, early_stopping=True, n_iter_no_change=10, random_state=SEED))\n", "])\n", "\n", "pipe_xgb = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', XGBClassifier(eval_metric='logloss', tree_method='hist', max_bin=128, random_state=SEED))\n", "])\n", "pipe_rf = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', RandomForestClassifier(random_state=SEED))\n", "])\n", "\n", "pipe_svm = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', SVC(probability=True, random_state=SEED))\n", "])\n", "\n", "pipe_dt = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('model', DecisionTreeClassifier(random_state=SEED))\n", "])\n", "\n", "params_lr = {'model__penalty': ['l2'], 'model__C': loguniform(1e-4, 1e2)}\n", "\n", "params_mlp = {\n", " 'model__hidden_layer_sizes': [(100,), (100, 50), (50, 50)],\n", " 'model__activation': ['relu', 'tanh'],\n", " 'model__alpha': loguniform(1e-5, 1e-1),\n", " 'model__learning_rate_init': loguniform(1e-4, 1e-2)\n", "}\n", "params_xgb = {\n", " 'model__n_estimators': randint(50, 200),\n", " 'model__learning_rate': loguniform(0.01, 0.3),\n", " 'model__max_depth': randint(3, 7),\n", " 'model__subsample': [0.8, 1.0], \n", " 'model__colsample_bytree': [0.8, 1.0]\n", "}\n", "params_rf = {\n", " 'model__n_estimators': randint(100, 300),\n", " 'model__max_depth': [10, 20, 30, None],\n", " 'model__min_samples_split': randint(2, 10),\n", " 'model__min_samples_leaf': randint(1, 5),\n", " 'model__max_features': ['sqrt', 'log2']\n", "}\n", "params_svm = {\n", " 'model__C': loguniform(1e-2, 1e3),\n", " 'model__kernel': ['rbf', 'linear'],\n", " 'model__gamma': ['scale', 'auto']\n", "}\n", "\n", "params_dt = {\n", " 'model__max_depth': randint(3, 15),\n", " 'model__min_samples_split': randint(2, 20),\n", " 'model__criterion': ['gini', 'entropy']\n", "}\n", "\n", "models_to_tune = {\n", " \"LogisticRegression\": (pipe_lr, params_lr),\n", " \"SVM\": (pipe_svm, params_svm),\n", " \"MLP\": (pipe_mlp, params_mlp),\n", " \"XGBoost\": (pipe_xgb, params_xgb),\n", " \"RandomForest\": (pipe_rf, params_rf),\n", " \"DecisionTree\": (pipe_dt, params_dt),\n", "}\n", "\n", "all_results = {\n", " \"LogisticRegression\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", " \"MLP\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", " \"XGBoost\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", " \"RandomForest\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", " \"SVM\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", " \"DecisionTree\": {\"acc\": [], \"f1_w\": [], \"f1_m\": [], \"recall\": [], \"auc\": [], \"precision\": [], \"aupr\": []},\n", "}\n", "\n", "\n", "\n", "for model_name, (pipeline, param_dist) in models_to_tune.items():\n", " print(f\"Evaluating model with nested CV: {model_name}\")\n", " \n", " outer_cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=SEED)\n", "\n", " # Inner fold loop is for finding the best hyperparameters per fold.\n", " for fold_idx, (train_idx, test_idx) in enumerate(outer_cv.split(X, y), start=1):\n", " X_train_outer, X_test_outer = X.iloc[train_idx], X.iloc[test_idx]\n", " y_train_outer, y_test_outer = y.iloc[train_idx], y.iloc[test_idx]\n", " inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)\n", " \n", " best_score_fold = -np.inf\n", " best_params_fold = None\n", "\n", " # A fixed seed = same hyperparamters each fold.\n", " # No seed = different hyperparameters each fold. \n", " # This adds more randomness, and may yield better generalization.\n", " param_sampler = list(ParameterSampler(param_dist, n_iter=20))\n", " \n", " for params in param_sampler:\n", " inner_scores = []\n", " \n", " for inner_train_idx, inner_val_idx in inner_cv.split(X_train_outer, y_train_outer):\n", " X_train_inner = X_train_outer.iloc[inner_train_idx]\n", " X_val_inner = X_train_outer.iloc[inner_val_idx]\n", " y_train_inner = y_train_outer.iloc[inner_train_idx]\n", " y_val_inner = y_train_outer.iloc[inner_val_idx]\n", " \n", " inner_pipeline = clone(pipeline)\n", " inner_pipeline.set_params(**params)\n", " inner_pipeline.fit(X_train_inner, y_train_inner)\n", " \n", " y_val_pred = inner_pipeline.predict(X_val_inner)\n", " score = f1_score(y_val_inner, y_val_pred, average='macro', zero_division=0)\n", " inner_scores.append(score)\n", " \n", " mean_score = np.mean(inner_scores)\n", " if mean_score > best_score_fold:\n", " best_score_fold = mean_score\n", " best_params_fold = params\n", " \n", " print(f\"Outer fold {fold_idx}: best params (inner CV F1-M={best_score_fold:.4f})\")\n", " print(f\"{best_params_fold}\")\n", "\n", " final_pipeline = clone(pipeline)\n", " final_pipeline.set_params(**best_params_fold)\n", " final_pipeline.fit(X_train_outer, y_train_outer)\n", " \n", " preds = final_pipeline.predict(X_test_outer)\n", " \n", " if hasattr(final_pipeline, \"predict_proba\"):\n", " proba = final_pipeline.predict_proba(X_test_outer)\n", " else:\n", " proba = None\n", " \n", " acc = accuracy_score(y_test_outer, preds)\n", " f1_w = f1_score(y_test_outer, preds, average='weighted', zero_division=0)\n", " f1_m = f1_score(y_test_outer, preds, average='macro', zero_division=0)\n", " recall = recall_score(y_test_outer, preds, average='macro', zero_division=0)\n", " precision = precision_score(y_test_outer, preds, average='macro', zero_division=0)\n", " \n", " auc = np.nan\n", " aupr = np.nan\n", " \n", " if proba is not None:\n", " try:\n", " if len(np.unique(y)) == 2:\n", " auc = roc_auc_score(y_test_outer, proba[:, 1])\n", " aupr = average_precision_score(y_test_outer, proba[:, 1])\n", " else:\n", " auc = roc_auc_score(y_test_outer, proba, multi_class='ovr', average='macro')\n", " y_test_bin = label_binarize(y_test_outer, classes=np.unique(y))\n", " aupr = average_precision_score(y_test_bin, proba, average='macro')\n", " except Exception:\n", " auc = np.nan\n", " aupr = np.nan\n", "\n", " print(f\"Fold {fold_idx} results: Acc={acc:.4f}, F1-W={f1_w:.4f}, \"\n", " f\"F1-M={f1_m:.4f}, Recall={recall:.4f}, Precision={precision:.4f}, AUC={auc:.4f}, AUPR={aupr:.4f}\")\n", " \n", " all_results[model_name][\"acc\"].append(acc)\n", " all_results[model_name][\"f1_w\"].append(f1_w)\n", " all_results[model_name][\"f1_m\"].append(f1_m)\n", " all_results[model_name][\"recall\"].append(recall)\n", " all_results[model_name][\"precision\"].append(precision)\n", " all_results[model_name][\"auc\"].append(auc)\n", " all_results[model_name][\"aupr\"].append(aupr)\n", "\n", "print(\"\\nFINAL BASELINE RESULTS\\n\")\n", "for model_name, metrics in all_results.items():\n", " avg_acc = np.mean(metrics[\"acc\"])\n", " std_acc = np.std(metrics[\"acc\"])\n", " avg_f1_w = np.mean(metrics[\"f1_w\"])\n", " std_f1_w = np.std(metrics[\"f1_w\"])\n", " avg_f1_m = np.mean(metrics[\"f1_m\"])\n", " std_f1_m = np.std(metrics[\"f1_m\"])\n", " avg_recall = np.mean(metrics[\"recall\"])\n", " std_recall = np.std(metrics[\"recall\"])\n", " avg_prec = np.mean(metrics[\"precision\"])\n", " std_prec = np.std(metrics[\"precision\"])\n", " avg_auc = np.nanmean(metrics[\"auc\"])\n", " std_auc = np.nanstd(metrics[\"auc\"])\n", " avg_aupr = np.nanmean(metrics[\"aupr\"])\n", " std_aupr = np.nanstd(metrics[\"aupr\"])\n", " \n", " print(f\"\\n{model_name}:\")\n", " print(f\"Accuracy: {avg_acc:.4f} +/- {std_acc:.4f}\")\n", " print(f\"Precision: {avg_prec:.4f} +/- {std_prec:.4f}\")\n", " print(f\"F1 Weighted: {avg_f1_w:.4f} +/- {std_f1_w:.4f}\")\n", " print(f\"F1 Macro: {avg_f1_m:.4f} +/- {std_f1_m:.4f}\")\n", " print(f\"Recall: {avg_recall:.4f} +/- {std_recall:.4f}\")\n", " print(f\"AUC: {avg_auc:.4f} +/- {std_auc:.4f}\")\n", " print(f\"AUPR: {avg_aupr:.4f} +/- {std_aupr:.4f}\")\n" ] } ], "metadata": { "kernelspec": { "display_name": ".enviroment", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 5 }