Datasets Guide¶
BioNeuralNet ships with several built-in multi-omics datasets that can be loaded via convenience functions or through the bioneuralnet.datasets.DatasetLoader class.
Each dataset is loaded as a collection of pandas.DataFrame objects:
Keys are table names (e.g.,
"rna","mirna","clinical","target").Values are the corresponding data tables.
In BioNeuralNet, rows represent subjects/patients and columns represent omics features or related variables.
Feature Selection Summary¶
To address high dimensionality and isolate the most informative variables, unsupervised feature selection was performed across all cohorts using Laplacian Score filtering. This methylationod evaluates each feature based on its ability to preserve the local manifold structure of the data, emphasizing features that vary minimally between closely related samples. Lower Laplacian Scores indicate higher feature importance.
The following feature counts were retained per modality across BRCA, LGG, and KIPAN cohorts:
Modality |
Features Retained |
Cohorts Applied |
|---|---|---|
DNA methylation |
400 |
BRCA, LGG, KIPAN |
mRNA |
200 |
BRCA, LGG, KIPAN |
miRNA |
100 |
BRCA, LGG, KIPAN |
For a full list of available feature selection methylationods, see the Preprocessing Utilities documentation.
Quick Usage¶
You can use either the convenience loader functions or the lower-level DatasetLoader:
from bioneuralnet.datasets import (
load_example,
load_monet,
load_brca,
load_lgg,
load_kipan,
)
brca = load_brca()
print(brca.keys())
# dict_keys(['mirna', 'target', 'clinical', 'rna', 'methylation'])
from bioneuralnet.datasets import DatasetLoader
loader = DatasetLoader("kipan")
print(loader.shape)
API Summary¶
Each function returns a dict[str, pandas.DataFrame] mapping table names to loaded DataFrames:
load_example()keys:"X1","X2","Y","clinical"load_monet()keys:"gene","mirna","phenotype","rppa","clinical"load_brca()keys:"mirna","target","clinical","rna","methylation"load_lgg()keys:"mirna","target","clinical","rna","methylation"load_kipan()keys:"mirna","target","clinical","rna","methylation"
Valid dataset_name values for DatasetLoader (case-insensitive): "example", "monet", "brca", "lgg", "kipan".
Built-in Datasets¶
example¶
Synthetic dataset for testing and demonstration.
Table |
Shape |
Description |
|---|---|---|
|
(358, 500) |
Gene expression features |
|
(358, 100) |
miRNA features |
|
(358, 1) |
Continuous phenotype |
|
(358, 6) |
Clinical covariates |
from bioneuralnet.datasets import load_example
example = load_example()
monet¶
MONET multi-omics dataset.
Table |
Shape |
Description |
|---|---|---|
|
(107, 5039) |
Gene expression |
|
(107, 789) |
miRNA expression |
|
(106, 1) |
Phenotype labels |
|
(107, 175) |
Protein expression |
|
(107, 5) |
Clinical covariates |
from bioneuralnet.datasets import load_monet
monet = load_monet()
brca¶
TCGA Breast Invasive Carcinoma (BRCA). Target: PAM50 subtype (5-class): LumA (n=419), LumB (n=140), Basal (n=130), Her2 (n=46), Normal (n=34).
Stage |
methylation |
mRNA |
miRNA |
Clinical |
|---|---|---|---|---|
Raw (features x samples) |
20,107 x 885 |
18,321 x 1,212 |
503 x 1,189 |
1,098 x 18 |
Final aligned (samples x features) |
769 x 20,106 |
769 x 16,757 |
769 x 354 |
769 x 17 |
After Laplacian Score selection |
769 x 400 |
769 x 200 |
769 x 100 |
769 x 17 |
from bioneuralnet.datasets import load_brca
brca = load_brca()
lgg¶
TCGA Brain Lower Grade Glioma (LGG). Target: binary vital status, Alive (n=386) vs. Deceased (n=125).
Stage |
methylation |
mRNA |
miRNA |
Clinical |
|---|---|---|---|---|
Raw (features x samples) |
20,115 x 685 |
18,328 x 701 |
548 x 531 |
14 x 1,110 |
Final aligned (samples x features) |
511 x 20,114 |
511 x 18,328 |
511 x 548 |
511 x 13 |
After Laplacian Score selection |
511 x 400 |
511 x 200 |
511 x 100 |
511 x 13 |
from bioneuralnet.datasets import load_lgg
lgg = load_lgg()
kipan¶
TCGA Pan-Kidney cohort (KIPAN: KICH + KIRC + KIRP). Target: binary cancer stage, Early (Stages I/II, n=417) vs. Late (Stages III/IV, n=216).
Stage |
methylation |
mRNA |
miRNA |
Clinical |
|---|---|---|---|---|
Raw (features x samples) |
20,117 x 867 |
18,272 x 1,020 |
472 x 1,005 |
20 x 941 |
Final aligned (samples x features) |
658 x 20,116 |
658 x 18,272 |
658 x 472 |
658 x 19 |
After Laplacian Score selection |
633 x 400 |
633 x 200 |
633 x 100 |
633 x 19 |
from bioneuralnet.datasets import load_kipan
kipan = load_kipan()
Feature Selection¶
To reduce the dimensionality of the high-feature omics datasets, unsupervised feature selection was performed using Laplacian Score filtering. The Laplacian Score for the \(r\)-th feature is:
Where:
\(L_r\) is the Laplacian Score for feature \(r\). Lower scores indicate higher importance.
\(x_{ri}\) and \(x_{rj}\) are the standardized values of feature \(r\) for samples \(i\) and \(j\). All feature vectors undergo Z-score normalization prior to scoring.
\(W_{ij}\) is the edge weight between samples \(i\) and \(j\) in a symmetric k-nearest neighbors affinity graph. If samples \(i\) and \(j\) are neighbors, \(W_{ij} = 1\); otherwise \(W_{ij} = 0\).
\(\text{Var}(x_r)\) is the variance of feature \(r\), weighted by the degree matrix. This denominator ensures scale-invariant normalization, reflecting local spatial variance relative to global feature variance.
By filtering for the lowest Laplacian Scores, the optimal subsets of features were retained per cohort to maximize computational efficiency while preserving biological signals.