Datasets Guide¶

BioNeuralNet ships with several built-in multi-omics datasets that can be loaded via convenience functions or through the bioneuralnet.datasets.DatasetLoader class.

Each dataset is loaded as a collection of pandas.DataFrame objects:

Keys are table names (e.g., "rna", "mirna", "clinical", "target").
Values are the corresponding data tables.

In BioNeuralNet, rows represent subjects/patients and columns represent omics features or related variables.

Feature Selection Summary¶

To address high dimensionality and isolate the most informative variables, unsupervised feature selection was performed across all cohorts using Laplacian Score filtering. This methylationod evaluates each feature based on its ability to preserve the local manifold structure of the data, emphasizing features that vary minimally between closely related samples. Lower Laplacian Scores indicate higher feature importance.

The following feature counts were retained per modality across BRCA, LGG, and KIPAN cohorts:

Modality	Features Retained	Cohorts Applied
DNA methylation	400	BRCA, LGG, KIPAN
mRNA	200	BRCA, LGG, KIPAN
miRNA	100	BRCA, LGG, KIPAN

For a full list of available feature selection methylationods, see the Preprocessing Utilities documentation.

Quick Usage¶

You can use either the convenience loader functions or the lower-level DatasetLoader:

from bioneuralnet.datasets import (
    load_example,
    load_monet,
    load_brca,
    load_lgg,
    load_kipan,
)

brca = load_brca()
print(brca.keys())
# dict_keys(['mirna', 'target', 'clinical', 'rna', 'methylation'])

from bioneuralnet.datasets import DatasetLoader

loader = DatasetLoader("kipan")
print(loader.shape)

API Summary¶

Each function returns a dict[str, pandas.DataFrame] mapping table names to loaded DataFrames:

load_example() keys: "X1", "X2", "Y", "clinical"
load_monet() keys: "gene", "mirna", "phenotype", "rppa", "clinical"
load_brca() keys: "mirna", "target", "clinical", "rna", "methylation"
load_lgg() keys: "mirna", "target", "clinical", "rna", "methylation"
load_kipan() keys: "mirna", "target", "clinical", "rna", "methylation"

Valid dataset_name values for DatasetLoader (case-insensitive): "example", "monet", "brca", "lgg", "kipan".

Built-in Datasets¶

example¶

Synthetic dataset for testing and demonstration.

Table	Shape	Description
`X1`	(358, 500)	Gene expression features
`X2`	(358, 100)	miRNA features
`Y`	(358, 1)	Continuous phenotype
`clinical`	(358, 6)	Clinical covariates

from bioneuralnet.datasets import load_example
example = load_example()

monet¶

MONET multi-omics dataset.

Table	Shape	Description
`gene`	(107, 5039)	Gene expression
`mirna`	(107, 789)	miRNA expression
`phenotype`	(106, 1)	Phenotype labels
`rppa`	(107, 175)	Protein expression
`clinical`	(107, 5)	Clinical covariates

from bioneuralnet.datasets import load_monet
monet = load_monet()

brca¶

TCGA Breast Invasive Carcinoma (BRCA). Target: PAM50 subtype (5-class): LumA (n=419), LumB (n=140), Basal (n=130), Her2 (n=46), Normal (n=34).

Stage	methylation	mRNA	miRNA	Clinical
Raw (features x samples)	20,107 x 885	18,321 x 1,212	503 x 1,189	1,098 x 18
Final aligned (samples x features)	769 x 20,106	769 x 16,757	769 x 354	769 x 17
After Laplacian Score selection	769 x 400	769 x 200	769 x 100	769 x 17

from bioneuralnet.datasets import load_brca
brca = load_brca()

lgg¶

TCGA Brain Lower Grade Glioma (LGG). Target: binary vital status, Alive (n=386) vs. Deceased (n=125).

Stage	methylation	mRNA	miRNA	Clinical
Raw (features x samples)	20,115 x 685	18,328 x 701	548 x 531	14 x 1,110
Final aligned (samples x features)	511 x 20,114	511 x 18,328	511 x 548	511 x 13
After Laplacian Score selection	511 x 400	511 x 200	511 x 100	511 x 13

from bioneuralnet.datasets import load_lgg
lgg = load_lgg()

kipan¶

TCGA Pan-Kidney cohort (KIPAN: KICH + KIRC + KIRP). Target: binary cancer stage, Early (Stages I/II, n=417) vs. Late (Stages III/IV, n=216).

Stage	methylation	mRNA	miRNA	Clinical
Raw (features x samples)	20,117 x 867	18,272 x 1,020	472 x 1,005	20 x 941
Final aligned (samples x features)	658 x 20,116	658 x 18,272	658 x 472	658 x 19
After Laplacian Score selection	633 x 400	633 x 200	633 x 100	633 x 19

from bioneuralnet.datasets import load_kipan
kipan = load_kipan()

Feature Selection¶

To reduce the dimensionality of the high-feature omics datasets, unsupervised feature selection was performed using Laplacian Score filtering. The Laplacian Score for the \(r\)-th feature is:

\[L_r = \frac{\sum_{ij} (x_{ri} - x_{rj})^2 W_{ij}}{\text{Var}(x_r)}\]

Where:

\(L_r\) is the Laplacian Score for feature \(r\). Lower scores indicate higher importance.
\(x_{ri}\) and \(x_{rj}\) are the standardized values of feature \(r\) for samples \(i\) and \(j\). All feature vectors undergo Z-score normalization prior to scoring.
\(W_{ij}\) is the edge weight between samples \(i\) and \(j\) in a symmetric k-nearest neighbors affinity graph. If samples \(i\) and \(j\) are neighbors, \(W_{ij} = 1\); otherwise \(W_{ij} = 0\).
\(\text{Var}(x_r)\) is the variance of feature \(r\), weighted by the degree matrix. This denominator ensures scale-invariant normalization, reflecting local spatial variance relative to global feature variance.

By filtering for the lowest Laplacian Scores, the optimal subsets of features were retained per cohort to maximize computational efficiency while preserving biological signals.