Data Decision Framework¶

Building a reliable pipeline with BioNeuralNet requires careful, sequential decisions - each stage shapes the one that follows. This framework provides a structured recipe to guide you from raw multi-omics data to a configured downstream model, with concrete parameter recommendations grounded in empirical results [Hussein2024] [AbdelHafiz2022].

Note

For complete end-to-end code implementations on the TCGA cohorts, see the Notebooks. The BRCA, LGG, KIPAN, and ROSMAP notebooks each follow these stages from raw data alignment through downstream applications.

Stage 0: Load, Align, and Inspect Your Data¶

Before making any modeling decisions, standardize patient identifiers across all omics layers, align them to a common set of samples, and run a data quality assessment. This step informs every choice that follows.

0.1 - Align and standardize

import pandas as pd
from bioneuralnet.utils import m_transform, data_stats, sparse_filter

# Transpose to (samples, features) convention
mirna = mirna_raw.T
rna = rna_raw.T
meth = meth_raw.T

# Standardize patient barcodes (e.g. TCGA: keep first 12 characters)
def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical.index = clinical.index.str.upper()

# Coerce to numeric; average any duplicate rows
for df in [meth, rna, mirna]:
    df = df.apply(pd.to_numeric, errors='coerce')
    df = df.groupby(df.index).mean()

# Standardize column names
for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")

# Intersect to common patients
common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(clinical.index))
X_meth, X_rna, X_mirna = meth.loc[common], rna.loc[common], mirna.loc[common]
Y_clinical = clinical.loc[common]

0.2 - Special case: methylation Beta values

If your methylation data consists of Beta values (bounded strictly between 0 and 1), convert them to M-values before any downstream analysis. The data_stats function will flag this automatically (see recommendation output below). M-values follow an approximately normal distribution and improve neural network stability.

# Only if data_stats flags Beta-value bounds:
X_meth = m_transform(X_meth, eps=1e-7)

0.3 - Inspect each omics layer

from bioneuralnet.utils import data_stats

data_stats(X_mirna, "miRNA")
data_stats(X_rna, "RNA")
data_stats(X_meth, "Methylation")

data_stats reports variance, sparsity, expression range, and missingness, and emits actionable recommendations. A representative output looks like:

=== miRNA Statistics Overview ===
--- Variance Summary ---
Variance Mean                   : 1.6516
Variance Median                 : 1.2756
Variance Min                    : 0.2437
Variance Max                    : 10.6028
Number Of Low Variance Features : 0

--- Zero Summary ---
Zero Mean                       : 0.0000
Number Of High Zero Features    : 0

--- Expression Summary ---
Expression Min                  : -2.7145
Expression Max                  : 18.2380

=== miRNA NaN Report ===
Global NaN: 13.72%
138 features are missing in >20.0% of samples.
54 samples are missing >20.0% of their features.

--- miRNA Recommendations ---
NORMALIZATION: Data distribution looks unbounded with low exact zeros.
Appears properly transformed.

=== Methylation Recommendations ===
NORMALIZATION: Values are strictly bounded between 0 and 1.
If these are Methylation Beta values, highly consider applying
`m_transform(df)` to convert them to M-values for neural network stability.

Decision rules from the output:

What `data_stats` reports	Interpretation	Action
`Global NaN > 10%`	Significant missingness	Run `sparse_filter` (Stage 0.4)
`Number Of Low Variance Features > 0`	Degenerate features present	Dropped automatically by `clean_inf_nan`
`Expression Min >= 0` and `Expression Max <= 1`	Likely Beta-value methylation data	Apply `m_transform` before proceeding
`Number Of High Zero Features > 0`	Possible raw count data (RNA-seq, miRNA)	Consider log2 transformation
`Global NaN ~= 0` and values unbounded	Data is pre-processed; safe to proceed	Continue to Stage 1

0.4 - Filter highly sparse features and samples

from bioneuralnet.utils import sparse_filter

# Drop features/samples missing in more than 20% of entries
X_mirna = sparse_filter(X_mirna, missing_fraction=0.20)
X_rna = sparse_filter(X_rna,   missing_fraction=0.20)
X_meth = sparse_filter(X_meth,  missing_fraction=0.20)

# Re-intersect after filtering (sparse_filter may drop patients)
final_patients = sorted(
  set(X_meth.index) & set(X_rna.index) &
  set(X_mirna.index) & set(Y_clinical.index)
)
X_meth, X_rna, X_mirna = (
  X_meth.loc[final_patients],
  X_rna.loc[final_patients],
  X_mirna.loc[final_patients]
)
Y_clinical = Y_clinical.loc[final_patients]

# Impute any remaining missing values with column means
for df in [X_meth, X_rna, X_mirna]:
  df.fillna(df.mean(), inplace=True)

Tip

The missing_fraction threshold of 0.20 is a reasonable starting point. For noisier data (e.g., single-cell or early-platform arrays) tighten to 0.10; for well-controlled bulk RNA-seq, 0.30 is acceptable.

Stage 1: Feature Selection¶

Important

Use unsupervised feature selection only prior to any train/test split. This avoids label leakage and ensures cross-validation estimates are unbiased.

Step 1.1 - Do you need feature selection?

Condition	Recommendation
n_features <= 500	Skip; proceed to network construction
500 < n_features <= 5,000	Apply one unsupervised selector below
n_features > 5,000	Feature selection is required

Step 1.2 - How many features to keep (n_keep)?

A practical upper bound is n_samples / 2 to avoid underdetermined systems.

n_samples	Recommended n_keep
< 100	20 - 50
100 - 300	50 - 150
300 - 700	150 - 400
> 700	400 - 1,000

Step 1.3 - Which selector to use?

Selector	Best for	Avoid when	Function
Laplacian Score (recommended default)	Preserving local manifold structure; heterogeneous cohorts; proven safe from leakage	Very large feature sets (>10k) where k-NN is costly	`laplacian_score`
Variance Threshold	Dense, low-dropout data; fast pre-filter	High-sparsity data (zeros inflate variance)	`variance_threshold`
MAD Filter	Skewed distributions; sparse count data; proteomics	Gaussian-like data (variance is sufficient)	`mad_filter`
PCA Loadings	Capturing global variance directions; correlated features	n_samples < n_features (PCA unstable)	`pca_loadings`
Correlation Filter (unsupervised)	Reducing redundancy among co-expressed features	When inter-feature correlation is expected by design	`correlation_filter(X, y=None)`

Step 1.4 - Worked example

For typical high-dimensional omics (e.g., 18,000 RNA features, 511 samples):

from bioneuralnet.utils import variance_threshold, laplacian_score

# Step 1: Fast pre-filter for very high-dimensional data
X_rna_prefiltered = variance_threshold(X_rna, k=2000)

# Step 2: Laplacian Score on the reduced set
# n_keep ~ n_samples / 3; k_neighbors = 5 for dense data
X_rna_selected = laplacian_score(X_rna_prefiltered, n_keep=170, k_neighbors=5)

# For miRNA (548 features, 511 samples) - Laplacian Score directly
X_mirna_selected = laplacian_score(X_mirna, n_keep=170, k_neighbors=5)

# For methylation (20,000+ features) - two-step is required
X_meth_prefiltered = variance_threshold(X_meth, k=2000)
X_meth_selected = laplacian_score(X_meth_prefiltered, n_keep=170, k_neighbors=5)

Key parameters:

k_neighbors: use 5 for dense data; 10-15 for sparse or noisy data. Values above 20 rarely improve results.
n_components (PCA Loadings only): set to min(50, n_samples, n_features).

Stage 2: Network Construction¶

After feature selection, construct the multi-omics network. Build a small set of candidate networks using the same k value, inspect their topology with NetworkAnalyzer, and select the one whose density falls in the target range.

Step 2.1 - Candidate network types

Method	Signal captured	Typical use case	Function
Pearson correlation	Linear co-expression	Gene expression, proteomics	`correlation_network`
Spearman correlation	Monotonic, rank-based relationships	Metabolomics, ordinal data	`correlation_network(method='spearman')`
Soft threshold	Scale-free topology; strong edges emphasized	WGCNA-style analysis	`threshold_network`
Cosine similarity	Directional similarity; magnitude-invariant	Mixed or high-dimensional omics	`similarity_network(metric='cosine')`
Gaussian k-NN	Smooth geometric neighborhoods	Metabolomics; continuous measurements	`gaussian_knn_network`

Step 2.2 - Worked example: build and compare candidates

Start with a common k=15 and vary the construction method or power parameter.

from bioneuralnet.network import (
  correlation_network, threshold_network,
  similarity_network, gaussian_knn_network,
  NetworkAnalyzer
)
import pandas as pd

# Combine selected omics into one feature matrix
omics_combined = pd.concat([X_rna_selected, X_mirna_selected, X_meth_selected], axis=1)

# Build candidate networks
net_threshold_62 = threshold_network(omics_combined, b=6.2, k=15)
net_threshold_75 = threshold_network(omics_combined, b=7.5, k=15)
net_correlation = correlation_network(omics_combined, k=15, method='pearson')
net_similarity = similarity_network(omics_combined, k=15, metric='cosine')
net_gaussian = gaussian_knn_network(omics_combined, k=15)

Step 2.3 - Key parameters

Soft threshold (``threshold_network``)

b (power parameter): start at 6.0-6.5. Increase toward 7.5-9.0 for a sparser, more scale-free network; decrease toward 4.0 if too many edges are lost.

Correlation and similarity networks

k: start with 15. Increase if the network is too sparse after the density check.

Gaussian k-NN

k: 10-20 for typical omics. Leave sigma=None for the data-adaptive estimate.

Stage 3: Network Quality Assessment¶

Inspect each candidate network with NetworkAnalyzer before passing it to a model. You are looking for density in the target range, no extreme hub dominance, and meaningful cross-omics edges.

Step 3.1 - Basic statistics and density check

from bioneuralnet.network import NetworkAnalyzer

analyzer = NetworkAnalyzer(net_threshold_75)
stats = analyzer.basic_statistics(threshold=0.1)

Example output:

============================================================
BASIC NETWORK STATISTICS (threshold > 0.1)
============================================================
Nodes:          430
Edges:        6,214
Density:     0.0674
Avg Degree:  28.90
Max Degree:  312
Isolated Nodes: 0 (0.0%)

Tip

These density ranges are starting points. Network behavior varies by cohort size, omics type, and construction method. Use NetworkAnalyzer diagnostics to validate before proceeding.

Measured density	Interpretation	Action
> 0.7	Dense / Complete - noise risk high	Increase `b` or lower `k`; rebuild and re-check
0.1 - 0.7	Moderately dense - best performance zone	Proceed to downstream task
0.01 - 0.1	Sparse - efficient but may miss links	Check isolated nodes; increase `k`
< 0.01	Too sparse - network is near-empty	Reduce `b`, increase `k`, or switch construction method

Step 3.2 - Hub analysis

Very high-degree hub nodes can dominate GNN message passing. Check the top hubs:

hub_df = analyzer.hub_analysis(threshold=0.1, top_n=10)

If the top hub’s degree is more than 10x the average degree, raise b or reduce k to trim outlier connections.

Step 3.3 - Edge weight distribution

Use the percentile output to select a binarization threshold for topology metrics, and to understand whether the network has biologically graded signal:

weights = analyzer.edge_weight_analysis()

Total edges (weight > 0): 12,428
Mean: 0.312   Median: 0.289   Std: 0.091
Percentiles:
  25th: 0.241   50th: 0.289   75th: 0.378   90th: 0.431
Edges at different biological thresholds:
  > 0.1 : 11,904 edges
  > 0.3 :  6,214 edges
  > 0.5 :  1,832 edges

Use the 25th-50th percentile range as your threshold argument to basic_statistics and hub_analysis.

Step 3.4 - Cross-omics connectivity

For multi-omics networks, verify that between-layer interactions are present and the network is not dominated by within-layer correlations:

cross = analyzer.cross_omics_analysis(threshold=0.1)

Omics Pair           |      Edges | Max Possible |    Density
------------------------------------------------------------
rna (within)         |      4,102 |       14,535 |   0.282
mirna (within)       |        312 |       14,365 |   0.022
rna-mirna            |      1,800 |       29,070 |   0.062

A healthy multi-omics network shows cross-omics density > 0.01. If cross-omics density is near zero, the network is dominated by within-layer correlations and may not benefit from multi-omics construction. In this case, revisit feature selection or try a different construction method.

Step 3.5 - Strongest edges (sanity check)

Inspect the top interactions against known biology:

top_edges = analyzer.find_strongest_edges(top_n=10)

Step 3.6 - Selection rule

After inspecting candidates, select the network with density closest to the 0.1-0.7 range, no extreme hub dominance, and meaningful cross-omics edges. In practice, moderately dense networks have consistently yielded the best prediction accuracy [Hussein2024].

Stage 4: Downstream Task Configuration¶

At this point you have a cleaned feature matrix and a network in the validated density range. Configure the downstream task based on your goal.

Disease Prediction (DPMON)¶

The DPMON pipeline trains a GNN to generate node embeddings, reduces their dimensionality, integrates them with the raw omics features, and feeds the result to a prediction network. The entire pipeline is optimized end-to-end.

GNN Architecture

Based on systematic comparison across 15 networks in two COPD patient cohorts [Hussein2024]:

Network type	Best architecture	Notes
Moderately dense	GAT (default recommendation)	Attention mechanism handles edge weight variation best
Complete / Dense	GCN or GAT (comparable)	GCN sufficient; attention adds little at high density
Sparse	GAT	Attention differentiates the few meaningful edges

Number of GNN Layers

Network type	Recommended layers	Rationale
Complete / Dense	2 - 3	Over-smoothing risk is high
Moderately dense or Sparse	4 - 5	Sufficient depth to propagate information across wider neighborhoods

Embedding Dimension

Safe range: 8 - 128. Performance gains plateau beyond 128.
Small cohorts (n < 200): stay at 8 - 32.
Large cohorts (n > 500): 64 - 128.

Dimensionality Reduction of Embeddings

Autoencoder reduction consistently yielded the strongest results across network types and densities.

Integration Method

GNN-derived scalar weights are applied via element-wise multiplication with the original omics features. This is the default and only integration method in DPMON.

Validated Reference Configurations

Validated on COPDGene multi-omics data predicting COPD GOLD stage (6 classes). Accuracy reported as mean +/- std over 500-1,000 runs [Hussein2024].

Current Smokers

Network	GNN	Layers	Emb. dim	Integration	Accuracy
1 Moderately Dense	GAT	4-5	64-128	Scaling	0.5621 +/- 0.0042
3 Moderately Dense	GAT	4-5	64-128	Scaling	0.5394 +/- 0.0009
1 Complete	GCN	2-3	64-128	Scaling	0.5378 +/- 0.0071

Former Smokers

Network	GNN	Layers	Emb. dim	Integration	Accuracy
1 Moderately Dense	GAT	4-5	64-128	Scaling	0.5513 +/- 0.0065
2 Moderately Dense	GAT	4-5	64-128	Scaling	0.5491 +/- 0.0021

Tip

Default DPMON recipe: Moderately dense network -> GAT -> 4 layers -> embedding dim 64 -> autoencoder reduction -> scaling integration -> 2-layer prediction NN. This configuration outperforms Logistic Regression, Random Forest, SAINT, and Node2Vec by approximately 8-10%.

Subgraph Detection¶

For identifying biologically significant subnetworks correlated with a phenotype, the hybrid Correlated Louvain + Correlated PageRank approach [AbdelHafiz2022] is recommended over hierarchical clustering.

Correlated Louvain parameter (kL)

kL controls the balance between modularity and phenotype correlation:

kL	Effect
0.8	Larger subgraphs; lower correlation
0.4-0.6	Balanced; default starting range
0.0-0.2	Smaller, highly correlated subgraphs

Start with kL=0.6. Use the subgraph with the highest absolute Pearson correlation to the phenotype as your result.

Correlated PageRank parameters

alpha (teleportation probability): 0.04
epsilon (tolerance): 1e-6

Hybrid approach (recommended)

Run Correlated Louvain (kL=0.6) on the full network
Select the top subgraph by rho
Seed Correlated PageRank weighted by node-level correlation contribution (alpha=0.04, epsilon=1e-6)
Repeat until subgraph size stabilizes

The hybrid method achieved rho = 0.41 on the COPD protein-metabolite network vs. rho = 0.33 for hierarchical clustering [AbdelHafiz2022].

Summary: Quick-Reference Recipe Table¶

Scenario	Feature Selection	Network	GNN / Task	Key params
Large cohort (n>500), high-dim, disease prediction	Variance -> Laplacian Score, n_keep=400	Threshold or Correlation, moderate density	GAT, 3-4 layers, emb 64-128	Scaling integration, autoencoder
Small cohort (n<200), disease prediction	Variance or MAD, n_keep=50-100	Sparse to moderate; avoid complete	GAT, 2-3 layers, emb 8-32	Scaling integration, 2-layer NN
Sparse / high-dropout data (metabolomics, scRNA)	MAD or Laplacian Score	Gaussian k-NN	GAT, 4-5 layers	Impute first; inspect cross-omics density
Subgraph / pathway detection	Not required (operate on full network)	Existing SmCCNet / SGTCCA network	Hybrid Louvain + PageRank	kL=0.6, alpha=0.04, epsilon=1e-6
Unknown / exploratory	Laplacian Score, n_keep = n_samples/3	Pearson correlation, target density 0.1-0.7	GAT, 4 layers, emb 64	Scaling, autoencoder; tune layers first

References¶

[Hussein2024] (1,2,3,4)

Hussein, S. et al. “Learning from Multi-Omics Networks to Enhance Disease Prediction: An Optimized Network Embedding and Fusion Approach.” In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 2024, pp. 4371-4378. DOI: 10.1109/BIBM62325.2024.10822233.

[AbdelHafiz2022] (1,2,3)

Abdel-Hafiz, M., Najafi, M., et al. “Significant Subgraph Detection in Multi-omics Networks for Disease Pathway Identification.” Frontiers in Big Data, 5, 894632 (2022). DOI: 10.3389/fdata.2022.894632.