Data Decision Framework

Building a reliable pipeline with BioNeuralNet requires careful, sequential decisions - each stage shapes the one that follows. This framework provides a structured recipe to guide you from raw multi-omics data to a configured downstream model, with concrete parameter recommendations grounded in empirical results [Hussein2024] [AbdelHafiz2022].

BioNeuralNet Data Decision Framework Flowchart

Figure: Step-by-step decision flowchart for configuring a BioNeuralNet pipeline. Details for each stage are given in the sections below. See Flow Chart Full Size

Note

For complete end-to-end code implementations on the TCGA cohorts, see the Notebooks. The BRCA, LGG, KIPAN, and ROSMAP notebooks each follow these stages from raw data alignment through downstream applications.

Stage 0: Load, Align, and Inspect Your Data

Before making any modeling decisions, standardize patient identifiers across all omics layers, align them to a common set of samples, and run a data quality assessment. This step informs every choice that follows.

0.1 - Align and standardize

import pandas as pd
from bioneuralnet.utils import m_transform, data_stats, sparse_filter

# Transpose to (samples, features) convention
mirna = mirna_raw.T
rna = rna_raw.T
meth = meth_raw.T

# Standardize patient barcodes (e.g. TCGA: keep first 12 characters)
def trim_barcode(idx):
    return idx.to_series().str.slice(0, 12)

meth.index = trim_barcode(meth.index)
rna.index = trim_barcode(rna.index)
mirna.index = trim_barcode(mirna.index)
clinical.index = clinical.index.str.upper()

# Coerce to numeric; average any duplicate rows
for df in [meth, rna, mirna]:
    df = df.apply(pd.to_numeric, errors='coerce')
    df = df.groupby(df.index).mean()

# Standardize column names
for df in [meth, rna, mirna]:
    df.columns = df.columns.str.replace(r"\?", "unknown_", regex=True)
    df.columns = df.columns.str.replace(r"\|", "_", regex=True)
    df.columns = df.columns.str.replace("-", "_", regex=False)
    df.columns = df.columns.str.replace(r"_+", "_", regex=True)
    df.columns = df.columns.str.strip("_")

# Intersect to common patients
common = sorted(set(meth.index) & set(rna.index) & set(mirna.index) & set(clinical.index))
X_meth, X_rna, X_mirna = meth.loc[common], rna.loc[common], mirna.loc[common]
Y_clinical = clinical.loc[common]

0.2 - Special case: methylation Beta values

If your methylation data consists of Beta values (bounded strictly between 0 and 1), convert them to M-values before any downstream analysis. The data_stats function will flag this automatically (see recommendation output below). M-values follow an approximately normal distribution and improve neural network stability.

# Only if data_stats flags Beta-value bounds:
X_meth = m_transform(X_meth, eps=1e-7)

0.3 - Inspect each omics layer

from bioneuralnet.utils import data_stats

data_stats(X_mirna, "miRNA")
data_stats(X_rna, "RNA")
data_stats(X_meth, "Methylation")

data_stats reports variance, sparsity, expression range, and missingness, and emits actionable recommendations. A representative output looks like:

=== miRNA Statistics Overview ===
--- Variance Summary ---
Variance Mean                   : 1.6516
Variance Median                 : 1.2756
Variance Min                    : 0.2437
Variance Max                    : 10.6028
Number Of Low Variance Features : 0

--- Zero Summary ---
Zero Mean                       : 0.0000
Number Of High Zero Features    : 0

--- Expression Summary ---
Expression Min                  : -2.7145
Expression Max                  : 18.2380

=== miRNA NaN Report ===
Global NaN: 13.72%
138 features are missing in >20.0% of samples.
54 samples are missing >20.0% of their features.

--- miRNA Recommendations ---
NORMALIZATION: Data distribution looks unbounded with low exact zeros.
Appears properly transformed.

=== Methylation Recommendations ===
NORMALIZATION: Values are strictly bounded between 0 and 1.
If these are Methylation Beta values, highly consider applying
`m_transform(df)` to convert them to M-values for neural network stability.

Decision rules from the output:

What data_stats reports

Interpretation

Action

Global NaN > 10%

Significant missingness

Run sparse_filter (Stage 0.4)

Number Of Low Variance Features > 0

Degenerate features present

Dropped automatically by clean_inf_nan

Expression Min >= 0 and Expression Max <= 1

Likely Beta-value methylation data

Apply m_transform before proceeding

Number Of High Zero Features > 0

Possible raw count data (RNA-seq, miRNA)

Consider log2 transformation

Global NaN ~= 0 and values unbounded

Data is pre-processed; safe to proceed

Continue to Stage 1

0.4 - Filter highly sparse features and samples

from bioneuralnet.utils import sparse_filter

# Drop features/samples missing in more than 20% of entries
X_mirna = sparse_filter(X_mirna, missing_fraction=0.20)
X_rna = sparse_filter(X_rna,   missing_fraction=0.20)
X_meth = sparse_filter(X_meth,  missing_fraction=0.20)

# Re-intersect after filtering (sparse_filter may drop patients)
final_patients = sorted(
  set(X_meth.index) & set(X_rna.index) &
  set(X_mirna.index) & set(Y_clinical.index)
)
X_meth, X_rna, X_mirna = (
  X_meth.loc[final_patients],
  X_rna.loc[final_patients],
  X_mirna.loc[final_patients]
)
Y_clinical = Y_clinical.loc[final_patients]

# Impute any remaining missing values with column means
for df in [X_meth, X_rna, X_mirna]:
  df.fillna(df.mean(), inplace=True)

Tip

The missing_fraction threshold of 0.20 is a reasonable starting point. For noisier data (e.g., single-cell or early-platform arrays) tighten to 0.10; for well-controlled bulk RNA-seq, 0.30 is acceptable.

Stage 1: Feature Selection

Important

Use unsupervised feature selection only prior to any train/test split. This avoids label leakage and ensures cross-validation estimates are unbiased.

Step 1.1 - Do you need feature selection?

Condition

Recommendation

n_features <= 500

Skip; proceed to network construction

500 < n_features <= 5,000

Apply one unsupervised selector below

n_features > 5,000

Feature selection is required

Step 1.2 - How many features to keep (n_keep)?

A practical upper bound is n_samples / 2 to avoid underdetermined systems.

n_samples

Recommended n_keep

< 100

20 - 50

100 - 300

50 - 150

300 - 700

150 - 400

> 700

400 - 1,000

Step 1.3 - Which selector to use?

Selector

Best for

Avoid when

Function

Laplacian Score (recommended default)

Preserving local manifold structure; heterogeneous cohorts; proven safe from leakage

Very large feature sets (>10k) where k-NN is costly

laplacian_score

Variance Threshold

Dense, low-dropout data; fast pre-filter

High-sparsity data (zeros inflate variance)

variance_threshold

MAD Filter

Skewed distributions; sparse count data; proteomics

Gaussian-like data (variance is sufficient)

mad_filter

PCA Loadings

Capturing global variance directions; correlated features

n_samples < n_features (PCA unstable)

pca_loadings

Correlation Filter (unsupervised)

Reducing redundancy among co-expressed features

When inter-feature correlation is expected by design

correlation_filter(X, y=None)

Step 1.4 - Worked example

For typical high-dimensional omics (e.g., 18,000 RNA features, 511 samples):

from bioneuralnet.utils import variance_threshold, laplacian_score

# Step 1: Fast pre-filter for very high-dimensional data
X_rna_prefiltered = variance_threshold(X_rna, k=2000)

# Step 2: Laplacian Score on the reduced set
# n_keep ~ n_samples / 3; k_neighbors = 5 for dense data
X_rna_selected = laplacian_score(X_rna_prefiltered, n_keep=170, k_neighbors=5)

# For miRNA (548 features, 511 samples) - Laplacian Score directly
X_mirna_selected = laplacian_score(X_mirna, n_keep=170, k_neighbors=5)

# For methylation (20,000+ features) - two-step is required
X_meth_prefiltered = variance_threshold(X_meth, k=2000)
X_meth_selected = laplacian_score(X_meth_prefiltered, n_keep=170, k_neighbors=5)

Key parameters:

  • k_neighbors: use 5 for dense data; 10-15 for sparse or noisy data. Values above 20 rarely improve results.

  • n_components (PCA Loadings only): set to min(50, n_samples, n_features).

Stage 2: Network Construction

After feature selection, construct the multi-omics network. Build a small set of candidate networks using the same k value, inspect their topology with NetworkAnalyzer, and select the one whose density falls in the target range.

Step 2.1 - Candidate network types

Method

Signal captured

Typical use case

Function

Pearson correlation

Linear co-expression

Gene expression, proteomics

correlation_network

Spearman correlation

Monotonic, rank-based relationships

Metabolomics, ordinal data

correlation_network(method='spearman')

Soft threshold

Scale-free topology; strong edges emphasized

WGCNA-style analysis

threshold_network

Cosine similarity

Directional similarity; magnitude-invariant

Mixed or high-dimensional omics

similarity_network(metric='cosine')

Gaussian k-NN

Smooth geometric neighborhoods

Metabolomics; continuous measurements

gaussian_knn_network

Step 2.2 - Worked example: build and compare candidates

Start with a common k=15 and vary the construction method or power parameter.

from bioneuralnet.network import (
  correlation_network, threshold_network,
  similarity_network, gaussian_knn_network,
  NetworkAnalyzer
)
import pandas as pd

# Combine selected omics into one feature matrix
omics_combined = pd.concat([X_rna_selected, X_mirna_selected, X_meth_selected], axis=1)

# Build candidate networks
net_threshold_62 = threshold_network(omics_combined, b=6.2, k=15)
net_threshold_75 = threshold_network(omics_combined, b=7.5, k=15)
net_correlation = correlation_network(omics_combined, k=15, method='pearson')
net_similarity = similarity_network(omics_combined, k=15, metric='cosine')
net_gaussian = gaussian_knn_network(omics_combined, k=15)

Step 2.3 - Key parameters

Soft threshold (``threshold_network``)

  • b (power parameter): start at 6.0-6.5. Increase toward 7.5-9.0 for a sparser, more scale-free network; decrease toward 4.0 if too many edges are lost.

Correlation and similarity networks

  • k: start with 15. Increase if the network is too sparse after the density check.

Gaussian k-NN

  • k: 10-20 for typical omics. Leave sigma=None for the data-adaptive estimate.

Stage 3: Network Quality Assessment

Inspect each candidate network with NetworkAnalyzer before passing it to a model. You are looking for density in the target range, no extreme hub dominance, and meaningful cross-omics edges.

Step 3.1 - Basic statistics and density check

from bioneuralnet.network import NetworkAnalyzer

analyzer = NetworkAnalyzer(net_threshold_75)
stats = analyzer.basic_statistics(threshold=0.1)

Example output:

============================================================
BASIC NETWORK STATISTICS (threshold > 0.1)
============================================================
Nodes:          430
Edges:        6,214
Density:     0.0674
Avg Degree:  28.90
Max Degree:  312
Isolated Nodes: 0 (0.0%)

Tip

These density ranges are starting points. Network behavior varies by cohort size, omics type, and construction method. Use NetworkAnalyzer diagnostics to validate before proceeding.

Measured density

Interpretation

Action

> 0.7

Dense / Complete - noise risk high

Increase b or lower k; rebuild and re-check

0.1 - 0.7

Moderately dense - best performance zone

Proceed to downstream task

0.01 - 0.1

Sparse - efficient but may miss links

Check isolated nodes; increase k

< 0.01

Too sparse - network is near-empty

Reduce b, increase k, or switch construction method

Step 3.2 - Hub analysis

Very high-degree hub nodes can dominate GNN message passing. Check the top hubs:

hub_df = analyzer.hub_analysis(threshold=0.1, top_n=10)

If the top hub’s degree is more than 10x the average degree, raise b or reduce k to trim outlier connections.

Step 3.3 - Edge weight distribution

Use the percentile output to select a binarization threshold for topology metrics, and to understand whether the network has biologically graded signal:

weights = analyzer.edge_weight_analysis()
Total edges (weight > 0): 12,428
Mean: 0.312   Median: 0.289   Std: 0.091
Percentiles:
  25th: 0.241   50th: 0.289   75th: 0.378   90th: 0.431
Edges at different biological thresholds:
  > 0.1 : 11,904 edges
  > 0.3 :  6,214 edges
  > 0.5 :  1,832 edges

Use the 25th-50th percentile range as your threshold argument to basic_statistics and hub_analysis.

Step 3.4 - Cross-omics connectivity

For multi-omics networks, verify that between-layer interactions are present and the network is not dominated by within-layer correlations:

cross = analyzer.cross_omics_analysis(threshold=0.1)
Omics Pair           |      Edges | Max Possible |    Density
------------------------------------------------------------
rna (within)         |      4,102 |       14,535 |   0.282
mirna (within)       |        312 |       14,365 |   0.022
rna-mirna            |      1,800 |       29,070 |   0.062

A healthy multi-omics network shows cross-omics density > 0.01. If cross-omics density is near zero, the network is dominated by within-layer correlations and may not benefit from multi-omics construction. In this case, revisit feature selection or try a different construction method.

Step 3.5 - Strongest edges (sanity check)

Inspect the top interactions against known biology:

top_edges = analyzer.find_strongest_edges(top_n=10)

Step 3.6 - Selection rule

After inspecting candidates, select the network with density closest to the 0.1-0.7 range, no extreme hub dominance, and meaningful cross-omics edges. In practice, moderately dense networks have consistently yielded the best prediction accuracy [Hussein2024].

Stage 4: Downstream Task Configuration

At this point you have a cleaned feature matrix and a network in the validated density range. Configure the downstream task based on your goal.

Disease Prediction (DPMON)

The DPMON pipeline trains a GNN to generate node embeddings, reduces their dimensionality, integrates them with the raw omics features, and feeds the result to a prediction network. The entire pipeline is optimized end-to-end.

GNN Architecture

Based on systematic comparison across 15 networks in two COPD patient cohorts [Hussein2024]:

Network type

Best architecture

Notes

Moderately dense

GAT (default recommendation)

Attention mechanism handles edge weight variation best

Complete / Dense

GCN or GAT (comparable)

GCN sufficient; attention adds little at high density

Sparse

GAT

Attention differentiates the few meaningful edges

Number of GNN Layers

Network type

Recommended layers

Rationale

Complete / Dense

2 - 3

Over-smoothing risk is high

Moderately dense or Sparse

4 - 5

Sufficient depth to propagate information across wider neighborhoods

Embedding Dimension

  • Safe range: 8 - 128. Performance gains plateau beyond 128.

  • Small cohorts (n < 200): stay at 8 - 32.

  • Large cohorts (n > 500): 64 - 128.

Dimensionality Reduction of Embeddings

Autoencoder reduction consistently yielded the strongest results across network types and densities.

Integration Method

GNN-derived scalar weights are applied via element-wise multiplication with the original omics features. This is the default and only integration method in DPMON.

Validated Reference Configurations

Validated on COPDGene multi-omics data predicting COPD GOLD stage (6 classes). Accuracy reported as mean +/- std over 500-1,000 runs [Hussein2024].

Current Smokers

Network

GNN

Layers

Emb. dim

Integration

Accuracy

1 Moderately Dense

GAT

4-5

64-128

Scaling

0.5621 +/- 0.0042

3 Moderately Dense

GAT

4-5

64-128

Scaling

0.5394 +/- 0.0009

1 Complete

GCN

2-3

64-128

Scaling

0.5378 +/- 0.0071

Former Smokers

Network

GNN

Layers

Emb. dim

Integration

Accuracy

1 Moderately Dense

GAT

4-5

64-128

Scaling

0.5513 +/- 0.0065

2 Moderately Dense

GAT

4-5

64-128

Scaling

0.5491 +/- 0.0021

Tip

Default DPMON recipe: Moderately dense network -> GAT -> 4 layers -> embedding dim 64 -> autoencoder reduction -> scaling integration -> 2-layer prediction NN. This configuration outperforms Logistic Regression, Random Forest, SAINT, and Node2Vec by approximately 8-10%.

Subgraph Detection

For identifying biologically significant subnetworks correlated with a phenotype, the hybrid Correlated Louvain + Correlated PageRank approach [AbdelHafiz2022] is recommended over hierarchical clustering.

Correlated Louvain parameter (kL)

kL controls the balance between modularity and phenotype correlation:

kL

Effect

0.8

Larger subgraphs; lower correlation

0.4-0.6

Balanced; default starting range

0.0-0.2

Smaller, highly correlated subgraphs

Start with kL=0.6. Use the subgraph with the highest absolute Pearson correlation to the phenotype as your result.

Correlated PageRank parameters

  • alpha (teleportation probability): 0.04

  • epsilon (tolerance): 1e-6

Hybrid approach (recommended)

  1. Run Correlated Louvain (kL=0.6) on the full network

  2. Select the top subgraph by rho

  3. Seed Correlated PageRank weighted by node-level correlation contribution (alpha=0.04, epsilon=1e-6)

  4. Repeat until subgraph size stabilizes

The hybrid method achieved rho = 0.41 on the COPD protein-metabolite network vs. rho = 0.33 for hierarchical clustering [AbdelHafiz2022].

Summary: Quick-Reference Recipe Table

Scenario

Feature Selection

Network

GNN / Task

Key params

Large cohort (n>500), high-dim, disease prediction

Variance -> Laplacian Score, n_keep=400

Threshold or Correlation, moderate density

GAT, 3-4 layers, emb 64-128

Scaling integration, autoencoder

Small cohort (n<200), disease prediction

Variance or MAD, n_keep=50-100

Sparse to moderate; avoid complete

GAT, 2-3 layers, emb 8-32

Scaling integration, 2-layer NN

Sparse / high-dropout data (metabolomics, scRNA)

MAD or Laplacian Score

Gaussian k-NN

GAT, 4-5 layers

Impute first; inspect cross-omics density

Subgraph / pathway detection

Not required (operate on full network)

Existing SmCCNet / SGTCCA network

Hybrid Louvain + PageRank

kL=0.6, alpha=0.04, epsilon=1e-6

Unknown / exploratory

Laplacian Score, n_keep = n_samples/3

Pearson correlation, target density 0.1-0.7

GAT, 4 layers, emb 64

Scaling, autoencoder; tune layers first

References

[Hussein2024] (1,2,3,4)

Hussein, S. et al. “Learning from Multi-Omics Networks to Enhance Disease Prediction: An Optimized Network Embedding and Fusion Approach.” In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 2024, pp. 4371-4378. DOI: 10.1109/BIBM62325.2024.10822233.

[AbdelHafiz2022] (1,2,3)

Abdel-Hafiz, M., Najafi, M., et al. “Significant Subgraph Detection in Multi-omics Networks for Disease Pathway Identification.” Frontiers in Big Data, 5, 894632 (2022). DOI: 10.3389/fdata.2022.894632.