bioneuralnet¶

BioNeuralNet: Graph Neural Network-based Multi-Omics Network Data Analysis.

BioNeuralNet is a modular framework tailored for end-to-end network-based multi-omics data analysis. It leverages Graph Neural Networks (GNNs) to transform complex molecular networks into biologically meaningful low-dimensional representations, enabling diverse downstream analytical tasks.

Key Features:

Network Construction: Modules to construct networks from raw tabular data using similarity, correlation, neighborhood-based, or phenotype-driven strategies (e.g., SmCCNet).
Network Embedding: Generate low-dimensional representations using advanced Graph Neural Networks, including GCN, GAT, GraphSAGE, and GIN.
Subgraph Detection: Identify biologically meaningful modules using supervised and unsupervised community detection methods like Correlated Louvain and PageRank.
Downstream Tasks: Execute specialized pipelines such as DPMON (Disease Prediction using Multi-Omics Networks) and Subject Representation for patient-level analysis.
Data Handling: Streamline data ingestion, feature selection (ANOVA, Random Forest), and preprocessing.
Reproducibility: Built-in logging, configuration, and seeding utilities to ensure reproducible research.

Functions

`auto_pysmccnet`(X, Y[, AdjustedCovar, ...])	Automated SmCCNet workflow with GPU acceleration.
`get_logger`(name)	Retrieves a global logger configured to write to 'bioneuralnet.log'.
`load_brca`()	Load the Breast Invasive Carcinoma (BRCA) dataset.
`load_example`()	Load the synthetic Example dataset.
`load_kipan`()	Load the Pan-kidney (KIPAN) dataset.
`load_lgg`()	Load the Brain Lower Grade Glioma dataset.
`load_monet`()	Load the synthetic MONET dataset.
`set_seed`(seed_value)	Sets seeds for maximum reproducibility across Python, NumPy, and PyTorch.

Classes

`CorrelatedLouvain`(G, B, Y[, k_L, weight, ...])	Correlated Louvain community detection.
`CorrelatedPageRank`(graph, omics_data, ...[, ...])	Correlated PageRank clustering on a multi-omics network.
`DPMON`(adjacency_matrix, omics_list, ...[, ...])	DPMON (Disease Prediction using Multi-Omics Networks) end-to-end pipeline for multi-omics disease prediction.
`DatasetLoader`(dataset_name)	Load a pre-packaged multi-omics dataset from the package.
`GNNEmbedding`(adjacency_matrix, omics_data, ...)	GNNEmbedding Class for Generating Graph Neural Network (GNN) Based Embeddings.
`HybridLouvain`(G, B, Y[, k_L, teleport_prob, ...])	Hybrid Louvain-PageRank for significant subgraph detection.
`SubjectRepresentation`(omics_data, embeddings)	SubjectRepresentation Class for Integrating Network Embeddings into Omics Data.

class bioneuralnet.CorrelatedLouvain(G: Graph, B: DataFrame, Y: Series | DataFrame, k_L: float = 0.2, weight: str = 'weight', max_passes: int = 50, min_delta: float = 1e-06, seed: int | None = None)[source]¶

Bases: Louvain

Correlated Louvain community detection.

Inherits from Louvain.

Parameters:

G (nx.Graph) – The input graph for community detection.
B (pd.DataFrame) – Omics data (n_samples x n_features). Column names must match nodes.
Y (Union[pd.Series, pd.DataFrame]) – Phenotype vector aligned with rows of B.
k_L (float) – Weight on modularity in combined objective (Eq. 9).
weight (str) – Edge attribute name for weights.
max_passes (int) – Maximum number of passes for Phase 1 optimization.
min_delta (float) – Convergence tolerance for objective gain.
seed (Optional[int]) – Random seed for reproducibility.

property communities: Dict[int, List[Any]]¶

Retrieves the communities grouped by community ID.

Convenient for iterating over sets of nodes belonging to the same community.

Returns:: A dictionary mapping community IDs to lists of nodes.
Return type:: Dict[int, List[Any]]

get_combined_quality() → float[source]¶

Access the calculated combined quality score.

Returns:: The Q* score.
Return type:: float

get_top_communities(n: int = 1) → List[Tuple[int, float, List[Any]]][source]¶

Retrieve the top communities based on absolute correlation.

Parameters:: n (int) – Number of top communities to return.
Returns:: Community data sorted by rho .
Return type:: List[Tuple[int, float, List[Any]]]

property history: List[Dict[str, Any]]¶

Retrieves the history of the algorithm’s execution levels.

Provides insight into the convergence process and reduction of graph size.

Returns:: A list of dictionaries containing stats for each level.
Return type:: List[Dict[str, Any]]

property modularity: float¶

Retrieves the final modularity score of the computed partition.

Requires that the run() method has been executed previously.

Returns:: The modularity score.
Return type:: float

property partition: Dict[Any, int]¶

Retrieves the final partition of the graph.

Requires that the run() method has been executed previously.

Returns:: A dictionary mapping nodes to community IDs.
Return type:: Dict[Any, int]

run() → Dict[Any, int][source]¶

Execute the Correlated Louvain algorithm.

Returns:: Mapping of original nodes to community IDs.
Return type:: Dict[Any, int]

class bioneuralnet.CorrelatedPageRank(graph: Graph, omics_data: DataFrame, phenotype_data: DataFrame | Series, teleport_prob: float = 0.1, k_P: float = 0.5, max_iter: int = 100, tol: float = 1e-06, min_cluster: int = 2, seed: int | None = None)[source]¶

Bases: object

Correlated PageRank clustering on a multi-omics network.

Parameters:

graph (nx.Graph) – Weighted undirected NetworkX graph.
omics_data (pd.DataFrame) – Omics matrix (n_samples x n_features), columns = node ids.
phenotype_data (Union[pd.DataFrame, pd.Series]) – Phenotype vector aligned with rows of omics_data.
teleport_prob (float) – Teleportation probability (alpha). Default 0.10.
k_P (float) – Weight on conductance in combined objective (Eq. 5).
max_iter (int) – Max iterations for PageRank power iteration.
tol (float) – Convergence tolerance for PageRank.
min_cluster (int) – Minimum cluster size for sweep cut consideration.
seed (Optional[int]) – Random seed for reproducibility.

generate_weighted_personalization(nodes: List[Any], alpha_max: float | None = None) → Dict[Any, float][source]¶

Build personalization vector based on each node’s correlation contribution.

Parameters:

nodes (List[Any]) – Seed node list.
alpha_max (Optional[float]) – Maximum teleportation weight.

Returns:

Personalization mapping {node: weight}.

Return type:

Dict[Any, float]

phen_omics_corr(nodes: List[Any]) → Tuple[float, float][source]¶

Compute Pearson(PC1(omics[:, nodes]), phenotype).

Parameters:: nodes (List[Any]) – List of node identifiers.
Returns:: (correlation, p_value). Returns (0.0, 1.0) on failure.
Return type:: Tuple[float, float]

run(seed_nodes: List[Any]) → Dict[str, Any][source]¶

Execute Correlated PageRank clustering.

Parameters:: seed_nodes (List[Any]) – Nodes to use as the teleport set.
Returns:: Cluster performance and node list.
Return type:: Dict

sweep_cut(pr_scores: Dict[Any, float]) → Dict[str, Any][source]¶

Identify the best cluster via sweep cut on PageRank scores.

Parameters:: pr_scores (Dict[Any, float]) – Mapping of nodes to PageRank scores.
Returns:: Best cluster details including nodes, conductance, and composite score.
Return type:: Dict

class bioneuralnet.DPMON(adjacency_matrix: DataFrame, omics_list: List[DataFrame], phenotype_data: DataFrame, clinical_data: DataFrame | None = None, correlation_mode: str = 'abs_pearson', model: str = 'GAT', phenotype_col: str = 'phenotype', gnn_hidden_dim: int = 16, gnn_layer_num: int = 4, gnn_dropout: float = 0.1, gnn_activation: str = 'relu', dim_reduction: str = 'ae', ae_architecture: str = 'original', ae_encoding_dim: int = 8, nn_hidden_dim1: int = 16, nn_hidden_dim2: int = 8, num_epochs: int = 100, repeat_num: int = 1, n_folds: int = 5, lr: float = 0.1, weight_decay: float = 0.0001, gat_heads: int = 1, tune: bool = False, tune_trials: int = 20, gpu: bool = False, cv: bool = False, cuda: int = 0, seed: int = 1804, seed_trials: bool = False, output_dir: str | None = None)[source]¶

Bases: object

DPMON (Disease Prediction using Multi-Omics Networks) end-to-end pipeline for multi-omics disease prediction.

Instead of node-level MSE regression, DPMON aggregates node embeddings with patient-level omics data and feeds them to a downstream classification head (e.g., a softmax layer with CrossEntropyLoss) for sample-level disease prediction. This end-to-end setup leverages both local (node-level) and global (patient-level) network information.

adjacency_matrix¶

Adjacency matrix of the feature-level network; index/columns are feature names.

Type:: pd.DataFrame

omics_list¶

List of omics data matrices or a single merged omics DataFrame (samples x features).

Type:: List[pd.DataFrame] | pd.DataFrame

phenotype_data¶

Phenotype labels used for supervision.

Type:: pd.DataFrame | pd.Series

clinical_data¶

Optional clinical covariates (samples x clinical features); may be None.

Type:: Optional[pd.DataFrame]

phenotype_col¶

Column name in phenotype_data that stores the target labels.

Type:: str

model¶

GNN backbone; one of {“GCN”, “GAT”, “SAGE”, “GIN”}.

Type:: str

gnn_hidden_dim¶

Hidden dimension size of GNN layers.

Type:: int

gnn_layer_num¶

Number of stacked GNN layers.

Type:: int

gnn_dropout¶

Dropout rate applied within the GNN.

Type:: float

gnn_activation¶

Non-linear activation used in GNN layers (e.g., “relu”).

Type:: str

dim_reduction¶

Dimensionality reduction strategy for omics input (e.g., “ae” for autoencoder).

Type:: str

ae_encoding_dim¶

Encoding dimension of the autoencoder bottleneck if dim_reduction=”ae”.

Type:: int

nn_hidden_dim1¶

Hidden dimension of the first fully connected layer in the downstream classifier.

Type:: int

nn_hidden_dim2¶

Hidden dimension of the second fully connected layer in the downstream classifier.

Type:: int

num_epochs¶

Number of training epochs per run.

Type:: int

repeat_num¶

Number of repeated training runs (for repeated train/test splits or repeated CV).

Type:: int

n_folds¶

Number of folds to use when cv=True.

Type:: int

lr¶

Learning rate for the optimizer.

Type:: float

weight_decay¶

L2 weight decay (regularization) coefficient.

Type:: float

tune¶

If True, perform hyperparameter tuning before final training.

Type:: bool

tune_trials¶

Number of trials to perform if tune=True.

Type:: int

gpu¶

If True, use GPU if available.

Type:: bool

cv¶

If True, use K-fold cross-validation; otherwise use repeated train/test splits.

Type:: bool

cuda¶

CUDA device index to use when gpu=True.

Type:: int

seed¶

Random seed for reproducibility.

Type:: int

seed_trials¶

If True, use a fixed seed for hyperparameter sampling to ensure reproducibility across trials.

Type:: bool

output_dir¶

Directory where logs, checkpoints, and results are written.

Type:: Path

run() → Tuple[pd.DataFrame, object, torch.Tensor | None][source]¶

Execute the DPMON pipeline.

This method aligns the graph and omics features, optionally performs hyperparameter tuning, and then trains and evaluates the chosen GNN model using either K-fold cross-validation (cv=True) or repeated train/test splits (cv=False). It returns prediction outputs, a metrics/config object, and optionally the learned embeddings.

Returns:

A tuple (predictions_df, metrics, embeddings) where:: predictions_df (pd.DataFrame): If cv=False, per-sample predictions with actual vs predicted labels; if cv=True, aggregated CV performance or fold-level results depending on the backend metrics (object): Dictionary or configuration object containing evaluation metrics and, when tuning is enabled, information about the selected hyperparameters. embeddings (torch.Tensor | None): Learned embedding tensor (e.g., node or sample embeddings) if produced by the training routine, otherwise None.

Return type:

Tuple[pd.DataFrame, object, torch.Tensor | None]

class bioneuralnet.DatasetLoader(dataset_name: str)[source]¶

Bases: object

Load a pre-packaged multi-omics dataset from the package.

Options for ‘dataset_name’:

“example”: Synthetic example. “monet”: Synthetic example. “brca”: Breast invasive carcinoma. “lgg”: Brain Lower Grade Glioma. “kipan”: Pan-kidney carcinoma.

Parameters:

dataset_name (str) – Normalized dataset name.
base_dir (Path) – Directory where the dataset folders live.
data (dict[str, pd.DataFrame]) – Mapping from table name to loaded DataFrame.

property shape: dict[str, tuple[int, int]]¶: Dictionary mapping each table name to its (n_rows, n_cols) shape.

class bioneuralnet.GNNEmbedding(adjacency_matrix: DataFrame, omics_data: DataFrame, phenotype_data: Series | DataFrame, clinical_data: DataFrame | None = None, phenotype_col: str = 'phenotype', model_type: str = 'GAT', hidden_dim: int = 64, layer_num: int = 4, dropout: bool | float = True, num_epochs: int = 100, lr: float = 0.001, weight_decay: float = 0.0001, gpu: bool = False, activation: str = 'relu', seed: int | None = None, tune: bool | None = False, output_dir: str | Path | None = None)[source]¶

Bases: object

GNNEmbedding Class for Generating Graph Neural Network (GNN) Based Embeddings.

adjacency_matrix¶: pd.DataFrame

omics_data¶: pd.DataFrame

phenotype_data¶: pd.DataFrame

clinical_data¶: Optional[pd.DataFrame]

phenotype_col¶: str

model_type¶: str

hidden_dim¶: int

layer_num¶: int

dropout¶: Union[bool, float] (if bool, True maps to 0.5, False to 0.0)

num_epochs¶: int

lr¶: float

weight_decay¶: float

gpu¶: bool

seed¶: Optional[int]

tune¶: Optional[bool]

embed(as_df: bool = False) → torch.Tensor | DataFrame[source]¶: Generates node embeddings. If tuning is enabled, runs hyperparameter tuning and uses the best configuration.

fit() → None[source]¶: Trains the GNN model using the provided data.

run_gnn_embedding_tuning(num_samples=20)[source]¶: Run hyperparameter tuning with Ray Tune.

class bioneuralnet.HybridLouvain(G: Graph | DataFrame, B: DataFrame, Y: DataFrame | Series, k_L: float = 0.8, teleport_prob: float = 0.05, k_P: float = 0.7, max_iter: int = 10, min_nodes: int = 3, weight: str = 'weight', seed: int | None = None)[source]¶

Bases: object

Hybrid Louvain-PageRank for significant subgraph detection.

Iteratively refines a multi-omics network by alternating:

Correlated Louvain to find the most phenotype-associated community
Correlated PageRank to refine that community via sweep cut

The graph shrinks each iteration. The best subgraph by rho is tracked across all iterations and returned.

Parameters:

G (Union[nx.Graph, pd.DataFrame]) – Weighted undirected graph or adjacency matrix DataFrame.
B (pd.DataFrame) – Omics data (n_samples x n_features).
Y (Union[pd.DataFrame, pd.Series]) – Phenotype vector.
k_L (float) – Weight on modularity for Correlated Louvain).
teleport_prob (float) – Teleportation probability for PageRank (alpha).
k_P (float) – Weight on conductance for PageRank sweep cut.
max_iter (int) – Maximum Hybrid iterations.
min_nodes (int) – Stop if graph shrinks below this size.
weight (str) – Edge attribute name for weights.
seed (Optional[int]) – Random seed.

property best_subgraph: Tuple[List[Any], float, int]¶

Retrieves the nodes and performance metrics of the best subgraph found.

Returns:: (nodes, rho , iteration_index).
Return type:: Tuple[List[Any], float, int]

property iterations: List[Dict[str, Any]]¶

Provides access to per-iteration details from the most recent run.

Returns:: A list of result dictionaries for each iteration.
Return type:: List[Dict[str, Any]]

run(as_dfs: bool = False) → Dict[str, Any] | List[DataFrame][source]¶

Execute the Hybrid Louvain-PageRank algorithm.

Returns:

best_nodes: nodes of the highest rho subgraph
best_correlation: float
best_iteration: int
iterations: full per-iteration metadata
all_subgraphs: {iteration_index: [nodes]}

Return type:

Dict

class bioneuralnet.SubjectRepresentation(omics_data: DataFrame, embeddings: DataFrame, phenotype_data: DataFrame | None = None, phenotype_col: str = 'phenotype', reduce_method: str = 'AE', seed: int | None = None, tune: bool | None = False, output_dir: str | Path | None = None)[source]¶

Bases: object

SubjectRepresentation Class for Integrating Network Embeddings into Omics Data.

This class integrates network-derived embeddings with raw omics data to create enriched subject-level profiles. It supports dimensionality reduction of embeddings (via Autoencoders or other methods) and subsequent fusion with original omics features.

omics_data¶

DataFrame of omics features (columns).

Type:: pd.DataFrame

embeddings¶

DataFrame with embeddings (indexed by feature names).

Type:: pd.DataFrame

phenotype_data¶

Optional DataFrame with phenotype labels.

Type:: Optional[pd.DataFrame]

phenotype_col¶

Name of the phenotype column.

Type:: str

reduce_method¶

Method used for dimensionality reduction (e.g., “AE”).

Type:: str

seed¶

Random seed for reproducibility.

Type:: Optional[int]

tune¶

Whether to run hyperparameter tuning.

Type:: bool

output_dir¶

Directory where results are written.

Type:: Path

run() → DataFrame[source]¶

Executes the Subject Representation workflow.

If tuning is enabled, runs hyperparameter tuning and uses the best config to reduce embeddings. Otherwise, uses the default reduction method.

Returns:: Enhanced omics data as a DataFrame.
Return type:: pd.DataFrame

bioneuralnet.auto_pysmccnet(X: List[DataFrame | ndarray], Y: DataFrame | ndarray, AdjustedCovar: DataFrame | None = None, preprocess: bool = False, Kfold: int = 5, subSampNum: int = 100, DataType: List[str] | None = None, BetweenShrinkage: float = 2.0, ScalingPen: List[float] = [0.1, 0.1], saving_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bioneuralnet/checkouts/latest/docs/source', tuneLength: int = 5, tuneRangeCCA: List[float] = [0.1, 0.5], tuneRangePLS: List[float] = [0.5, 0.9], EvalMethod: str = 'accuracy', ncomp_pls: int = 3, seed: int = 123, CutHeight: float = 0.9999999999, min_size: int = 10, max_size: int = 100, summarization: str = 'NetSHy', precomputed_fold_data: dict | None = None, device: torch.device | None = 'cpu', dtype: torch.dtype = torch.float64, rename: bool = True) → dict[source]¶

Automated SmCCNet workflow with GPU acceleration.

Runs the complete SmCCNet pipeline supporting both CCA (continuous phenotype) and PLS (binary phenotype) modes. The workflow includes optional preprocessing, cross-validation for penalty tuning, subsampling for stability selection, and final network construction.

Parameters:

X (List[pd.DataFrame | np.ndarray]) – Input data matrices (omics layers) for integration.
Y (pd.DataFrame | np.ndarray) – Phenotype vector; numeric for CCA or binary (0/1) for PLS.
AdjustedCovar (pd.DataFrame | None) – Optional covariates to regress out from X before analysis.
preprocess (bool) – If True, center and scale data; if False, use raw input.
Kfold (int) – Number of cross-validation folds for penalty parameter tuning.
subSampNum (int) – Number of subsampling iterations for stability selection.
DataType (List[str] | None) – Names for each omics layer in X; defaults to generic names if None.
BetweenShrinkage (float) – Shrinkage factor for between-omics scaling weights.
ScalingPen (List[float]) – Penalty terms used for determining scaling factors.
saving_dir (str) – Directory path for saving output results.
tuneLength (int) – Number of candidate penalty parameters to test per omics layer.
tuneRangeCCA (List[float]) – Min and max penalty values for CCA (continuous phenotype).
tuneRangePLS (List[float]) – Min and max penalty values for PLS (binary phenotype).
EvalMethod (str) – Metric for PLS evaluation; one of ‘accuracy’, ‘auc’, ‘precision’, ‘recall’, or ‘f1’.
ncomp_pls (int) – Number of latent components to use for PLS models.
CutHeight (float) – Height threshold for hierarchical tree cutting in module extraction.
min_size (int) – Minimum number of nodes to retain a network module.
max_size (int) – Maximum module size; larger modules are pruned down.
summarization (str) – Network summarization method. Currently only ‘NetSHy’ is supported.
seed (int) – Random seed for reproducibility.
precomputed_fold_data (dict | None) – Precomputed CV folds to bypass internal fold generation.
device (torch.device | cpu) – PyTorch device; if None, automatically selects GPU if available.
dtype (torch.dtype) – PyTorch data type for computations.
rename (bool) – If True, prefix datatype to column names; if False, use original column names.

Returns:

Dictionary containing results for ‘CCA’ or ‘PLS’ including adjacency matrices, processed data, and CV results.

Return type:

dict

bioneuralnet.get_logger(name: str) → Logger[source]¶

Retrieves a global logger configured to write to ‘bioneuralnet.log’.

Parameters:: name (str) – Name of the logger.
Returns:: Configured logger instance.
Return type:: logging.Logger

bioneuralnet.load_brca() → dict[source]¶

Load the Breast Invasive Carcinoma (BRCA) dataset.

Returns:: Keys include ‘mirna’, ‘target’, ‘clinical’, ‘rna’, ‘methylation’.
Return type:: dict

bioneuralnet.load_example() → dict[source]¶

Load the synthetic Example dataset.

Returns:: Keys include ‘X1’, ‘X2’, ‘Y’, ‘clinical’.
Return type:: dict

bioneuralnet.load_kipan() → dict[source]¶

Load the Pan-kidney (KIPAN) dataset.

Returns:: Keys include ‘mirna’, ‘target’, ‘clinical’, ‘rna’, ‘methylation’.
Return type:: dict

bioneuralnet.load_lgg() → dict[source]¶

Load the Brain Lower Grade Glioma dataset.

Returns:: Keys include ‘mirna’, ‘target’, ‘clinical’, ‘rna’, ‘methylation’.
Return type:: dict

bioneuralnet.load_monet() → dict[source]¶

Load the synthetic MONET dataset.

Returns:: Keys include ‘gene’, ‘mirna’, ‘phenotype’, ‘rppa’, ‘clinical’.
Return type:: dict

bioneuralnet.set_seed(seed_value: int) → None[source]¶

Sets seeds for maximum reproducibility across Python, NumPy, and PyTorch.

This function sets global random seeds and configures PyTorch/CUDNN to use deterministic algorithms, ensuring that the experiment produces the exact same numerical result across different runs.

Parameters:: seed_value (int) – The integer value to use as the random seed.
Returns:: None

Modules

`clustering`	Network Clustering and Subgraph Detection.
`datasets`	Built-in datasets for BioNeuralNet.
`downstream_task`	Downstream task pipelines for BioNeuralNet.
`external_tools`	External Tools Module
`metrics`	Metrics and visualization tools for BioNeuralNet.
`network`	Network Construction and Analysis.
`network_embedding`	Network embedding modules for BioNeuralNet.
`utils`	Utility Module