bioneuralnet.downstream_task

Downstream task pipelines for BioNeuralNet.

This module implements high-level workflows for analyzing patient data using network-derived insights. It includes DPMON (Disease Prediction using Multi-Omics Networks), an end-to-end pipeline that leverages GNNs (GCN, GAT, SAGE, GIN) to learn feature importance weights for supervised phenotype prediction. Additionally, it provides SubjectRepresentation, a class for fusing learned network embeddings with raw omics data via dimensionality reduction (AutoEncoder or PCA) to generate enriched patient profiles.

Classes

DPMON(adjacency_matrix, omics_list, ...[, ...])

DPMON (Disease Prediction using Multi-Omics Networks) end-to-end pipeline for multi-omics disease prediction.

SubjectRepresentation(omics_data, embeddings)

SubjectRepresentation Class for Integrating Network Embeddings into Omics Data.

class bioneuralnet.downstream_task.DPMON(adjacency_matrix: DataFrame, omics_list: List[DataFrame], phenotype_data: DataFrame, clinical_data: DataFrame | None = None, correlation_mode: str = 'abs_pearson', model: str = 'GAT', phenotype_col: str = 'phenotype', gnn_hidden_dim: int = 16, gnn_layer_num: int = 4, gnn_dropout: float = 0.1, gnn_activation: str = 'relu', dim_reduction: str = 'ae', ae_architecture: str = 'original', ae_encoding_dim: int = 8, nn_hidden_dim1: int = 16, nn_hidden_dim2: int = 8, num_epochs: int = 100, repeat_num: int = 1, n_folds: int = 5, lr: float = 0.1, weight_decay: float = 0.0001, gat_heads: int = 1, tune: bool = False, tune_trials: int = 20, gpu: bool = False, cv: bool = False, cuda: int = 0, seed: int = 1804, seed_trials: bool = False, output_dir: str | None = None)[source]

Bases: object

DPMON (Disease Prediction using Multi-Omics Networks) end-to-end pipeline for multi-omics disease prediction.

Instead of node-level MSE regression, DPMON aggregates node embeddings with patient-level omics data and feeds them to a downstream classification head (e.g., a softmax layer with CrossEntropyLoss) for sample-level disease prediction. This end-to-end setup leverages both local (node-level) and global (patient-level) network information.

adjacency_matrix

Adjacency matrix of the feature-level network; index/columns are feature names.

Type:

pd.DataFrame

omics_list

List of omics data matrices or a single merged omics DataFrame (samples x features).

Type:

List[pd.DataFrame] | pd.DataFrame

phenotype_data

Phenotype labels used for supervision.

Type:

pd.DataFrame | pd.Series

clinical_data

Optional clinical covariates (samples x clinical features); may be None.

Type:

Optional[pd.DataFrame]

phenotype_col

Column name in phenotype_data that stores the target labels.

Type:

str

model

GNN backbone; one of {“GCN”, “GAT”, “SAGE”, “GIN”}.

Type:

str

gnn_hidden_dim

Hidden dimension size of GNN layers.

Type:

int

gnn_layer_num

Number of stacked GNN layers.

Type:

int

gnn_dropout

Dropout rate applied within the GNN.

Type:

float

gnn_activation

Non-linear activation used in GNN layers (e.g., “relu”).

Type:

str

dim_reduction

Dimensionality reduction strategy for omics input (e.g., “ae” for autoencoder).

Type:

str

ae_encoding_dim

Encoding dimension of the autoencoder bottleneck if dim_reduction=”ae”.

Type:

int

nn_hidden_dim1

Hidden dimension of the first fully connected layer in the downstream classifier.

Type:

int

nn_hidden_dim2

Hidden dimension of the second fully connected layer in the downstream classifier.

Type:

int

num_epochs

Number of training epochs per run.

Type:

int

repeat_num

Number of repeated training runs (for repeated train/test splits or repeated CV).

Type:

int

n_folds

Number of folds to use when cv=True.

Type:

int

lr

Learning rate for the optimizer.

Type:

float

weight_decay

L2 weight decay (regularization) coefficient.

Type:

float

tune

If True, perform hyperparameter tuning before final training.

Type:

bool

tune_trials

Number of trials to perform if tune=True.

Type:

int

gpu

If True, use GPU if available.

Type:

bool

cv

If True, use K-fold cross-validation; otherwise use repeated train/test splits.

Type:

bool

cuda

CUDA device index to use when gpu=True.

Type:

int

seed

Random seed for reproducibility.

Type:

int

seed_trials

If True, use a fixed seed for hyperparameter sampling to ensure reproducibility across trials.

Type:

bool

output_dir

Directory where logs, checkpoints, and results are written.

Type:

Path

run() Tuple[pd.DataFrame, object, torch.Tensor | None][source]

Execute the DPMON pipeline.

This method aligns the graph and omics features, optionally performs hyperparameter tuning, and then trains and evaluates the chosen GNN model using either K-fold cross-validation (cv=True) or repeated train/test splits (cv=False). It returns prediction outputs, a metrics/config object, and optionally the learned embeddings.

Returns:

A tuple (predictions_df, metrics, embeddings) where:

predictions_df (pd.DataFrame): If cv=False, per-sample predictions with actual vs predicted labels; if cv=True, aggregated CV performance or fold-level results depending on the backend metrics (object): Dictionary or configuration object containing evaluation metrics and, when tuning is enabled, information about the selected hyperparameters. embeddings (torch.Tensor | None): Learned embedding tensor (e.g., node or sample embeddings) if produced by the training routine, otherwise None.

Return type:

Tuple[pd.DataFrame, object, torch.Tensor | None]

class bioneuralnet.downstream_task.SubjectRepresentation(omics_data: DataFrame, embeddings: DataFrame, phenotype_data: DataFrame | None = None, phenotype_col: str = 'phenotype', reduce_method: str = 'AE', seed: int | None = None, tune: bool | None = False, output_dir: str | Path | None = None)[source]

Bases: object

SubjectRepresentation Class for Integrating Network Embeddings into Omics Data.

This class integrates network-derived embeddings with raw omics data to create enriched subject-level profiles. It supports dimensionality reduction of embeddings (via Autoencoders or other methods) and subsequent fusion with original omics features.

omics_data

DataFrame of omics features (columns).

Type:

pd.DataFrame

embeddings

DataFrame with embeddings (indexed by feature names).

Type:

pd.DataFrame

phenotype_data

Optional DataFrame with phenotype labels.

Type:

Optional[pd.DataFrame]

phenotype_col

Name of the phenotype column.

Type:

str

reduce_method

Method used for dimensionality reduction (e.g., “AE”).

Type:

str

seed

Random seed for reproducibility.

Type:

Optional[int]

tune

Whether to run hyperparameter tuning.

Type:

bool

output_dir

Directory where results are written.

Type:

Path

run() DataFrame[source]

Executes the Subject Representation workflow.

If tuning is enabled, runs hyperparameter tuning and uses the best config to reduce embeddings. Otherwise, uses the default reduction method.

Returns:

Enhanced omics data as a DataFrame.

Return type:

pd.DataFrame

Modules

dpmon

DPMON: Optimized Network Embedding and Fusion for Disease Prediction.

subject_representation