bioneuralnet.network.pysmccnet¶
Sparse Multiple Canonical Correlation Network (SmCCNet 2.0).
This module implements the SmCCNet pipeline for multi-omics network inference using PyTorch with optional CUDA acceleration. It integrates multiple omics data types to construct sparse biological networks associated with a phenotype of interest.
Developed in collaboration with the Kechris Lab at CU Anschutz.
References
Liu et al. (2024), “SmCCNet 2.0: A Comprehensive Tool for Multi-omics Network Inference with Shiny Visualization,” BMC Bioinformatics.
Shi et al. (2019), “Unsupervised Discovery of Phenotype-Relevant Multi-omics Networks,” Bioinformatics.
Vu et al. (2023), “NetSHy: Network Summarization via a Hybrid Approach Leveraging Topological Properties,” Bioinformatics.
Notes
Sparse Canonical Correlation Analysis (SCCA) The core optimization finds sparse weight vectors that maximize cross-covariance between data matrices under L1/L2 constraints:
Where \(X, Y\) are standardized data matrices and \(c_1, c_2\) control sparsity via L1 penalty parameters.
Multi-omics SmCCA Objective Extends SCCA to jointly maximize pairwise correlations across multiple omics layers with scaling factors:
Where \(w_{ij}\) balances the contribution of each omics pair and \(u_i\) is the sparse weight vector for omics layer \(i\).
Sparse PLS-DA (Binary Phenotype) For binary or categorical phenotypes, sparse partial least squares discriminant analysis extracts latent factors via soft-thresholded direction estimation:
Latent factors are then weighted by logistic regression coefficients to produce phenotype-discriminative canonical weights.
Network Adjacency Construction The global similarity matrix is built by averaging outer products of absolute weight vectors across subsampling iterations:
Where \(w^{(k)}\) is the weight vector from the \(k\)-th subsample and \(\\bar{A}\) is normalized to a maximum of 1.0.
Network Module Extraction Hierarchical clustering (complete linkage) on \(1 - \\bar{A}\) with a user-defined cut height partitions features into modules. Modules are pruned to a target size range by iteratively removing the lowest-degree node, then summarized via the NetSHy hybrid approach (Laplacian-weighted PCA).
- Algorithm:
The automated pipeline proceeds through five sequential phases:
Preprocessing: Optional covariate regression, centering, and scaling of each omics layer.
Scaling Factor Determination: Pairwise canonical correlations between omics layers are computed and shrunk by a user-defined factor to balance between-omics and omics-phenotype contributions.
Cross-Validation: K-fold CV over a penalty grid selects the sparsity parameters that minimize the ratio of prediction error to test canonical correlation (CCA) or maximize classification accuracy (PLS).
Subsampling: The selected penalties are applied across repeated feature subsamples to construct a stable global adjacency matrix.
Module Extraction: Hierarchical clustering and degree-based pruning produce final subnetworks, each summarized by NetSHy scores and their phenotype correlations.
Functions
|
Automated SmCCNet workflow with GPU acceleration. |
- bioneuralnet.network.pysmccnet.auto_pysmccnet(X: List[DataFrame | ndarray], Y: DataFrame | ndarray, AdjustedCovar: DataFrame | None = None, preprocess: bool = False, Kfold: int = 5, subSampNum: int = 100, DataType: List[str] | None = None, BetweenShrinkage: float = 2.0, ScalingPen: List[float] = [0.1, 0.1], saving_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bioneuralnet/checkouts/latest/docs/source', tuneLength: int = 5, tuneRangeCCA: List[float] = [0.1, 0.5], tuneRangePLS: List[float] = [0.5, 0.9], EvalMethod: str = 'accuracy', ncomp_pls: int = 3, seed: int = 123, CutHeight: float = 0.9999999999, min_size: int = 10, max_size: int = 100, summarization: str = 'NetSHy', precomputed_fold_data: dict | None = None, device: torch.device | None = 'cpu', dtype: torch.dtype = torch.float64, rename: bool = True) dict[source]¶
Automated SmCCNet workflow with GPU acceleration.
Runs the complete SmCCNet pipeline supporting both CCA (continuous phenotype) and PLS (binary phenotype) modes. The workflow includes optional preprocessing, cross-validation for penalty tuning, subsampling for stability selection, and final network construction.
- Parameters:
X (List[pd.DataFrame | np.ndarray]) – Input data matrices (omics layers) for integration.
Y (pd.DataFrame | np.ndarray) – Phenotype vector; numeric for CCA or binary (0/1) for PLS.
AdjustedCovar (pd.DataFrame | None) – Optional covariates to regress out from X before analysis.
preprocess (bool) – If True, center and scale data; if False, use raw input.
Kfold (int) – Number of cross-validation folds for penalty parameter tuning.
subSampNum (int) – Number of subsampling iterations for stability selection.
DataType (List[str] | None) – Names for each omics layer in X; defaults to generic names if None.
BetweenShrinkage (float) – Shrinkage factor for between-omics scaling weights.
ScalingPen (List[float]) – Penalty terms used for determining scaling factors.
saving_dir (str) – Directory path for saving output results.
tuneLength (int) – Number of candidate penalty parameters to test per omics layer.
tuneRangeCCA (List[float]) – Min and max penalty values for CCA (continuous phenotype).
tuneRangePLS (List[float]) – Min and max penalty values for PLS (binary phenotype).
EvalMethod (str) – Metric for PLS evaluation; one of ‘accuracy’, ‘auc’, ‘precision’, ‘recall’, or ‘f1’.
ncomp_pls (int) – Number of latent components to use for PLS models.
CutHeight (float) – Height threshold for hierarchical tree cutting in module extraction.
min_size (int) – Minimum number of nodes to retain a network module.
max_size (int) – Maximum module size; larger modules are pruned down.
summarization (str) – Network summarization method. Currently only ‘NetSHy’ is supported.
seed (int) – Random seed for reproducibility.
precomputed_fold_data (dict | None) – Precomputed CV folds to bypass internal fold generation.
device (torch.device | cpu) – PyTorch device; if None, automatically selects GPU if available.
dtype (torch.dtype) – PyTorch data type for computations.
rename (bool) – If True, prefix datatype to column names; if False, use original column names.
- Returns:
Dictionary containing results for ‘CCA’ or ‘PLS’ including adjacency matrices, processed data, and CV results.
- Return type:
Modules
Main SmCCNet pipeline. |
|