bioneuralnet.network.pysmccnet

Sparse Multiple Canonical Correlation Network (SmCCNet 2.0).

This module implements the SmCCNet pipeline for multi-omics network inference using PyTorch with optional CUDA acceleration. It integrates multiple omics data types to construct sparse biological networks associated with a phenotype of interest.

Developed in collaboration with the Kechris Lab at CU Anschutz.

References

Liu et al. (2024), “SmCCNet 2.0: A Comprehensive Tool for Multi-omics Network Inference with Shiny Visualization,” BMC Bioinformatics.

Shi et al. (2019), “Unsupervised Discovery of Phenotype-Relevant Multi-omics Networks,” Bioinformatics.

Vu et al. (2023), “NetSHy: Network Summarization via a Hybrid Approach Leveraging Topological Properties,” Bioinformatics.

Notes

Sparse Canonical Correlation Analysis (SCCA) The core optimization finds sparse weight vectors that maximize cross-covariance between data matrices under L1/L2 constraints:

\[ \begin{align}\begin{aligned}\begin{split}\\max_{u,v} \\; u^T X^T Y v\end{split}\\\begin{split}\\text{s.t.} \\quad \\|u\\|_2 \\leq 1, \\; \\|v\\|_2 \\leq 1, \\; \\|u\\|_1 \\leq c_1, \\; \\|v\\|_1 \\leq c_2\end{split}\end{aligned}\end{align} \]

Where \(X, Y\) are standardized data matrices and \(c_1, c_2\) control sparsity via L1 penalty parameters.

Multi-omics SmCCA Objective Extends SCCA to jointly maximize pairwise correlations across multiple omics layers with scaling factors:

\[\begin{split}\\max \\; \\sum_{i < j} w_{ij} \\cdot u_i^T X_i^T X_j u_j\end{split}\]

Where \(w_{ij}\) balances the contribution of each omics pair and \(u_i\) is the sparse weight vector for omics layer \(i\).

Sparse PLS-DA (Binary Phenotype) For binary or categorical phenotypes, sparse partial least squares discriminant analysis extracts latent factors via soft-thresholded direction estimation:

\[ \begin{align}\begin{aligned}\begin{split}w_{\\text{new}} = S\\bigl(X^T y, \\; \\eta \\cdot \\max|X^T y|\\bigr)\end{split}\\\begin{split}S(z, \\lambda) = \\operatorname{sign}(z) \\cdot \\max(|z| - \\lambda, \\; 0)\end{split}\end{aligned}\end{align} \]

Latent factors are then weighted by logistic regression coefficients to produce phenotype-discriminative canonical weights.

Network Adjacency Construction The global similarity matrix is built by averaging outer products of absolute weight vectors across subsampling iterations:

\[\begin{split}\\bar{A} = \\frac{1}{K} \\sum_{k=1}^{K} |w^{(k)}| \\cdot |w^{(k)}|^T\end{split}\]

Where \(w^{(k)}\) is the weight vector from the \(k\)-th subsample and \(\\bar{A}\) is normalized to a maximum of 1.0.

Network Module Extraction Hierarchical clustering (complete linkage) on \(1 - \\bar{A}\) with a user-defined cut height partitions features into modules. Modules are pruned to a target size range by iteratively removing the lowest-degree node, then summarized via the NetSHy hybrid approach (Laplacian-weighted PCA).

Algorithm:

The automated pipeline proceeds through five sequential phases:

  1. Preprocessing: Optional covariate regression, centering, and scaling of each omics layer.

  2. Scaling Factor Determination: Pairwise canonical correlations between omics layers are computed and shrunk by a user-defined factor to balance between-omics and omics-phenotype contributions.

  3. Cross-Validation: K-fold CV over a penalty grid selects the sparsity parameters that minimize the ratio of prediction error to test canonical correlation (CCA) or maximize classification accuracy (PLS).

  4. Subsampling: The selected penalties are applied across repeated feature subsamples to construct a stable global adjacency matrix.

  5. Module Extraction: Hierarchical clustering and degree-based pruning produce final subnetworks, each summarized by NetSHy scores and their phenotype correlations.

Functions

auto_pysmccnet(X, Y[, AdjustedCovar, ...])

Automated SmCCNet workflow with GPU acceleration.

bioneuralnet.network.pysmccnet.auto_pysmccnet(X: List[DataFrame | ndarray], Y: DataFrame | ndarray, AdjustedCovar: DataFrame | None = None, preprocess: bool = False, Kfold: int = 5, subSampNum: int = 100, DataType: List[str] | None = None, BetweenShrinkage: float = 2.0, ScalingPen: List[float] = [0.1, 0.1], saving_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bioneuralnet/checkouts/latest/docs/source', tuneLength: int = 5, tuneRangeCCA: List[float] = [0.1, 0.5], tuneRangePLS: List[float] = [0.5, 0.9], EvalMethod: str = 'accuracy', ncomp_pls: int = 3, seed: int = 123, CutHeight: float = 0.9999999999, min_size: int = 10, max_size: int = 100, summarization: str = 'NetSHy', precomputed_fold_data: dict | None = None, device: torch.device | None = 'cpu', dtype: torch.dtype = torch.float64, rename: bool = True) dict[source]

Automated SmCCNet workflow with GPU acceleration.

Runs the complete SmCCNet pipeline supporting both CCA (continuous phenotype) and PLS (binary phenotype) modes. The workflow includes optional preprocessing, cross-validation for penalty tuning, subsampling for stability selection, and final network construction.

Parameters:
  • X (List[pd.DataFrame | np.ndarray]) – Input data matrices (omics layers) for integration.

  • Y (pd.DataFrame | np.ndarray) – Phenotype vector; numeric for CCA or binary (0/1) for PLS.

  • AdjustedCovar (pd.DataFrame | None) – Optional covariates to regress out from X before analysis.

  • preprocess (bool) – If True, center and scale data; if False, use raw input.

  • Kfold (int) – Number of cross-validation folds for penalty parameter tuning.

  • subSampNum (int) – Number of subsampling iterations for stability selection.

  • DataType (List[str] | None) – Names for each omics layer in X; defaults to generic names if None.

  • BetweenShrinkage (float) – Shrinkage factor for between-omics scaling weights.

  • ScalingPen (List[float]) – Penalty terms used for determining scaling factors.

  • saving_dir (str) – Directory path for saving output results.

  • tuneLength (int) – Number of candidate penalty parameters to test per omics layer.

  • tuneRangeCCA (List[float]) – Min and max penalty values for CCA (continuous phenotype).

  • tuneRangePLS (List[float]) – Min and max penalty values for PLS (binary phenotype).

  • EvalMethod (str) – Metric for PLS evaluation; one of ‘accuracy’, ‘auc’, ‘precision’, ‘recall’, or ‘f1’.

  • ncomp_pls (int) – Number of latent components to use for PLS models.

  • CutHeight (float) – Height threshold for hierarchical tree cutting in module extraction.

  • min_size (int) – Minimum number of nodes to retain a network module.

  • max_size (int) – Maximum module size; larger modules are pruned down.

  • summarization (str) – Network summarization method. Currently only ‘NetSHy’ is supported.

  • seed (int) – Random seed for reproducibility.

  • precomputed_fold_data (dict | None) – Precomputed CV folds to bypass internal fold generation.

  • device (torch.device | cpu) – PyTorch device; if None, automatically selects GPU if available.

  • dtype (torch.dtype) – PyTorch data type for computations.

  • rename (bool) – If True, prefix datatype to column names; if False, use original column names.

Returns:

Dictionary containing results for ‘CCA’ or ‘PLS’ including adjacency matrices, processed data, and CV results.

Return type:

dict

Modules

analysis

core

math_helpers

pipeline

Main SmCCNet pipeline.

wrappers