Source code for bioneuralnet.clustering.correlated_pagerank
r"""
Correlated PageRank Clustering.
This module implements a personalized PageRank algorithm combined with a
phenotype-aware sweep cut to detect significant subgraphs.
References:
Abdel-Hafiz et al. (2022), "Significant Subgraph Detection in
Multi-omics Networks for Disease Pathway Identification,"
Frontiers in Big Data.
Algorithm:
The PageRank vector is computed as the stationary distribution of:
.. math::
pr_{\\alpha}(s) = \\alpha s + (1 - \\alpha) pr_{\\alpha}(s) W
Where:
* :math:`\\alpha`: Teleportation (restart) probability.
* :math:`s`: Personalization vector (seed weights).
* :math:`W`: Transition matrix.
.. important::
The `networkx.pagerank` implementation uses a `alpha` parameter
representing the **damping factor** (link-following probability).
Therefore, :math:`\\text{nx_alpha} = 1 - \\alpha_{theoretical}`.
Notes:
**Sweep Cut Optimization**
Nodes are sorted by PageRank-per-degree in descending order. For each
prefix set :math:`S_i`, the algorithm minimizes the **Hybrid Conductance**:
.. math::
\\Phi_{hybrid} = k_P \\Phi + (1 - k_P) \\rho
Where:
* :math:`\\Phi`: Standard conductance (:math:`cut / \min(vol(S), vol(V \setminus S))`).
* :math:`\\rho`: Negative absolute Pearson correlation (:math:`-|\\rho|`).
* :math:`k_P`: Trade-off weight (Default: ~0.5).
**Personalization Vector (Seed Weighting)**
Teleportation probabilities for seeds are weighted by their marginal
contribution to correlation:
.. math::
\\alpha_i = \\frac{\\rho_i}{\\max(\\rho_{seeds})} \\cdot \\alpha_{max}
Where :math:`\\rho_i = |\\rho(S)| - |\\rho(S \setminus \{i\})|`.
Values where :math:`\\rho_i < 0` are clamped to 0.
"""
from typing import Any, Dict, List, Optional, Tuple, Union
import networkx as nx
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from ..utils import get_logger
logger = get_logger(__name__)