Utils

The utils module provides helper functions for data preprocessing, feature selection, statistical data exploration, network pruning, and reproducibility.

from bioneuralnet.utils import (
    set_seed, data_stats, sparse_filter,
    laplacian_score, variance_threshold, mad_filter,
    pca_loadings, correlation_filter, importance_rf, top_anova_f_features,
    m_transform, impute_simple, impute_knn, normalize,
    clean_inf_nan, clean_internal, preprocess_clinical,
    prune_network, prune_network_by_quantile,
    network_remove_low_variance, network_remove_high_zero_fraction,
)

Reproducibility

  • bioneuralnet.utils.reproducibility.set_seed(): Sets global random seeds for Python, NumPy, and PyTorch (including all CUDA GPU operations). Configures torch.backends.cudnn for deterministic algorithms.

    Parameters: seed_value (int).

    from bioneuralnet.utils import set_seed
    set_seed(123)
    

Data Diagnostics

  • bioneuralnet.utils.data.data_stats(): Combines variance, zero fraction, expression, and NaN summaries for an omics DataFrame and emits actionable recommendations. Correlation summary is skipped by default.

    Parameters: df, name (str label for logging), compute_correlation (bool, default False).

    from bioneuralnet.utils import data_stats
    data_stats(X_mirna, "miRNA")
    data_stats(X_meth,  "Methylation", compute_correlation=False)
    
  • bioneuralnet.utils.data.sparse_filter(): Drops columns (features) whose missing fraction exceeds missing_fraction, then drops rows (samples) whose missing fraction exceeds the same threshold.

    Parameters: df, missing_fraction (float, default 0.20).

    from bioneuralnet.utils import sparse_filter
    X_mirna = sparse_filter(X_mirna, missing_fraction=0.20)
    
  • bioneuralnet.utils.data.nan_summary(): Logs global and per-feature/per-sample NaN rates. Returns the global missing percentage as a float.

    Parameters: df, name, missing_threshold (float, default 20.0).

  • bioneuralnet.utils.data.variance_summary(): Returns a dict of mean, median, min, max, and std of column variances. Optionally counts features below var_threshold.

    Parameters: df, var_threshold (Optional[float]).

  • bioneuralnet.utils.data.zero_summary(): Returns a dict of statistics for the fraction of zeros per column. Optionally counts features above zero_threshold.

    Parameters: df, zero_threshold (Optional[float]).

  • bioneuralnet.utils.data.expression_summary(): Returns a dict of mean, median, min, max, and std of feature means.

    Parameters: df.

  • bioneuralnet.utils.data.correlation_summary(): Returns a dict of statistics for each feature’s maximum pairwise absolute correlation. Fills diagonal with 0 before computing max.

    Parameters: df.

Feature Selection

  • bioneuralnet.utils.feature_selection.laplacian_score(): Selects the top n_keep features by Laplacian Score computed on a symmetric k-NN affinity graph built from standardized data. Lower scores indicate higher importance.

    Parameters: df, n_keep (int), k_neighbors (int, default 5).

    from bioneuralnet.utils import laplacian_score
    X_selected = laplacian_score(X, n_keep=200, k_neighbors=5)
    
  • bioneuralnet.utils.feature_selection.variance_threshold(): Selects the top-k features by variance after applying clean_inf_nan.

    Parameters: df, k (int, default 1000), ddof (int, default 0).

    from bioneuralnet.utils import variance_threshold
    X_prefiltered = variance_threshold(X, k=2000)
    
  • bioneuralnet.utils.feature_selection.mad_filter(): Selects the top n_keep features by Median Absolute Deviation computed across samples.

    Parameters: df, n_keep (int).

    from bioneuralnet.utils import mad_filter
    X_selected = mad_filter(X, n_keep=200)
    
  • bioneuralnet.utils.feature_selection.pca_loadings(): Selects features with the highest absolute PCA loading magnitudes, weighted by explained variance ratio. Scales data with StandardScaler before PCA.

    Parameters: df, n_keep (int), n_components (int, default 50), seed (int, default 1883).

    from bioneuralnet.utils import pca_loadings
    X_selected = pca_loadings(X, n_keep=200, n_components=50)
    
  • bioneuralnet.utils.feature_selection.correlation_filter(): In unsupervised mode (y=None), ranks features by mean absolute inter-feature correlation and selects the top top_k. In supervised mode (y provided), ranks by absolute Pearson correlation with the target.

    Parameters: X, y (pd.Series or None), top_k (int, default 1000).

    from bioneuralnet.utils import correlation_filter
    X_selected = correlation_filter(X, top_k=500)
    X_supervised = correlation_filter(X, y=y, top_k=500)
    
  • bioneuralnet.utils.feature_selection.importance_rf(): Fits a RandomForestClassifier (when y.nunique() <= 10) or RandomForestRegressor and selects the top top_k features by feature_importances_. Drops zero-variance columns before fitting.

    Parameters: X, y (pd.Series), top_k (int, default 1000), seed (int, default 119).

    from bioneuralnet.utils import importance_rf
    X_selected = importance_rf(X, y, top_k=200)
    
  • bioneuralnet.utils.feature_selection.top_anova_f_features(): Computes ANOVA F-statistics (f_classif or f_regression) with Benjamini-Hochberg FDR correction. Returns up to max_features features ordered by F-statistic, significant features first. Pads with the strongest non-significant features if needed.

    Parameters: X, y (pd.Series), max_features (int), alpha (float, default 0.05), task ("classification" or "regression").

    from bioneuralnet.utils import top_anova_f_features
    X_selected = top_anova_f_features(X, y, max_features=200, alpha=0.05, task="classification")
    

Preprocessing Utilities

  • bioneuralnet.utils.preprocess.m_transform(): Converts Beta values to M-values via log2(clip(B, eps, 1-eps) / (1 - clip(B, eps, 1-eps))). Non-numeric columns are coerced to numeric before transformation.

    Parameters: df, eps (float, default 1e-6).

    from bioneuralnet.utils import m_transform
    X_meth = m_transform(X_meth, eps=1e-7)
    
  • bioneuralnet.utils.preprocess.impute_simple(): Fills NaN values using mean, median, or zero strategy via DataFrame.fillna.

    Parameters: df, method (str, default "mean").

    from bioneuralnet.utils import impute_simple
    X_imputed = impute_simple(X, method="mean")
    
  • bioneuralnet.utils.preprocess.impute_knn(): KNN imputation via sklearn.impute.KNNImputer. Raises ValueError on non-numeric columns. Returns df unchanged if no NaNs are present.

    Parameters: df, n_neighbors (int, default 5).

    from bioneuralnet.utils import impute_knn
    X_imputed = impute_knn(X, n_neighbors=5)
    
  • bioneuralnet.utils.preprocess.normalize(): Scales data using StandardScaler ("standard"), MinMaxScaler ("minmax"), or log2(x + 1) ("log2").

    Parameters: df, method (str, default "standard").

    from bioneuralnet.utils import normalize
    X_norm = normalize(X, method="standard")
    
  • bioneuralnet.utils.preprocess.clean_inf_nan(): Replaces inf/-inf with NaN, imputes NaNs with column median, and drops zero-variance columns.

    Parameters: df.

    from bioneuralnet.utils import clean_inf_nan
    X_clean = clean_inf_nan(X)
    
  • bioneuralnet.utils.preprocess.clean_internal(): Drops columns with NaN fraction above nan_threshold, removes zero-variance columns, and imputes remaining NaNs with column median via SimpleImputer.

    Parameters: df, nan_threshold (float, default 0.5).

    from bioneuralnet.utils import clean_internal
    X_clean = clean_internal(X, nan_threshold=0.5)
    
  • bioneuralnet.utils.preprocess.preprocess_clinical(): Drops specified columns, maps ordinal variables to numeric ranks, coerces continuous columns, one-hot encodes nominal categoricals (with dummy_na=True), optionally scales numeric columns with RobustScaler, and removes zero-variance columns. Returns float32 DataFrame.

    Parameters: X, scale (bool, default False), drop_columns (list or None), ordinal_mappings (dict or None), continuous_columns (list or None), impute (bool, default False).