Utils¶
The utils module provides helper functions for data preprocessing, feature selection,
statistical data exploration, network pruning, and reproducibility.
from bioneuralnet.utils import (
set_seed, data_stats, sparse_filter,
laplacian_score, variance_threshold, mad_filter,
pca_loadings, correlation_filter, importance_rf, top_anova_f_features,
m_transform, impute_simple, impute_knn, normalize,
clean_inf_nan, clean_internal, preprocess_clinical,
prune_network, prune_network_by_quantile,
network_remove_low_variance, network_remove_high_zero_fraction,
)
Reproducibility¶
bioneuralnet.utils.reproducibility.set_seed(): Sets global random seeds for Python, NumPy, and PyTorch (including all CUDA GPU operations). Configurestorch.backends.cudnnfor deterministic algorithms.Parameters:
seed_value(int).from bioneuralnet.utils import set_seed set_seed(123)
Data Diagnostics¶
bioneuralnet.utils.data.data_stats(): Combines variance, zero fraction, expression, and NaN summaries for an omics DataFrame and emits actionable recommendations. Correlation summary is skipped by default.Parameters:
df,name(str label for logging),compute_correlation(bool, defaultFalse).from bioneuralnet.utils import data_stats data_stats(X_mirna, "miRNA") data_stats(X_meth, "Methylation", compute_correlation=False)
bioneuralnet.utils.data.sparse_filter(): Drops columns (features) whose missing fraction exceedsmissing_fraction, then drops rows (samples) whose missing fraction exceeds the same threshold.Parameters:
df,missing_fraction(float, default0.20).from bioneuralnet.utils import sparse_filter X_mirna = sparse_filter(X_mirna, missing_fraction=0.20)
bioneuralnet.utils.data.nan_summary(): Logs global and per-feature/per-sample NaN rates. Returns the global missing percentage as a float.Parameters:
df,name,missing_threshold(float, default20.0).bioneuralnet.utils.data.variance_summary(): Returns a dict of mean, median, min, max, and std of column variances. Optionally counts features belowvar_threshold.Parameters:
df,var_threshold(Optional[float]).bioneuralnet.utils.data.zero_summary(): Returns a dict of statistics for the fraction of zeros per column. Optionally counts features abovezero_threshold.Parameters:
df,zero_threshold(Optional[float]).bioneuralnet.utils.data.expression_summary(): Returns a dict of mean, median, min, max, and std of feature means.Parameters:
df.bioneuralnet.utils.data.correlation_summary(): Returns a dict of statistics for each feature’s maximum pairwise absolute correlation. Fills diagonal with 0 before computing max.Parameters:
df.
Feature Selection¶
bioneuralnet.utils.feature_selection.laplacian_score(): Selects the topn_keepfeatures by Laplacian Score computed on a symmetric k-NN affinity graph built from standardized data. Lower scores indicate higher importance.Parameters:
df,n_keep(int),k_neighbors(int, default5).from bioneuralnet.utils import laplacian_score X_selected = laplacian_score(X, n_keep=200, k_neighbors=5)
bioneuralnet.utils.feature_selection.variance_threshold(): Selects the top-k features by variance after applyingclean_inf_nan.Parameters:
df,k(int, default1000),ddof(int, default0).from bioneuralnet.utils import variance_threshold X_prefiltered = variance_threshold(X, k=2000)
bioneuralnet.utils.feature_selection.mad_filter(): Selects the topn_keepfeatures by Median Absolute Deviation computed across samples.Parameters:
df,n_keep(int).from bioneuralnet.utils import mad_filter X_selected = mad_filter(X, n_keep=200)
bioneuralnet.utils.feature_selection.pca_loadings(): Selects features with the highest absolute PCA loading magnitudes, weighted by explained variance ratio. Scales data withStandardScalerbefore PCA.Parameters:
df,n_keep(int),n_components(int, default50),seed(int, default1883).from bioneuralnet.utils import pca_loadings X_selected = pca_loadings(X, n_keep=200, n_components=50)
bioneuralnet.utils.feature_selection.correlation_filter(): In unsupervised mode (y=None), ranks features by mean absolute inter-feature correlation and selects the toptop_k. In supervised mode (yprovided), ranks by absolute Pearson correlation with the target.Parameters:
X,y(pd.Series or None),top_k(int, default1000).from bioneuralnet.utils import correlation_filter X_selected = correlation_filter(X, top_k=500) X_supervised = correlation_filter(X, y=y, top_k=500)
bioneuralnet.utils.feature_selection.importance_rf(): Fits aRandomForestClassifier(wheny.nunique() <= 10) orRandomForestRegressorand selects the toptop_kfeatures byfeature_importances_. Drops zero-variance columns before fitting.Parameters:
X,y(pd.Series),top_k(int, default1000),seed(int, default119).from bioneuralnet.utils import importance_rf X_selected = importance_rf(X, y, top_k=200)
bioneuralnet.utils.feature_selection.top_anova_f_features(): Computes ANOVA F-statistics (f_classiforf_regression) with Benjamini-Hochberg FDR correction. Returns up tomax_featuresfeatures ordered by F-statistic, significant features first. Pads with the strongest non-significant features if needed.Parameters:
X,y(pd.Series),max_features(int),alpha(float, default0.05),task("classification"or"regression").from bioneuralnet.utils import top_anova_f_features X_selected = top_anova_f_features(X, y, max_features=200, alpha=0.05, task="classification")
Preprocessing Utilities¶
bioneuralnet.utils.preprocess.m_transform(): Converts Beta values to M-values vialog2(clip(B, eps, 1-eps) / (1 - clip(B, eps, 1-eps))). Non-numeric columns are coerced to numeric before transformation.Parameters:
df,eps(float, default1e-6).from bioneuralnet.utils import m_transform X_meth = m_transform(X_meth, eps=1e-7)
bioneuralnet.utils.preprocess.impute_simple(): Fills NaN values usingmean,median, orzerostrategy viaDataFrame.fillna.Parameters:
df,method(str, default"mean").from bioneuralnet.utils import impute_simple X_imputed = impute_simple(X, method="mean")
bioneuralnet.utils.preprocess.impute_knn(): KNN imputation viasklearn.impute.KNNImputer. RaisesValueErroron non-numeric columns. Returnsdfunchanged if no NaNs are present.Parameters:
df,n_neighbors(int, default5).from bioneuralnet.utils import impute_knn X_imputed = impute_knn(X, n_neighbors=5)
bioneuralnet.utils.preprocess.normalize(): Scales data usingStandardScaler("standard"),MinMaxScaler("minmax"), orlog2(x + 1)("log2").Parameters:
df,method(str, default"standard").from bioneuralnet.utils import normalize X_norm = normalize(X, method="standard")
bioneuralnet.utils.preprocess.clean_inf_nan(): Replacesinf/-infwith NaN, imputes NaNs with column median, and drops zero-variance columns.Parameters:
df.from bioneuralnet.utils import clean_inf_nan X_clean = clean_inf_nan(X)
bioneuralnet.utils.preprocess.clean_internal(): Drops columns with NaN fraction abovenan_threshold, removes zero-variance columns, and imputes remaining NaNs with column median viaSimpleImputer.Parameters:
df,nan_threshold(float, default0.5).from bioneuralnet.utils import clean_internal X_clean = clean_internal(X, nan_threshold=0.5)
bioneuralnet.utils.preprocess.preprocess_clinical(): Drops specified columns, maps ordinal variables to numeric ranks, coerces continuous columns, one-hot encodes nominal categoricals (withdummy_na=True), optionally scales numeric columns withRobustScaler, and removes zero-variance columns. Returnsfloat32DataFrame.Parameters:
X,scale(bool, defaultFalse),drop_columns(list or None),ordinal_mappings(dict or None),continuous_columns(list or None),impute(bool, defaultFalse).