bioneuralnet.utils¶
Utility Module
This module provides a collection of helper functions for data preprocessing, feature selection, statistical data exploration, graph network pruning, and reproducibility.
Functions
|
Clean a numeric DataFrame by handling infinite values, imputing NaNs, and dropping zero-variance columns. |
|
Clean a numeric DataFrame by dropping sparse and constant columns and imputing remaining NaNs. |
|
Select top-k features by correlation in supervised or unsupervised mode. |
Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature. |
|
|
Prints a comprehensive set of key statistics for an omics DataFrame. |
Computes summary statistics for the mean expression (average value) of all features. |
|
|
Retrieves a global logger configured to write to 'bioneuralnet.log'. |
|
Select top-k features using RandomForest feature importances. |
|
Imputes missing values (NaNs) using the K-Nearest Neighbors (KNN) approach. |
|
Imputes missing values (NaNs) in the DataFrame using a specified strategy. |
|
Unsupervised feature selection via the Laplacian Score to address dimensionality. |
|
Converts methylation Beta-values to M-values using log2 transformation. |
|
Select the top features by Median Absolute Deviation (MAD). |
|
Logs a report on the missing data (NaNs) in the DataFrame. |
|
Remove nodes from an adjacency matrix with a high fraction of zero entries. |
|
Remove nodes from an adjacency matrix whose connectivity pattern has very low variance. |
|
Scales or transforms feature data using common normalization techniques. |
|
Select features with the highest absolute loadings across the top principal components. |
|
Preprocess clinical data by cleaning, mapping ordinals, encoding nominals, and scaling. |
|
Prune a weighted network by thresholding edge weights and removing isolated nodes. |
|
Prune a weighted network using a quantile-based edge-weight threshold. |
|
Sets seeds for maximum reproducibility across Python, NumPy, and PyTorch. |
|
Drops features (columns) and then samples (rows) that exceed the maximum missing data fraction. |
|
Select top features using ANOVA F-test with FDR correction. |
|
Computes key summary statistics for the feature (column) variances within an omics DataFrame. |
|
Select the top-k features with the highest variance after cleaning. |
|
Computes statistics on the fraction of zero values present in each feature (column). |
- bioneuralnet.utils.clean_inf_nan(df: DataFrame) DataFrame[source]¶
Clean a numeric DataFrame by handling infinite values, imputing NaNs, and dropping zero-variance columns.
Infinite values are replaced with NaN, all NaNs are imputed using the column median, and any features with zero variance are removed.
- Parameters:
df (pd.DataFrame) – Input DataFrame containing numeric columns, potentially with inf and NaN values.
- Returns:
Cleaned DataFrame with finite values, no NaNs, and only columns with non-zero variance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.clean_internal(df: DataFrame, nan_threshold: float = 0.5) DataFrame[source]¶
Clean a numeric DataFrame by dropping sparse and constant columns and imputing remaining NaNs.
Columns with a fraction of missing values above nan_threshold are dropped, columns with zero variance are removed, and any remaining NaNs are imputed using the column median.
- Parameters:
df (pd.DataFrame) – Input numeric DataFrame to be cleaned.
nan_threshold (float) – Maximum allowed fraction of NaNs per column before the column is dropped.
- Returns:
Cleaned DataFrame with dense, non-constant columns and no remaining NaN values.
- Return type:
pd.DataFrame
- bioneuralnet.utils.correlation_filter(X: DataFrame, y: Series | None = None, top_k: int = 1000) DataFrame[source]¶
Select top-k features by correlation in supervised or unsupervised mode.
In supervised mode (y provided), features are ranked by their absolute correlation with the target. In unsupervised mode, features are ranked by their mean absolute correlation with all other features to reduce redundancy, and selection is applied after basic cleaning.
- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns.
y (pd.Series | None) – Optional target vector; if provided, supervised selection is used, otherwise unsupervised redundancy-based selection.
top_k (int) – Number of features to select, capped at the number of available numeric features.
- Returns:
Numeric subset of X containing the selected features ordered by correlation-based ranking.
- Return type:
pd.DataFrame
- bioneuralnet.utils.correlation_summary(df: DataFrame) dict[source]¶
Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature.
This helps identify features that are highly redundant or collinear.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
- Returns:
A dictionary containing the mean, median, min, max, and std of the max absolute correlations.
- Return type:
- bioneuralnet.utils.expression_summary(df: DataFrame) dict[source]¶
Computes summary statistics for the mean expression (average value) of all features.
Provides insight into the overall magnitude and central tendency of the data values.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
- Returns:
A dictionary containing the mean, median, min, max, and standard deviation of the feature means.
- Return type:
- bioneuralnet.utils.get_logger(name: str) Logger[source]¶
Retrieves a global logger configured to write to ‘bioneuralnet.log’.
- Parameters:
name (str) – Name of the logger.
- Returns:
Configured logger instance.
- Return type:
- bioneuralnet.utils.importance_rf(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]¶
Select top-k features using RandomForest feature importances.
Non-numeric columns are rejected, the remaining data are cleaned and zero-variance features removed, a RandomForest classifier or regressor is fitted depending on y, and the top_k most important features are selected based on
feature_importances_.- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns; all columns must be numeric.
y (pd.Series | pd.DataFrame) – Target values as a Series or single-column DataFrame used to determine classification vs regression.
top_k (int) – Maximum number of most important features to keep according to the RandomForest model.
seed (int) – Random seed for initializing the RandomForest estimator.
- Returns:
Cleaned numeric subset of X restricted to the top-k most important features by RandomForest importance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.impute_knn(df: DataFrame, n_neighbors: int = 5) DataFrame[source]¶
Imputes missing values (NaNs) using the K-Nearest Neighbors (KNN) approach.
KNN imputation replaces missing values with the average value from the ‘n_neighbors’ most similar samples. NOTE: Input data should be scaled/normalized prior to imputation.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing missing values.
n_neighbors (int) – The number of nearest neighbors to consider.
- Returns:
The DataFrame with missing values filled using KNN.
- Return type:
pd.DataFrame
- bioneuralnet.utils.impute_simple(df: DataFrame, method: str = 'mean') DataFrame[source]¶
Imputes missing values (NaNs) in the DataFrame using a specified strategy.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing missing values.
method (str) – The imputation strategy to use. Must be ‘mean’, ‘median’, or ‘zero’.
- Returns:
The DataFrame with missing values filled.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the specified imputation method is not recognized.
- bioneuralnet.utils.laplacian_score(df: DataFrame, n_keep: int, k_neighbors: int = 5) DataFrame[source]¶
Unsupervised feature selection via the Laplacian Score to address dimensionality.
Evaluates a feature’s ability to preserve the local manifold structure of the data. The score is computed as the sum of squared differences between connected samples weighted by the global network (W_ij), divided by the feature’s variance. Lower scores indicate higher importance (smoothness on the graph).
- bioneuralnet.utils.m_transform(df: DataFrame, eps: float = 1e-06) DataFrame[source]¶
Converts methylation Beta-values to M-values using log2 transformation.
M-values follow a normal distribution, improving statistical analysis by transforming the constrained [0, 1] Beta scale to an unbounded log-transformed scale.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing Beta-values (0 to 1).
eps (float) – A small epsilon value used to clip Beta-values away from 0 and 1, preventing logarithm errors.
- Returns:
A new DataFrame containing the log2-transformed M-values.
- Return type:
pd.DataFrame
- bioneuralnet.utils.mad_filter(df: DataFrame, n_keep: int) DataFrame[source]¶
Select the top features by Median Absolute Deviation (MAD).
The Median Absolute Deviation is calculated across samples for each feature, and the features with the highest MAD scores are retained.
- Parameters:
df (pd.DataFrame) – Input DataFrame from which to select high-MAD numeric features.
n_keep (int) – Maximum number of top features to keep in the output.
- Returns:
DataFrame containing only the top n_keep features ranked by MAD.
- Return type:
pd.DataFrame
- bioneuralnet.utils.nan_summary(df: DataFrame, name: str = 'Dataset', missing_threshold: float = 20.0) float[source]¶
Logs a report on the missing data (NaNs) in the DataFrame.
- Parameters:
- Returns:
The global percentage of missing values (NaNs) in the DataFrame.
- Return type:
- bioneuralnet.utils.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) DataFrame[source]¶
Remove nodes from an adjacency matrix with a high fraction of zero entries.
For each node, the fraction of zero entries in its corresponding column is computed, nodes whose zero fraction is greater than or equal to the threshold are removed, and the matrix is reduced to the remaining indices.
- Parameters:
network (pd.DataFrame) – Square adjacency matrix with identical row and column labels.
threshold (float) – Maximum allowed fraction of zeros per node; nodes with higher zero fraction are dropped.
- Returns:
Filtered adjacency matrix restricted to nodes with sufficiently non-zero connectivity.
- Return type:
pd.DataFrame
- bioneuralnet.utils.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) DataFrame[source]¶
Remove nodes from an adjacency matrix whose connectivity pattern has very low variance.
Column-wise variances are computed across the adjacency matrix, and any row/column pair whose variance is at or below the given threshold is removed, preserving a square node-by-node structure.
- Parameters:
network (pd.DataFrame) – Square adjacency matrix with identical row and column labels.
threshold (float) – Minimum allowed variance for a node’s connectivity profile; nodes below this are dropped.
- Returns:
Filtered adjacency matrix restricted to nodes with variance greater than the specified threshold.
- Return type:
pd.DataFrame
- bioneuralnet.utils.normalize(df: DataFrame, method: str = 'standard') DataFrame[source]¶
Scales or transforms feature data using common normalization techniques.
- Parameters:
df (pd.DataFrame) – The input DataFrame.
method (str) – The scaling strategy. Must be ‘standard’ (Z-score), ‘minmax’, or ‘log2’.
- Returns:
The DataFrame with features normalized according to the specified method.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the specified normalization method is not recognized.
- bioneuralnet.utils.pca_loadings(df: DataFrame, n_keep: int, n_components: int = 50, seed: int = 1883) DataFrame[source]¶
Select features with the highest absolute loadings across the top principal components.
The input data is scaled and PCA is applied. Feature importance is determined by weighting each principal component’s loadings by its explained variance ratio and taking the maximum across all selected components.
- Parameters:
- Returns:
DataFrame containing only the top n_keep features with the highest PCA loadings.
- Return type:
pd.DataFrame
- bioneuralnet.utils.preprocess_clinical(X: DataFrame, scale: bool = False, drop_columns: list | None = None, ordinal_mappings: dict | None = None, continuous_columns: list | None = None, impute: float = False) DataFrame[source]¶
Preprocess clinical data by cleaning, mapping ordinals, encoding nominals, and scaling.
This function provides a generalized pipeline for standardizing clinical datasets. It removes specified non-informative columns, maps ordinal variables to numeric ranks, safely coerces continuous variables, and applies one-hot encoding to nominal categories while tracking missing records. Optionally, it handles median imputation and scaling.
Args:
X (pd.DataFrame): The raw clinical feature matrix with patients as rows and variables as columns. scale (bool): If True, applies RobustScaler to the numeric columns. drop_columns (list | None): List of column names to drop prior to processing. ordinal_mappings (dict | None): Nested dictionary mapping string categories to numeric ranks. continuous_columns (list | None): List of strictly continuous column names to coerce to numeric. impute (bool): If True, applies median imputation to missing numeric and ordinal values.
Returns:
pd.DataFrame: Processed clinical feature matrix containing valid numeric types with zero-variance columns removed.
- bioneuralnet.utils.prune_network(adjacency_matrix: DataFrame, weight_threshold: float = 0.0) DataFrame[source]¶
Prune a weighted network by thresholding edge weights and removing isolated nodes.
Edges with weights below weight_threshold are removed from the input adjacency matrix, then all nodes with no remaining connections are dropped, and basic before/after graph statistics are logged.
- Parameters:
adjacency_matrix (pd.DataFrame) – Weighted adjacency matrix with nodes as both rows and columns.
weight_threshold (float) – Minimum edge weight to retain; edges with smaller weights are pruned.
- Returns:
Pruned adjacency matrix containing only edges above the threshold and nodes with at least one connection.
- Return type:
pd.DataFrame
- bioneuralnet.utils.prune_network_by_quantile(adjacency_matrix: DataFrame, quantile: float = 0.5) DataFrame[source]¶
Prune a weighted network using a quantile-based edge-weight threshold.
A global weight threshold is computed as the given quantile of all edge weights, edges below this threshold are removed, and isolated nodes are dropped from the resulting adjacency matrix.
- Parameters:
adjacency_matrix (pd.DataFrame) – Weighted adjacency matrix with nodes as both rows and columns.
quantile (float) – Quantile in [0, 1] used to determine the global weight cutoff for pruning.
- Returns:
Adjacency matrix with low-weight edges and isolated nodes removed based on the quantile threshold.
- Return type:
pd.DataFrame
- bioneuralnet.utils.set_seed(seed_value: int) None[source]¶
Sets seeds for maximum reproducibility across Python, NumPy, and PyTorch.
This function sets global random seeds and configures PyTorch/CUDNN to use deterministic algorithms, ensuring that the experiment produces the exact same numerical result across different runs.
- Parameters:
seed_value (int) – The integer value to use as the random seed.
- Returns:
None
- bioneuralnet.utils.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]¶
Select top features using ANOVA F-test with FDR correction.
Numeric features are cleaned, ANOVA F-statistics and p-values are computed against y using f_classif or f_regression, p-values are adjusted with Benjamini-Hochberg FDR, and up to max_features indices are chosen by prioritizing significant features and padding with the strongest remaining ones if needed.
- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns.
y (pd.Series) – Target vector; categorical for classification or continuous for regression, aligned to the rows of X.
max_features (int) – Maximum number of features to return after significance-based ranking and padding.
alpha (float) – Significance threshold applied to FDR-adjusted p-values to define significant features.
task (str) – Task type, either “classification” (uses f_classif) or “regression” (uses f_regression).
- Returns:
Numeric subset of X with up to max_features columns ordered by F-statistic with significant features first and padded by the strongest remaining features if necessary.
- Return type:
pd.DataFrame
- bioneuralnet.utils.variance_summary(df: DataFrame, var_threshold: float | None = None) dict[source]¶
Computes key summary statistics for the feature (column) variances within an omics DataFrame.
This is useful for assessing feature distribution and identifying low-variance features prior to modeling.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame (samples as rows, features as columns).
var_threshold (Optional[float]) – A threshold used to count features falling below this variance level.
- Returns:
- A dictionary containing the mean, median, min, max, and standard deviation of the column variances.
If a threshold is provided, it also includes ‘Number Of Low Variance Features’.
- Return type:
- bioneuralnet.utils.variance_threshold(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]¶
Select the top-k features with the highest variance after cleaning.
The input is first cleaned with clean_inf_nan, then numeric columns are ranked by variance and the top k features are selected (or all if fewer than k are available).
- Parameters:
- Returns:
Numeric DataFrame containing only the top-k highest-variance features after cleaning.
- Return type:
pd.DataFrame
- bioneuralnet.utils.zero_summary(df: DataFrame, zero_threshold: float | None = None) dict[source]¶
Computes statistics on the fraction of zero values present in each feature (column).
This helps identify feature sparsity, which is common in omics data (e.g., RNA-seq FPKM).
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
zero_threshold (Optional[float]) – A threshold used to count features whose zero-fraction exceeds this value.
- Returns:
- A dictionary containing the mean, median, min, max, and standard deviation of the zero fractions.
If a threshold is provided, it includes ‘Number Of High Zero Features’.
- Return type:
Modules