bioneuralnet.utils.preprocess¶

Functions

`clean_inf_nan`(df)	Clean a numeric DataFrame by handling infinite values, imputing NaNs, and dropping zero-variance columns.
`clean_internal`(df[, nan_threshold])	Clean a numeric DataFrame by dropping sparse and constant columns and imputing remaining NaNs.
`get_logger`(name)	Retrieves a global logger configured to write to 'bioneuralnet.log'.
`impute_knn`(df[, n_neighbors])	Imputes missing values (NaNs) using the K-Nearest Neighbors (KNN) approach.
`impute_simple`(df[, method])	Imputes missing values (NaNs) in the DataFrame using a specified strategy.
`m_transform`(df[, eps])	Converts methylation Beta-values to M-values using log2 transformation.
`network_remove_high_zero_fraction`(network[, ...])	Remove nodes from an adjacency matrix with a high fraction of zero entries.
`network_remove_low_variance`(network[, threshold])	Remove nodes from an adjacency matrix whose connectivity pattern has very low variance.
`normalize`(df[, method])	Scales or transforms feature data using common normalization techniques.
`preprocess_clinical`(X[, scale, ...])	Preprocess clinical data by cleaning, mapping ordinals, encoding nominals, and scaling.
`prune_network`(adjacency_matrix[, ...])	Prune a weighted network by thresholding edge weights and removing isolated nodes.
`prune_network_by_quantile`(adjacency_matrix)	Prune a weighted network using a quantile-based edge-weight threshold.

Classes

`KNNImputer`(*[, missing_values, n_neighbors, ...])	Imputation for completing missing values using k-Nearest Neighbors.
`MinMaxScaler`([feature_range, copy, clip])	Transform features by scaling each feature to a given range.
`RobustScaler`(*[, with_centering, ...])	Scale features using statistics that are robust to outliers.
`SimpleImputer`(*[, missing_values, strategy, ...])	Univariate imputer for completing missing values with simple strategies.
`StandardScaler`(*[, copy, with_mean, with_std])	Standardize features by removing the mean and scaling to unit variance.

bioneuralnet.utils.preprocess.clean_inf_nan(df: DataFrame) → DataFrame[source]¶

Clean a numeric DataFrame by handling infinite values, imputing NaNs, and dropping zero-variance columns.

Infinite values are replaced with NaN, all NaNs are imputed using the column median, and any features with zero variance are removed.

Parameters:: df (pd.DataFrame) – Input DataFrame containing numeric columns, potentially with inf and NaN values.
Returns:: Cleaned DataFrame with finite values, no NaNs, and only columns with non-zero variance.
Return type:: pd.DataFrame

bioneuralnet.utils.preprocess.clean_internal(df: DataFrame, nan_threshold: float = 0.5) → DataFrame[source]¶

Clean a numeric DataFrame by dropping sparse and constant columns and imputing remaining NaNs.

Columns with a fraction of missing values above nan_threshold are dropped, columns with zero variance are removed, and any remaining NaNs are imputed using the column median.

Parameters:

df (pd.DataFrame) – Input numeric DataFrame to be cleaned.
nan_threshold (float) – Maximum allowed fraction of NaNs per column before the column is dropped.

Returns:

Cleaned DataFrame with dense, non-constant columns and no remaining NaN values.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.impute_knn(df: DataFrame, n_neighbors: int = 5) → DataFrame[source]¶

Imputes missing values (NaNs) using the K-Nearest Neighbors (KNN) approach.

KNN imputation replaces missing values with the average value from the ‘n_neighbors’ most similar samples. NOTE: Input data should be scaled/normalized prior to imputation.

Parameters:

df (pd.DataFrame) – The input DataFrame containing missing values.
n_neighbors (int) – The number of nearest neighbors to consider.

Returns:

The DataFrame with missing values filled using KNN.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.impute_simple(df: DataFrame, method: str = 'mean') → DataFrame[source]¶

Imputes missing values (NaNs) in the DataFrame using a specified strategy.

Parameters:

df (pd.DataFrame) – The input DataFrame containing missing values.
method (str) – The imputation strategy to use. Must be ‘mean’, ‘median’, or ‘zero’.

Returns:

The DataFrame with missing values filled.

Return type:

pd.DataFrame

Raises:

ValueError – If the specified imputation method is not recognized.

bioneuralnet.utils.preprocess.m_transform(df: DataFrame, eps: float = 1e-06) → DataFrame[source]¶

Converts methylation Beta-values to M-values using log2 transformation.

M-values follow a normal distribution, improving statistical analysis by transforming the constrained [0, 1] Beta scale to an unbounded log-transformed scale.

Parameters:

df (pd.DataFrame) – The input DataFrame containing Beta-values (0 to 1).
eps (float) – A small epsilon value used to clip Beta-values away from 0 and 1, preventing logarithm errors.

Returns:

A new DataFrame containing the log2-transformed M-values.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.network_remove_high_zero_fraction(network: DataFrame, threshold: float = 0.95) → DataFrame[source]¶

Remove nodes from an adjacency matrix with a high fraction of zero entries.

For each node, the fraction of zero entries in its corresponding column is computed, nodes whose zero fraction is greater than or equal to the threshold are removed, and the matrix is reduced to the remaining indices.

Parameters:

network (pd.DataFrame) – Square adjacency matrix with identical row and column labels.
threshold (float) – Maximum allowed fraction of zeros per node; nodes with higher zero fraction are dropped.

Returns:

Filtered adjacency matrix restricted to nodes with sufficiently non-zero connectivity.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.network_remove_low_variance(network: DataFrame, threshold: float = 1e-06) → DataFrame[source]¶

Remove nodes from an adjacency matrix whose connectivity pattern has very low variance.

Column-wise variances are computed across the adjacency matrix, and any row/column pair whose variance is at or below the given threshold is removed, preserving a square node-by-node structure.

Parameters:

network (pd.DataFrame) – Square adjacency matrix with identical row and column labels.
threshold (float) – Minimum allowed variance for a node’s connectivity profile; nodes below this are dropped.

Returns:

Filtered adjacency matrix restricted to nodes with variance greater than the specified threshold.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.normalize(df: DataFrame, method: str = 'standard') → DataFrame[source]¶

Scales or transforms feature data using common normalization techniques.

Parameters:

df (pd.DataFrame) – The input DataFrame.
method (str) – The scaling strategy. Must be ‘standard’ (Z-score), ‘minmax’, or ‘log2’.

Returns:

The DataFrame with features normalized according to the specified method.

Return type:

pd.DataFrame

Raises:

ValueError – If the specified normalization method is not recognized.

bioneuralnet.utils.preprocess.preprocess_clinical(X: DataFrame, scale: bool = False, drop_columns: list | None = None, ordinal_mappings: dict | None = None, continuous_columns: list | None = None, impute: float = False) → DataFrame[source]¶

Preprocess clinical data by cleaning, mapping ordinals, encoding nominals, and scaling.

This function provides a generalized pipeline for standardizing clinical datasets. It removes specified non-informative columns, maps ordinal variables to numeric ranks, safely coerces continuous variables, and applies one-hot encoding to nominal categories while tracking missing records. Optionally, it handles median imputation and scaling.

Args:

X (pd.DataFrame): The raw clinical feature matrix with patients as rows and variables as columns. scale (bool): If True, applies RobustScaler to the numeric columns. drop_columns (list | None): List of column names to drop prior to processing. ordinal_mappings (dict | None): Nested dictionary mapping string categories to numeric ranks. continuous_columns (list | None): List of strictly continuous column names to coerce to numeric. impute (bool): If True, applies median imputation to missing numeric and ordinal values.

Returns:

pd.DataFrame: Processed clinical feature matrix containing valid numeric types with zero-variance columns removed.

bioneuralnet.utils.preprocess.prune_network(adjacency_matrix: DataFrame, weight_threshold: float = 0.0) → DataFrame[source]¶

Prune a weighted network by thresholding edge weights and removing isolated nodes.

Edges with weights below weight_threshold are removed from the input adjacency matrix, then all nodes with no remaining connections are dropped, and basic before/after graph statistics are logged.

Parameters:

adjacency_matrix (pd.DataFrame) – Weighted adjacency matrix with nodes as both rows and columns.
weight_threshold (float) – Minimum edge weight to retain; edges with smaller weights are pruned.

Returns:

Pruned adjacency matrix containing only edges above the threshold and nodes with at least one connection.

Return type:

pd.DataFrame

bioneuralnet.utils.preprocess.prune_network_by_quantile(adjacency_matrix: DataFrame, quantile: float = 0.5) → DataFrame[source]¶

Prune a weighted network using a quantile-based edge-weight threshold.

A global weight threshold is computed as the given quantile of all edge weights, edges below this threshold are removed, and isolated nodes are dropped from the resulting adjacency matrix.

Parameters:

adjacency_matrix (pd.DataFrame) – Weighted adjacency matrix with nodes as both rows and columns.
quantile (float) – Quantile in [0, 1] used to determine the global weight cutoff for pruning.

Returns:

Adjacency matrix with low-weight edges and isolated nodes removed based on the quantile threshold.

Return type:

pd.DataFrame