bioneuralnet.utils.data

Functions

correlation_summary(df)

Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature.

data_stats(df[, name, compute_correlation])

Prints a comprehensive set of key statistics for an omics DataFrame.

expression_summary(df)

Computes summary statistics for the mean expression (average value) of all features.

get_logger(name)

Retrieves a global logger configured to write to 'bioneuralnet.log'.

nan_summary(df[, name, missing_threshold])

Logs a report on the missing data (NaNs) in the DataFrame.

sparse_filter(df[, missing_fraction])

Drops features (columns) and then samples (rows) that exceed the maximum missing data fraction.

variance_summary(df[, var_threshold])

Computes key summary statistics for the feature (column) variances within an omics DataFrame.

zero_summary(df[, zero_threshold])

Computes statistics on the fraction of zero values present in each feature (column).

bioneuralnet.utils.data.correlation_summary(df: DataFrame) dict[source]

Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature.

This helps identify features that are highly redundant or collinear.

Parameters:

df (pd.DataFrame) – The input omics DataFrame.

Returns:

A dictionary containing the mean, median, min, max, and std of the max absolute correlations.

Return type:

dict

bioneuralnet.utils.data.data_stats(df: DataFrame, name: str = 'Data', compute_correlation: bool = False) None[source]

Prints a comprehensive set of key statistics for an omics DataFrame.

Combines variance, zero fraction, expression, correlation, and missingness summaries for rapid data quality assessment. Recommends data cleaning steps if high missingness is found.

Parameters:
  • df (pd.DataFrame) – The input omics DataFrame.

  • name (str) – A descriptive name for the dataset (e.g., ‘X_rna_final’) for clear output labeling.

  • compute_correlation (bool) – Whether to compute pairwise correlations. Defaults to False.

Returns:

Logs the statistics directly to the console.

Return type:

None

bioneuralnet.utils.data.expression_summary(df: DataFrame) dict[source]

Computes summary statistics for the mean expression (average value) of all features.

Provides insight into the overall magnitude and central tendency of the data values.

Parameters:

df (pd.DataFrame) – The input omics DataFrame.

Returns:

A dictionary containing the mean, median, min, max, and standard deviation of the feature means.

Return type:

dict

bioneuralnet.utils.data.nan_summary(df: DataFrame, name: str = 'Dataset', missing_threshold: float = 20.0) float[source]

Logs a report on the missing data (NaNs) in the DataFrame.

Parameters:
  • df (pd.DataFrame) – The input omics DataFrame.

  • name (str) – A descriptive name for the dataset for clear output labeling.

  • missing_threshold (float) – Percentage threshold (0-100) to trigger a warning for highly missing data.

Returns:

The global percentage of missing values (NaNs) in the DataFrame.

Return type:

float

bioneuralnet.utils.data.sparse_filter(df: DataFrame, missing_fraction: float = 0.2) DataFrame[source]

Drops features (columns) and then samples (rows) that exceed the maximum missing data fraction.

Parameters:
  • df (pd.DataFrame) – The input omics DataFrame.

  • missing_fraction (float) – The maximum allowed fraction of missing values (0.0 to 1.0).

Returns:

The filtered DataFrame with highly missing features and samples removed.

Return type:

pd.DataFrame

bioneuralnet.utils.data.variance_summary(df: DataFrame, var_threshold: float | None = None) dict[source]

Computes key summary statistics for the feature (column) variances within an omics DataFrame.

This is useful for assessing feature distribution and identifying low-variance features prior to modeling.

Parameters:
  • df (pd.DataFrame) – The input omics DataFrame (samples as rows, features as columns).

  • var_threshold (Optional[float]) – A threshold used to count features falling below this variance level.

Returns:

A dictionary containing the mean, median, min, max, and standard deviation of the column variances.

If a threshold is provided, it also includes ‘Number Of Low Variance Features’.

Return type:

dict

bioneuralnet.utils.data.zero_summary(df: DataFrame, zero_threshold: float | None = None) dict[source]

Computes statistics on the fraction of zero values present in each feature (column).

This helps identify feature sparsity, which is common in omics data (e.g., RNA-seq FPKM).

Parameters:
  • df (pd.DataFrame) – The input omics DataFrame.

  • zero_threshold (Optional[float]) – A threshold used to count features whose zero-fraction exceeds this value.

Returns:

A dictionary containing the mean, median, min, max, and standard deviation of the zero fractions.

If a threshold is provided, it includes ‘Number Of High Zero Features’.

Return type:

dict