bioneuralnet.utils.data¶
Functions
Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature. |
|
|
Prints a comprehensive set of key statistics for an omics DataFrame. |
Computes summary statistics for the mean expression (average value) of all features. |
|
|
Retrieves a global logger configured to write to 'bioneuralnet.log'. |
|
Logs a report on the missing data (NaNs) in the DataFrame. |
|
Drops features (columns) and then samples (rows) that exceed the maximum missing data fraction. |
|
Computes key summary statistics for the feature (column) variances within an omics DataFrame. |
|
Computes statistics on the fraction of zero values present in each feature (column). |
- bioneuralnet.utils.data.correlation_summary(df: DataFrame) dict[source]¶
Computes summary statistics on the maximum pairwise (absolute) correlation observed for each feature.
This helps identify features that are highly redundant or collinear.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
- Returns:
A dictionary containing the mean, median, min, max, and std of the max absolute correlations.
- Return type:
- bioneuralnet.utils.data.data_stats(df: DataFrame, name: str = 'Data', compute_correlation: bool = False) None[source]¶
Prints a comprehensive set of key statistics for an omics DataFrame.
Combines variance, zero fraction, expression, correlation, and missingness summaries for rapid data quality assessment. Recommends data cleaning steps if high missingness is found.
- Parameters:
- Returns:
Logs the statistics directly to the console.
- Return type:
None
- bioneuralnet.utils.data.expression_summary(df: DataFrame) dict[source]¶
Computes summary statistics for the mean expression (average value) of all features.
Provides insight into the overall magnitude and central tendency of the data values.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
- Returns:
A dictionary containing the mean, median, min, max, and standard deviation of the feature means.
- Return type:
- bioneuralnet.utils.data.nan_summary(df: DataFrame, name: str = 'Dataset', missing_threshold: float = 20.0) float[source]¶
Logs a report on the missing data (NaNs) in the DataFrame.
- Parameters:
- Returns:
The global percentage of missing values (NaNs) in the DataFrame.
- Return type:
- bioneuralnet.utils.data.sparse_filter(df: DataFrame, missing_fraction: float = 0.2) DataFrame[source]¶
Drops features (columns) and then samples (rows) that exceed the maximum missing data fraction.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
missing_fraction (float) – The maximum allowed fraction of missing values (0.0 to 1.0).
- Returns:
The filtered DataFrame with highly missing features and samples removed.
- Return type:
pd.DataFrame
- bioneuralnet.utils.data.variance_summary(df: DataFrame, var_threshold: float | None = None) dict[source]¶
Computes key summary statistics for the feature (column) variances within an omics DataFrame.
This is useful for assessing feature distribution and identifying low-variance features prior to modeling.
- Parameters:
df (pd.DataFrame) – The input omics DataFrame (samples as rows, features as columns).
var_threshold (Optional[float]) – A threshold used to count features falling below this variance level.
- Returns:
- A dictionary containing the mean, median, min, max, and standard deviation of the column variances.
If a threshold is provided, it also includes ‘Number Of Low Variance Features’.
- Return type:
- bioneuralnet.utils.data.zero_summary(df: DataFrame, zero_threshold: float | None = None) dict[source]¶
Computes statistics on the fraction of zero values present in each feature (column).
This helps identify feature sparsity, which is common in omics data (e.g., RNA-seq FPKM).
- Parameters:
df (pd.DataFrame) – The input omics DataFrame.
zero_threshold (Optional[float]) – A threshold used to count features whose zero-fraction exceeds this value.
- Returns:
- A dictionary containing the mean, median, min, max, and standard deviation of the zero fractions.
If a threshold is provided, it includes ‘Number Of High Zero Features’.
- Return type: