bioneuralnet.utils.feature_selection¶
Functions
|
Clean a numeric DataFrame by handling infinite values, imputing NaNs, and dropping zero-variance columns. |
|
Select top-k features by correlation in supervised or unsupervised mode. |
|
Compute the ANOVA F-value for the provided sample. |
|
Univariate linear regression tests returning F-statistic and p-values. |
|
Retrieves a global logger configured to write to 'bioneuralnet.log'. |
|
Select top-k features using RandomForest feature importances. |
|
Compute the (weighted) graph of k-Neighbors for points in X. |
|
Unsupervised feature selection via the Laplacian Score to address dimensionality. |
|
Select the top features by Median Absolute Deviation (MAD). |
|
Test results and p-value correction for multiple tests |
|
Select features with the highest absolute loadings across the top principal components. |
|
Select top features using ANOVA F-test with FDR correction. |
|
Select the top-k features with the highest variance after cleaning. |
Classes
|
Principal component analysis (PCA). |
|
A random forest classifier. |
|
A random forest regressor. |
|
Standardize features by removing the mean and scaling to unit variance. |
- bioneuralnet.utils.feature_selection.correlation_filter(X: DataFrame, y: Series | None = None, top_k: int = 1000) DataFrame[source]¶
Select top-k features by correlation in supervised or unsupervised mode.
In supervised mode (y provided), features are ranked by their absolute correlation with the target. In unsupervised mode, features are ranked by their mean absolute correlation with all other features to reduce redundancy, and selection is applied after basic cleaning.
- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns.
y (pd.Series | None) – Optional target vector; if provided, supervised selection is used, otherwise unsupervised redundancy-based selection.
top_k (int) – Number of features to select, capped at the number of available numeric features.
- Returns:
Numeric subset of X containing the selected features ordered by correlation-based ranking.
- Return type:
pd.DataFrame
- bioneuralnet.utils.feature_selection.importance_rf(X: DataFrame, y: Series, top_k: int = 1000, seed: int = 119) DataFrame[source]¶
Select top-k features using RandomForest feature importances.
Non-numeric columns are rejected, the remaining data are cleaned and zero-variance features removed, a RandomForest classifier or regressor is fitted depending on y, and the top_k most important features are selected based on
feature_importances_.- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns; all columns must be numeric.
y (pd.Series | pd.DataFrame) – Target values as a Series or single-column DataFrame used to determine classification vs regression.
top_k (int) – Maximum number of most important features to keep according to the RandomForest model.
seed (int) – Random seed for initializing the RandomForest estimator.
- Returns:
Cleaned numeric subset of X restricted to the top-k most important features by RandomForest importance.
- Return type:
pd.DataFrame
- bioneuralnet.utils.feature_selection.laplacian_score(df: DataFrame, n_keep: int, k_neighbors: int = 5) DataFrame[source]¶
Unsupervised feature selection via the Laplacian Score to address dimensionality.
Evaluates a feature’s ability to preserve the local manifold structure of the data. The score is computed as the sum of squared differences between connected samples weighted by the global network (W_ij), divided by the feature’s variance. Lower scores indicate higher importance (smoothness on the graph).
- bioneuralnet.utils.feature_selection.mad_filter(df: DataFrame, n_keep: int) DataFrame[source]¶
Select the top features by Median Absolute Deviation (MAD).
The Median Absolute Deviation is calculated across samples for each feature, and the features with the highest MAD scores are retained.
- Parameters:
df (pd.DataFrame) – Input DataFrame from which to select high-MAD numeric features.
n_keep (int) – Maximum number of top features to keep in the output.
- Returns:
DataFrame containing only the top n_keep features ranked by MAD.
- Return type:
pd.DataFrame
- bioneuralnet.utils.feature_selection.pca_loadings(df: DataFrame, n_keep: int, n_components: int = 50, seed: int = 1883) DataFrame[source]¶
Select features with the highest absolute loadings across the top principal components.
The input data is scaled and PCA is applied. Feature importance is determined by weighting each principal component’s loadings by its explained variance ratio and taking the maximum across all selected components.
- Parameters:
- Returns:
DataFrame containing only the top n_keep features with the highest PCA loadings.
- Return type:
pd.DataFrame
- bioneuralnet.utils.feature_selection.top_anova_f_features(X: DataFrame, y: Series, max_features: int, alpha: float = 0.05, task: str = 'classification') DataFrame[source]¶
Select top features using ANOVA F-test with FDR correction.
Numeric features are cleaned, ANOVA F-statistics and p-values are computed against y using f_classif or f_regression, p-values are adjusted with Benjamini-Hochberg FDR, and up to max_features indices are chosen by prioritizing significant features and padding with the strongest remaining ones if needed.
- Parameters:
X (pd.DataFrame) – Numeric feature matrix with samples as rows and features as columns.
y (pd.Series) – Target vector; categorical for classification or continuous for regression, aligned to the rows of X.
max_features (int) – Maximum number of features to return after significance-based ranking and padding.
alpha (float) – Significance threshold applied to FDR-adjusted p-values to define significant features.
task (str) – Task type, either “classification” (uses f_classif) or “regression” (uses f_regression).
- Returns:
Numeric subset of X with up to max_features columns ordered by F-statistic with significant features first and padded by the strongest remaining features if necessary.
- Return type:
pd.DataFrame
- bioneuralnet.utils.feature_selection.variance_threshold(df: DataFrame, k: int = 1000, ddof: int = 0) DataFrame[source]¶
Select the top-k features with the highest variance after cleaning.
The input is first cleaned with clean_inf_nan, then numeric columns are ranked by variance and the top k features are selected (or all if fewer than k are available).
- Parameters:
- Returns:
Numeric DataFrame containing only the top-k highest-variance features after cleaning.
- Return type:
pd.DataFrame