A Python implementation of feature selection algorithms using k-Nearest Neighbor classification. This project implements three different search strategies for finding optimal feature subsets: Forward Selection, Backward Elimination, and Simulated Annealing.
- Three feature selection algorithms:
- Forward Selection
- Backward Elimination
- Simulated Annealing
- k-Nearest Neighbor classification
- Multiple data normalization options
- Leave-one-out cross-validation
- Support for custom datasets
- Built-in test datasets including Titanic dataset
- Python 3.x
- NumPy
- pathlib
- argparse
- Clone this repository
- Ensure you have the required dependencies installed:
pip install numpy
The program can be run from the command line with various arguments:
python main.py [options]
--customdata
,-d
: Path to a custom dataset file--testdata
: Choose from provided test datasets [bigdata
,smalldata
,titanic
]--search
,-s
: Select feature search method [forward
,backward
,simulated-annealing
]--debug
: Enable debug logging (default:False
)--NN
,-k
: Set k value for k-nearest neighbor (default:3
)--normalization
,-norm
: Choose normalization method [min-max
,std-normal
,numpy
,none
]
# Run with default settings (forward selection on titanic dataset)
python main.py
# Run backward elimination on small dataset with k=5
python main.py --search backward --testdata smalldata --NN 5
# Use custom dataset with simulated annealing
python main.py -d path/to/dataset.txt -s simulated-annealing
# Run with different normalization method
python main.py --normalization std-normal
Input data is parsed using numpy's loadtxt
function.
Input data should be formatted as a text file with:
- First column: Binary labels (
0
or1
) - Subsequent columns: Feature values
- Space-separated values
- One instance per line
Your input dataset should be a .txt
and should look something like this.
1 0.1 0.2 0.3
0 0.4 0.5 0.6
1 0.7 0.8 0.9
- Forward Selection [
forward
]: Starts with no features and iteratively adds the most beneficial features - Backward Elimination[
backward
]: Starts with all features and iteratively removes the least beneficial features - Simulated Annealing[
simulated-annealing
]: Uses probabilistic approach to search feature space, potentially escaping local optima
min-max
: Scales features to range [0
,1
]std-normal
: Standardizes features to a mean of 0 and standard deviation of 1.numpy
: Uses NumPy's default normalizationnone
: No normalization applied
MIT License
Equal Contributions to this project came from Lindsay Adams
This project was developed as part of CS-170 at UCR.