This repository documents efforts to predict Spanish bilingualism using brainwave data collected from EEG measurements. The project combines exploratory data analysis, feature engineering, and machine learning to develop a predictive model.
This dataset consists of event-related potentials (ERPs) extracted from EEG recordings of 40 participants during a language processing study. Participants were exposed to carefully constructed word pair stimuli across four languages: English, Spanish, French, and German. The study aimed to elicit distinct ERP patterns, such as the N400, which are linked to semantic processing, to assess language proficiency.
Key objectives:
- Primary Goal: Predict Spanish bilingualism using ERP features.
- Bonus Objectives: Extend predictions to German or French bilingualism or bilingualism in general.
- Model Trust: Emphasize explainability, performance evaluation, and feature relevance.
- StandardScaler: Standardized features to have a mean of zero and unit variance, essential for models sensitive to feature scaling.
- SMOTE: Addressed class imbalance by oversampling the minority class (Spanish bilinguals) with synthetic examples.
- PCA (Principal Component Analysis): Reduced high-dimensional EEG data to principal components while preserving the variance.
- Recursive Feature Elimination (RFE): Iteratively removed irrelevant features to optimize model performance.
- Segmented ERPs into pre-stimulus and ERP component time windows (e.g., N400, P200).
- Analyzed the minimum and maximum amplitudes to capture meaningful variations in brainwave responses.
- Focused on the relationships between primes and targets (e.g., translations, repetitions, unrelated).
- Expected significant ERP differences in bilingual participants for semantically related pairs.
- Random Forest: Provided an interpretable baseline but lacked the ability to fully utilize temporal and spatial data.
- 1D Convolutional Neural Network (CNN): Better at capturing temporal and spatial features but had lower recall and F1 scores for bilingual predictions.
- XGBoost: Chosen for its ability to handle nonlinear patterns effectively and its robustness to incomplete data.
Final Model Performance:
- Accuracy: 80.24%
- Precision (Spanish Speaking): 0.78
- Recall (Spanish Speaking): 0.84
- F1-Score (Spanish Speaking): 0.81
- Used Randomized Search CV to optimize hyperparameters such as:
colsample_bytree
,learning_rate
,max_depth
,min_child_weight
,n_estimators
,reg_alpha
,reg_lambda
, andsubsample
.
- Key features included specific ERP components such as:
n400_P4_max
p600_P4_min
n170_Fz_max
- These features were highly correlated with bilingualism predictions, showcasing their importance in semantic and syntactic processing.
This project demonstrated the potential of EEG data in predicting language proficiency, particularly Spanish bilingualism. By leveraging ERP components and advanced machine learning techniques, we developed a model with high predictive accuracy and interpretability.