Many countries speak Arabic; however, each country has its own dialect; the aim of this task is to build a model that predicts the dialect given the text.
- Data fetching script
- Data pre-processing
- Applying machine learning (ML)
- Applying deep learning (DL)
- Deployment script
#Evaluating Models The models are evaluated using four methods: accuracy, precision , recall, and F-score foreach classes
You must have Scikit Learn, Keras, Pandas KerasTuner and Flask (for API) installed.
- Data fetching script.ipynb - This includes code to retrieve text using ids.
- Data pre-processing script.ipynb- This includes some steps to clean text data such as removing URL, removing some special characters, removing Arabic stop words, remove emoji.
- split dataset.ipynb - This includes code to split dataset using stratified method into a training set and testing set. The training set is used to optimize, and train models and the testing set is used to evaluate models.
#Training and Evaluate models include many files
First approach: Six ML model is used support vector machine (SVM), Logistic Regression(LR), naive Bayes (NB), k-nearest neighbors(KNN), decision tree (DT) and random forest (RF)
- ML with Count Vector and Optimization Method.ipynb - This includes -
- CountVectorizer as feature extraction methods
- Grid search with cross-validation is used to optimize models.
- Result of applying ML models for the training set and testing set.
2 ML with TF_IDF.ipynb.ipynb - This includes -
- TfidfVectorizer as featur extraction methods
- Result of appying ML models for traning set and testing set.
- ML with Word2Vec.ipynb - This includes -
- Word2Vec is used to build word vectors as feature extraction methods
- Result of applying ML models for the training set and testing set.
Files
- KerasTuner is used to optimize LSTM and GRU https://keras.io/keras_tuner/
- Twitter-CBOW and Twitter-SkipGram with 300Vec-Size as word embedding is used: https://github.com/bakrianoo/aravec Four files LSTM, GRU files
- app.py - This contains Flask APIs that receive Text through GUI or API calls, compute the precited value based on our model, and return it.