🥇 Welcome to HodHod_Preprocess! This repository offers scripts designed for end-to-end pre-processing of the HodHod dataset. HodHod is the first large-scale, cleaned Persian dataset available in text format. You can find the dataset here.
- Normalization and Filtering: Cleans and filters text data for improved quality.
- Deduplication: Removes redundant documents to create a unique dataset.
- Ease of Use: Provides clear instructions and scripts for pre-processing.
- Python 3.11+
- All packages listed in
requirements.txt
-
Create a virtual environment:
virtualenv <env_name> source <env_name>/bin/activate
-
Install dependencies:
pip install -r requirements.txt
The pre-processing involves the following steps:
-
Normalization and Filtering:
This step cleans and filters the text data to enhance its quality. Script: ```python from preprocess.preprocess_document import Preprocessor
preprocessor = Preprocessor() preprocessor.preprocess_files('crawl', filtering=True) ```
- Replace
'crawl'
with the subdirectory containing your data. - Set
filtering=True
to remove low-quality documents. - The normalized and filtered documents will be stored in
./result/normalized
directory.
- Replace
-
Deduplication of redundant documents This step removes redundant documents to create a unique dataset. Script: ```python from preprocess.deduplication import Deduplication
deduplication = Deduplication() deduplication.preprocess_files('crawl') ```
- Replace
'crawl'
with the subdirectory of your data folder. - The deduplicated data is will be saved in the
./result/deduplicated
directory. - logs for each step will be available in
./result/logs
.
- Replace
.
├── data
│ ├── book # Example data folder
│ │ ├── History # Example data subfolder
│ │ │ └── A Modest Proposal.json # Example data file
│ │ └── ...
│ ├── social_media # Example data folder
│ └── ...
├── preprocess # Code directory
└── ...
Data Format : Each data file should be a text file, JSON file, or JSONL file containing a "text" field.
The deduplication contains four steps:
- MinHash Generation
- Duplicate Pairs Generation (Stored in
./result/lsh
) - Duplicate Graph Construction & Search for Connected Components
- Delete the redundant documents More information about deduplication can be found here.
GNU Lesser General Public License v2.1
Primarily used for software libraries, the GNU LGPL requires that derived works be licensed under the same license, but works that only link to it do not fall under this restriction. There are two commonly used versions of the GNU LGPL.
See LICENSE
- Developed by the Tehran university NLP lab
Contributors: