HodHod Preprocess

🥇 Welcome to HodHod_Preprocess! This repository offers scripts designed for end-to-end pre-processing of the HodHod dataset. HodHod is the first large-scale, cleaned Persian dataset available in text format. You can find the dataset here.

Key Features:

Normalization and Filtering: Cleans and filters text data for improved quality.
Deduplication: Removes redundant documents to create a unique dataset.
Ease of Use: Provides clear instructions and scripts for pre-processing.

Requirements

Python 3.11+
All packages listed in requirements.txt

Installation

Create a virtual environment:

virtualenv <env_name>
source <env_name>/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Preprocessing Steps

The pre-processing involves the following steps:

Normalization and Filtering:

This step cleans and filters the text data to enhance its quality. Script: ```python from preprocess.preprocess_document import Preprocessor
```
preprocessor = Preprocessor()
preprocessor.preprocess_files('crawl', filtering=True)
```
```
- Replace 'crawl' with the subdirectory containing your data.
- Set filtering=True to remove low-quality documents.
- The normalized and filtered documents will be stored in ./result/normalized directory.
Deduplication of redundant documents This step removes redundant documents to create a unique dataset. Script: ```python from preprocess.deduplication import Deduplication
```
  deduplication = Deduplication()
  deduplication.preprocess_files('crawl')
 ```   
```
- Replace 'crawl' with the subdirectory of your data folder.
- The deduplicated data is will be saved in the ./result/deduplicated directory.
- logs for each step will be available in ./result/logs.

Directory Structure

.
├── data
│   ├── book  # Example data folder
│   │   ├── History  # Example data subfolder
│   │   │   └── A Modest Proposal.json  # Example data file
│   │   └── ...
│   ├── social_media  # Example data folder
│   └── ...
├── preprocess  # Code directory
└── ...

Data Format : Each data file should be a text file, JSON file, or JSONL file containing a "text" field.

Additional notes

The deduplication contains four steps:

MinHash Generation
Duplicate Pairs Generation (Stored in ./result/lsh)
Duplicate Graph Construction & Search for Connected Components
Delete the redundant documents More information about deduplication can be found here.

License

GNU Lesser General Public License v2.1

Primarily used for software libraries, the GNU LGPL requires that derived works be licensed under the same license, but works that only link to it do not fall under this restriction. There are two commonly used versions of the GNU LGPL.

See LICENSE

About ️

Developed by the Tehran university NLP lab

Contributors:

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
preprocess		preprocess
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HodHod Preprocess

Key Features:

Requirements

Installation

Preprocessing Steps

Directory Structure

Additional notes

License

About ️

About

Releases

Packages

Languages

License

UT-NLP-LAB/HodHod_Preprocess

Folders and files

Latest commit

History

Repository files navigation

HodHod Preprocess

Key Features:

Requirements

Installation

Preprocessing Steps

Directory Structure

Additional notes

License

About ️

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages