This repository offers tools for embedding texts in multiple languages with an efficient workflow. It uses the transformers
library by Hugging Face and the make
tool to manage large datasets. Make
ensures robust and incremental processing, allowing you to handle new data, resume tasks, and run processes across different machines, all while avoiding redundant work.
- Efficient Storage Management: Minimal local storage is required as necessary data for each year of a newspaper are downloaded on-the-fly and truncated after uploading.
- Parallel Processing: Processes run in parallel to optimize throughput.
- Selective Processing: Only the necessary processing steps are executed, ensuring efficiency by not reprocessing existing outputs on S3.
- S3 Integration: Integration with S3 for storing and resuming processing. The system ensures no overwriting of files or partial uploads due to interruptions. It is also posssible to run everything locally without S3.
- Custom Embedding Options: Flexible configurations via normal environment variables or make variables, including the ability to specify model versions and filter text data.
- Batch processing of texts is not yet implemented. This will be added in a future release.
- Installing specialized xformer implementation for sparse attention inference is not yet imlemented. This shoulld be added in a future release for faster inference.
- Local Storage: Temporary disk space used for processing tasks. This disk space can be fast storage.
- S3 Storage: Permanent storage where final results are stored. Processing can be resumed from this storage.
To manage dependencies, local file stamps are used:
- Input Stamps (
.stamp
): Indicate the status of input files on S3. - Output Stamps (no extension or
.done
): Indicate the completion status of output files on S3. Make uses these to determine if a file needs to be processed or not.
Local stamps help make
determine which files need to be processed or skipped. The helper script lib/sync_s3_filestamps.py
manages these stamps by syncing them with S3.
The processing follows a structured organization:
- S3 Buckets: Data is organized by processing steps and, in some cases, versions.
- Build Directory: A local mirror of the S3 storage, structured similarly for consistency.
# Example directory structure
BUILD_DIR/BUCKET/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2
BUILD_DIR/BUCKET/PROCESSING_TYPE/VERSION/NEWSPAPER/<NEWSPAPER-YEAR>.jsonl.bz2
-
Clone the Repository:
git clone [email protected]:impresso/impresso-text-embedder.git cd impresso-text-embedder
-
Configure S3 Credentials: Copy the
dotenv.sample
file to.env
. Modify the.env
file to include your AWS credentials:SE_ACCESS_KEY=<your-access-key> SE_SECRET_KEY=<your-secret-key> SE_HOST_URL=<your host name>
-
Install Dependencies: Ensure
make
andpython3
andpip3
are installed. For GPU support of pytorch, go to https://pytorch.org/get-started/locally/ and get your installation command. Run:pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 pip3 install -r requirements.txt # or pipenv install
-
Setup Environment and make variables:
cp dotenv.sample .env # edit .env with your S3 credentials cp local.config.sample.mk local.config.mk # edit local.config.mk with your local settings
-
Setup Directories and Model: Create necessary directories, the list of newspapers to process and download the Hugging Face model:
make setup
make help
-
Process a single newspaper: You can specify a list of newspapers to process using the
NEWSPAPER_LIST_FILE
. The default list is generated automatically from the S3 bucket:make newspaper
-
Parallel Processing of each newspaper: To process newspapers in parallel, use:
make each
flowchart LR
%% Nodes
%% External Entities
subgraph cluster_local ["Local Machine"]
style cluster_local fill:#FFF3E0,stroke:#FF6F00,stroke-width:1px
%% Processes
F{{"Text Embedding Processor"}}
%% Data Stores
B[("Local Rebuilt Data")]
E[("Text Embedding Model")]
C[("Processed Output Data")]
end
subgraph cluster_s3 ["S3 Storage"]
style cluster_s3 fill:#E0F7FA,stroke:#0097A7,stroke-width:1px
A[/"Rebuilt Data"/]
D[/"Processed Data"/]
%% Data Flows
A -->|Sync Input| B
B -->|Data| F
E -->|Model| F
F -->|Output| C
C -->|Upload Output| D
D -->|Sync Output| C
end
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585) and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2018-2024 The Impresso team.
Contributors to this program include: Simon Clematide
This program is provided as open source under the GNU Affero General Public License v3 or later.