Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the pipeline on cloud or a big data platform #104

Open
zllai opened this issue Feb 21, 2024 · 1 comment
Open

Running the pipeline on cloud or a big data platform #104

zllai opened this issue Feb 21, 2024 · 1 comment

Comments

@zllai
Copy link

zllai commented Feb 21, 2024

Dear RedPajama team,

I apologize this might not be the right place to ask questions, but I was curious on several aspects of your projects and couldn’t find other better ways to reach out.

I'm a PhD student at CUHK and recently came across your amazing project. I was impressed by the size of the dataset and the fact that it only requires a few python scripts for preparing the data in such a great volume.

I wonder how many CPUs did you use and how much time did that take. Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage; or did you consider utilizing cloud services to reduce the cost? When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

My current research is to design a low cost platform for LLM data preparation on the cloud. Your insights would greatly assist researchers like myself.

Best,
Bruce

@mauriceweber
Copy link
Collaborator

Hi @zllai, thanks for your questions!

I wonder how many CPUs did you use and how much time did that take

We used 16 aws nodes with 64 CPU cores and 500GB of RAM for the largest part of the pipeline -- the total processing took around 2 months with that setup.

Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage;

We have not explored that, but using a framework for scheduling jobs on the different nodes would definitely be useful.

When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

We used the same version of the pipeline for the entire run, so there was no risk of having conflicting versions when processing different parts of commoncrawl. Typically you would include unit tests to ensure (at least to some degree) the consistency of different versions of the pipeline.

For your research, I would also recommend to look into other tools such as datatrove by HuggingFace and the Dolma toolkit by Allen AI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants