Running the pipeline on cloud or a big data platform #104

zllai · 2024-02-21T06:27:04Z

Dear RedPajama team,

I apologize this might not be the right place to ask questions, but I was curious on several aspects of your projects and couldn’t find other better ways to reach out.

I'm a PhD student at CUHK and recently came across your amazing project. I was impressed by the size of the dataset and the fact that it only requires a few python scripts for preparing the data in such a great volume.

I wonder how many CPUs did you use and how much time did that take. Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage; or did you consider utilizing cloud services to reduce the cost? When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

My current research is to design a low cost platform for LLM data preparation on the cloud. Your insights would greatly assist researchers like myself.

Best,
Bruce

mauriceweber · 2024-02-22T09:44:06Z

Hi @zllai, thanks for your questions!

I wonder how many CPUs did you use and how much time did that take

We used 16 aws nodes with 64 CPU cores and 500GB of RAM for the largest part of the pipeline -- the total processing took around 2 months with that setup.

Have you explored using some big data platforms like Spark, Flink, or Hadoop to facilitate the distributed data processing and storage;

We have not explored that, but using a framework for scheduling jobs on the different nodes would definitely be useful.

When developing the pipeline, how did you manage the evolving versions of the pipeline and evaluate the quality of dataset that each version of pipeline generates?

We used the same version of the pipeline for the entire run, so there was no risk of having conflicting versions when processing different parts of commoncrawl. Typically you would include unit tests to ensure (at least to some degree) the consistency of different versions of the pipeline.

For your research, I would also recommend to look into other tools such as datatrove by HuggingFace and the Dolma toolkit by Allen AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the pipeline on cloud or a big data platform #104

Running the pipeline on cloud or a big data platform #104

zllai commented Feb 21, 2024

mauriceweber commented Feb 22, 2024

Running the pipeline on cloud or a big data platform #104

Running the pipeline on cloud or a big data platform #104

Comments

zllai commented Feb 21, 2024

mauriceweber commented Feb 22, 2024