Implement file-list-batch style catalog import #308

delucchi-cmu · 2024-05-14T18:47:30Z

Feature request

PLACEHOLDER.

There are a lot of details I'm glossing over. I'll write up more later.

Before submitting
Please check the following:

I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

troyraen · 2024-05-23T20:01:42Z

If I'm interpreting the title correctly, I think the feature request is:

Add docs and code to the file readers module that shows how to send lists of input files to the reader and have the reader concatenate data from multiple files if necessary to yield chunks with at least x number of rows. This should reduce the number of files in the intermediate dataset in cases where the input files are small and numerous.

For reference, a recent import of the ZTF lightcurves resulted in an intermediate dataset with 4.4 million files. The import took several days to run and multiple things went wrong at different stages, including obscure but crucial problems with the compute nodes. The large number of files made it practically impossible to verify what was actually on disk at any given time. This was especially hard after some of the intermediate files were deleted during the reducing step and I ended up just having to start over completely.

troyraen · 2024-05-23T20:30:08Z

As I recall, @delucchi-cmu recommended that the lists be sized so that there are 50-100 lists per worker. One list of input files per worker is not recommended because it prevents the pipeline from being able to skip any of the previously completed input files when resuming the splitting step.

delucchi-cmu added the enhancement New feature or request label May 14, 2024

delucchi-cmu self-assigned this May 14, 2024

delucchi-cmu added this to HATS / LSDB May 14, 2024

nevencaplar moved this to Todo in HATS / LSDB May 16, 2024

troyraen mentioned this issue May 23, 2024

Creation of many small files before merging #275

Open

delucchi-cmu moved this from Todo to In Progress in HATS / LSDB May 31, 2024

This was referenced Jun 10, 2024

Support list of paths for parquet dataset. astronomy-commons/hats#288

Merged

Enable "index file" reads for catalog import #334

Merged

delucchi-cmu closed this as completed in #334 Jun 12, 2024

github-project-automation bot moved this from In Progress to Done in HATS / LSDB Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement file-list-batch style catalog import #308

Implement file-list-batch style catalog import #308

delucchi-cmu commented May 14, 2024

troyraen commented May 23, 2024

troyraen commented May 23, 2024

Implement file-list-batch style catalog import #308

Implement file-list-batch style catalog import #308

Comments

delucchi-cmu commented May 14, 2024

troyraen commented May 23, 2024

troyraen commented May 23, 2024