Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement file-list-batch style catalog import #308

Closed
3 tasks
delucchi-cmu opened this issue May 14, 2024 · 2 comments · Fixed by #334
Closed
3 tasks

Implement file-list-batch style catalog import #308

delucchi-cmu opened this issue May 14, 2024 · 2 comments · Fixed by #334
Assignees
Labels
enhancement New feature or request

Comments

@delucchi-cmu
Copy link
Contributor

Feature request

PLACEHOLDER.

There are a lot of details I'm glossing over. I'll write up more later.

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.
@delucchi-cmu delucchi-cmu added the enhancement New feature or request label May 14, 2024
@delucchi-cmu delucchi-cmu self-assigned this May 14, 2024
@nevencaplar nevencaplar moved this to Todo in HATS / LSDB May 16, 2024
@troyraen
Copy link
Collaborator

If I'm interpreting the title correctly, I think the feature request is:

Add docs and code to the file readers module that shows how to send lists of input files to the reader and have the reader concatenate data from multiple files if necessary to yield chunks with at least x number of rows. This should reduce the number of files in the intermediate dataset in cases where the input files are small and numerous.

For reference, a recent import of the ZTF lightcurves resulted in an intermediate dataset with 4.4 million files. The import took several days to run and multiple things went wrong at different stages, including obscure but crucial problems with the compute nodes. The large number of files made it practically impossible to verify what was actually on disk at any given time. This was especially hard after some of the intermediate files were deleted during the reducing step and I ended up just having to start over completely.

@troyraen
Copy link
Collaborator

As I recall, @delucchi-cmu recommended that the lists be sized so that there are 50-100 lists per worker. One list of input files per worker is not recommended because it prevents the pipeline from being able to skip any of the previously completed input files when resuming the splitting step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants