Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing file I/O and Overwrites #4

Open
iguinn opened this issue Mar 2, 2021 · 3 comments
Open

Managing file I/O and Overwrites #4

iguinn opened this issue Mar 2, 2021 · 3 comments

Comments

@iguinn
Copy link
Collaborator

iguinn commented Mar 2, 2021

This is a continuation of a conversation from pull request 153 legend-exp/pygama#153
Summary:
@sweigart made the overwrite option act as expected for raw_to_dsp. However, ultimately we want to make a few more changes:

  1. Be able to overwrite only specific fields in an HDF5 table. According to @jasondet "proper overwrite (at the file as well as dataset level) is now implemented but untested in LH5Store in the refactor branch on my fork", so once this is tested/pulled into the main branch we can use it in our processing
  2. raw_to_dsp will ultimately not handle file I/O, but instead act as a table_in -> table_out function. According to @mmatteo instead the I/O will be a part of the dataflow manager

If I missed anything important in this summary please add on to it!

@iguinn
Copy link
Collaborator Author

iguinn commented Mar 2, 2021

About handling IO in the dataflow manager: right now we do not read/write the entire files all at once, but in chunks of ~3000 waveforms at a time. As a result, it's not clear to me how a raw_to_dsp as a table_in to table_out function will work. The current pseudocode for raw_to_dsp is:

 Make input table based on contents of input file (but don't read it yet!)
 Make processing chain and output table from JSON config file
 for chunk in file
     read chunk from input file into input table
     execute the processing chain
     write chunk from output table to output file

If we want the dataflow manager to handle the IO steps, it will have to handle that full loop. That also means it will have to interact with the processing chain and not just the input and output tables, meaning raw_to_dsp under this proposal would have to also return the processing chain.

@jasondet
Copy link
Contributor

jasondet commented Apr 8, 2021

These should both be handled with the refactor. Let's keep this open then come back to it then.

@gipert gipert transferred this issue from legend-exp/pygama May 24, 2023
@jasondet
Copy link
Contributor

in lh5.store.write() for append or overwrite mode we need to check if an object being written is going to be the new element of a struct (or column of a table, etc) and update the corresponding attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants