-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexing on (new) split car files #122
Comments
a |
@linuskendall @gagliardetto what all indexes do we want here? |
add a new index subset-offset-size which looks up which subset and then just the offset and size within that. this index is generated from the split car files. |
this might be a good time to up the version of the config file so only the new version works. |
See #145 as well for the config file format changes. |
We are waiting for the new combined reader here which should hopefully render this unnecessary. |
This story has two basic parts:
other indexes
Add a merge tool to allow merging of the CAR. Takes a directory of splits and makes them into a full car file. Should support streaming into stdout.
The indexing command needs to be able to read stdin.
Alternatively, the indexing command can take a list of files and can index them sequentially. That might be an easier approach and would allow us to also at the same time generate the cid-to-offset index below.
cid-to-offset
The new split car files can be indexed individually if we expand the index to include which subset the CID is in.
This means we might want to create a new index instead, which would do CID-subset-offset-length mapping. In this case we would need to read a list of CAR files and then create the indexes.
Basically the logic woudl be:
Do this in a full pass, i.e. read the car files sequentially as they are split. Technically it doesn't matter what order the car files are read in since the offsets are only on basis of the offset within the car file.
The text was updated successfully, but these errors were encountered: