Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexing on (new) split car files #122

Open
gagliardetto opened this issue Jul 3, 2024 · 7 comments
Open

Add indexing on (new) split car files #122

gagliardetto opened this issue Jul 3, 2024 · 7 comments
Assignees

Comments

@gagliardetto
Copy link
Collaborator

gagliardetto commented Jul 3, 2024

This story has two basic parts:

  1. come up with how we will create the slot-to-cid, sig-to-cid, sig-exists and gsfa indexes on the basis of the split car files
  2. come up with how we can create the cid-to-offset for the split car files.

other indexes

Add a merge tool to allow merging of the CAR. Takes a directory of splits and makes them into a full car file. Should support streaming into stdout.

faithful-cli merge <*.car> - | faithful-cli index - 

The indexing command needs to be able to read stdin.

Alternatively, the indexing command can take a list of files and can index them sequentially. That might be an easier approach and would allow us to also at the same time generate the cid-to-offset index below.

cid-to-offset

The new split car files can be indexed individually if we expand the index to include which subset the CID is in.

This means we might want to create a new index instead, which would do CID-subset-offset-length mapping. In this case we would need to read a list of CAR files and then create the indexes.

Basically the logic woudl be:

  1. Start reading the first split, making note of the subset that this split is for
  2. When creating the offset index entry for each CID, count only the offset within the current split and also store the subset identifier in the offset index entry
  3. Once the first split is done, continue to the next split ..

Do this in a full pass, i.e. read the car files sequentially as they are split. Technically it doesn't matter what order the car files are read in since the offsets are only on basis of the offset within the car file.

@linuskendall
Copy link
Contributor

linuskendall commented Jul 10, 2024

a

@anjor
Copy link
Contributor

anjor commented Jul 16, 2024

@linuskendall @gagliardetto what all indexes do we want here?

@linuskendall
Copy link
Contributor

add a new index subset-offset-size which looks up which subset and then just the offset and size within that. this index is generated from the split car files.

@linuskendall
Copy link
Contributor

this might be a good time to up the version of the config file so only the new version works.

@anjor
Copy link
Contributor

anjor commented Aug 19, 2024

#146

@linuskendall
Copy link
Contributor

See #145 as well for the config file format changes.

@anjor
Copy link
Contributor

anjor commented Nov 6, 2024

We are waiting for the new combined reader here which should hopefully render this unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants