Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select specific TSV files for import #7

Open
ItsMeMarc opened this issue Jul 4, 2024 · 3 comments
Open

Select specific TSV files for import #7

ItsMeMarc opened this issue Jul 4, 2024 · 3 comments

Comments

@ItsMeMarc
Copy link

Hello,

As you already mentioned in your description, the database becomes very large. To keep the database smaller, it would be nice to be able to select which TSV files should be imported. In my case, for example, only the files title.basics.tsv.gz, title.akas.tsv.gz, and title.episode.tsv.gz are of interest (at least for now). Perhaps there is the possibility to implement this as a parameter.
Thank you.

Best regards,
Marc

@jojje
Copy link
Owner

jojje commented Jul 4, 2024

That's a good idea. I can see that as useful if running on small rented VMs in the cloud or similar.

The tricky bit is how to surface that in the CLI as something self-explanatory and self evident.
The challenge is not technical but related to users and their assumed preexisting understanding of the amazon/imdb datasets, the implicit relations between those different files as they pertain to the specific data (projection(s)) desired to be extractable.

For instance, can we assume:

  1. All users are already familiar with the TSV files, and know what each contain?
  2. They know the implicit relations between the partial bits of information those various files contain?
  3. They are able to figure out which specific files they need for their specific task?

Just loading all those files, as is currently done skirts those problems completely, because it offers a consistent dataset with everything a user could possibly want to extract. When starting to cherry-pick, it opens a can of worms from a user's perspective.

If we at least assume the user has read the readme for this project, then they have a mental model of how things relate. As such they should be able to figure out from the diagram which tables they need in order to get the data they're after. As such it would then follow that surfacing an option that allows specifying a subset of table names would be the preferable approach. The program would then just fetch the corresponding TSV files and create the subset of relations that data subset allows for.

What are your thoughts on a solution along that line?

@ItsMeMarc
Copy link
Author

I also think that this should be done as intuitively as possible.
In my opinion, importing only the title.basics.tsv.gz file makes no sense, as the links between series and episodes are then missing. Therefore, the title.episode.tsv.gz file should also always be imported. In addition, it also makes sense to take the title.akas.tsv.gz file into account so that you really have all the titles in the database. This could be the minimal option.

And then I see two additional options: Ratings and Crew/People

When I think about it, I would suggest a total of four options ​​for the import process:

  1. Complete (default)
  2. Titles only (see above)
  3. Titles with ratings
  4. Titles with crew/people

If you give users these four options ​​to choose from, they don't even need to know the dependencies.
What do you think?

@jojje
Copy link
Owner

jojje commented Jul 17, 2024

Yes, presenting a sane set for people to choose from would seem the most intuitive. I'll see what I can do when I find some time to work on this. Thanks Marc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants