Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mirror(s) for OpenLibrary bulk data dumps #9954

Open
tfmorris opened this issue Oct 16, 2024 · 6 comments
Open

Add mirror(s) for OpenLibrary bulk data dumps #9954

tfmorris opened this issue Oct 16, 2024 · 6 comments
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Response Issues which require feedback from lead Needs: Staff Decision Issues that are blocked on a staff member's decision Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@tfmorris
Copy link
Contributor

Proposal

The OpenLibrary data is a public resource which represents decades of investment by many volunteers and must be preserved. Data dumps are currently only archived on archive.org which is housed in a single physical data center with no redundancy, backup, or disaster recovery.

Please establish one or more reliable mirrors with the normal disaster resilience diversity in geography, power source, network connection, organizational administration, etc for this data to assure its preservation.

Justification

Breakdown

Requirements Checklist

  • [ ]

Related files

Stakeholders


Instructions for Contributors

Please run these commands to ensure your repository is up to date before creating a new branch to work on this issue and each time after pushing code to Github, because the pre-commit bot may add commits to your PRs upstream.

@tfmorris tfmorris added Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] labels Oct 16, 2024
@raresboza
Copy link

Would have been really useful now. Any known mirrors?

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Oct 23, 2024
@andrewhwanpark
Copy link

Please, would love this. I have a site that depends on Open Library's API and data. Would love to get a mirror so I can host my own API and DB.

@sarim2000
Copy link

hey can you guys download dumps?

@mekarpeles
Copy link
Member

All our dumps are here: https://openlibrary.org/developers/dumps
One options is for us to create torrents for each dump. Another option is to mirror on s3 https://aws.amazon.com/opendata/

@mekarpeles mekarpeles added Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Staff Decision Issues that are blocked on a staff member's decision Priority: 3 Issues that we can consider at our leisure. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead Needs: Response Issues which require feedback from lead labels Oct 29, 2024
@tfmorris
Copy link
Contributor Author

I've seen too many poorly seeded torrents, so I have a strong preference for Amazon's Open Data Sets. That would also allow folks to easily use cloud resources to process the dumps and provide more reliable bandwidth (the Oct. 2 editions dump is currently estimated to take 31 hrs to download even though I was able to download the works dump, which is 1/3 he size in 7 1/2 minutes)

@github-actions github-actions bot added the Needs: Response Issues which require feedback from lead label Oct 29, 2024
@Freso
Copy link
Contributor

Freso commented Nov 3, 2024

What is the size of this? If it’s just the <60GB dumps from the “Dumps” header of https://openlibrary.org/developers/dumps, I could probably convince dotsrc to mirror the data (they currently mirror MetaBrainz’s datadumps too). I’m sure there are other open source mirroring projects that would be willing to host copies as well.

For mirror sites to be able to properly mirror though, it would likely be helpful if there was something rsync’able to mirror against. Right now it appears as if the files are just served over regular HTTP linked from regular HTML, which I don’t think rsync is terribly fond of (but I might be mistaken).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Response Issues which require feedback from lead Needs: Staff Decision Issues that are blocked on a staff member's decision Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

6 participants