Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIM for bm_all_maxi has different sizes between 1.13 and 1.14 #2070

Closed
audiodude opened this issue Jul 24, 2024 · 6 comments
Closed

ZIM for bm_all_maxi has different sizes between 1.13 and 1.14 #2070

audiodude opened this issue Jul 24, 2024 · 6 comments
Milestone

Comments

@audiodude
Copy link
Member

The ZIM that was scraped in July 2024 by 1.14 for bm_all_maxi is about half the size of the one for June, scraped by 1.13:

wikipedia_bm_all_maxi_2024-06.zim         2024-06-12 00:05   41M   
wikipedia_bm_all_maxi_2024-07.zim         2024-07-22 05:40   23M

We've started looking at the ZIMs and there is definitely a disparity in image resolution. Many of the images in the July ZIM have much smaller dimensions.

This could have been caused by clearing the image cache between runs. If 1.14 didn't find the image in the cache, it may have resorted to either:

  1. Downloading it again at a lower resolution
  2. Downloading it and transcoding to a different resolution webp
@audiodude
Copy link
Member Author

@Jaifroid
Copy link
Collaborator

Jaifroid commented Aug 8, 2024

It seems clear this is the same issue as #2071. Perhaps close this and generalize the title of that?

@audiodude
Copy link
Member Author

It's the opposite problem actually, the version scraped with 1.14 is half the size (smaller).

@audiodude
Copy link
Member Author

The first step in analyzing this would be to do the "apples to apples" and scrape the wiki as it is now with 1.13 versus 1.14.

@audiodude
Copy link
Member Author

Here's the results of scraping the current wiki with 1.13 and 1.14:

14M	output/wikipedia_bm_all_maxi_2024-08.113.zim
22M	output/wikipedia_bm_all_maxi_2024-08.114.zim

It is clear there were major structural changes between June and July that cause the most recent scrapes to be smaller.

@audiodude
Copy link
Member Author

In the end, it turns out this is in fact the same issue as #2071. Closing as duplicate.

@audiodude audiodude closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants