Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIM for ab_all_maxi has different sizes between 1.13 and 1.14 #2071

Open
audiodude opened this issue Jul 24, 2024 · 16 comments
Open

ZIM for ab_all_maxi has different sizes between 1.13 and 1.14 #2071

audiodude opened this issue Jul 24, 2024 · 16 comments
Milestone

Comments

@audiodude
Copy link
Member

The ZIM scraped in July 2024 by 1.14 has a different size than the one scraped in June 2024 by 1.13:

wikipedia_ab_all_maxi_2024-06.zim             2024-06-17 02:21   26M   
wikipedia_ab_all_maxi_2024-07.zim             2024-07-22 20:56   36M

Oddly, this is the opposite problem as #2070. We don't yet know what the issue might be.

@audiodude audiodude added this to the 1.14.0 milestone Jul 24, 2024
@audiodude
Copy link
Member Author

audiodude commented Jul 24, 2024

@audiodude
Copy link
Member Author

Here is a .tsv for every entry in the ZIMs. It has the format:

path,june size, july size

comparison.zip

Doing some analysis in pandas, we see that there are 681 webps that are larger in July, out of 2969 total webps:

image

The mean size difference is +10,661 bytes for those webps that are larger.

However, the total difference, including webps that are smaller in July, is only 6.77 MB:

image

@audiodude
Copy link
Member Author

Clearly I'm doing something wrong, because the total sums of July sizes and June sizes are only 107 MB and 110 MB:

image

I tried to iterate over all entries in the ZIM using the _get_entry_by_id hack:

import csv

from libzim.reader import Archive

june = Archive("zims/wikipedia_ab_all_maxi_2024-06.zim")
july = Archive("zims/wikipedia_ab_all_maxi_2024-07.zim")

path_to_sizes = {}

for i in range(0, june.all_entry_count):
  entry = june._get_entry_by_id(i)
  path_to_sizes[entry.path] = [entry.get_item().size]

for i in range(0, july.all_entry_count):
  entry = july._get_entry_by_id(i)
  if entry.path in path_to_sizes:
    path_to_sizes[entry.path].append(entry.get_item().size)
  else:
    path_to_sizes[entry.path] = [None, entry.get_item().size]

for entry, sizes in path_to_sizes.items():
  if len(sizes) == 1:
    sizes.append(None)

with open('comparison.tsv', 'w', newline='') as csvfile:
  csvwriter = csv.writer(csvfile, delimiter='\t')
  sorted_keys = sorted(path_to_sizes.keys())
  csvwriter.writerow(('path', 'june', 'july'))
  for key in sorted_keys:
    csvwriter.writerow((key, *path_to_sizes[key]))

@rgaudin
Copy link
Member

rgaudin commented Jul 26, 2024

Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.

@rgaudin
Copy link
Member

rgaudin commented Jul 26, 2024

sum(june._get_entry_by_id(i).get_item().size for i in range(0, june.all_entry_count) if june._get_entry_by_id(i).get_item().mimetype == "image/webp")
> 13594704  # 12.96 MiB

@rgaudin
Copy link
Member

rgaudin commented Jul 26, 2024

July WEBP are 24129838 / 23.01 MiB

@audiodude
Copy link
Member Author

Your tsv is not filtered on WEBP files ; it contains all entries, including compressed ones (text) and indexes.

The pandas code limits it to webp:

webps = df[df['path'].str.endswith('.webp')]

@audiodude
Copy link
Member Author

audiodude commented Jul 27, 2024

The larger question is why is my total size 118 MB?

Edit: I misread the original ZIM sizes as 26/36 GB instead of MB. So actually, uncompressed, 110/118 MB makes sense.

@audiodude
Copy link
Member Author

Here is my Jupyter notebook with analysis: https://github.com/audiodude/zim-investigation/blob/main/compare.ipynb

@audiodude
Copy link
Member Author

Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:

30244	zims/wikipedia_ab_all_maxi_2024-08.113.zim
34728	zims/wikipedia_ab_all_maxi_2024-08.114.zim

@audiodude
Copy link
Member Author

Dumping some of the webps from the respective ZIMs, we see that the 1.14 ones are much bigger:

$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/*
16	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/1924WOlympicPoster.jpg.webp
16	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
12	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
20	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_habanera_original.jpg.webp
20	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
8	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Christos_Acheiropoietos.jpg.webp
28	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Hovenia_dulcis.jpg.webp
12	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Lashkendar_temple_ruins.JPG.webp
28	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Paliurus_fg01.jpg.webp
20	zim_file_dump/wikipedia_ab_all_maxi_2024-08.113.zim/Tsebelda_iconostasis.jpg.webp

$ du zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/*
112	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/1924WOlympicPoster.jpg.webp
16	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/200412_-_Plaqueminier_et_ses_kakis.jpg.webp
120	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Ambara_church_ruins_in_Abkhazia%2C_1899.jpg.webp
112	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_habanera_original.jpg.webp
144	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Carmen_-_illustration_by_Luc_for_Journal_Amusant_1911.jpg.webp
112	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Christos_Acheiropoietos.jpg.webp
28	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Hovenia_dulcis.jpg.webp
168	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Lashkendar_temple_ruins.JPG.webp
28	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Paliurus_fg01.jpg.webp
140	zim_file_dump/wikipedia_ab_all_maxi_2024-08.114.zim/Tsebelda_iconostasis.jpg.webp

Confirmed manually that the 1.14 images have much bigger dimensions.

@audiodude
Copy link
Member Author

Confirmed this is due to larger images.

@kelson42
Copy link
Collaborator

kelson42 commented Aug 6, 2024

@audiodude What does that mean concretly it term of resolution and quality? Are they all impacted in the same manner?

@audiodude
Copy link
Member Author

@kelson42 The resolutions are much bigger, they have larger widths and heights in terms of pixels. I didn't do any systematic analysis of to what degree that is the case, but it's most likely due to #1925

@kelson42
Copy link
Collaborator

kelson42 commented Aug 7, 2024

@audiodude I'm not against to downscale images but:

  • To do that properly we need to understand properly hownthis should be done. We should probably find the piece of code of the MCS responsible of that and copy it
  • We shoukd clarify if - on the top - there is not a problem with the quality of the webp pictures we do (what we suspect)

@Jaifroid
Copy link
Collaborator

Jaifroid commented Aug 7, 2024

Doing a more "apples to apples" comparison of the wiki scraped right now with 1.13 versus 1.14, the discrepancy is much less:

30244	zims/wikipedia_ab_all_maxi_2024-08.113.zim
34728	zims/wikipedia_ab_all_maxi_2024-08.114.zim

I noticed this with https://download.kiwix.org/zim/wikivoyage/wikivoyage_en_all_maxi_2024-08.zim, which is scraped with 1.13 from new endpoint, has larger images at least in terms of display dimensions, but which hardly increases the ZIM size compared to ZIMs scraped from the old endpoint.

I actually rather like the larger display size for images at least in that Wikivoyage version (which I've just released as a packaged app). If we could hit that sweet-spot in terms of display-size vs compression, it would be a good solution IMHO. What is 1.13 doing right here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants