Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websites manipulating already rewriten URLs needing fuzzy rules are not working #345

Open
benoit74 opened this issue Jul 2, 2024 · 1 comment
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Jul 2, 2024

The scenario has been encountered on https://ir.voanews.com, see openzim/zim-requests#833 (comment)

Scenario is as follow:

  • we want to rewrite image URLs with fuzzy rules so that they are capable to adapt to various screen sizes (image resolution is embedded inside the URL)
  • the original page HTML contains a "default" image URL in the <img src=...>
  • this image URL is hence statically rewritten (in Python) + fuzzyfied before the HTML is pushed to the ZIM
  • when the HTML is loaded, some JS code is manipulating the "default" image URL (now a fuzzified relative path to a ZIM entry) to select the proper resolution
  • the URL is rewritten a second time with dynamic rewriting (in JS) ; JS code detects that URL has already been rewritten and does not rewrite it ; fuzzy rule is hence not applied on this modified URL and item is not found in the ZIM

The fact that we do not want to rewrite dynamically a URL which has already been rewritten statically is intentional to avoid problems, because we need at least special handling for this situation, and usually it is not needed to rewrite a second time.

Developing a special handling for already rewritten URL is not possible (yet) because we need to reverse the whole rewriting logic. The part manipulating the path and querystring is probably feasible (but complex), but we might also need to reverse the fuzzy rule, and this is not possible yet because the fuzzyfication is a one-way "reduction" operation in most cases.

Since the URL has been manipulated anyway by the JS, maybe just reversing the hostname change induced by the fuzzyfication would be enough in most cases, at least it would be enough in current situtation.

Example:

  • HTML page URL: https://ir.voanews.com/a/iran-elections-opposition-dissidents-figures-boycott-call/7681344.html
  • Original image URL: https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jpg
  • Rewriten (and hence fuzzyfied) URL: ../../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_w250_r1_s.jp
  • URL after JS manipulation to fetch proper resolution: ../../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg
  • URL rewritten today: idem, we detect properly that URL has already been rewritten
  • URL we would like to get: ./../../gdb.voanews.fuzzy.replayweb.page/01000000-0aff-0242-ce72-08dc9778f46b_high.jpg
  • Reversed original image URL we need to build for this to work: https://gdb.voanews.com/01000000-0aff-0242-ce72-08dc9778f46b_w1023_r1_s.jpg (where we see that in this specific case just reversing path manipulation + reversing hostname change would be enough ... definitely not true for all fuzzy rules / website manipulations however)
@benoit74 benoit74 added bug Something isn't working question Further information is requested labels Jul 2, 2024
@benoit74 benoit74 transferred this issue from openzim/zim-requests Jul 2, 2024
@benoit74 benoit74 added this to the later milestone Jul 2, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Jul 2, 2024

Who, zim-request? I was probably too tired when I wrote this one ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant