Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimcheck fails to locate existing entries where a path segment contains a trailing dot #438

Open
IMayBeABitShy opened this issue Dec 31, 2024 · 3 comments

Comments

@IMayBeABitShy
Copy link

So, this is probably an edge case, but if a ZIM contains the path test/a/b./c/ and another entry links to it via <a href="../../test/a/b./c/>...</a>, then zimcheck fails with the message [ERROR] Invalid internal links found:. The exact message (see below) hints that the path was not properly resolved.

General informations

  • zimcheck 3.1.3
  • Debian 12
  • test ZIM file created using python3.11.2 and python-libzim 3.4.0

Steps to reproduce

First, create a ZIM with the following minimal example (a quick modification of the python-libzim example):

import pathlib

from libzim.writer import Creator, Item, StringProvider, FileProvider, Hint


class MyItem(Item):
    def __init__(self, title, path, content = "", fpath = None):
        super().__init__()
        self.path = path
        self.title = title
        self.content = content
        self.fpath = fpath

    def get_path(self):
        return self.path

    def get_title(self):
        return self.title

    def get_mimetype(self):
        return "text/html"

    def get_contentprovider(self):
        if self.fpath is not None:
            return FileProvider(self.fpath)
        return StringProvider(self.content)

    def get_hints(self):
        return {Hint.FRONT_ARTICLE: True}


content = """<html><head><meta charset="UTF-8"><title>Web Page Title</title></head>
<body><h1>Welcome to this ZIM</h1><p>Link to test file is <a href="../../test/a/b./c/">here</a>.</p></body></html>"""
content2 = """<html><head><meta charset="UTF-8"><title>File with dotpath</title></head>
<body><h1>This file has a dot in its filename</h1><p>This file is merely an example.</p></body></html>"""

item = MyItem("Hello Kiwix", "root/home/", content)
item2 = MyItem("Test", "test/a/b./c/", content2)

with Creator("test.zim").config_indexing(True, "eng") as creator:
    creator.set_mainpath("root/home/")
    creator.add_item(item)
    creator.add_item(item2)
    illustration = pathlib.Path("icon.png").read_bytes()
    creator.add_illustration(48, illustration)
    for name, value in {
        "creator": "python-libzim",
        "description": "Created in python",
        "name": "my-zim",
        "publisher": "You",
        "title": "Test ZIM",
        "language": "eng",
        "date": "2024-06-30"
    }.items():

        creator.add_metadata(name.title(), value)

Then, run zimcheck test.zim. This results in the output:

[INFO] Checking zim file test.zim
[INFO] Zimcheck version is 3.1.3
[INFO] Verifying ZIM-archive structure integrity...
[INFO] Avoiding redundant checksum test (already performed by the integrity check).
[INFO] Searching for metadata entries...
[INFO] Searching for Favicon...
[INFO] Searching for main page...
[INFO] Verifying Articles' content...
[INFO] Searching for redundant articles...
  Verifying Similar Articles for redundancies...
[INFO] Checking for redirect loops...
[ERROR] Invalid internal links found:
  The following links:
- ../../test/a/b./c/
(test/a/bc/) were not found in article root/home/
[INFO] Overall Test Status: Fail
[INFO] Total time taken by zimcheck: <3 seconds.

As you can see, the link mentioned above gets wrongly converted. When serving the ZIM using kiwix-serve, the links work perfectly.

Potential cause

After some initial search, I've found a couple of questions on stackoverflow discussing other occurences where paths with such dots were handled weirdly. The responses commented that such paths are handled differently on windows due to some windows naming convention (NOTE: tests performed on debian). Perhaps some library handles ZIM paths similary despite not being windows paths?

@ThisIsFineTM
Copy link

ThisIsFineTM commented Jan 1, 2025

I may have found the cause:

Call stack:
src/zimcheck/checks.cpp: ArticleChecker::check_item -> check_internal_links
src/zimcheck/checks.cpp: ArticleChecker::check_internal_links -> normalize_link
src/tools.cpp: normalize_link

In the call to normalize_link, there is a block that checks looks for "./" and skips over those characters. Given the error shows the "./" being skipped over in the before and after, I believe this is the most likely cause.

- ../../test/a/b./c/
(test/a/bc/) 

@IMayBeABitShy
Copy link
Author

Thank you for looking into this @ThisIsFineTM.

The linked code segment contains a comment // We must simply skip this part. This indicates that this may be intended behavior, but provides no ellaboration on why this is intended. Perhaps this code was intended to resolve ./ at the beginning of a relative path (e.g. resolve ./b/c inside /a to /a/b/c) but fails to ensure that this character sequence is only skipped if it's on the beginning of the path?

I've just run another test, this time also adding an article test/a/b../c/ (notice the double .. following the b). This also fails, but with the error message ../../test/a/b../c/ is out of bounds. Article: root/home/. kiwix-serve+firefox as well as kiwix-desktop handle these URLs correctly. I've also confirmed the expected behavior of these URLs with the relative URL resolution function in python and javascript.

Proposed fix: Adjust lines 423 and 435 in the code snippet linked by @ThisIsFineTM to ensure dots (both single and double occurences) only get resolved as relative path segments if they are the only characters in the current path segment (e.g. resolve them in ./a, ../a, b/../a, b/./a but not in a./, a../, /a/b./c, a/b../c and so on).

@ThisIsFineTM
Copy link

It's both just edge cases with the manual path handling. I'm going to be putting an issue together for changing out the manual path handling with std::filesystem if we can get the maintainers' approval for using c++17 features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants