Slugify and truncate the default collection name #106

Gallaecio · 2024-12-16T17:47:12Z

Fixes #104

zyte_spider_templates/_incremental/manager.py

kmike · 2024-12-16T18:31:29Z

zyte_spider_templates/_incremental/manager.py

+    if name := crawler.settings.get("INCREMENTAL_CRAWL_COLLECTION_NAME"):
+        return name
+    name = get_spider_name(crawler).rstrip("_")[:_MAX_LENGTH] + INCREMENTAL_SUFFIX
+    return re.sub(r"[^a-zA-Z0-9_]", "_", name)


This one may be problematic for another reason - all spiders with unicode names of the same length would silently get the same collection name, and reuse the fingerprint DB unintentionally.

If we do not care about human readability of the collection names, we could replace characters with their UTF-8 hexadecimal (e.g. å → C3A5) or some other Unicode character ID.

Otherwise, I think we are back to slugify or at least Unidecode, with the caveat that the results may change over time since we have no control over those libraries. We could pin them in zyte-spider-templates-project, but eventually some user may find this problematic.

We could also use the spider numeric ID, since collections are project-specific. We can extract it easily from the middle of the job ID. To keep backward-compatibility with 0.11, we could do this only on spiders with spider names that are invalid collection names.

I also wonder what are the restrictions on the spider names. The solution could also be to bring the formats closer together (e.g. add more validation for virtual spider names).

Otherwise, I think we are back to slugify or at least Unidecode

Hm, using text-unidecode could be pretty stable. Unlike unidecode or python-slugify, it didn't have a release in many-many years :)

Oh, nice! And its maintainer feels familiar :)

What do you think about just disabling the default name creation and require an explicit collection name?

Good point. I think it is worth considering. While it is a slightly worse initial user experience, it is the most reliable approach, no need for us to worry about any long-term issues with the default name generation.

We could make it a backward-compatible change by making it required only in the UI with some JSON schema customization, and logging a deprecation error (because warnings are not visible in the UI easily) to encourage users to set it on spiders created with 0.11.

…der name

Slugify and truncate the default collection name

2e53b75

Gallaecio requested review from kmike, wRAR and PyExplorer December 16, 2024 17:47

kmike reviewed Dec 16, 2024

View reviewed changes

zyte_spider_templates/_incremental/manager.py Outdated Show resolved Hide resolved

Use a more basic, more reliable slugifying implementation

f4b1daa

kmike reviewed Dec 16, 2024

View reviewed changes

Use text-unidecode to derive the collection name from a (virtual) spi…

37441e1

…der name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slugify and truncate the default collection name #106

Slugify and truncate the default collection name #106

Gallaecio commented Dec 16, 2024 •

edited

Loading

kmike Dec 16, 2024

Gallaecio Dec 17, 2024 •

edited

Loading

kmike Dec 17, 2024

Gallaecio Dec 17, 2024 •

edited

Loading

PyExplorer Dec 17, 2024

Gallaecio Dec 17, 2024 •

edited

Loading

Slugify and truncate the default collection name #106

Are you sure you want to change the base?

Slugify and truncate the default collection name #106

Conversation

Gallaecio commented Dec 16, 2024 • edited Loading

kmike Dec 16, 2024

Choose a reason for hiding this comment

Gallaecio Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

kmike Dec 17, 2024

Choose a reason for hiding this comment

Gallaecio Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

PyExplorer Dec 17, 2024

Choose a reason for hiding this comment

Gallaecio Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Gallaecio commented Dec 16, 2024 •

edited

Loading

Gallaecio Dec 17, 2024 •

edited

Loading

Gallaecio Dec 17, 2024 •

edited

Loading

Gallaecio Dec 17, 2024 •

edited

Loading