Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a request fingerprinter that accounts for dependencies #172

Merged
merged 20 commits into from
Jan 11, 2024

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Nov 6, 2023

Pending decisions:

@Gallaecio Gallaecio requested review from kmike, wRAR and BurnzZ November 6, 2023 14:13
Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #172 (55326a4) into master (0eef7ae) will increase coverage by 1.71%.
The diff coverage is 99.09%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #172      +/-   ##
==========================================
+ Coverage   85.11%   86.83%   +1.71%     
==========================================
  Files          14       15       +1     
  Lines         786      896     +110     
==========================================
+ Hits          669      778     +109     
- Misses        117      118       +1     
Files Coverage Δ
scrapy_poet/__init__.py 100.00% <100.00%> (ø)
scrapy_poet/_request_fingerprinter.py 100.00% <100.00%> (ø)
scrapy_poet/utils/testing.py 79.86% <94.73%> (+2.00%) ⬆️

Copy link
Contributor

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking earlier today if there are other approaches aside from this but it seems the Request Fingerprinter route is the best way to go 👍

tests/__init__.py Outdated Show resolved Hide resolved
tests/test_request_fingerprinter.py Outdated Show resolved Hide resolved
tests/test_request_fingerprinter.py Show resolved Hide resolved
scrapy_poet/_request_fingerprinter.py Show resolved Hide resolved
tests/__init__.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member Author

I’m adding a test, test_dep_resolution, that shows how fingerprinting does not resolve dependencies, so callbacks that effectively request the same dependencies could end up with a different fingerprint.

I assume this is OK given how messy dependency resolution is due to ItemProvider, and how unlikely such cases are (having 2 callbacks on the same spider that request the same deps indirectly). However, if we ever get rid of ItemProvider, it might be worth considering.

@Gallaecio
Copy link
Member Author

One more thing: page params.

I have just realized that we should probably alter the fingerprint based on their presence in Request.meta, but web-poet says it is an arbitrary dict, so it might not be serializable and thus fingerprintable.

  • Should we try to take page params into account any way for request fingerprinting?
  • Should we log a warning if it is not serializable?
  • How should we fingerprint requests with unserializable parameters?
    • Same fingerprint as requests without parameters
    • Their own fingerprint, but common to all requests with unserializable parameters (fixed salt). i.e. identical requests with different unserializable parameters are considered the same nonetheless.
    • Their own fingerprint, unique for each unserializable parameter set (random salt). i.e. identical requests with the same unserializable parameters are considered different nonetheless.

@kmike kmike merged commit b5c82fd into scrapinghub:master Jan 11, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants