Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the Set-Cookie header #132

Merged
merged 8 commits into from
Nov 24, 2023
Merged

Conversation

Gallaecio
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Sep 26, 2023

Codecov Report

Merging #132 (ad1444a) into main (b892453) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head ad1444a differs from pull request most recent head f9fb461. Consider uploading reports for the commit f9fb461 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #132   +/-   ##
=======================================
  Coverage   98.83%   98.84%           
=======================================
  Files          10       10           
  Lines         687      691    +4     
=======================================
+ Hits          679      683    +4     
  Misses          8        8           
Files Coverage Δ
scrapy_zyte_api/responses.py 99.00% <100.00%> (+0.04%) ⬆️

# Cookies should be fetched from experimental.responseCookies only, the
# Set-Cookie header may define cookies from the main HTTP response that
# were overriden later on, e.g. during browser rendering.
"set-cookie",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove headers when httpResponseBody = True.

@Gallaecio Gallaecio marked this pull request as draft September 28, 2023 10:52
@Gallaecio Gallaecio requested a review from wRAR October 4, 2023 07:21
@Gallaecio Gallaecio marked this pull request as ready for review October 4, 2023 07:25
Copy link

@proway2 proway2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

if input_headers:
headers_to_remove = copy(cls.REMOVE_HEADERS)
if response_cookies or "httpResponseBody" not in api_response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it handle an edge case where only httpResponseHeaders are requested, without httpResponseBody?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not, did not even cross my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… This is more involved than I initially thought.

Since httpResponseHeaders can be combined with any other output, including extraction (current and future extraction keys), filtering only by “httpResponseHeaders and no other output” is not feasible.

So I think we need to choose:

  • Keep the logic as is, remove Set-Cookie even if httpResponseHeaders is the only output, as long as response cookies are requested as well.
  • Keep the Set-Cookie header whenever httpResponseHeaders is present, even if browserHtml or other outputs are also requested.
  • Keep the Set-Cookie header whenever httpResponseHeaders is present as long as no other known output is requested, and accept that there will be times between Zyte API adding support for a new extraction type and us supporting it here where Set-Cookie would be deleted for those new scenarios. We could expose the list of extractions as a setting as a workaround, if we want to go that far.

Copy link
Member

@kmike kmike Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so, we need to remove the header if we know that the browser rendering was not used internally. And it's not straightforward to know if it was used or not: e.g. it could be the case httpResponseBody + browserHtml could be supported at the same time. Extraction may also be picking browser vs no browser transparently in the future.

Is it the case that ideally, we'd like to keep the Set-Cookie header always (for informational purposes), but ensure it's not processed by the cookie middleware when responseCookies are present? Or are there any issues with that? Can we address this directly somehow, instead of removing Set-Cookie header conditionally?

Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received? If there is no use case, and addressing the point above is complicated, then can we drop Set-Cookie unconditionally when responseCookies are present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received?

@proway2 Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpResponseBody and browserHtml are mutually exclusive, so there's no way to use these two together at the moment. The problem is that when httpResponseHeaders is used together with browserHtml it's obviously a mistake. There might be the very same cookie in two different places and with different values, therefore the same cookie is going to be sent twice which may lead to errors that are really hard to find and debug. We've seen this with Lowes, when zipcode is automatically populated based on IP and after that a proper location is set and another cookie with completely different zipcode is populated as well and both end up in Scrapy. I'd say that there's no reason to use httpResponseHeaders with browserHtml.
When httpResonseHeaders are used together with httpResponseBody then it's a different story, in this case headers contain valid data as no rendering is used and no other headers are created by the engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still does not explain why, when using httpResponseBody, you would read the cookies from Set-Cookie instead of responseCookies. It’s accommodating that scenario, i.e. not dropping Set-Cookie even though responseCookies have been received, what makes the implementation tricky.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what stated on our website for httpResponseHeaders:

The Set-Cookie header value, when present, contains the header value received from the main HTTP response. These cookies could have changed later on, e.g. during browser rendering. Usually you will want to ignore this header in favor of responseCookies, which provides the final cookies.

I'm not sure what Usually exactly means here, I'd read cookies from responseCookies. But I guess responseCookies may be set to False but httpResponseHeaders may be set to True, it may be the case when a user needs something from headers (e.g. some key) but doesn't actually need a cookie.

I'd say that if browserHtml is used, remove httpResponseHeaders at all. If httpResponseBody is used and cookies are enabled and responseCookies is True - read cookies from responseCookies, but if responseCookies is False - read cookies from httpResponseHeaders. In any case any other header from httpResponseHeaders must remain untouched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that when httpResponseHeaders is used together with browserHtml it's obviously a mistake.
I'd say that if browserHtml is used, remove httpResponseHeaders at all.

I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.

Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.

Based on the discussion in Slack though it's allowed to use httpResponseHeaders+browserHtml it's strongly discouraged. I'm not sure if there's a use case for this.

Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?

Yes

@wRAR wRAR self-requested a review October 30, 2023 14:57
@Gallaecio Gallaecio merged commit 493a48c into scrapy-plugins:main Nov 24, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants