Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the Set-Cookie header #132

Merged
merged 8 commits into from
Nov 24, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 20 additions & 7 deletions scrapy_zyte_api/responses.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from base64 import b64decode
from copy import copy
from datetime import datetime
from typing import Any, Dict, List, Optional, Tuple, Union

Expand All @@ -22,7 +23,7 @@ class ZyteAPIMixin:
# Zyte API already decompresses the HTTP Response Body. Scrapy's
# HttpCompressionMiddleware will error out when it attempts to
# decompress an already decompressed body based on this header.
"content-encoding"
"content-encoding",
}

def __init__(self, *args, raw_api_response: Optional[Dict] = None, **kwargs):
Expand Down Expand Up @@ -90,18 +91,30 @@ def _prepare_headers(cls, api_response: Dict[str, Any]):
input_headers: Optional[List[Dict[str, str]]] = api_response.get(
"httpResponseHeaders"
)
response_cookies: Optional[List[Dict[str, str]]] = api_response.get(
"experimental", {}
).get("responseCookies")
if input_headers:
headers_to_remove = copy(cls.REMOVE_HEADERS)
if response_cookies or "httpResponseBody" not in api_response:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it handle an edge case where only httpResponseHeaders are requested, without httpResponseBody?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not, did not even cross my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… This is more involved than I initially thought.

Since httpResponseHeaders can be combined with any other output, including extraction (current and future extraction keys), filtering only by “httpResponseHeaders and no other output” is not feasible.

So I think we need to choose:

  • Keep the logic as is, remove Set-Cookie even if httpResponseHeaders is the only output, as long as response cookies are requested as well.
  • Keep the Set-Cookie header whenever httpResponseHeaders is present, even if browserHtml or other outputs are also requested.
  • Keep the Set-Cookie header whenever httpResponseHeaders is present as long as no other known output is requested, and accept that there will be times between Zyte API adding support for a new extraction type and us supporting it here where Set-Cookie would be deleted for those new scenarios. We could expose the list of extractions as a setting as a workaround, if we want to go that far.

Copy link
Member

@kmike kmike Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so, we need to remove the header if we know that the browser rendering was not used internally. And it's not straightforward to know if it was used or not: e.g. it could be the case httpResponseBody + browserHtml could be supported at the same time. Extraction may also be picking browser vs no browser transparently in the future.

Is it the case that ideally, we'd like to keep the Set-Cookie header always (for informational purposes), but ensure it's not processed by the cookie middleware when responseCookies are present? Or are there any issues with that? Can we address this directly somehow, instead of removing Set-Cookie header conditionally?

Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received? If there is no use case, and addressing the point above is complicated, then can we drop Set-Cookie unconditionally when responseCookies are present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what's the actual use case of keeping Set-Cookie header when responseCookies are received?

@proway2 Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpResponseBody and browserHtml are mutually exclusive, so there's no way to use these two together at the moment. The problem is that when httpResponseHeaders is used together with browserHtml it's obviously a mistake. There might be the very same cookie in two different places and with different values, therefore the same cookie is going to be sent twice which may lead to errors that are really hard to find and debug. We've seen this with Lowes, when zipcode is automatically populated based on IP and after that a proper location is set and another cookie with completely different zipcode is populated as well and both end up in Scrapy. I'd say that there's no reason to use httpResponseHeaders with browserHtml.
When httpResonseHeaders are used together with httpResponseBody then it's a different story, in this case headers contain valid data as no rendering is used and no other headers are created by the engine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still does not explain why, when using httpResponseBody, you would read the cookies from Set-Cookie instead of responseCookies. It’s accommodating that scenario, i.e. not dropping Set-Cookie even though responseCookies have been received, what makes the implementation tricky.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what stated on our website for httpResponseHeaders:

The Set-Cookie header value, when present, contains the header value received from the main HTTP response. These cookies could have changed later on, e.g. during browser rendering. Usually you will want to ignore this header in favor of responseCookies, which provides the final cookies.

I'm not sure what Usually exactly means here, I'd read cookies from responseCookies. But I guess responseCookies may be set to False but httpResponseHeaders may be set to True, it may be the case when a user needs something from headers (e.g. some key) but doesn't actually need a cookie.

I'd say that if browserHtml is used, remove httpResponseHeaders at all. If httpResponseBody is used and cookies are enabled and responseCookies is True - read cookies from responseCookies, but if responseCookies is False - read cookies from httpResponseHeaders. In any case any other header from httpResponseHeaders must remain untouched.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that when httpResponseHeaders is used together with browserHtml it's obviously a mistake.
I'd say that if browserHtml is used, remove httpResponseHeaders at all.

I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.

Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this; if it's supported by the API, and requested explicitly by the user, then it might not be a mistake.

Based on the discussion in Slack though it's allowed to use httpResponseHeaders+browserHtml it's strongly discouraged. I'm not sure if there's a use case for this.

Overall, it seems we're all on the same page - we should prioritize responseCookies, and there is no issue with removing Set-Cookie from httpResponseHeaders if responseCookies are present, right?

Yes

# Only the experimental.responseCookies response field should
# be used in general to get cookies, since the Set-Cookie
# header may define cookies from the main HTTP response that
# were overridden later on, e.g. during browser rendering.
# However, for HTTP requests (i.e. with httpResponseBody), it
# is OK to keep the Set-Cookie header if
# experimental.responseCookies was not received, as it should
# be identical to what experimental.responseCookies would
# contain.
headers_to_remove.add("set-cookie")
result = {
h["name"]: [h["value"]]
for h in input_headers
if h["name"].lower() not in cls.REMOVE_HEADERS
if h["name"].lower() not in headers_to_remove
}
input_cookies: Optional[List[Dict[str, str]]] = api_response.get(
"experimental", {}
).get("responseCookies")
if input_cookies:
if response_cookies:
result["Set-Cookie"] = []
for cookie in input_cookies:
for cookie in response_cookies:
result["Set-Cookie"].append(
cls._response_cookie_to_header_value(cookie)
)
Expand Down
95 changes: 94 additions & 1 deletion tests/test_responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -235,23 +235,116 @@ def test_response_headers_removal(api_response, cls):
"""
additional_headers = [
{"name": "Content-Encoding", "value": "gzip"},
{"name": "Set-Cookie", "value": "a=b"},
{"name": "X-Some-Other-Value", "value": "123"},
]
raw_response = api_response()
raw_response["httpResponseHeaders"] = additional_headers

response = cls.from_api_response(raw_response)

assert response.headers == {
expected_headers = {
b"X-Some-Other-Value": [b"123"],
**OUTPUT_COOKIE_HEADERS,
}
assert response.headers == expected_headers
assert (
response.raw_api_response["httpResponseHeaders"]
== raw_response["httpResponseHeaders"]
)


INPUT_COOKIES_SIMPLE = [{"name": "c", "value": "d"}]


@pytest.mark.parametrize(
"fields,cls,keep",
[
# For HTTP requests, whether the response Set-Cookie header is kept or
# not depends on whether experimental.responseCookies is received.
*(
(
{
"httpResponseBody": b64encode(PAGE_CONTENT.encode("utf-8")),
"httpResponseHeaders": [
{"name": "Content-Type", "value": "text/html"},
{"name": "Content-Length", "value": str(len(PAGE_CONTENT))},
],
**cookie_fields, # type: ignore[dict-item]
},
ZyteAPIResponse,
keep,
)
for cookie_fields, keep in (
# No response cookies, so Set-Cookie is kept.
(
{},
True,
),
# Response cookies, so Set-Cookie is not kept.
(
{
"experimental": {
"responseCookies": INPUT_COOKIES_SIMPLE,
},
},
False,
),
)
),
# For non-HTTP requests, the response Set-Cookie header is always
# dropped.
*(
(
{
"browserHtml": PAGE_CONTENT,
"httpResponseHeaders": [
{"name": "Content-Type", "value": "text/html"},
{"name": "Content-Length", "value": str(len(PAGE_CONTENT))},
],
**cookie_fields, # type: ignore[dict-item]
},
ZyteAPITextResponse,
False,
)
for cookie_fields in (
{},
{
"experimental": {
"responseCookies": INPUT_COOKIES_SIMPLE,
},
},
)
),
],
)
def test_response_cookie_header(fields, cls, keep):
"""Test the logic to keep or not the Set-Cookie header in response
headers."""
expected_headers = {
**{
header["name"].encode(): [header["value"].encode()]
for header in fields["httpResponseHeaders"]
},
}
if keep:
expected_headers[b"Set-Cookie"] = [b"a=b"]
elif "experimental" in fields:
expected_headers[b"Set-Cookie"] = [b"c=d"]

fields["url"] = "https://example.com"
fields["statusCode"] = 200
fields["httpResponseHeaders"].append({"name": "Set-Cookie", "value": "a=b"})

response = cls.from_api_response(fields)

assert response.headers == expected_headers
assert (
response.raw_api_response["httpResponseHeaders"]
== fields["httpResponseHeaders"]
)


def test__process_response_no_body():
"""The _process_response() function should handle missing 'browserHtml' or
'httpResponseBody'.
Expand Down