-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add transform_request_function
for enqueue_links
#923
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very opinion based, but we should consider returning some other special value for skipping a request instead of None. Seeing e.g. return Request.SKIP
would be more readable than return None
.
Moreover, an empty transform function would mean "skip all requests", which I'm not too fond of - I'd expect that to just let any request go through without any modification.
I like the use of some special flag, in terms of readability. But I'm not sure how convenient it would be in regular use. Also the current behavior is closer to TS, if I'm not mistaken about the if (transformRequestFunction) {
requestOptions = requestOptions
.map((request) => transformRequestFunction(request))
.filter((r) => !!r) as RequestOptions[];
} |
src/crawlee/_request.py
Outdated
@@ -108,6 +108,23 @@ def __eq__(self, other: object) -> bool: | |||
user_data_adapter = TypeAdapter(UserData) | |||
|
|||
|
|||
class RequestOptions(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both this and BaseRequestData
? If we could pick one and drop the other, I'd be pretty happy. If we could drop both, I'd be super super happy 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, please no more Request
-whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorta in favor of RequestOptions
here - it is simpler to handle than Request
and the dict structure invites people to mutate it, if you know what I mean. On the other hand, shoving the full Request
in would let us use validation on assignment, which may be more user friendly than getting an error only after the options are used to construct a Request
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I preferred RequestOptions
because it better matches the required functionality and behavior of the TS version. And it also allows not to think about cases where, for example, unique_key
was changed after Request
was created....
And yes, a dictionary is a basic data type that all users understand.
I imagine something like this - it would be cleaner to put the special flags in a separate enum and not def transform(request_options: RequestOptions) -> RequestOptions | Literal[Request.SKIP, Request.UNCHANGED]:
if 'page' in request_options["url"]:
return {*request_options, "label": "page"}
if 'campaign' in request_options["url"]:
return Request.SKIP
return Request.UNCHANGED
Yeah, it does filter out results that convert to |
What about returning I would rather not force users to import anything to achieve this behaviour. |
I think that you can't really tell what
I get that, that's why I suggest to put it on |
I don't think so, at least not in the JS version. But we can have both, I agree its more readable with some kind of symbol. Docs could use that with a comment saying |
src/crawlee/_types.py
Outdated
@@ -353,6 +354,8 @@ def __call__( | |||
- `BeautifulSoupCrawler` supports CSS selectors. | |||
label: Label for the newly created `Request` objects, used for request routing. | |||
user_data: User data to be provided to the newly created `Request` objects. | |||
transform_request_function: A function that processes request options before creating a request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform_request_function: A function that processes request options before creating a request. | |
transform_request_callback: A function that processes request options before creating a request. |
Isn't callback a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno, callback kinda sounds like something that just performs some random side effect. This function actually directly changes the result of enqueue_links
I have no issue with using string literals in combination with the "new" def transform(request_options: RequestOptions) -> RequestOptions | Literal['skip', 'unchanged']:
... |
From point of view of a person that sees this function for the first time and is not aware of the JS implementation I would expect this transform_request function to accept Request object, mutate it in place and return nothing. I think we should be more explicit with the naming of functions. This name implies side effects on request. I would suggest to make a choice of either doing side effects and not returning anything or creating copy of object and returning it. |
That's why I wouldn't accept functions that return |
Let me disagree, anything that is called |
@@ -108,27 +108,57 @@ def __eq__(self, other: object) -> bool: | |||
user_data_adapter = TypeAdapter(UserData) | |||
|
|||
|
|||
class BaseRequestData(BaseModel): | |||
"""Data needed to create a new crawling request.""" | |||
class RequestOptions(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but can't we now use RequestOptions
to type the kwargs of Request.from_url
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't, because these variables from RequestOptions
are used in the from_url
method.
@@ -140,20 +140,31 @@ async def enqueue_links( | |||
selector: str = 'a', | |||
label: str | None = None, | |||
user_data: dict[str, Any] | None = None, | |||
transform_request_function: Callable[[RequestOptions], RequestOptions | Literal['skip', 'unchanged']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should extract this type into some shared module. Users who want to provide a correct return type annotation for their transform_request_function
should not need to copy all this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
class BaseRequestData(BaseModel): | ||
"""Data needed to create a new crawling request.""" | ||
class RequestOptions(TypedDict): | ||
"""Options that can be used to customize request creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ou, so we are getting rid of the BaseRequestData
, ok, that sounds good and makes sense. Thanks.
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
Co-authored-by: Vlada Dusek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
@@ -44,6 +44,8 @@ | |||
|
|||
HttpPayload: TypeAlias = bytes | |||
|
|||
RequestTransformAction: TypeAlias = Literal['skip', 'unchanged'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure whether this is necessary 🙂
Description
transform_request_function
forenqueue_links
Issues
context.enqueue_links
Method #894