Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add transform_request_function for enqueue_links #923

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Mantisus
Copy link
Collaborator

Description

  • Add transform_request_function for enqueue_links
  • Add example in introduction section
  • Add tests

Issues

@Mantisus Mantisus self-assigned this Jan 20, 2025
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very opinion based, but we should consider returning some other special value for skipping a request instead of None. Seeing e.g. return Request.SKIP would be more readable than return None.

Moreover, an empty transform function would mean "skip all requests", which I'm not too fond of - I'd expect that to just let any request go through without any modification.

docs/introduction/code/03_transform_request.py Outdated Show resolved Hide resolved
@Mantisus
Copy link
Collaborator Author

Very opinion based, but we should consider returning some other special value for skipping a request instead of None. Seeing e.g. return Request.SKIP would be more readable than return None.

Moreover, an empty transform function would mean "skip all requests", which I'm not too fond of - I'd expect that to just let any request go through without any modification.

I like the use of some special flag, in terms of readability. But I'm not sure how convenient it would be in regular use.

Also the current behavior is closer to TS, if I'm not mistaken about the filter operation

if (transformRequestFunction) {
    requestOptions = requestOptions
        .map((request) => transformRequestFunction(request))
        .filter((r) => !!r) as RequestOptions[];
}

@Mantisus Mantisus requested a review from vdusek January 21, 2025 08:28
@@ -108,6 +108,23 @@ def __eq__(self, other: object) -> bool:
user_data_adapter = TypeAdapter(UserData)


class RequestOptions(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both this and BaseRequestData? If we could pick one and drop the other, I'd be pretty happy. If we could drop both, I'd be super super happy 😁

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, please no more Request-whatever.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorta in favor of RequestOptions here - it is simpler to handle than Request and the dict structure invites people to mutate it, if you know what I mean. On the other hand, shoving the full Request in would let us use validation on assignment, which may be more user friendly than getting an error only after the options are used to construct a Request.

Copy link
Collaborator Author

@Mantisus Mantisus Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred RequestOptions because it better matches the required functionality and behavior of the TS version. And it also allows not to think about cases where, for example, unique_key was changed after Request was created....

And yes, a dictionary is a basic data type that all users understand.

@janbuchar
Copy link
Collaborator

janbuchar commented Jan 21, 2025

Very opinion based, but we should consider returning some other special value for skipping a request instead of None. Seeing e.g. return Request.SKIP would be more readable than return None.

I like the use of some special flag, in terms of readability. But I'm not sure how convenient it would be in regular use.

I imagine something like this - it would be cleaner to put the special flags in a separate enum and not Request, but you usually import that anyway. Thoughts?

def transform(request_options: RequestOptions) -> RequestOptions | Literal[Request.SKIP, Request.UNCHANGED]:
    if 'page' in request_options["url"]:
        return {*request_options, "label": "page"}
    if 'campaign' in request_options["url"]:
        return Request.SKIP
    return Request.UNCHANGED

Also the current behavior is closer to TS, if I'm not mistaken about the filter operation

if (transformRequestFunction) {
    requestOptions = requestOptions
        .map((request) => transformRequestFunction(request))
        .filter((r) => !!r) as RequestOptions[];
}

Yeah, it does filter out results that convert to false. I'd argue that this behavior is counter-intuitive, even though changing it in the JS version would be breaking in subtle ways. I would be happy to sacrifice parity between versions here, but I'm open to the opinions of others - @Pijukatel @vdusek @B4nan

@B4nan
Copy link
Member

B4nan commented Jan 21, 2025

What about returning false explicitly?

I would rather not force users to import anything to achieve this behaviour.

@janbuchar
Copy link
Collaborator

What about returning false explicitly?

I think that you can't really tell what false means in this context either, but at least it's not the implicit return value...

I would rather not force users to import anything to achieve this behaviour.

I get that, that's why I suggest to put it on Request which you're pretty likely to import anyways. We could also use string literals, but that's not commonplace in Python.

@B4nan
Copy link
Member

B4nan commented Jan 21, 2025

I get that, that's why I suggest to put it on Request which you're pretty likely to import anyways.

I don't think so, at least not in the JS version.

But we can have both, I agree its more readable with some kind of symbol. Docs could use that with a comment saying false also works.

@@ -353,6 +354,8 @@ def __call__(
- `BeautifulSoupCrawler` supports CSS selectors.
label: Label for the newly created `Request` objects, used for request routing.
user_data: User data to be provided to the newly created `Request` objects.
transform_request_function: A function that processes request options before creating a request.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transform_request_function: A function that processes request options before creating a request.
transform_request_callback: A function that processes request options before creating a request.

Isn't callback a better name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno, callback kinda sounds like something that just performs some random side effect. This function actually directly changes the result of enqueue_links

@vdusek
Copy link
Collaborator

vdusek commented Jan 22, 2025

I have no issue with using string literals in combination with the "new" Literal type. The type checker will handle typos and other problems, and users won't need to import anything more.

def transform(request_options: RequestOptions) -> RequestOptions | Literal['skip', 'unchanged']:
    ...

@Pijukatel
Copy link
Contributor

Very opinion based, but we should consider returning some other special value for skipping a request instead of None. Seeing e.g. return Request.SKIP would be more readable than return None.

I like the use of some special flag, in terms of readability. But I'm not sure how convenient it would be in regular use.

I imagine something like this - it would be cleaner to put the special flags in a separate enum and not Request, but you usually import that anyway. Thoughts?

def transform(request_options: RequestOptions) -> RequestOptions | Literal[Request.SKIP, Request.UNCHANGED]:
    if 'page' in request_options["url"]:
        return {*request_options, "label": "page"}
    if 'campaign' in request_options["url"]:
        return Request.SKIP
    return Request.UNCHANGED

Also the current behavior is closer to TS, if I'm not mistaken about the filter operation

if (transformRequestFunction) {
    requestOptions = requestOptions
        .map((request) => transformRequestFunction(request))
        .filter((r) => !!r) as RequestOptions[];
}

Yeah, it does filter out results that convert to false. I'd argue that this behavior is counter-intuitive, even though changing it in the JS version would be breaking in subtle ways. I would be happy to sacrifice parity between versions here, but I'm open to the opinions of others - @Pijukatel @vdusek @B4nan

From point of view of a person that sees this function for the first time and is not aware of the JS implementation I would expect this transform_request function to accept Request object, mutate it in place and return nothing. I think we should be more explicit with the naming of functions. This name implies side effects on request.
What we really do is more close to some name like "get_transformed_request_options" ...

I would suggest to make a choice of either doing side effects and not returning anything or creating copy of object and returning it.

@janbuchar
Copy link
Collaborator

From point of view of a person that sees this function for the first time and is not aware of the JS implementation I would expect this transform_request function to accept Request object, mutate it in place and return nothing. I think we should be more explicit with the naming of functions. This name implies side effects on request. What we really do is more close to some name like "get_transformed_request_options" ...

That's why I wouldn't accept functions that return None - it forces the caller to pass a function that returns either the modified options or one of the two other possible conclusions. Empty functions won't pass the type check.

@B4nan
Copy link
Member

B4nan commented Jan 22, 2025

What we really do is more close to some name like "get_transformed_request_options" ...

Let me disagree, anything that is called get* doesn't imply it can be used to mutate things. I would personally keep the JS name.

@@ -108,27 +108,57 @@ def __eq__(self, other: object) -> bool:
user_data_adapter = TypeAdapter(UserData)


class BaseRequestData(BaseModel):
"""Data needed to create a new crawling request."""
class RequestOptions(TypedDict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but can't we now use RequestOptions to type the kwargs of Request.from_url method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't, because these variables from RequestOptions are used in the from_url method.

@@ -140,20 +140,31 @@ async def enqueue_links(
selector: str = 'a',
label: str | None = None,
user_data: dict[str, Any] | None = None,
transform_request_function: Callable[[RequestOptions], RequestOptions | Literal['skip', 'unchanged']]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extract this type into some shared module. Users who want to provide a correct return type annotation for their transform_request_function should not need to copy all this.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

docs/introduction/03_adding_more_urls.mdx Outdated Show resolved Hide resolved
Comment on lines -111 to +112
class BaseRequestData(BaseModel):
"""Data needed to create a new crawling request."""
class RequestOptions(TypedDict):
"""Options that can be used to customize request creation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ou, so we are getting rid of the BaseRequestData, ok, that sounds good and makes sense. Thanks.

src/crawlee/_request.py Outdated Show resolved Hide resolved
src/crawlee/_types.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@@ -44,6 +44,8 @@

HttpPayload: TypeAlias = bytes

RequestTransformAction: TypeAlias = Literal['skip', 'unchanged']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether this is necessary 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for Specifying Headers in context.enqueue_links Method
5 participants