scrape page count is exceed maxRequestsPerCrawl too much #2269

zshnb · 2024-01-04T06:22:19Z

zshnb
Jan 4, 2024

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

the actual scrape page count is more exceed than expect, even 2x(13:5)

Code sample

sample code in guide

Package version

@builder.io/[email protected] /Users/zheng/Workbench/github/gpt-crawler └── [email protected] -> ./node_modules/.pnpm/[email protected][email protected]/node_modules/crawlee

Node.js version

v18

Operating system

MacOS

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

B4nan · 2024-01-04T08:33:42Z

B4nan
Jan 4, 2024
Maintainer

This happens with a higher initial concurrency and is expected behavior, we never ignore requests that were started (in fact it's usually not possible to cancel them in the first place), as you can see yourself in the logs (the message starting with earlier... is about this).

This fact is described in several places in the docs, e.g. https://crawlee.dev/docs/introduction/adding-urls#limit-your-crawls-with-maxrequestspercrawl

sample code in guide

This is not very helpful, reproductions need to be complete and specific.

2 replies

zshnb Jan 4, 2024
Author

you mean even in log display actual count is more than config, crawlee can't cancel exceed requests which are already in request queue?

B4nan Jan 4, 2024
Maintainer

I don't understand what you are trying to say. The problem is not about requests in queue, it's about requests that already started processing, the request handler for them was already started, that's what cannot be canceled.

The problem is parallelism, if you would limit the concurrency to 1, this could never happen, but if you allow parallel processing (which is the default), you can get to a state like this, when several requests started processing and only after that you reach the limit (as the limit is about what requests finished processing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrape page count is exceed maxRequestsPerCrawl too much #2269

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

scrape page count is exceed maxRequestsPerCrawl too much #2269

zshnb Jan 4, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

Replies: 1 comment · 2 replies

B4nan Jan 4, 2024 Maintainer

zshnb Jan 4, 2024 Author

B4nan Jan 4, 2024 Maintainer

zshnb
Jan 4, 2024

I have tested this on the `next` release

Replies: 1 comment 2 replies

B4nan
Jan 4, 2024
Maintainer

zshnb Jan 4, 2024
Author

B4nan Jan 4, 2024
Maintainer