-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: The crawler is crawling too slow, look for solutions to increase performance #19
Comments
@jacqueline-chan is going into the code to confirm async calls. The reason why there might have been a memory leak is because apify default was setting too much RAM (60GB when we only had 40GB). Had to manually be set to 30 on prod. This only happens on production. It doesn't happen when running the application locally. Jacqueline to investigate. @RaiyanRahman will also take a look to advise if anything he thinks of. We still have a memory mystery to solve, but should know more after next week. |
We want to know how long one request will take or ten requests will take. Need some benchmarking @amygaoo will code this up for us |
@jacqueline-chan will run Cheerio |
@jacqueline-chan got cheerio working but the URLs it's crawling are not the URLs she expects it to crawl, so she is looking into that issue. It is much faster than puppeteer (rendering html only). Will troubleshoot with @RaiyanRahman to help in crawler selection. |
@jacqueline-chan restarted with max concurrency but no speed benefit. Nat suggested we should get to the crux of the issue for puppeteer so she is writing tests to determine what is going on. @jacqueline-chan and @RaiyanRahman did get Cheerio working but then it stopped. We need to run tests for both. Two streams are being pursued: 1 is tests/configs for the Puppeteer crawler (@RaiyanRahman) and 2 is tests for Cheerio (@jacqueline-chan). @jacqueline-chan will share written tests with @RaiyanRahman and @AlAndr04 as they should be able to be applied to both crawlers. @AlAndr04 will also look at the issue of why the puppeteer crawler is not managing resources and concurrency as it should. |
@RaiyanRahman & @jacqueline-chan
|
This third part is completed and is now in testing. While discussing with @RaiyanRahman, we determined that solving task#3 will consequently solve the issue that task#1 was meant to fix and therefore task#1 would be redundant. Manually enqueuing links has caused an issue of where the crawler does stop periodically crawling and needs to be manually restarted after crawling a couple thousand links. Will need to debug some more + fix this bug Planning to do a large crawl with cheerio this weekend or first thing monday |
@RaiyanRahman could not make today's meeting. Get site you know will fail https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md |
@RaiyanRahman please join us in debugging the manually enqueuing issue |
Suggestions from Nat: use the simplest puppeteer crawler to determine fails/edge cases
|
@RaiyanRahman refactored selective loading and tested the solution. Will this work for pop-ups? Check out: https://www.tubantia.nl/ is an example. Delay/speed may be related to the process of saving files. We can't run asynchronously because we synchronously save files. Does our database solve this problem? Notes: "I don't care about Cookies" doesn't work on Chrome, but the bigger issue is that we're using apify puppeteer (which makes it difficult to add extensions) - https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en A lot of people are asking puppeteer and apify to implement it, and there is a 'stealth' function that has been developed and could be looked into. @RaiyanRahman might be able to look into this a little further. For loop might be slowing us down too.
Priority one: Trying to solve pop-up problem - try using @RaiyanRahman's code first (new branch already pushed) Priority one: will stealth mode work in puppeteer (I don't care about cookies extension) |
As discussion with @RaiyanRahman, stealth mode and blocking request for extra resources does not work. Will need to manually click on each accept button for now, @RaiyanRahman will help look for a way to automate that. Both of us is going to try and write scripts to manually accept the consent forms. I will make a list of problematic links so far for us to test. |
@RaiyanRahman - looking into other extensions that might resolve the issue of pop-ups. Extensions will work with specific websites, but not generally. @Natkeeran advises that we develop a list of 10-15 sites with this problem and assess the problem for these sites: How to identify? How to close? (put into a .csv). Based on this, we can make decisions about how to address in code - are there any features that seem common to all of these sites? @jacqueline-chan and @RaiyanRahman will split up target links and put into a sheets doc or something similar (and link in this ticket). |
Jacqueline Raiyan URLs with pop ups:
URLs that exits immediately - problem unknown URLs that sometimes work URLS that take a long time to load (but still ultimately crawls)
URL fixes URLs that require sign in / human verification |
@RaiyanRahman - inconsistent behaviour when trying to close pop-ups (solutions seem to work sometimes but not others) https://derstandard.at/ - @jacqueline-chan has to get database up to get this data. But if she can't get it working will have to restart and run for two days. She did look into one of the links which has a pop-up and it should have been really easy to click away and click accept but for some reason it still cannot find the button to click away. Can't search for name of button. Todo @jacqueline-chan trying to retrieve database from preliminary crawl |
@jacqueline-chan @RaiyanRahman Some url links that only has one hit, hit much more frequently when it is tested on its own (without any other domain in the queue). Therefore @RaiyanRahman would like to explore batching as an option for now. and also look into the possibility of using just using Puppeteer/Playwright without apify to give us more control on the queue mechanism |
@jacqueline-chan - yesterday she was able to retrieve the database! She will make a CSV/Google Sheet - copy/ticket @RaiyanRahman suggests we try running batches of websites to see if this improves behaviour. @jacqueline-chan will try running small batches manually to start to see if this makes a difference prior to any code development. If this works, write code. If this doesn't work, don't use apify and write own queuing mechanism. |
CSV for the database. How I determine that a link is most likely a pop up issue is if it starts with https and only has a few hits go to this site and click on download csv |
Imported on the sheet here: https://docs.google.com/spreadsheets/d/1DJfiLT7XGL0XXttp8q0BKRZkn6swWAQ74gdlcaYG3CI/edit#gid=241833616 |
@jacqueline-chan to:
|
Overall, we believe that we should explore running multiple instances on separate VMs concurrently, including Raiyan's improvements to the batching process.
|
bug fix: if the crawl needs to run a subset of its crawl urls, the fu…
email when the crawl.js stops
@RaiyanRahman is seeking an optimized point for batches and page crawls per batch. Priority is an exhaustive crawl of individual domains if at all possible. @jacqueline-chan been pruning branches and working on compute canada instances and documentation (tasks above) |
@RaiyanRahman - Looping through domains individually is the best approach. @todo - Making the crawl more robust for subsequent crawls. Raiyan has found a sample implementation in documentation that he is working on. |
@RaiyanRahman has completed a refactor of the queuing system which runs locally (for the most part) but installing in Graham cloud raised specific issues:
|
Strategy for finding out more:
|
Equal importance was given to working domains. Most domains didn't work because of a memory issue - call stack gets too big and takes things down. Tested alongside a local instance. The behaviour was actually the same. NY times was tested on a separate machine. In 48 hours:
Next steps:
|
|
Time notes: 2 days = almost 15,000 links and had over 120,000 links left in the queue, and created in the mid-10,000s for JSON crawls. Tried a couple of different naming conventions. Raiyan has implemented a timestamp solution to stop JSON files being overwritten, and this has reduced the number of JSON files that are missed, and we anticipate that this solution means that we won't have JSON overwrite problems. For the JSON issues going forward, we'll need to implement a check after the data goes to the post-processor to see which found URLs did not result in JSON files. The crawler still runs out of stack memory and then stops, and needs to be restarted. Raiyan is working on a mechanism for automating the restarting process. Raiyan will have a meeting next week with the dev team on the log memory issue. Raiyan will spend time to understand compute canada resources so we understand how much space is available for storing data. |
Compute Canada: There are lots of folders hidden with space associated with them. We should check these folders before setting up an instance. 12000 after 24 hours. 8.5 links per minute and faster (15 links per minute) when there is no issue.
|
Needs investigation
some leads:
The text was updated successfully, but these errors were encountered: