Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

Open
jacqueline-chan opened this issue Dec 10, 2020 · 29 comments
Assignees

Comments

@jacqueline-chan
Copy link
Contributor

Needs investigation
some leads:

  • kubernetes ?
  • double check if something is holding up the async calls in the crawler
  • more servers?
  • different way of submitting batches/multithreading
@kstapelfeldt
Copy link
Member

@jacqueline-chan is going into the code to confirm async calls.

The reason why there might have been a memory leak is because apify default was setting too much RAM (60GB when we only had 40GB). Had to manually be set to 30 on prod. This only happens on production. It doesn't happen when running the application locally. Jacqueline to investigate. @RaiyanRahman will also take a look to advise if anything he thinks of. We still have a memory mystery to solve, but should know more after next week.

@jacqueline-chan
Copy link
Contributor Author

We want to know how long one request will take or ten requests will take. Need some benchmarking @amygaoo will code this up for us

@kstapelfeldt
Copy link
Member

@jacqueline-chan will run Cheerio
@RaiyanRahman will explore how we might manage multiple concurrent puppeteer instances
@amygaoo will also explore how we might manage multiple concurrent puppeteer instances

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Jan 5, 2021

@jacqueline-chan got cheerio working but the URLs it's crawling are not the URLs she expects it to crawl, so she is looking into that issue. It is much faster than puppeteer (rendering html only). Will troubleshoot with @RaiyanRahman to help in crawler selection.
@jacqueline-chan will restart puppeteer with setting max and min concurrency. Min concurrency should be set to 50.
@RaiyanRahman read up how to add new instances of puppeteer and manage them. There are couple of different ways to do it. We are currently using apify sdk which manages puppeteer. However, we can use puppeteer directly and manage it on our own. We need to test this. Also, if the rendering is very slow we can selectively render certain elements. Will look into selective rendering that blocks media loading (images and video).

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Jan 14, 2021

@jacqueline-chan restarted with max concurrency but no speed benefit. Nat suggested we should get to the crux of the issue for puppeteer so she is writing tests to determine what is going on.

@jacqueline-chan and @RaiyanRahman did get Cheerio working but then it stopped. We need to run tests for both.

Two streams are being pursued: 1 is tests/configs for the Puppeteer crawler (@RaiyanRahman) and 2 is tests for Cheerio (@jacqueline-chan).

@jacqueline-chan will share written tests with @RaiyanRahman and @AlAndr04 as they should be able to be applied to both crawlers. @AlAndr04 will also look at the issue of why the puppeteer crawler is not managing resources and concurrency as it should.

@kstapelfeldt
Copy link
Member

@RaiyanRahman & @jacqueline-chan
Two issues remain: too slow, and not getting links back (only single links)

  1. hard-code https:// in puppeteer & cheerio and restarting crawl to see if this resolves the problem of bringing back only single links.
  2. fold in selective rendering code for puppeteer (after it's complete) (still need to be able to better block video)
  3. Look into manually queueing - do we need to modify in order to resolve issue?
    Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

@jacqueline-chan
Copy link
Contributor Author

jacqueline-chan commented Jan 23, 2021

  1. Look into manually queueing - do we need to modify in order to resolve issue?
    Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

This third part is completed and is now in testing. While discussing with @RaiyanRahman, we determined that solving task#3 will consequently solve the issue that task#1 was meant to fix and therefore task#1 would be redundant.

Manually enqueuing links has caused an issue of where the crawler does stop periodically crawling and needs to be manually restarted after crawling a couple thousand links. Will need to debug some more + fix this bug

Planning to do a large crawl with cheerio this weekend or first thing monday

@kstapelfeldt
Copy link
Member

@RaiyanRahman could not make today's meeting.
@jacqueline-chan implemented number three. When manual queuing is implemented, the crawler does not function as it would using self-derived links. There remains an implementation issue. How could we troubleshoot this? Nat suggests two ways forward: 1. contact developers/github repo maintainers/community or 2. determine how to mark the manual queuing so it is identical to the internal queuing process. This is an issue with cheerio.

Get site you know will fail
Write tests that focus on the queue.

https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

@jacqueline-chan
Copy link
Contributor Author

@RaiyanRahman please join us in debugging the manually enqueuing issue

@jacqueline-chan
Copy link
Contributor Author

jacqueline-chan commented Jan 28, 2021

Suggestions from Nat:

use the simplest puppeteer crawler to determine fails/edge cases

  1. how do we handle pop ups / privacy issues (Accept or close) (Use user agent). Ways to recognize pop ups for special cases per site.
    -- avoid javascript ?? pop up? (have users accept). Have a library/extension. Touch base with Raiyan.
  2. paywalls -- probably have to use api
  3. save asynchronously
  4. give batches

jacqueline-chan added a commit that referenced this issue Jan 29, 2021
@kstapelfeldt
Copy link
Member

kstapelfeldt commented Feb 4, 2021

@RaiyanRahman refactored selective loading and tested the solution. Will this work for pop-ups? Check out: https://www.tubantia.nl/ is an example.

Delay/speed may be related to the process of saving files. We can't run asynchronously because we synchronously save files. Does our database solve this problem?

Notes: "I don't care about Cookies" doesn't work on Chrome, but the bigger issue is that we're using apify puppeteer (which makes it difficult to add extensions) - https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en A lot of people are asking puppeteer and apify to implement it, and there is a 'stealth' function that has been developed and could be looked into. @RaiyanRahman might be able to look into this a little further.

For loop might be slowing us down too.

  • Still focusing on puppeteer first

@jacqueline-chan:

Priority one: Trying to solve pop-up problem - try using @RaiyanRahman's code first (new branch already pushed)
Priority two: Speed problem - move all processing to a function for later so the crawler doesn't wait on it.

@RaiyanRahman

Priority one: will stealth mode work in puppeteer (I don't care about cookies extension)
Priority two: based on collaboration with Jacqueline, do we need to handle pop-ups/cookies elsewhere in the code?

@jacqueline-chan
Copy link
Contributor Author

As discussion with @RaiyanRahman, stealth mode and blocking request for extra resources does not work.

Will need to manually click on each accept button for now, @RaiyanRahman will help look for a way to automate that.

Both of us is going to try and write scripts to manually accept the consent forms. I will make a list of problematic links so far for us to test.

@kstapelfeldt
Copy link
Member

@RaiyanRahman - looking into other extensions that might resolve the issue of pop-ups. Extensions will work with specific websites, but not generally. @Natkeeran advises that we develop a list of 10-15 sites with this problem and assess the problem for these sites: How to identify? How to close? (put into a .csv). Based on this, we can make decisions about how to address in code - are there any features that seem common to all of these sites?

@jacqueline-chan and @RaiyanRahman will split up target links and put into a sheets doc or something similar (and link in this ticket).

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Feb 18, 2021

@RaiyanRahman - inconsistent behaviour when trying to close pop-ups (solutions seem to work sometimes but not others) https://derstandard.at/ -

@jacqueline-chan has to get database up to get this data. But if she can't get it working will have to restart and run for two days. She did look into one of the links which has a pop-up and it should have been really easy to click away and click accept but for some reason it still cannot find the button to click away. Can't search for name of button.

Todo

@jacqueline-chan trying to retrieve database from preliminary crawl
@jacqueline-chan and @RaiyanRahman trying to resolve URL issues
@jacqueline-chan and @RaiyanRahman we need a Google Sheet spreadsheet with all the problematic URLs to make collaboration and documentation easier.

@jacqueline-chan
Copy link
Contributor Author

@jacqueline-chan @RaiyanRahman Some url links that only has one hit, hit much more frequently when it is tested on its own (without any other domain in the queue). Therefore @RaiyanRahman would like to explore batching as an option for now. and also look into the possibility of using just using Puppeteer/Playwright without apify to give us more control on the queue mechanism

@kstapelfeldt
Copy link
Member

@jacqueline-chan - yesterday she was able to retrieve the database! She will make a CSV/Google Sheet - copy/ticket
@jacqueline-chan and @RaiyanRahman trying to resolve URL issues - @Natkeeran reviewed a website that @jacqueline-chan was debugging and found that they are deliberately hiding pop-up (to prevent crawling). This will be difficult. @RaiyanRahman found his site to be very inconsistent. 50% of the time a button could not be found.

@RaiyanRahman suggests we try running batches of websites to see if this improves behaviour. @jacqueline-chan will try running small batches manually to start to see if this makes a difference prior to any code development. If this works, write code. If this doesn't work, don't use apify and write own queuing mechanism.

@jacqueline-chan
Copy link
Contributor Author

CSV for the database. How I determine that a link is most likely a pop up issue is if it starts with https and only has a few hits

go to this site and click on download csv

http://199.241.167.146/

@kstapelfeldt
Copy link
Member

Imported on the sheet here: https://docs.google.com/spreadsheets/d/1DJfiLT7XGL0XXttp8q0BKRZkn6swWAQ74gdlcaYG3CI/edit#gid=241833616

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Mar 4, 2021

@RaiyanRahman

  1. trying to automate batching in apify
  2. look for content of the a tag instead of title.

@jacqueline-chan to:

  • 1. look and see if there are other manual queues already written that we might use (to swap out apify)

  • 2. send notification if apify crashes (find code and push)

  • 3. Output list of problematic sites. We need some way to spit out a file of URLs that don't get crawled, as we're going to have some in the end.

  • 4. bug fix, take in the full scope, but only run the batch according to the batch id

  • 5. run ny times as a batch and twitter crawler

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Mar 11, 2021

  • @RaiyanRahman did work in batching system for puppeteer using apify, but needs to mark batches at done. He starts by splitting the scope into smaller batches and runs them in succession. Once one is done, he begins the next. He has more than 15 hours left and believes this can be done in that time.

Overall, we believe that we should explore running multiple instances on separate VMs concurrently, including Raiyan's improvements to the batching process.

jacqueline-chan added a commit that referenced this issue Mar 11, 2021
bug fix: if the crawl needs to run a subset of its crawl urls, the fu…
jacqueline-chan added a commit that referenced this issue Mar 11, 2021
@kstapelfeldt
Copy link
Member

@RaiyanRahman is seeking an optimized point for batches and page crawls per batch. Priority is an exhaustive crawl of individual domains if at all possible.

@jacqueline-chan been pruning branches and working on compute canada instances and documentation (tasks above)

@kstapelfeldt
Copy link
Member

@RaiyanRahman - Looping through domains individually is the best approach. @todo - Making the crawl more robust for subsequent crawls. Raiyan has found a sample implementation in documentation that he is working on.

@kstapelfeldt
Copy link
Member

@RaiyanRahman has completed a refactor of the queuing system which runs locally (for the most part) but installing in Graham cloud raised specific issues:

  1. After two days of running, infinite loop
  2. Some domains only have a single page crawled and subsequent links are not added to the queue
  3. JSON not being returned for some crawled links
  4. Two specific domains did not have a results folder created

@kstapelfeldt
Copy link
Member

kstapelfeldt commented May 27, 2021

Strategy for finding out more:

  1. Re-run crawl with only working domains to see if infinite loop problem persists.
  2. Install two more versions of the code base
  3. Run one new machine with a single problematic URL
  4. Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Jun 3, 2021

  1. Re-run crawl with only working domains to see if infinite loop problem persists.
  2. Install two more versions of the code base
  3. Run one new machine with a single problematic URL
  4. Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic
    ran the whole scope.

Equal importance was given to working domains. Most domains didn't work because of a memory issue - call stack gets too big and takes things down. Tested alongside a local instance. The behaviour was actually the same. NY times was tested on a separate machine.

In 48 hours:

  • 40,421 Crawled Links
  • 9,507 JSON
  • 14 URLs a minute (low average)

Next steps:

@kstapelfeldt
Copy link
Member

  • Raiyan implemented a different naming convention (UUID and millisecond timestamp). Pushed changes and updated Graham cloud. Seems to be better but still a misfit between # of JSON being produced and number of links crawled. Logs show when it happens, but still figuring out how to avoid issue.
  • Next step: Keep working to find out when infinite loop thing happens: monitor crawl? - Raiyan has a couple of ideas.

@kstapelfeldt
Copy link
Member

kstapelfeldt commented Jun 17, 2021

Time notes: 2 days = almost 15,000 links and had over 120,000 links left in the queue, and created in the mid-10,000s for JSON crawls. Tried a couple of different naming conventions.

Raiyan has implemented a timestamp solution to stop JSON files being overwritten, and this has reduced the number of JSON files that are missed, and we anticipate that this solution means that we won't have JSON overwrite problems. For the JSON issues going forward, we'll need to implement a check after the data goes to the post-processor to see which found URLs did not result in JSON files.

The crawler still runs out of stack memory and then stops, and needs to be restarted. Raiyan is working on a mechanism for automating the restarting process. Raiyan will have a meeting next week with the dev team on the log memory issue.

Raiyan will spend time to understand compute canada resources so we understand how much space is available for storing data.

@kstapelfeldt
Copy link
Member

Compute Canada: There are lots of folders hidden with space associated with them. We should check these folders before setting up an instance.

12000 after 24 hours. 8.5 links per minute and faster (15 links per minute) when there is no issue.
Did some refactoring per Nat's suggestion.
Crawls to 24 hour mark and then has trouble opening new pages in browser. After 24 hours it happens frequently.
Added timestamps and that made debug file a lot easier.
If we prevent new page timeout issue, this will speed up a lot and take care of a lot of our issues.
Stealth crawler might cause problems.
JSON files are now all being created!! - script counts all pages crawled and it matches.

  • Experiment to see what puppeteer configuration flag might be causing the new page timeout issue.
  • Consider stopping and restarting crawl with a separate script scheduled to fire when the issue of new page timeout appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants