Issue: The crawler is crawling too slow, look for solutions to increase performance #19

jacqueline-chan · 2020-12-10T22:43:25Z

Needs investigation
some leads:

kubernetes ?
double check if something is holding up the async calls in the crawler
more servers?
different way of submitting batches/multithreading

kstapelfeldt · 2020-12-15T18:08:49Z

@jacqueline-chan is going into the code to confirm async calls.

The reason why there might have been a memory leak is because apify default was setting too much RAM (60GB when we only had 40GB). Had to manually be set to 30 on prod. This only happens on production. It doesn't happen when running the application locally. Jacqueline to investigate. @RaiyanRahman will also take a look to advise if anything he thinks of. We still have a memory mystery to solve, but should know more after next week.

jacqueline-chan · 2020-12-16T16:05:27Z

We want to know how long one request will take or ten requests will take. Need some benchmarking @amygaoo will code this up for us

kstapelfeldt · 2020-12-22T18:25:28Z

@jacqueline-chan will run Cheerio
@RaiyanRahman will explore how we might manage multiple concurrent puppeteer instances
@amygaoo will also explore how we might manage multiple concurrent puppeteer instances

kstapelfeldt · 2021-01-05T18:17:34Z

@jacqueline-chan got cheerio working but the URLs it's crawling are not the URLs she expects it to crawl, so she is looking into that issue. It is much faster than puppeteer (rendering html only). Will troubleshoot with @RaiyanRahman to help in crawler selection.
@jacqueline-chan will restart puppeteer with setting max and min concurrency. Min concurrency should be set to 50.
@RaiyanRahman read up how to add new instances of puppeteer and manage them. There are couple of different ways to do it. We are currently using apify sdk which manages puppeteer. However, we can use puppeteer directly and manage it on our own. We need to test this. Also, if the rendering is very slow we can selectively render certain elements. Will look into selective rendering that blocks media loading (images and video).

kstapelfeldt · 2021-01-14T14:10:45Z

@jacqueline-chan restarted with max concurrency but no speed benefit. Nat suggested we should get to the crux of the issue for puppeteer so she is writing tests to determine what is going on.

@jacqueline-chan and @RaiyanRahman did get Cheerio working but then it stopped. We need to run tests for both.

Two streams are being pursued: 1 is tests/configs for the Puppeteer crawler (@RaiyanRahman) and 2 is tests for Cheerio (@jacqueline-chan).

@jacqueline-chan will share written tests with @RaiyanRahman and @AlAndr04 as they should be able to be applied to both crawlers. @AlAndr04 will also look at the issue of why the puppeteer crawler is not managing resources and concurrency as it should.

kstapelfeldt · 2021-01-21T14:22:20Z

@RaiyanRahman & @jacqueline-chan
Two issues remain: too slow, and not getting links back (only single links)

hard-code https:// in puppeteer & cheerio and restarting crawl to see if this resolves the problem of bringing back only single links.
fold in selective rendering code for puppeteer (after it's complete) (still need to be able to better block video)
Look into manually queueing - do we need to modify in order to resolve issue?
Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

jacqueline-chan · 2021-01-23T03:20:35Z

Look into manually queueing - do we need to modify in order to resolve issue?
Also: Run tests written by @jacqueline-chan on puppeteer/cheerio to test speed and bring back stats

This third part is completed and is now in testing. While discussing with @RaiyanRahman, we determined that solving task#3 will consequently solve the issue that task#1 was meant to fix and therefore task#1 would be redundant.

Manually enqueuing links has caused an issue of where the crawler does stop periodically crawling and needs to be manually restarted after crawling a couple thousand links. Will need to debug some more + fix this bug

Planning to do a large crawl with cheerio this weekend or first thing monday

kstapelfeldt · 2021-01-28T14:28:49Z

@RaiyanRahman could not make today's meeting.
@jacqueline-chan implemented number three. When manual queuing is implemented, the crawler does not function as it would using self-derived links. There remains an implementation issue. How could we troubleshoot this? Nat suggests two ways forward: 1. contact developers/github repo maintainers/community or 2. determine how to mark the manual queuing so it is identical to the internal queuing process. This is an issue with cheerio.

Get site you know will fail
Write tests that focus on the queue.

https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md

jacqueline-chan · 2021-01-28T14:37:55Z

@RaiyanRahman please join us in debugging the manually enqueuing issue

jacqueline-chan · 2021-01-28T15:00:25Z

Suggestions from Nat:

use the simplest puppeteer crawler to determine fails/edge cases

how do we handle pop ups / privacy issues (Accept or close) (Use user agent). Ways to recognize pop ups for special cases per site.
-- avoid javascript ?? pop up? (have users accept). Have a library/extension. Touch base with Raiyan.
paywalls -- probably have to use api
save asynchronously
give batches

recover branch #19

kstapelfeldt · 2021-02-04T14:27:29Z

@RaiyanRahman refactored selective loading and tested the solution. Will this work for pop-ups? Check out: https://www.tubantia.nl/ is an example.

Delay/speed may be related to the process of saving files. We can't run asynchronously because we synchronously save files. Does our database solve this problem?

Notes: "I don't care about Cookies" doesn't work on Chrome, but the bigger issue is that we're using apify puppeteer (which makes it difficult to add extensions) - https://chrome.google.com/webstore/detail/i-dont-care-about-cookies/fihnjjcciajhdojfnbdddfaoknhalnja?hl=en A lot of people are asking puppeteer and apify to implement it, and there is a 'stealth' function that has been developed and could be looked into. @RaiyanRahman might be able to look into this a little further.

For loop might be slowing us down too.

Still focusing on puppeteer first

@jacqueline-chan:

Priority one: Trying to solve pop-up problem - try using @RaiyanRahman's code first (new branch already pushed)
Priority two: Speed problem - move all processing to a function for later so the crawler doesn't wait on it.

@RaiyanRahman

Priority one: will stealth mode work in puppeteer (I don't care about cookies extension)
Priority two: based on collaboration with Jacqueline, do we need to handle pop-ups/cookies elsewhere in the code?

jacqueline-chan · 2021-02-08T20:17:39Z

As discussion with @RaiyanRahman, stealth mode and blocking request for extra resources does not work.

Will need to manually click on each accept button for now, @RaiyanRahman will help look for a way to automate that.

Both of us is going to try and write scripts to manually accept the consent forms. I will make a list of problematic links so far for us to test.

kstapelfeldt · 2021-02-11T14:19:38Z

@RaiyanRahman - looking into other extensions that might resolve the issue of pop-ups. Extensions will work with specific websites, but not generally. @Natkeeran advises that we develop a list of 10-15 sites with this problem and assess the problem for these sites: How to identify? How to close? (put into a .csv). Based on this, we can make decisions about how to address in code - are there any features that seem common to all of these sites?

@jacqueline-chan and @RaiyanRahman will split up target links and put into a sheets doc or something similar (and link in this ticket).

jacqueline-chan · 2021-02-12T22:18:02Z

kstapelfeldt · 2021-02-18T14:19:03Z

@RaiyanRahman - inconsistent behaviour when trying to close pop-ups (solutions seem to work sometimes but not others) https://derstandard.at/ -

@jacqueline-chan has to get database up to get this data. But if she can't get it working will have to restart and run for two days. She did look into one of the links which has a pop-up and it should have been really easy to click away and click accept but for some reason it still cannot find the button to click away. Can't search for name of button.

Todo

@jacqueline-chan trying to retrieve database from preliminary crawl
@jacqueline-chan and @RaiyanRahman trying to resolve URL issues
@jacqueline-chan and @RaiyanRahman we need a Google Sheet spreadsheet with all the problematic URLs to make collaboration and documentation easier.

jacqueline-chan · 2021-02-25T01:02:32Z

@jacqueline-chan @RaiyanRahman Some url links that only has one hit, hit much more frequently when it is tested on its own (without any other domain in the queue). Therefore @RaiyanRahman would like to explore batching as an option for now. and also look into the possibility of using just using Puppeteer/Playwright without apify to give us more control on the queue mechanism

kstapelfeldt · 2021-02-25T14:21:47Z

@jacqueline-chan - yesterday she was able to retrieve the database! She will make a CSV/Google Sheet - copy/ticket
@jacqueline-chan and @RaiyanRahman trying to resolve URL issues - @Natkeeran reviewed a website that @jacqueline-chan was debugging and found that they are deliberately hiding pop-up (to prevent crawling). This will be difficult. @RaiyanRahman found his site to be very inconsistent. 50% of the time a button could not be found.

@RaiyanRahman suggests we try running batches of websites to see if this improves behaviour. @jacqueline-chan will try running small batches manually to start to see if this makes a difference prior to any code development. If this works, write code. If this doesn't work, don't use apify and write own queuing mechanism.

jacqueline-chan · 2021-03-01T20:22:31Z

CSV for the database. How I determine that a link is most likely a pop up issue is if it starts with https and only has a few hits

go to this site and click on download csv

http://199.241.167.146/

kstapelfeldt · 2021-03-01T20:26:55Z

Imported on the sheet here: https://docs.google.com/spreadsheets/d/1DJfiLT7XGL0XXttp8q0BKRZkn6swWAQ74gdlcaYG3CI/edit#gid=241833616

kstapelfeldt · 2021-03-04T15:01:09Z

@RaiyanRahman

trying to automate batching in apify
look for content of the a tag instead of title.

@jacqueline-chan to:

1. look and see if there are other manual queues already written that we might use (to swap out apify)
2. send notification if apify crashes (find code and push)
3. Output list of problematic sites. We need some way to spit out a file of URLs that don't get crawled, as we're going to have some in the end.
4. bug fix, take in the full scope, but only run the batch according to the batch id
5. run ny times as a batch and twitter crawler

kstapelfeldt · 2021-03-11T14:42:56Z

bug fix: if the crawl needs to run a subset of its crawl urls, the fu…

email when the crawl.js stops

kstapelfeldt · 2021-03-18T13:15:45Z

@RaiyanRahman is seeking an optimized point for batches and page crawls per batch. Priority is an exhaustive crawl of individual domains if at all possible.

@jacqueline-chan been pruning branches and working on compute canada instances and documentation (tasks above)

kstapelfeldt · 2021-03-25T13:27:23Z

@RaiyanRahman - Looping through domains individually is the best approach. @todo - Making the crawl more robust for subsequent crawls. Raiyan has found a sample implementation in documentation that he is working on.

kstapelfeldt · 2021-05-27T13:48:57Z

@RaiyanRahman has completed a refactor of the queuing system which runs locally (for the most part) but installing in Graham cloud raised specific issues:

After two days of running, infinite loop
Some domains only have a single page crawled and subsequent links are not added to the queue
JSON not being returned for some crawled links
Two specific domains did not have a results folder created

kstapelfeldt · 2021-05-27T14:17:12Z

Strategy for finding out more:

Re-run crawl with only working domains to see if infinite loop problem persists.
Install two more versions of the code base
Run one new machine with a single problematic URL
Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic

kstapelfeldt · 2021-06-03T13:45:59Z

Re-run crawl with only working domains to see if infinite loop problem persists.
Install two more versions of the code base
Run one new machine with a single problematic URL
Run the second new machine using a new suite of URLs from the scope, and record what works and what is problematic
ran the whole scope.

Equal importance was given to working domains. Most domains didn't work because of a memory issue - call stack gets too big and takes things down. Tested alongside a local instance. The behaviour was actually the same. NY times was tested on a separate machine.

In 48 hours:

40,421 Crawled Links
9,507 JSON
14 URLs a minute (low average)

Next steps:

Revise hash formula to guarantee unique IDS so that JSON is not overwritten (change naming convention)
Get a notification sent when the machine a) goes down b) thinks it's finished.
Try running in parallel in the same instance (use all machines to maximize use of time)
Keep problem domain spreadsheet updated: https://docs.google.com/spreadsheets/d/19O1l3A_qB9AqVCPHx1VCB7EyBOMgejx4lhbTukRvZJE/edit#gid=1391855376

kstapelfeldt · 2021-06-10T13:45:19Z

Raiyan implemented a different naming convention (UUID and millisecond timestamp). Pushed changes and updated Graham cloud. Seems to be better but still a misfit between # of JSON being produced and number of links crawled. Logs show when it happens, but still figuring out how to avoid issue.
Next step: Keep working to find out when infinite loop thing happens: monitor crawl? - Raiyan has a couple of ideas.

kstapelfeldt · 2021-06-17T13:25:44Z

Time notes: 2 days = almost 15,000 links and had over 120,000 links left in the queue, and created in the mid-10,000s for JSON crawls. Tried a couple of different naming conventions.

Raiyan has implemented a timestamp solution to stop JSON files being overwritten, and this has reduced the number of JSON files that are missed, and we anticipate that this solution means that we won't have JSON overwrite problems. For the JSON issues going forward, we'll need to implement a check after the data goes to the post-processor to see which found URLs did not result in JSON files.

The crawler still runs out of stack memory and then stops, and needs to be restarted. Raiyan is working on a mechanism for automating the restarting process. Raiyan will have a meeting next week with the dev team on the log memory issue.

Raiyan will spend time to understand compute canada resources so we understand how much space is available for storing data.

kstapelfeldt · 2021-06-23T16:16:50Z

Compute Canada: There are lots of folders hidden with space associated with them. We should check these folders before setting up an instance.

12000 after 24 hours. 8.5 links per minute and faster (15 links per minute) when there is no issue.
Did some refactoring per Nat's suggestion.
Crawls to 24 hour mark and then has trouble opening new pages in browser. After 24 hours it happens frequently.
Added timestamps and that made debug file a lot easier.
If we prevent new page timeout issue, this will speed up a lot and take care of a lot of our issues.
Stealth crawler might cause problems.
JSON files are now all being created!! - script counts all pages crawled and it matches.

Experiment to see what puppeteer configuration flag might be causing the new page timeout issue.
Consider stopping and restarting crawl with a separate script scheduled to fire when the issue of new page timeout appears.

jacqueline-chan assigned jacqueline-chan and RaiyanRahman Dec 10, 2020

kstapelfeldt assigned jacqueline-chan and unassigned jacqueline-chan Dec 22, 2020

kstapelfeldt assigned amygao9 Dec 22, 2020

jacqueline-chan added a commit that referenced this issue Jan 8, 2021

merge with #19-benchmark-log changes

7368a55

jacqueline-chan added a commit that referenced this issue Jan 8, 2021

merged branch #19-benchmark-log

822c864

kstapelfeldt mentioned this issue Jan 14, 2021

Issue: Some domains only return one / a few links back. #18

Closed

jacqueline-chan added a commit that referenced this issue Jan 29, 2021

Merge pull request #22 from UTMediaCAT/#19-benchmark-log

8578b8f

recover branch #19

jacqueline-chan added a commit that referenced this issue Mar 11, 2021

Merge pull request #25 from UTMediaCAT/#19-bugfix-scope-input-batching

ae9c94d

bug fix: if the crawl needs to run a subset of its crawl urls, the fu…

jacqueline-chan added a commit that referenced this issue Mar 11, 2021

Merge pull request #26 from UTMediaCAT/#19-email-when-crawl-stops

1c4497e

email when the crawl.js stops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

jacqueline-chan commented Dec 10, 2020

kstapelfeldt commented Dec 15, 2020

jacqueline-chan commented Dec 16, 2020

kstapelfeldt commented Dec 22, 2020

kstapelfeldt commented Jan 5, 2021 •

edited

Loading

kstapelfeldt commented Jan 14, 2021 •

edited

Loading

kstapelfeldt commented Jan 21, 2021

jacqueline-chan commented Jan 23, 2021 •

edited

Loading

kstapelfeldt commented Jan 28, 2021

jacqueline-chan commented Jan 28, 2021

jacqueline-chan commented Jan 28, 2021 •

edited

Loading

kstapelfeldt commented Feb 4, 2021 •

edited

Loading

jacqueline-chan commented Feb 8, 2021

kstapelfeldt commented Feb 11, 2021

jacqueline-chan commented Feb 12, 2021 •

edited

Loading

kstapelfeldt commented Feb 18, 2021 •

edited

Loading

jacqueline-chan commented Feb 25, 2021

kstapelfeldt commented Feb 25, 2021

jacqueline-chan commented Mar 1, 2021

kstapelfeldt commented Mar 1, 2021

kstapelfeldt commented Mar 4, 2021 •

edited by jacqueline-chan

Loading

kstapelfeldt commented Mar 11, 2021 •

edited by jacqueline-chan

Loading

kstapelfeldt commented Mar 18, 2021

kstapelfeldt commented Mar 25, 2021

kstapelfeldt commented May 27, 2021

kstapelfeldt commented May 27, 2021 •

edited

Loading

kstapelfeldt commented Jun 3, 2021 •

edited

Loading

kstapelfeldt commented Jun 10, 2021

kstapelfeldt commented Jun 17, 2021 •

edited

Loading

kstapelfeldt commented Jun 23, 2021

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

Issue: The crawler is crawling too slow, look for solutions to increase performance #19

Comments

jacqueline-chan commented Dec 10, 2020

kstapelfeldt commented Dec 15, 2020

jacqueline-chan commented Dec 16, 2020

kstapelfeldt commented Dec 22, 2020

kstapelfeldt commented Jan 5, 2021 • edited Loading

kstapelfeldt commented Jan 14, 2021 • edited Loading

kstapelfeldt commented Jan 21, 2021

jacqueline-chan commented Jan 23, 2021 • edited Loading

kstapelfeldt commented Jan 28, 2021

jacqueline-chan commented Jan 28, 2021

jacqueline-chan commented Jan 28, 2021 • edited Loading

kstapelfeldt commented Feb 4, 2021 • edited Loading

jacqueline-chan commented Feb 8, 2021

kstapelfeldt commented Feb 11, 2021

jacqueline-chan commented Feb 12, 2021 • edited Loading

kstapelfeldt commented Feb 18, 2021 • edited Loading

jacqueline-chan commented Feb 25, 2021

kstapelfeldt commented Feb 25, 2021

jacqueline-chan commented Mar 1, 2021

kstapelfeldt commented Mar 1, 2021

kstapelfeldt commented Mar 4, 2021 • edited by jacqueline-chan Loading

kstapelfeldt commented Mar 11, 2021 • edited by jacqueline-chan Loading

kstapelfeldt commented Mar 18, 2021

kstapelfeldt commented Mar 25, 2021

kstapelfeldt commented May 27, 2021

kstapelfeldt commented May 27, 2021 • edited Loading

kstapelfeldt commented Jun 3, 2021 • edited Loading

kstapelfeldt commented Jun 10, 2021

kstapelfeldt commented Jun 17, 2021 • edited Loading

kstapelfeldt commented Jun 23, 2021

kstapelfeldt commented Jan 5, 2021 •

edited

Loading

kstapelfeldt commented Jan 14, 2021 •

edited

Loading

jacqueline-chan commented Jan 23, 2021 •

edited

Loading

jacqueline-chan commented Jan 28, 2021 •

edited

Loading

kstapelfeldt commented Feb 4, 2021 •

edited

Loading

jacqueline-chan commented Feb 12, 2021 •

edited

Loading

kstapelfeldt commented Feb 18, 2021 •

edited

Loading

kstapelfeldt commented Mar 4, 2021 •

edited by jacqueline-chan

Loading

kstapelfeldt commented Mar 11, 2021 •

edited by jacqueline-chan

Loading

kstapelfeldt commented May 27, 2021 •

edited

Loading

kstapelfeldt commented Jun 3, 2021 •

edited

Loading

kstapelfeldt commented Jun 17, 2021 •

edited

Loading