This repository contains cloud crawler functions used by scrapeulous.com.
If you want to add your own crawler function to be used within the crawling infrastructure of scrapeulous, please contact us at contact.
Here is how you can test all crawling functions locally.
This repository contains a test_runner
program.
For example, execute the Google Scraper with:
node test_runner.js google_scraper.js '["keyword 1",]'
or run the amazon crawler with:
node test_runner.js amazon.js '["Notebook",]'
or the reverse image crawler with:
node test_runner.js reverse_image_google_url.js '["https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Mohamed_Atta.jpg/220px-Mohamed_Atta.jpg", "https://aldianews.com/sites/default/files/styles/article_image/public/articles/ISISAmenaza.jpg?itok=u7Nhc41a"]'
or
node test_runner.js reverse_image_bing_url.js '["https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Mohamed_Atta.jpg/220px-Mohamed_Atta.jpg"]'
or
node test_runner.js reverse_image_bing.js '["AC_I161709.jpg"]'
or you can run the social scraper:
node test_runner.js social.js '["http://www.flinders.edu.au/", "http://www.latrobe.edu.au/", "http://www.griffith.edu.au/", "http://www.murdoch.edu.au/", "https://www.qut.edu.au/"]'
- Scraping of Product Metadata on Amazon
- Extract the SERP from Google
- Extract the SERP from Bing
- Simple HTTP crawler making plain requests
- Leads: Extracting phone numbers and email addresses from any url with raw http requests
- Extracting linkedin profile data from any linkedin profile
- Extracting amazon warehouse deals
- Extracting amazon product data
You can add two types of Cloud Crawler functions:
- For crawling with the chrome browser controlled via
puppeteer
, use theBrowserWorker
base class - Scraping with the http library
got
and parsing withcheerio
, use theHttpWorker
base class
Function prototype for browsers looks like this:
/**
*
* The BrowserWorker class contains your scraping/crawling logic.
*
* Each BrowserWorker class must declare a crawl() function, which is executed on a distributed unique machine
* with dedicated CPU, memory and browser instance. A unique IP is not guaranteed,
* but it is the norm.
*
* Scraping workers time out after 200 seconds. So the function
* should return before this hard limit.
*
* Each Worker has a `page` param: A puppeteer like page object. See here:
* https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#class-page
*/
class Worker extends BrowserWorker {
/**
*
* Implement your crawling logic here. You have access to `this.page` here
* with a fully loaded browser according to configuration.
*
* @param item: The item that this crawl function makes progress with
*/
async function crawl(item) {
}
}
And the function prototype for HttpWorker
instances looks similar:
/**
*
* The HttpWorker class contains your scraping/crawling logic.
*
* Each HttpWorker class must declare a crawl() function, which is executed on a distributed unique machine
* with dedicated CPU, memory and browser instance. A unique IP is not guaranteed,
* but it is the norm.
*
* Scraping workers time out after 200 seconds. So the function
* should return before this hard limit.
*
* The class has access to the `this.Got` http library and `this.Cheerio` for parsing html documents.
* https://github.com/sindresorhus/got
*/
class Worker extends HttpWorker {
/**
*
* Implement your crawling logic here. You have access to `this.Got` here
* with a powerful http client library.
*
* @param item: The item that this crawl function makes progress with
*/
async function crawl(item) {
}
}