Skip to content

OmarElgabry/gunaydin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Screenshot

Gunaydin!

Scrutinizer Code Quality Code Climate

Your good mornings!. "Gunaydin" in Turkish 🇹🇷 language means "Good Morning".

Every day, I wake up in the morning, go to my work, and the first thing I do is I go throw a list of websites I usually keep an eye on, and check if there's something new. Even worse, sometimes I forget to check some of them, and so I miss some information.

One way to automate the process is to scrap these websites form time to time, and get a list of the latest links (news, products, etc).

The downside is I've to explicitly define how to scrap each website. That's how to find the content in the HTML page. To do so, add a new document in Template collection. That's all, and the logic is the same for all.

Index

Demo

Check Demo

Features

  • Scraps a given list of pages from time to time.
  • Uses request for static pages and nightmare for dynamic pages.
  • Queue async jobs (scraping and saving to database) using async
  • Scraps proxies, rotate between proxies, and randomly assigns user agents.
  • Logs and tracks events especially jobs (how many succeeded, failed (and why), etc.).
  • Logging Errors and Exceptions using Winston, and logs are shipped to Loggly via winston-loggly-bulk
  • Uses MongoDB, Mongoose and MongoLab(mLab) for storing and querying data.

Installation

Running Locally

Make sure you have Node.js and npm installed.

  1. Clone or Download the repository

    $ git clone https://github.com/OmarElGabry/gunaydin.git
    $ cd gunaydin
    
  2. Install Dependencies

    $ npm install
    
  3. Start the application

    $ npm start
    

Your app should now be running on localhost:3000.

How It Works

Setup Configurations

Everything from setting up user authentication to database is explained in chat.io. I almost copied and pasted the code.

User & Pages (Model)

Every user document contains all information about that user. It has an array of pages.

Each page is what to be scrapped. Each page has list of links, a title, url, etc, and a reference to the template (see below). The links list might be a list of products, news, etc depending on the page.

Template (Model)

Template is is webpage(s) with the same layout. For example all the below links they have the same layout. So, they can be grouped under a 'Template', which defines a one specific way on how to scrap the webpage.

Thinking about a template? Open an issue, and I'll be happy to add it on the list.

Shards (aka Cycles)

Users are split-ed into logical shards. So, every time interval, say 1 hour, go to a shard, and scrap all users' pages in that shard. Then, update their listings in the database.

Queue (Service)

A queue is a list of the async jobs to be processed by the workers. The jobs might be scraping or saving to database. Accordingly, the workers might be scrapers or database workers.

A queue limits the maximum number of simultaneous operations, and handle the failed job by re-pushing it to the queue (up to maximum of say, 3 times).

There is a generic Queue class, where the Queue Factory instantiates different queues with different workers and max concurrent jobs.

Scrapers (Service)

There are three scrapers; static, dynamic, and a dedicated one for proxies (also dynamic). All scrapers inherit from the generic class Scraper, which provides useful methods to extract data, rotate proxies, randomly assigns user agents, and so on.

All scrapers are also workers and inherit from the Worker interface.

Stats (Service)

It keeps track all events especially jobs. It then persist them to database every some hours.

Support

I've written this script in my free time during my work. If you find it useful, please support the project by spreading the word.

Contribute

Contribute by creating new issues, sending pull requests on Github or you can send an email at: [email protected]

License

Built under MIT license.

Releases

No releases published

Packages

No packages published