Scraper Project

Overview

The Scraper project is designed to demonstrate the practical implementation of various design patterns, including Factory, Builder, and Singleton. This project features a web scraper that periodically scrapes specified web pages based on a given interval (cron expression). The scraper is highly modular and extensible, making it easy to adapt to different web scraping needs.

Features

Factory Design Pattern: Used to create different types of scrapers based on the target website.
Builder Design Pattern: Facilitates the construction of complex scrape URL paths.
Singleton Design Pattern: Ensures that the db connection is a single instance, managing the storing tasks.
Scheduler: Configured with cron expressions to scrape web pages at specified intervals.
Extensible and Modular: Easy to add new scrapers and extend functionality.

Project Structure

scraper/
├── src
│   ├── PathBuilderFactory
│   │   ├── index.ts
│   │   └── products
│   │       ├── ListAm.ts
│   │       └── MobileCentre.ts
│   ├── Scheduler
│   │   └── index.ts
│   ├── ScrapableFactory
│   │   ├── index.ts
│   │   └── products
│   │       ├── ListAm.ts
│   │       └── MobileCentre.ts
│   ├── ScrapeValidatorFactory
│   │   ├── index.ts
│   │   └── products
│   │       ├── ListAm.ts
│   │       └── MobileCentre.ts
│   ├── ScrapersFactory
│   │   ├── index.ts
│   │   └── products
│   │       ├── Https.ts
│   │       └── Puppetter.ts
│   ├── __tests__
│   │   └── pathBuilder.test.ts
│   ├── configs
│   │   ├── constants.ts
│   │   └── types.ts
│   ├── db
│   │   ├── MongoCRUD.ts
│   │   └── MongoDBConnection.ts
│   ├── index.ts
│   └── utils
│       ├── getDateTime.ts
│       ├── getRandomInterval.ts
│       ├── request.ts
│       └── sleep.ts
├── tsconfig.json
├── README.MD
├── eslint.config.mjs
├── jest.config.cjs
├── package-lock.json
└── package.json

15 directories, 29 files

Getting Started

Prerequisites

Node.js
typescript
npm

Installation

Clone the repository:

git clone https://github.com/surenpoghosian/scraper.git
cd scraper

Install dependencies:
```
npm install
```

Configuration

Configure the scraper using the PathBuilderFactory to set the target URLs.

Usage

Build and run the project:
```
npm run build
npm start
```
The scheduler will start and scrape the specified web pages based on the configured cron expressions.

Example Code

Here's how to run the scraper:

import { v4 as uuid } from 'uuid';
import ScrapersFactory from './ScrapersFactory';
import ScrapableFactory from './ScrapableFactory';
import PathBuilderFactory from './PathBuilderFactory';
import { ListAmCategory, ListAmGeolocation, PathBuilderVariant, ScrapeValidatorVariant } from "./configs/types";
import { ScrapeableVariant, ScraperType, ScrapeType } from "./configs/types";
import { ListAmBaseURL } from './configs/constants';
import { sleep } from "./utils/sleep";
import ScrapeValidatorFactory from './ScrapeValidatorFactory';
import getRandomInterval from './utils/getRandomInterval';

class Main {
  run = async () => {
    const scraperFactory = ScrapersFactory;
    const scrapableFactory = ScrapableFactory;
    const pathBuilderFactory = PathBuilderFactory;
    const scrapeValidatorFactory = ScrapeValidatorFactory;

    const scraper = scraperFactory.createScraper(ScraperType.PUPPETTER);
    const validator = scrapeValidatorFactory.createScrapeValidator(ScrapeValidatorVariant.LISTAM);
    const scrapable = scrapableFactory.createScrapable(ScrapeableVariant.LISTAM, scraper, validator);
    const pathBuilder = pathBuilderFactory.createPathBuilder(PathBuilderVariant.LISTAM);

    pathBuilder.init('', ListAmCategory.ROOM_FOR_A_RENT);

    const scrapeId = uuid();

    for (let i = 1; true; i++) {
      pathBuilder.reset();
      pathBuilder.addPageNumber(i);
      pathBuilder.addGeolocation(ListAmGeolocation.YEREVAN);

      const finalPath = pathBuilder.build();
      console.log(`${ListAmBaseURL}${finalPath}`);

      await scrapable.scrape(scrapeId, finalPath, ScrapeType.LIST);

      const sleepInterval = getRandomInterval(4000, 10000);
      await sleep(sleepInterval);
    }
  }
};

export default Main;

Design Patterns in Use

Factory Pattern

The ScrapersFactory is responsible for creating instances of different types of scrapers based on the scraper engine that you need.

Example:

const scraper = ScrapersFactory.createScraper(ScraperType.PUPPETTER);

Builder Pattern

The PathBuilderFactory is used to construct complex scrape URL paths.

Example:

const pathBuilder = PathBuilderFactory.createPathBuilder(PathBuilderVariant.LISTAM);
pathBuilder.init('', ListAmCategory.ROOM_FOR_A_RENT);
pathBuilder.addPageNumber(1);
pathBuilder.addGeolocation(ListAmGeolocation.YEREVAN);
const finalPath = pathBuilder.build();
console.log(finalPath);

Singleton Pattern

The MongoDBConnection class is implemented as a singleton to ensure that there is only one instance managing the MongoDB connection.

Example:

const mongoConnection = await MongoDBConnection.getInstance();
const db = mongoConnection.getDb();
// Use db as needed

Extending the Project

Adding a New Scraper

Create a new scrapable product in the scrapableFactory/ directory that extends Scrapable.
Implement the required methods.
Register the new scrapable in the scrapableFactory.

Contributing

Contributions are welcome! Please submit a pull request or open an issue to discuss your ideas.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Happy Scraping!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper Project

Overview

Features

Project Structure

Getting Started

Prerequisites

Installation

Configuration

Usage

Example Code

Design Patterns in Use

Factory Pattern

Builder Pattern

Singleton Pattern

Extending the Project

Adding a New Scraper

Contributing

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.eslintignore		.eslintignore
.gitignore		.gitignore
LICENSE		LICENSE
README.MD		README.MD
eslint.config.mjs		eslint.config.mjs
jest.config.cjs		jest.config.cjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
typedoc.json		typedoc.json

License

surenpoghosian/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper Project

Overview

Features

Project Structure

Getting Started

Prerequisites

Installation

Configuration

Usage

Example Code

Design Patterns in Use

Factory Pattern

Builder Pattern

Singleton Pattern

Extending the Project

Adding a New Scraper

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages