Skip to content

This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages.

Notifications You must be signed in to change notification settings

timschopinski/web_crawler

Repository files navigation

Web Crawler

This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.


Introduction

This app uses asynchronous requests to fetch all the subpages in a recurrent way. It stores than the data in a specified CSV or JSON file. There is also an option to print the structure of the page as a tree. The data of scraped pages contains link, title, number of internal links, number of external links, number of times url was referenced by other pages. For simplicity links that start with http are counted as external links.

Documentation

crawl.py --page 'url' --format 'csv/json' --output 'output'

This script fetches subpages from the given 'url'. Results are saved in 'csv/json' format in 'output' where each row is representing one page, with the following columns/keys:
• link
• title
• number of internal links
• number of external links
• number of times url was referenced by other pages*

print_tree.py --page 'url'

This script prints the structure of the page as a tree in the following format: Main page (5)
  subpage1 (2)
    subpage1_1 (0)
    subpage1_2 (0)
  subpage2 (1)
    subpage2_1 (0)

There are also two additional arguments:

--allow_redirects, the redirects are also fetched by default is set to False

--max_depth, maximum depth of subpages, by default is set to 2

--timeout, you can speed up the scraping by specifying the timeout argument, default is 3s

Notice: If the script will reach the timeout there will be a 'Timeout error.' message displayed in the console.

Example usage with all arguments:

crawl.py --page 'url' --format 'csv/json' --output 'output' --max_depth 3 --allow_redirects --timeout 10

License

MIT

Copyright (c) 2022, Tim Schopinski

About

This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages