GitHub - timschopinski/web_crawler: This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages.

Web Crawler

This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.

Introduction

This app uses asynchronous requests to fetch all the subpages in a recurrent way. It stores than the data in a specified CSV or JSON file. There is also an option to print the structure of the page as a tree. The data of scraped pages contains link, title, number of internal links, number of external links, number of times url was referenced by other pages. For simplicity links that start with http are counted as external links.

Documentation

crawl.py --page 'url' --format 'csv/json' --output 'output'

This script fetches subpages from the given 'url'. Results are saved in 'csv/json' format in 'output' where each row is representing one page, with the following columns/keys:
• link
• title
• number of internal links
• number of external links
• number of times url was referenced by other pages*

print_tree.py --page 'url'

This script prints the structure of the page as a tree in the following format: Main page (5)
subpage1 (2)
    subpage1_1 (0)
    subpage1_2 (0)
subpage2 (1)
    subpage2_1 (0)

There are also two additional arguments:

--allow_redirects, the redirects are also fetched by default is set to False

--max_depth, maximum depth of subpages, by default is set to 2

--timeout, you can speed up the scraping by specifying the timeout argument, default is 3s

Notice: If the script will reach the timeout there will be a 'Timeout error.' message displayed in the console.

Example usage with all arguments:

crawl.py --page 'url' --format 'csv/json' --output 'output' --max_depth 3 --allow_redirects --timeout 10

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
crawler		crawler
file_writers		file_writers
logger		logger
management		management
tests		tests
urls		urls
utils		utils
.gitignore		.gitignore
README.md		README.md
crawl.py		crawl.py
print_tree.py		print_tree.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Introduction

Documentation

crawl.py --page 'url' --format 'csv/json' --output 'output'

print_tree.py --page 'url'

License

About

Releases

Packages

Languages

timschopinski/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Introduction

Documentation

crawl.py --page 'url' --format 'csv/json' --output 'output'

print_tree.py --page 'url'

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages