This app is build as CLI. The main goal for this app is to fetch and process data from the specified website and all its subpages linked from the main page and subpages of subpages etc.
This app uses asynchronous requests to fetch all the subpages in a recurrent way. It stores than the data in a specified CSV or JSON file. There is also an option to print the structure of the page as a tree. The data of scraped pages contains link, title, number of internal links, number of external links, number of times url was referenced by other pages. For simplicity links that start with http are counted as external links.
This script fetches subpages from the given 'url'. Results are saved in 'csv/json' format in 'output'
where each row is representing one page, with the following columns/keys:
• link
• title
• number of internal links
• number of external links
• number of times url was referenced by other pages*
This script prints the structure of the page as a tree in the following format:
Main page (5)
subpage1 (2)
subpage1_1 (0)
subpage1_2 (0)
subpage2 (1)
subpage2_1 (0)
There are also two additional arguments:
--allow_redirects, the redirects are also fetched by default is set to False
--max_depth, maximum depth of subpages, by default is set to 2
--timeout, you can speed up the scraping by specifying the timeout argument, default is 3s
Notice: If the script will reach the timeout there will be a 'Timeout error.' message displayed in the console.
Example usage with all arguments:
crawl.py --page 'url' --format 'csv/json' --output 'output' --max_depth 3 --allow_redirects --timeout 10
Copyright (c) 2022, Tim Schopinski