MultiCrawl is a framework designed for running web measurements with different crawling setups across various machines, enabling near real-time website crawling with browsers like Firefox and Chrome. MultiCrawl also automates interactions with consent banners on websites and recognizes tracking requests. All measurement data is pushed to BigQuery for analysis.
Supported Browsers: Chrome, Firefox
Collectable Data Types:
- Cookies
- LocalStorage
- Requests
- Responses
- DNS Responses
- Callstacks
- JavaScript calls
Before diving into the installation process, ensure you have the prerequisites ready:
- PostgreSQL database
- Authentication JSON for Google Cloud API
- Sites to visit (e.g., Tranco list)
- A VM (e.g., Ubuntu 20.04) setup
- Initialize your PostgreSQL database using the
/resources/posgres.sql
script. - Update the PostgreSQL connection string in the
/DBOps.py
file. - Save your Google Cloud API's
authentication JSON
asgoogle.json
in/resources
(Guide). - Import your list into the
sites
table of PostgreSQL. - Use
/Commander_extract_Subpages.py
to extract subpages from your imported list. - Prepare your BigQuery dataset with the tables
requests
,responses
,cookies
, andlocalstorage
. For column definitions, refer toresources/bigquery.md
.
- Set up an Ubuntu 20.04 VM.
- Install the required packages from
/req-pip.txt
and/req-conda.txt
. - Execute
install.sh
for OpenWPM installation. - Configure a VPN connection on your VM (if needed).
- Name your VMs according to the
getMode()
function in/setup.py
. - Adjust the crawling preferences in the
getConfig()
function in/setup.py
- Execute
restart.sh
on every VM to initiate the measurement.
This repository incorporates files from OpenWPM, utilizing OpenWPM (v0.20) for Firefox operations.