MultiCrawl (v0.1.2)

MultiCrawl is a framework designed for running web measurements with different crawling setups across various machines, enabling near real-time website crawling with browsers like Firefox and Chrome. MultiCrawl also automates interactions with consent banners on websites and recognizes tracking requests. All measurement data is pushed to BigQuery for analysis.

Supported Browsers: Chrome, Firefox

Collectable Data Types:

Cookies
LocalStorage
Requests
Responses
DNS Responses
Callstacks
JavaScript calls

Getting Started

Before diving into the installation process, ensure you have the prerequisites ready:

PostgreSQL database
Authentication JSON for Google Cloud API
Sites to visit (e.g., Tranco list)
A VM (e.g., Ubuntu 20.04) setup

Installation & Configuration

Initialize your PostgreSQL database using the /resources/posgres.sql script.
Update the PostgreSQL connection string in the /DBOps.py file.
Save your Google Cloud API's authentication JSON as google.json in /resources (Guide).
Import your list into the sites table of PostgreSQL.
Use /Commander_extract_Subpages.py to extract subpages from your imported list.
Prepare your BigQuery dataset with the tables requests, responses, cookies, and localstorage. For column definitions, refer to resources/bigquery.md.

Running the Framework

Set up an Ubuntu 20.04 VM.
Install the required packages from /req-pip.txt and /req-conda.txt.
Execute install.sh for OpenWPM installation.
Configure a VPN connection on your VM (if needed).
Name your VMs according to the getMode() function in /setup.py.
Adjust the crawling preferences in the getConfig() function in /setup.py
Execute restart.sh on every VM to initiate the measurement.

Acknowledgements

This repository incorporates files from OpenWPM, utilizing OpenWPM (v0.20) for Firefox operations.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
database_schema		database_schema
docs		docs
drivers		drivers
schemas		schemas
scripts		scripts
test		test
.codecov.yml		.codecov.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Commander_extract_Subpages.py		Commander_extract_Subpages.py
CrawlerChrome.py		CrawlerChrome.py
CrawlerOpenWPM.py		CrawlerOpenWPM.py
CrawlerOpenWPM_Wrapper.py		CrawlerOpenWPM_Wrapper.py
DBOps.py		DBOps.py
DataPreProcessingOps.py		DataPreProcessingOps.py
DataPreProcessing_is_tracker.py		DataPreProcessing_is_tracker.py
Dockerfile		Dockerfile
LICENSE		LICENSE
Objects.py		Objects.py
Ops.py		Ops.py
PushOps.py		PushOps.py
QueueManager.py		QueueManager.py
README.md		README.md
VERSION		VERSION
commitlint.config.js		commitlint.config.js
cookieButton_button_signatures.py		cookieButton_button_signatures.py
cookieButton_rules.py		cookieButton_rules.py
cookiepedia.py		cookiepedia.py
crawler.py		crawler.py
custom_command.py		custom_command.py
demo.py		demo.py
demo_multicrawl.py		demo_multicrawl.py
environment.yaml		environment.yaml
install.md		install.md
install.sh		install.sh
monitor.py		monitor.py
ops.sh		ops.sh
package-lock.json		package-lock.json
package.json		package.json
profile.tar.gz		profile.tar.gz
pytest.ini		pytest.ini
req-conda.txt		req-conda.txt
req-pip.txt		req-pip.txt
restart.sh		restart.sh
setup.cfg		setup.cfg
setup.py		setup.py
wrapper_ops.py		wrapper_ops.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiCrawl (v0.1.2)

Table of Contents

Getting Started

Installation & Configuration

Running the Framework

Acknowledgements

About

Releases

Packages

Languages

License

nrllh/MultiCrawl

Folders and files

Latest commit

History

Repository files navigation

MultiCrawl (v0.1.2)

Table of Contents

Getting Started

Installation & Configuration

Running the Framework

Acknowledgements

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages