mwscrape downloads rendered articles from MediaWiki sites via web API and stores them in CouchDB to enable further offline processing.
mwscrape depends on the following:
- CouchDB (1.3.0 or newer)
- Python 3 (3.6 or newer)
Consult your operating system documentation and these projects’ websites for installation instructions.
For example, on Ubuntu 18.04, the following command installs required packages:
sudo apt-get install python3 python3-venv
To install CouchDB first enable the Apache CouchDB package repository:
echo "deb https://apache.bintray.com/couchdb-deb bionic main" | sudo tee -a /etc/apt/sources.list
Then install the repository key:
curl -L https://couchdb.apache.org/repo/bintray-pubkey.asc | sudo apt-key add -
And finally update the repository cache and install the package:
sudo apt-get update && sudo apt-get install couchdb
Alternatively, run CouchDB with docker:
docker run --detach --rm --name couchdb \
-v $(PWD)/.couchdb:/opt/couchdb/data \
-p 5984:5984 \
couchdb:2
See CouchDB Docker image docs for more details.
Note that starting with CouchDB 3.0 an admin user must be set up. See CouchDB documentation.
With docker:
docker run --detach --rm --name couchdb \
-e COUCHDB_USER=admin \
-e COUCHDB_PASSWORD=secret \
-v $(PWD)/.couchdb:/opt/couchdb/data \
-p 5984:5984 \
couchdb:3
By default CouchDB uses snappy for file compression. Change
file_compression
configuration parameter in couchdb
config section to
deflate_6 (Maximum is deflate_9). This reduces database disc space usage
significantly.
Create new Python virtual environment:
python3 -m venv env-mwscrape
Activate it:
source env-mwscrape/bin/activate
Install mwscrape from source:
pip install https://github.com/itkach/mwscrape/tarball/master
usage: mwscrape [-h] [--site-path SITE_PATH] [--site-ext SITE_EXT] [-c COUCH]
[--db DB] [--titles TITLES [TITLES ...]] [--start START]
[--changes-since CHANGES_SINCE] [--recent-days RECENT_DAYS]
[--recent] [--timeout TIMEOUT] [-S] [-r [SESSION ID]]
[--sessions-db-name SESSIONS_DB_NAME] [--desc]
[--delete-not-found] [--speed {0,1,2,3,4,5}]
[site]
positional arguments:
site MediaWiki site to scrape (host name), e.g.
en.wikipedia.org
optional arguments:
-h, --help show this help message and exit
--site-path SITE_PATH
MediaWiki site API path. Default: /w/
--site-ext SITE_EXT MediaWiki site API script extension. Default: .php
-c COUCH, --couch COUCH
CouchDB server URL. Default: http://localhost:5984
--db DB CouchDB database name. If not specified, the name will
be derived from Mediawiki host name.
--titles TITLES [TITLES ...]
Download article pages with these names (titles). It
name starts with @ it is interpreted as name of file
containing titles, one per line, utf8 encoded.
--start START Download all article pages beginning with this name
--changes-since CHANGES_SINCE
Download all article pages that change since specified
time. Timestamp format is yyyymmddhhmmss. See
https://www.mediawiki.org/wiki/Timestamp. Hours,
minutes and seconds can be omited
--recent-days RECENT_DAYS
Number of days to look back for recent changes
--recent Download recently changed articles only
--timeout TIMEOUT Network communications timeout. Default: 30.0s
-S, --siteinfo-only Fetch or update siteinfo, then exit
-r [SESSION ID], --resume [SESSION ID]
Resume previous scrape session. This relies on stats
saved in mwscrape database.
--sessions-db-name SESSIONS_DB_NAME
Name of database where session info is stored.
Default: mwscrape
--desc Request all pages in descending order
--delete-not-found Remove non-existing pages from the database
--speed {0,1,2,3,4,5}
Scrape speed
--delay
Pause before requesting rendered article for
this many seconds.
Some sites limit request rate so that even
single-threaded, request-at-a-time scrapes
are too fast and additional delay needs
to be introduced
--namespace ID of MediaWiki namespace to " "scrape.
--user-agent HTTP user agent string.
The following examples are for with CouchDB < 3.0 running in admin party mode.
To get English Wiktionary:
mwscrape en.wiktionary.org
To get the same but work through list of titles in reverse order:
mwscrape en.wiktionary.org --desc
Some sites expose Mediawiki API at path different from Wikipedia’s
default, specify it with --site-path
:
mwscrape lurkmore.to --site-path=/
For CouchDB with admin user admin
and password secret
specify
credentials as part of CouchDB URL:
mwscrape -c http://admin:secret@localhost:5984 en.wiktionary.org
mwscrape compares page revisions reported by MediaWiki API with revisions of previously scraped pages in CouchDB and requests parsed page data if new revision is available.
mwscrape also creates a CouchDB design document w
with show
function html
to allow viewing article html returned by MediaWiki
API and navigating to html of other collected articles.
For example, to view rendered html for article A in
database simple-wikipedia-org, in a web browser go to the
following address (assuming CouchDB is running on localhost):
http://127.0.0.1:5984/simple-wikipedia-org/_design/w/_show/html/A
If databases are combined via replication articles with the same title will be stored as conflicts. mwresolvec script is provided to merge conflicting versions (combine aliases, select highest MediaWiki article revision, discard other revisions). Usage:
mwresolvec [-h] [-s START] [-b BATCH_SIZE] [-w WORKERS] [-v] couch_url
positional arguments:
couch_url
optional arguments:
-h, --help show this help message and exit
-s START, --start START
-b BATCH_SIZE, --batch-size BATCH_SIZE
-w WORKERS, --workers WORKERS
-v, --verbose
Example:
mwresolvec http://localhost:5984/en-m-wikipedia-org