extract-articles

Code and scripts to extract data and populate the (example) database.

High level overview

flowchart TD
    A(sites.yml) -- collect-urls.py --> C(urls.csv);
    B(your own script) --> C;
    C -- process-urls.py --> D(SQLite database);

Generating list of URLs

Description of the YAML file here ...

$ python collect-urls.py --help

Usage: collect-urls.py [OPTIONS]

Options:
  --sites TEXT              The YML file describing the sites. This file is
                            only read.  [required]
  --out-file TEXT           The generated output CSV file.  [required]
  --pages-per-site INTEGER  Sample only so many pages per site for testing
                            purposes [default: collect all pages].
  --help                    Show this message and exit.

Example:

$ python collect-urls.py --sites sites.yml --out-file urls.csv

Structure of the generated CSV file (example):

name,url,language
...,...,...

Processing list of URLs

This step reads a CSV file with page site names, URLs, and languages and uses this information to fetch the HTML, post-process it, and to save it to database. Currently it both downloads the HTML and post-processes it. Later we can separate this into two steps: fetching all HTML and post-processing all HTML as a follow-up step.

$ python process-urls.py --help

Usage: process-urls.py [OPTIONS]

Options:
  --csv-file TEXT  The input CSV file containing list of URLs to process.
                   [required]
  --db-file TEXT   The SQLite database file.  [required]
  --help           Show this message and exit.

Example:

$ python process-urls.py --csv-file urls.csv --db-file articles.db

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
collect-urls.py		collect-urls.py
db.py		db.py
juicer.py		juicer.py
process-urls.py		process-urls.py
requirements.txt		requirements.txt
sites.yml		sites.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

extract-articles

High level overview

Generating list of URLs

Processing list of URLs

About

Releases

Packages

Languages

threat-defuser/extract-articles

Folders and files

Latest commit

History

Repository files navigation

extract-articles

High level overview

Generating list of URLs

Processing list of URLs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages