Skip to content

Commit

Permalink
4CAT Extension - easy(ier) adding of new datasources/processors that …
Browse files Browse the repository at this point in the history
…can be mainted seperately from 4CAT base code (#451)

* domain only

* fix reference

* try and collect links with selenium

* update column_filter to find multiple matches

* fix up the normal url_scraper datasource

* ensure all selenium links are strings for join

* change output of url_scraper to ndjson with map_items

* missed key/index change

* update web archive to use json and map to 4CAT

* fix no text found

* and none on scraped_links

* check key first

* fix up web_archive error reporting

* handle None type for error

* record web archive "bad request"

* add wait after redirect movement

* increase waittime for redirects

* add processor for trackers

* dict to list for addition

* allow both newline and comma seperated links

* attempt to scrape iframes as seperate pages

* Fixes for selenium scraper to work with config database

* installation of packages, geckodriver, and firefox if selenium enabled

* update install instructions

* fix merge error

* fix dropped function

* have to be kidding me

* add note; setup requires docker... need to think about IF this will ever
be installed without Docker

* seperate selenium class into wrapper and Search class so wrapper can be 
used in processors!

* add screenshots; add firefox extension support

* update selenium definitions

* regex for extracting urls from strings

* screenshots processor; extract urls from text and takes screenshots

* Allow producing zip files from data sources

* import time

* pick better default

* test screenshot datasource

* validate all params

* fix enable extension

* haha break out of while loop

* count my items

* whoops, len() is important here

* must be getting tired...

* remove redundant logging

* Eager loading for screenshots, viewport options, etc

* Woops, wrong folder

* Fix label shortening

* Just 'queue' instead of 'search queue'

* Yeah, make it headless

* README -> DESCRIPTION

* h1 -> h2

* Actually just have no header

* Use proper filename for downloaded files

* Configure whether to offer pseudonymisation etc

* Tweak descriptions

* fix log missing data

* add columns to post_topic_matrix

* fix breadcrumb bug

* Add top topics column

* Fix selenium config install parameter (Docker uses this/manual would
need to run install_selenium, well, manually)

* this processor is slow; i thought it was broken long before it updated!

* refactor detect_trackers as conversion processor not filter

* add geckodriver executable to docker install

* Auto-configure webdrivers if available in PATH

* update screenshots to act as image-downloader and benefit from processors

* fix is_compatible_with

* Delete helper-scripts/migrate/migrate-1.30-1.31.py

* fix embeddings is_compatible_with

* fix up UI options for hashing and private

* abstract was moved to lib

* various fixes to selenium based datasources

* processors not compatible with image datasets

* update firefox extension handling

* screenshots datasource fix get_options

* rename screenshots processor to be detected as image dataset

* add monthly and weekly frequencies to wayback machine datasource

* wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily

* add scroll down page to allow lazy loading for entire page screenshots

* screenshots: adjust pause time so it can be used to force a wait for images to load

I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine

* hash URLs to create filenames

* remove log

* add setting to toggle display advanced options

* add progress bars

* web archive fix query validation

* count subpages in progress

* remove overwritten function

* move http response to own column

* special filenames

* add timestamps to all screenshots

* restart selenium on failure

* new build have selenium

* process urls after start (keep original query parameters)

* undo default firefox

* quick max

* rename SeleniumScraper to SeleniumSearch

todo: build SeleniumProcessor!

* max number screenshots configurable

* method to get url with error handling

* use get_with_error_handling

* d'oh, screenshot processor needs to quit selenium

* update log to contain URL

* Update scrolling to use Page down key if necessary

* improve logs

* update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way.

Also, could I get categories from the metadata? That's... ugh.

* no category, no processor

* str errors

* screenshots: dismiss alerts when checking ready state is complete

* set screenshot timeout to 30 seconds

* update gensim package

* screenshots: move processor interrupt into attempts loop

* if alert disappears before we can dismiss it...

* selenium specific logger

* do not switch window when no alert found on dismiss

* extract wait for page to load to selenium class

* improve descriptions of screenshot options

* remove unused line

* treat timeouts differently from other errors

these are more likely due to an issue with the website in question

* debug if requested

* increase pause time

* restart browser w/ PID

* increase max_workers for selenium

this is by individual worker class not for all selenium classes... so you can really crank them out if desired

* quick fix restart by pid

* avoid bad urls

* missing bracket & attempt to fix-missing dependencies in Docker install

* Allow dynamic form options in processors

* Allow 'requires' on data source options as well

* Handle list values with requires

* basic processor for apple store; setup checks for additional requirements

* fix is_4cat_class

* show preview when no map_item

* add google store datasource

* Docker setup.py use extensions

* Wider support for file upload in processors

* Log file uploads in DMI service manager

* add map_item methods and record more data per item

need additional item data as map_item is staticmethod

* update from master; merge conflicts

* fix docker build context (ignore data files)

* fix option requirements

* apple store fix: list still tries to get query

* apple & google stores fix up item mapping

* missed merge error

* minor fix

* remove unused import

* fix datasources w/ files frontend error

* fix error w/ datasources having file option

* better way to name docker volumes

* update two other docker compose files

* fix docker-compose ymls

* minor bug: fix and add warning; fix no results fail

* update apple field names to better match interface

* update google store fieldnames and order

* sneak in jinja logger if needed

* fix fourcat.js handling checkboxes for dynamic settings

* add new endpoint for app details to apple store

* apple_store map new beta app data

* add default lang/country

* not all apps have advisories

* revert so button works

* add chart positions to beta map items

* basic scheduler

To-do
- fix up and add options to scheduler view (e.g. delete/change)
- add scheduler view to navigator
- tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view)
- more testing...

* update scheduler view, add functions to update job interval

* revert .env

* working scheduler!

* basic scheduler view w/ datasets

* fix postgres tag

* update job status in scheduled_jobs table

* fix timestamp; end_date needed for last run check; add dataset label

* improve scheduler view

* remove dataset from scheduled_jobs table on delete

* scheduler view order by last creation

* scheduler views: separate scheduler list from scheduled dataset list

* additional update from master fixes

* apple_store map_items fix missing locales

* add back depth for pagination

* correct route

* modify pagination to accept args

* pagination fun

* pagination: i hate testing on live servers...

* ok ok need the pagination route

* pagination: add route_args

* fix up scheduler header

* improve app store descriptions

* add azure store

* fix azure links

* azure_store: add category search

* azure fix type of config update timestamp

OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly

* basic aws store

* check if selenium available; get correct app_id

* aws: implement pagination

* add logging; wait for elements to load after next page; attempts to rework filter option collection

* apple_store: handle invalid param error

* fix filter_options

* aws: fix filter option collection!

* more merge

* move new datasources and processors to extensions and modify setup.py and module loader to use the new locations

* migrate.py to run extension "fourcat_install.py" files

* formatting

* remove extensions; add gitignore

* excise scheduler merge

* some additional cleanup from app_studies branch

* allow nested datasources folders; ignore files in extensions main folder

* allow extension install scripts to run pip if migrate.py has not

* Remove unused URL functions we could use ural for

* Take care of git commit hash tracking for extension processors

* Get rid of unused path.versionfile config setting

* Add extensions README

* Squashed commit of the following:

commit cd356f7
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:36:18 2024 +0200

    UI setting for 4CAT install ad in login

commit 0945d8c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 17:32:55 2024 +0200

    UI setting for anonymisation controls

    Todo: make per-datasource

commit 1a2562c
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:27 2024 +0200

    Debug panel for HTTP headers in control panel

commit 203314e
Author: Stijn Peeters <[email protected]>
Date:   Sat Sep 14 15:53:17 2024 +0200

    Preview for HTML datasets

commit 48c20c2
Author: Desktop Sal <[email protected]>
Date:   Wed Sep 11 13:54:23 2024 +0200

    Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies

commit 657ffd7
Author: Dale Wahl <[email protected]>
Date:   Fri Sep 6 16:29:19 2024 +0200

    fix nltk where it matters

commit 2ef5c80
Author: Stijn Peeters <[email protected]>
Date:   Tue Sep 3 12:05:14 2024 +0200

    Actually check progress in text annotator

commit 693960f
Author: Stijn Peeters <[email protected]>
Date:   Mon Sep 2 18:03:18 2024 +0200

    Add processor for stormtrooper DMI service

commit 6ae964a
Author: Stijn Peeters <[email protected]>
Date:   Fri Aug 30 17:31:37 2024 +0200

    Fix reference to old stopwords list in neologisms preset

* Fix Github links for extensions

* Fix commit detection in extensions

* Fix extension detection in module loader

* Follow symlinks when loading extensions

Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir

* Make queue message on create page more generic

* Markdown in datasource option tooltips

* Remove Spacy model from requirements

* Add software_source to database SQL

---------

Co-authored-by: Stijn Peeters <[email protected]>
Co-authored-by: Stijn Peeters <[email protected]>
  • Loading branch information
3 people authored Sep 16, 2024
1 parent cd356f7 commit a4bddae
Show file tree
Hide file tree
Showing 42 changed files with 437 additions and 179 deletions.
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ data/
.github/
.ipynb_checkpoints/
.gitignore
.idea/
3 changes: 1 addition & 2 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ TELEGRAM_PORT=443
# Docker Volume Names
DOCKER_DB_VOL=4cat_4cat_db
DOCKER_DATA_VOL=4cat_4cat_data
DOCKER_CONFIG_VOL=4cat_4cat_share
DOCKER_CONFIG_VOL=4cat_4cat_config
DOCKER_LOGS_VOL=4cat_4cat_logs

# Gunicorn settings
Expand All @@ -39,4 +39,3 @@ workers=4
threads=4
worker_class=gthread
log_level=debug

2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"license": "MPL-2.0",
"title": "4CAT Capture and Analysis Toolkit",
"upload_type": "software",
"version": "v1.45",
"version": "v1.46",
"keywords": [
"webmining",
"scraping",
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
1.45
1.46

This file should not be modified. It is used by 4CAT to determine whether it
needs to run migration scripts to e.g. update the database structure to a more
Expand Down
1 change: 1 addition & 0 deletions backend/database.sql
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ CREATE TABLE IF NOT EXISTS datasets (
is_private boolean DEFAULT TRUE,
software_version text,
software_file text DEFAULT '',
software_source text DEFAULT '',
annotation_fields text DEFAULT ''
);

Expand Down
8 changes: 5 additions & 3 deletions backend/lib/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def work(self):

# start log file
self.dataset.update_status("Processing data")
self.dataset.update_version(get_software_commit())
self.dataset.update_version(get_software_commit(self))

# get parameters
# if possible, fill defaults where parameters are not provided
Expand Down Expand Up @@ -628,7 +628,7 @@ def write_csv_items_and_finish(self, data):
self.dataset.update_status("Finished")
self.dataset.finish(len(data))

def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED):
def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED, finish=True):
"""
Archive a bunch of files into a zip archive and finish processing
Expand All @@ -639,6 +639,7 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
files added to the archive will be used.
:param int compression: Type of compression to use. By default, files
are not compressed, to speed up unarchiving.
:param bool finish: Finish the dataset/job afterwards or not?
"""
is_folder = False
if issubclass(type(files), PurePath):
Expand All @@ -665,7 +666,8 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
if num_items is None:
num_items = done

self.dataset.finish(num_items)
if finish:
self.dataset.finish(num_items)

def create_standalone(self):
"""
Expand Down
31 changes: 24 additions & 7 deletions backend/lib/search.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
import hashlib
import zipfile
import secrets
import shutil
import random
import json
import math
import csv
import os

from pathlib import Path
from abc import ABC, abstractmethod

from common.config_manager import config
from common.lib.dataset import DataSet
from backend.lib.processor import BasicProcessor
from common.lib.helpers import strip_tags, dict_search_and_update, remove_nuls, HashCache
from common.lib.exceptions import WorkerInterruptedException, ProcessorInterruptedException, MapItemException
Expand Down Expand Up @@ -71,18 +71,19 @@ def process(self):
items = self.import_from_file(query_parameters.get("file"))
else:
items = self.search(query_parameters)

except WorkerInterruptedException:
raise ProcessorInterruptedException("Interrupted while collecting data, trying again later.")

# Write items to file and update the DataBase status to finished
num_items = 0
if items:
self.dataset.update_status("Writing collected data to dataset file")
if results_file.suffix == ".ndjson":
num_items = self.items_to_ndjson(items, results_file)
elif results_file.suffix == ".csv":
if self.extension == "csv":
num_items = self.items_to_csv(items, results_file)
elif self.extension == "ndjson":
num_items = self.items_to_ndjson(items, results_file)
elif self.extension == "zip":
num_items = self.items_to_archive(items, results_file)
else:
raise NotImplementedError("Datasource query cannot be saved as %s file" % results_file.suffix)

Expand Down Expand Up @@ -361,6 +362,22 @@ def items_to_ndjson(self, items, filepath):

return processed

def items_to_archive(self, items, filepath):
"""
Save retrieved items as an archive
Assumes that items is an iterable with one item, a Path object
referring to a folder containing files to be archived. The folder will
be removed afterwards.
:param items:
:param filepath: Where to store the archive
:return int: Number of items
"""
num_items = len(os.listdir(items))
self.write_archive_and_finish(items, None, zipfile.ZIP_STORED, False)
return num_items


class SearchWithScope(Search, ABC):
"""
Expand Down Expand Up @@ -404,7 +421,7 @@ def search(self, query):
# proportion of items matches
# first, get amount of items for all threads in which matching
# items occur and that are long enough
thread_ids = tuple([post["thread_id"] for post in items])
thread_ids = tuple([item["thread_id"] for item in items])
self.dataset.update_status("Retrieving thread metadata for %i threads" % len(thread_ids))
try:
min_length = int(query.get("scope_length", 30))
Expand Down
11 changes: 11 additions & 0 deletions backend/lib/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,17 @@ def run(self):
location = "->".join(frames)
self.log.error("Worker %s raised exception %s and will abort: %s at %s" % (self.type, e.__class__.__name__, str(e), location))

# Clean up after work successfully completed or terminates
self.clean_up()

def clean_up(self):
"""
Clean up after a processor runs successfully or results in error.
Workers should override this method to implement any procedures
to run to clean up a worker; by default this does nothing.
"""
pass

def abort(self):
"""
Called when the application shuts down
Expand Down
4 changes: 2 additions & 2 deletions common/config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,9 @@ def with_db(self, db=None):
# Replace w/ db if provided else only initialise if not already
self.db = db if db else Database(logger=None, dbname=self.get("DB_NAME"), user=self.get("DB_USER"),
password=self.get("DB_PASSWORD"), host=self.get("DB_HOST"),
port=self.get("DB_PORT"), appname="config-reader") if not db else db
port=self.get("DB_PORT"), appname="config-reader")
else:
# self.db already initialized
# self.db already initialized and no db provided
pass

def load_user_settings(self):
Expand Down
28 changes: 15 additions & 13 deletions common/lib/config_definition.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,20 +165,10 @@
"help": "Can view worker status",
"tooltip": "Controls whether users can view worker status via the Control Panel"
},
# The following two options should be set to ensure that every analysis step can
# The following option should be set to ensure that every analysis step can
# be traced to a specific version of 4CAT. This allows for reproducible
# research. You can however leave them empty with no ill effect. The version ID
# should be a commit hash, which will be combined with the Github URL to offer
# links to the exact version of 4CAT code that produced an analysis result.
# If no version file is available, the output of "git show" in PATH_ROOT will be used
# to determine the version, if possible.
"path.versionfile": {
"type": UserInput.OPTION_TEXT,
"default": ".git-checked-out",
"help": "Version file",
"tooltip": "Path to file containing GitHub commit hash. File containing a commit ID (everything after the first whitespace found is ignored)",
"global": True
},
# research. The output of "git show" in PATH_ROOT will be used to determine
# the version of a processor file, if possible.
"4cat.github_url": {
"type": UserInput.OPTION_TEXT,
"default": "https://github.com/digitalmethodsinitiative/4cat",
Expand Down Expand Up @@ -516,6 +506,18 @@
"tooltip": "If a dataset is a JSON file but it can be mapped to a CSV file, show the CSV in the preview instead"
"of the underlying JSON."
},
"ui.offer_hashing": {
"type": UserInput.OPTION_TOGGLE,
"default": True,
"help": "Offer pseudonymisation",
"tooltip": "Add a checkbox to the 'create dataset' forum to allow users to toggle pseudonymisation."
},
"ui.offer_private": {
"type": UserInput.OPTION_TOGGLE,
"default": True,
"help": "Offer create as private",
"tooltip": "Add a checkbox to the 'create dataset' forum to allow users to make a dataset private."
},
"ui.option_email": {
"type": UserInput.OPTION_CHOICE,
"options": {
Expand Down
23 changes: 17 additions & 6 deletions common/lib/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare
self.parameters = json.loads(self.data["parameters"])
self.is_new = False
else:
self.data = {"type": type} # get_own_processor needs this
own_processor = self.get_own_processor()
version = get_software_commit(own_processor)
self.data = {
"key": self.key,
"query": self.get_label(parameters, default=type),
Expand All @@ -125,7 +128,8 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare
"timestamp": int(time.time()),
"is_finished": False,
"is_private": is_private,
"software_version": get_software_commit(),
"software_version": version[0],
"software_source": version[1],
"software_file": "",
"num_rows": 0,
"progress": 0.0,
Expand All @@ -139,7 +143,6 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare

# Find desired extension from processor if not explicitly set
if extension is None:
own_processor = self.get_own_processor()
if own_processor:
extension = own_processor.get_extension(parent_dataset=DataSet(key=parent, db=db) if parent else None)
# Still no extension, default to 'csv'
Expand Down Expand Up @@ -865,10 +868,12 @@ def get_label(self, parameters=None, default="Query"):
elif parameters.get("subject_match") and parameters["subject_match"] != "empty":
return parameters["subject_match"]
elif parameters.get("query"):
label = parameters["query"] if len(parameters["query"]) < 30 else parameters["query"][:25] + "..."
label = parameters["query"]
# Some legacy datasets have lists as query data
if isinstance(label, list):
label = ", ".join(label)

label = label if len(label) < 30 else label[:25] + "..."
label = label.strip().replace("\n", ", ")
return label
elif parameters.get("country_flag") and parameters["country_flag"] != "all":
Expand Down Expand Up @@ -1116,7 +1121,8 @@ def update_version(self, version):
processor_path = ""

updated = self.db.update("datasets", where={"key": self.data["key"]}, data={
"software_version": version,
"software_version": version[0],
"software_source": version[1],
"software_file": processor_path
})

Expand Down Expand Up @@ -1151,10 +1157,15 @@ def get_version_url(self, file):
:param file: File to link within the repository
:return: URL, or an empty string
"""
if not self.data["software_version"] or not config.get("4cat.github_url"):
if not self.data["software_source"]:
return ""

return config.get("4cat.github_url") + "/blob/" + self.data["software_version"] + self.data.get("software_file", "")
filepath = self.data.get("software_file", "")
if filepath.startswith("/extensions/"):
# go to root of extension
filepath = "/" + "/".join(filepath.split("/")[3:])

return self.data["software_source"] + "/blob/" + self.data["software_version"] + filepath

def top_parent(self):
"""
Expand Down
Loading

0 comments on commit a4bddae

Please sign in to comment.