4CAT Extension - easy(ier) adding of new datasources/processors that …

…can be mainted seperately from 4CAT base code (#451) * domain only * fix reference * try and collect links with selenium * update column_filter to find multiple matches * fix up the normal url_scraper datasource * ensure all selenium links are strings for join * change output of url_scraper to ndjson with map_items * missed key/index change * update web archive to use json and map to 4CAT * fix no text found * and none on scraped_links * check key first * fix up web_archive error reporting * handle None type for error * record web archive "bad request" * add wait after redirect movement * increase waittime for redirects * add processor for trackers * dict to list for addition * allow both newline and comma seperated links * attempt to scrape iframes as seperate pages * Fixes for selenium scraper to work with config database * installation of packages, geckodriver, and firefox if selenium enabled * update install instructions * fix merge error * fix dropped function * have to be kidding me * add note; setup requires docker... need to think about IF this will ever be installed without Docker * seperate selenium class into wrapper and Search class so wrapper can be used in processors! * add screenshots; add firefox extension support * update selenium definitions * regex for extracting urls from strings * screenshots processor; extract urls from text and takes screenshots * Allow producing zip files from data sources * import time * pick better default * test screenshot datasource * validate all params * fix enable extension * haha break out of while loop * count my items * whoops, len() is important here * must be getting tired... * remove redundant logging * Eager loading for screenshots, viewport options, etc * Woops, wrong folder * Fix label shortening * Just 'queue' instead of 'search queue' * Yeah, make it headless * README -> DESCRIPTION * h1 -> h2 * Actually just have no header * Use proper filename for downloaded files * Configure whether to offer pseudonymisation etc * Tweak descriptions * fix log missing data * add columns to post_topic_matrix * fix breadcrumb bug * Add top topics column * Fix selenium config install parameter (Docker uses this/manual would need to run install_selenium, well, manually) * this processor is slow; i thought it was broken long before it updated! * refactor detect_trackers as conversion processor not filter * add geckodriver executable to docker install * Auto-configure webdrivers if available in PATH * update screenshots to act as image-downloader and benefit from processors * fix is_compatible_with * Delete helper-scripts/migrate/migrate-1.30-1.31.py * fix embeddings is_compatible_with * fix up UI options for hashing and private * abstract was moved to lib * various fixes to selenium based datasources * processors not compatible with image datasets * update firefox extension handling * screenshots datasource fix get_options * rename screenshots processor to be detected as image dataset * add monthly and weekly frequencies to wayback machine datasource * wayback ds: fix fail if all attempts do not realize results; addion frequency options to options; add daily * add scroll down page to allow lazy loading for entire page screenshots * screenshots: adjust pause time so it can be used to force a wait for images to load I have not successfully come up with or found a way to wait for all images to load; document.readyState == 'complete' does not function in this way on certain sites including the wayback machine * hash URLs to create filenames * remove log * add setting to toggle display advanced options * add progress bars * web archive fix query validation * count subpages in progress * remove overwritten function * move http response to own column * special filenames * add timestamps to all screenshots * restart selenium on failure * new build have selenium * process urls after start (keep original query parameters) * undo default firefox * quick max * rename SeleniumScraper to SeleniumSearch todo: build SeleniumProcessor! * max number screenshots configurable * method to get url with error handling * use get_with_error_handling * d'oh, screenshot processor needs to quit selenium * update log to contain URL * Update scrolling to use Page down key if necessary * improve logs * update image_category_wall as screenshot datasource does not have category column; this is not ideal and ought to be solved in another way. Also, could I get categories from the metadata? That's... ugh. * no category, no processor * str errors * screenshots: dismiss alerts when checking ready state is complete * set screenshot timeout to 30 seconds * update gensim package * screenshots: move processor interrupt into attempts loop * if alert disappears before we can dismiss it... * selenium specific logger * do not switch window when no alert found on dismiss * extract wait for page to load to selenium class * improve descriptions of screenshot options * remove unused line * treat timeouts differently from other errors these are more likely due to an issue with the website in question * debug if requested * increase pause time * restart browser w/ PID * increase max_workers for selenium this is by individual worker class not for all selenium classes... so you can really crank them out if desired * quick fix restart by pid * avoid bad urls * missing bracket & attempt to fix-missing dependencies in Docker install * Allow dynamic form options in processors * Allow 'requires' on data source options as well * Handle list values with requires * basic processor for apple store; setup checks for additional requirements * fix is_4cat_class * show preview when no map_item * add google store datasource * Docker setup.py use extensions * Wider support for file upload in processors * Log file uploads in DMI service manager * add map_item methods and record more data per item need additional item data as map_item is staticmethod * update from master; merge conflicts * fix docker build context (ignore data files) * fix option requirements * apple store fix: list still tries to get query * apple & google stores fix up item mapping * missed merge error * minor fix * remove unused import * fix datasources w/ files frontend error * fix error w/ datasources having file option * better way to name docker volumes * update two other docker compose files * fix docker-compose ymls * minor bug: fix and add warning; fix no results fail * update apple field names to better match interface * update google store fieldnames and order * sneak in jinja logger if needed * fix fourcat.js handling checkboxes for dynamic settings * add new endpoint for app details to apple store * apple_store map new beta app data * add default lang/country * not all apps have advisories * revert so button works * add chart positions to beta map items * basic scheduler To-do - fix up and add options to scheduler view (e.g. delete/change) - add scheduler view to navigator - tie jobs to datasets? (either in scheduler view or, perhaps, filter dataset view) - more testing... * update scheduler view, add functions to update job interval * revert .env * working scheduler! * basic scheduler view w/ datasets * fix postgres tag * update job status in scheduled_jobs table * fix timestamp; end_date needed for last run check; add dataset label * improve scheduler view * remove dataset from scheduled_jobs table on delete * scheduler view order by last creation * scheduler views: separate scheduler list from scheduled dataset list * additional update from master fixes * apple_store map_items fix missing locales * add back depth for pagination * correct route * modify pagination to accept args * pagination fun * pagination: i hate testing on live servers... * ok ok need the pagination route * pagination: add route_args * fix up scheduler header * improve app store descriptions * add azure store * fix azure links * azure_store: add category search * azure fix type of config update timestamp OPTION_DATE does not appear correctly in settings and causes it to be written incorrectly * basic aws store * check if selenium available; get correct app_id * aws: implement pagination * add logging; wait for elements to load after next page; attempts to rework filter option collection * apple_store: handle invalid param error * fix filter_options * aws: fix filter option collection! * more merge * move new datasources and processors to extensions and modify setup.py and module loader to use the new locations * migrate.py to run extension "fourcat_install.py" files * formatting * remove extensions; add gitignore * excise scheduler merge * some additional cleanup from app_studies branch * allow nested datasources folders; ignore files in extensions main folder * allow extension install scripts to run pip if migrate.py has not * Remove unused URL functions we could use ural for * Take care of git commit hash tracking for extension processors * Get rid of unused path.versionfile config setting * Add extensions README * Squashed commit of the following: commit cd356f7 Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:36:18 2024 +0200 UI setting for 4CAT install ad in login commit 0945d8c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 17:32:55 2024 +0200 UI setting for anonymisation controls Todo: make per-datasource commit 1a2562c Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:27 2024 +0200 Debug panel for HTTP headers in control panel commit 203314e Author: Stijn Peeters <[email protected]> Date: Sat Sep 14 15:53:17 2024 +0200 Preview for HTML datasets commit 48c20c2 Author: Desktop Sal <[email protected]> Date: Wed Sep 11 13:54:23 2024 +0200 Remove spacy processors (linguistic extractor, get nouns, get entities) and remove dependencies commit 657ffd7 Author: Dale Wahl <[email protected]> Date: Fri Sep 6 16:29:19 2024 +0200 fix nltk where it matters commit 2ef5c80 Author: Stijn Peeters <[email protected]> Date: Tue Sep 3 12:05:14 2024 +0200 Actually check progress in text annotator commit 693960f Author: Stijn Peeters <[email protected]> Date: Mon Sep 2 18:03:18 2024 +0200 Add processor for stormtrooper DMI service commit 6ae964a Author: Stijn Peeters <[email protected]> Date: Fri Aug 30 17:31:37 2024 +0200 Fix reference to old stopwords list in neologisms preset * Fix Github links for extensions * Fix commit detection in extensions * Fix extension detection in module loader * Follow symlinks when loading extensions Probably not uncommon to have a checked out repo somewhere to then symlink into the extensions dir * Make queue message on create page more generic * Markdown in datasource option tooltips * Remove Spacy model from requirements * Add software_source to database SQL --------- Co-authored-by: Stijn Peeters <[email protected]> Co-authored-by: Stijn Peeters <[email protected]>
digitalmethodsinitiative · Sep 16, 2024 · a4bddae · a4bddae
1 parent cd356f7
commit a4bddae
Show file tree

Hide file tree

Showing 42 changed files with 437 additions and 179 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -2,3 +2,4 @@ data/
 .github/
 .ipynb_checkpoints/
 .gitignore
+.idea/
diff --git a/.env b/.env
@@ -30,7 +30,7 @@ TELEGRAM_PORT=443
 # Docker Volume Names
 DOCKER_DB_VOL=4cat_4cat_db
 DOCKER_DATA_VOL=4cat_4cat_data
-DOCKER_CONFIG_VOL=4cat_4cat_share
+DOCKER_CONFIG_VOL=4cat_4cat_config
 DOCKER_LOGS_VOL=4cat_4cat_logs
 
 # Gunicorn settings
@@ -39,4 +39,3 @@ workers=4
 threads=4
 worker_class=gthread
 log_level=debug
-
diff --git a/.zenodo.json b/.zenodo.json
@@ -3,7 +3,7 @@
   "license": "MPL-2.0",
   "title": "4CAT Capture and Analysis Toolkit",
   "upload_type": "software",
-  "version": "v1.45",
+  "version": "v1.46",
   "keywords": [
     "webmining",
     "scraping",

diff --git a/VERSION b/VERSION
@@ -1,4 +1,4 @@
-1.45
+1.46
 
 This file should not be modified. It is used by 4CAT to determine whether it
 needs to run migration scripts to e.g. update the database structure to a more

diff --git a/backend/database.sql b/backend/database.sql
@@ -56,6 +56,7 @@ CREATE TABLE IF NOT EXISTS datasets (
   is_private        boolean DEFAULT TRUE,
   software_version  text,
   software_file     text DEFAULT '',
+  software_source   text DEFAULT '',
   annotation_fields text DEFAULT ''
 );
 

diff --git a/backend/lib/processor.py b/backend/lib/processor.py
@@ -164,7 +164,7 @@ def work(self):
 
 		# start log file
 		self.dataset.update_status("Processing data")
-		self.dataset.update_version(get_software_commit())
+		self.dataset.update_version(get_software_commit(self))
 
 		# get parameters
 		# if possible, fill defaults where parameters are not provided
@@ -628,7 +628,7 @@ def write_csv_items_and_finish(self, data):
 		self.dataset.update_status("Finished")
 		self.dataset.finish(len(data))
 
-	def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED):
+	def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZIP_STORED, finish=True):
 		"""
 		Archive a bunch of files into a zip archive and finish processing
 
@@ -639,6 +639,7 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
 		  files added to the archive will be used.
 		:param int compression:  Type of compression to use. By default, files
 		  are not compressed, to speed up unarchiving.
+		:param bool finish:  Finish the dataset/job afterwards or not?
 		"""
 		is_folder = False
 		if issubclass(type(files), PurePath):
@@ -665,7 +666,8 @@ def write_archive_and_finish(self, files, num_items=None, compression=zipfile.ZI
 		if num_items is None:
 			num_items = done
 
-		self.dataset.finish(num_items)
+		if finish:
+			self.dataset.finish(num_items)
 
 	def create_standalone(self):
 		"""

diff --git a/backend/lib/search.py b/backend/lib/search.py
@@ -1,16 +1,16 @@
 import hashlib
+import zipfile
 import secrets
-import shutil
 import random
 import json
 import math
 import csv
+import os
 
 from pathlib import Path
 from abc import ABC, abstractmethod
 
 from common.config_manager import config
-from common.lib.dataset import DataSet
 from backend.lib.processor import BasicProcessor
 from common.lib.helpers import strip_tags, dict_search_and_update, remove_nuls, HashCache
 from common.lib.exceptions import WorkerInterruptedException, ProcessorInterruptedException, MapItemException
@@ -71,18 +71,19 @@ def process(self):
 				items = self.import_from_file(query_parameters.get("file"))
 			else:
 				items = self.search(query_parameters)
-
 		except WorkerInterruptedException:
 			raise ProcessorInterruptedException("Interrupted while collecting data, trying again later.")
 
 		# Write items to file and update the DataBase status to finished
 		num_items = 0
 		if items:
 			self.dataset.update_status("Writing collected data to dataset file")
-			if results_file.suffix == ".ndjson":
-				num_items = self.items_to_ndjson(items, results_file)
-			elif results_file.suffix == ".csv":
+			if self.extension == "csv":
 				num_items = self.items_to_csv(items, results_file)
+			elif self.extension == "ndjson":
+				num_items = self.items_to_ndjson(items, results_file)
+			elif self.extension == "zip":
+				num_items = self.items_to_archive(items, results_file)
 			else:
 				raise NotImplementedError("Datasource query cannot be saved as %s file" % results_file.suffix)
 
@@ -361,6 +362,22 @@ def items_to_ndjson(self, items, filepath):
 
 		return processed
 
+	def items_to_archive(self, items, filepath):
+		"""
+		Save retrieved items as an archive
+
+		Assumes that items is an iterable with one item, a Path object
+		referring to a folder containing files to be archived. The folder will
+		be removed afterwards.
+
+		:param items:
+		:param filepath:  Where to store the archive
+		:return int:  Number of items
+		"""
+		num_items = len(os.listdir(items))
+		self.write_archive_and_finish(items, None, zipfile.ZIP_STORED, False)
+		return num_items
+
 
 class SearchWithScope(Search, ABC):
 	"""
@@ -404,7 +421,7 @@ def search(self, query):
 			# proportion of items matches
 			# first, get amount of items for all threads in which matching
 			# items occur and that are long enough
-			thread_ids = tuple([post["thread_id"] for post in items])
+			thread_ids = tuple([item["thread_id"] for item in items])
 			self.dataset.update_status("Retrieving thread metadata for %i threads" % len(thread_ids))
 			try:
 				min_length = int(query.get("scope_length", 30))

diff --git a/backend/lib/worker.py b/backend/lib/worker.py
@@ -133,6 +133,17 @@ def run(self):
 			location = "->".join(frames)
 			self.log.error("Worker %s raised exception %s and will abort: %s at %s" % (self.type, e.__class__.__name__, str(e), location))
 
+		# Clean up after work successfully completed or terminates
+		self.clean_up()
+
+	def clean_up(self):
+		"""
+		Clean up after a processor runs successfully or results in error.
+		Workers should override this method to implement any procedures
+		to run to clean up a worker; by default this does nothing.
+		"""
+		pass
+
 	def abort(self):
 		"""
 		Called when the application shuts down

diff --git a/common/config_manager.py b/common/config_manager.py
@@ -44,9 +44,9 @@ def with_db(self, db=None):
             # Replace w/ db if provided else only initialise if not already
             self.db = db if db else Database(logger=None, dbname=self.get("DB_NAME"), user=self.get("DB_USER"),
                                          password=self.get("DB_PASSWORD"), host=self.get("DB_HOST"),
-                                         port=self.get("DB_PORT"), appname="config-reader") if not db else db
+                                         port=self.get("DB_PORT"), appname="config-reader")
         else:
-            # self.db already initialized
+            # self.db already initialized and no db provided
             pass
 
     def load_user_settings(self):

diff --git a/common/lib/config_definition.py b/common/lib/config_definition.py
@@ -165,20 +165,10 @@
         "help": "Can view worker status",
         "tooltip": "Controls whether users can view worker status via the Control Panel"
     },
-    # The following two options should be set to ensure that every analysis step can
+    # The following option should be set to ensure that every analysis step can
     # be traced to a specific version of 4CAT. This allows for reproducible
-    # research. You can however leave them empty with no ill effect. The version ID
-    # should be a commit hash, which will be combined with the Github URL to offer
-    # links to the exact version of 4CAT code that produced an analysis result.
-    # If no version file is available, the output of "git show" in PATH_ROOT will be used
-    # to determine the version, if possible.
-    "path.versionfile": {
-        "type": UserInput.OPTION_TEXT,
-        "default": ".git-checked-out",
-        "help": "Version file",
-        "tooltip": "Path to file containing GitHub commit hash. File containing a commit ID (everything after the first whitespace found is ignored)",
-        "global": True
-    },
+    # research. The output of "git show" in PATH_ROOT will be used to determine
+    # the version of a processor file, if possible.
     "4cat.github_url": {
         "type": UserInput.OPTION_TEXT,
         "default": "https://github.com/digitalmethodsinitiative/4cat",
@@ -516,6 +506,18 @@
         "tooltip": "If a dataset is a JSON file but it can be mapped to a CSV file, show the CSV in the preview instead"
                    "of the underlying JSON."
     },
+    "ui.offer_hashing": {
+        "type": UserInput.OPTION_TOGGLE,
+        "default": True,
+        "help": "Offer pseudonymisation",
+        "tooltip": "Add a checkbox to the 'create dataset' forum to allow users to toggle pseudonymisation."
+    },
+    "ui.offer_private": {
+        "type": UserInput.OPTION_TOGGLE,
+        "default": True,
+        "help": "Offer create as private",
+        "tooltip": "Add a checkbox to the 'create dataset' forum to allow users to make a dataset private."
+    },
     "ui.option_email": {
         "type": UserInput.OPTION_CHOICE,
         "options": {

diff --git a/common/lib/dataset.py b/common/lib/dataset.py
@@ -114,6 +114,9 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare
 			self.parameters = json.loads(self.data["parameters"])
 			self.is_new = False
 		else:
+			self.data = {"type": type}  # get_own_processor needs this
+			own_processor = self.get_own_processor()
+			version = get_software_commit(own_processor)
 			self.data = {
 				"key": self.key,
 				"query": self.get_label(parameters, default=type),
@@ -125,7 +128,8 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare
 				"timestamp": int(time.time()),
 				"is_finished": False,
 				"is_private": is_private,
-				"software_version": get_software_commit(),
+				"software_version": version[0],
+				"software_source": version[1],
 				"software_file": "",
 				"num_rows": 0,
 				"progress": 0.0,
@@ -139,7 +143,6 @@ def __init__(self, parameters=None, key=None, job=None, data=None, db=None, pare
 
 			# Find desired extension from processor if not explicitly set
 			if extension is None:
-				own_processor = self.get_own_processor()
 				if own_processor:
 					extension = own_processor.get_extension(parent_dataset=DataSet(key=parent, db=db) if parent else None)
 				# Still no extension, default to 'csv'
@@ -865,10 +868,12 @@ def get_label(self, parameters=None, default="Query"):
 		elif parameters.get("subject_match") and parameters["subject_match"] != "empty":
 			return parameters["subject_match"]
 		elif parameters.get("query"):
-			label = parameters["query"] if len(parameters["query"]) < 30 else parameters["query"][:25] + "..."
+			label = parameters["query"]
 			# Some legacy datasets have lists as query data
 			if isinstance(label, list):
 				label = ", ".join(label)
+
+			label = label if len(label) < 30 else label[:25] + "..."
 			label = label.strip().replace("\n", ", ")
 			return label
 		elif parameters.get("country_flag") and parameters["country_flag"] != "all":
@@ -1116,7 +1121,8 @@ def update_version(self, version):
 			processor_path = ""
 
 		updated = self.db.update("datasets", where={"key": self.data["key"]}, data={
-			"software_version": version,
+			"software_version": version[0],
+			"software_source": version[1],
 			"software_file": processor_path
 		})
 
@@ -1151,10 +1157,15 @@ def get_version_url(self, file):
 		:param file:  File to link within the repository
 		:return:  URL, or an empty string
 		"""
-		if not self.data["software_version"] or not config.get("4cat.github_url"):
+		if not self.data["software_source"]:
 			return ""
 
-		return config.get("4cat.github_url") + "/blob/" + self.data["software_version"] + self.data.get("software_file", "")
+		filepath = self.data.get("software_file", "")
+		if filepath.startswith("/extensions/"):
+			# go to root of extension
+			filepath = "/" + "/".join(filepath.split("/")[3:])
+
+		return self.data["software_source"] + "/blob/" + self.data["software_version"] + filepath
 
 	def top_parent(self):
 		"""