Skip to content

Commit

Permalink
Merge branch 'release/5.10.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
lukavdplas committed Aug 8, 2024
2 parents 2cbe71e + a3b7e69 commit 671ba3f
Show file tree
Hide file tree
Showing 63 changed files with 1,295 additions and 1,295 deletions.
42 changes: 36 additions & 6 deletions .github/workflows/backend-test.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# This workflow will run backend tests on the Python version defined in the Dockerfiles
# This workflow will run backend tests on the Python version defined in the backend/Dockerfile

name: Backend unit tests

Expand All @@ -13,15 +13,45 @@ on:
- 'hotfix/**'
- 'release/**'
- 'dependabot/**'
paths-ignore:
- 'frontend/**'
- '**.md'
paths:
- 'backend/**'
- '.github/workflows/backend*'
- 'docker-compose.yaml'

jobs:
backend-test:
name: Test Backend
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push Elasticsearch image
uses: docker/build-push-action@v6
with:
context: .
file: DockerfileElastic
push: true
tags: ghcr.io/uudigitalhumanitieslab/ianalyzer-elastic:latest
cache-from: type=registry,ref=ghcr.io/uudigitalhumanitieslab/ianalyzer-elastic:latest
cache-to: type=inline
- name: Build and push Backend
uses: docker/build-push-action@v6
with:
context: backend/.
push: true
tags: ghcr.io/uudigitalhumanitieslab/ianalyzer-backend:latest
cache-from: type=registry,ref=ghcr.io/uudigitalhumanitieslab/ianalyzer-backend:latest
cache-to: type=inline
- name: Run backend tests
run: sudo mkdir -p /ci-data && sudo docker-compose --env-file .env-ci run backend pytest
run: |
sudo mkdir -p /ci-data
docker compose pull elasticsearch
docker compose pull backend
docker compose --env-file .env-ci run --rm backend pytest
31 changes: 25 additions & 6 deletions .github/workflows/frontend-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,34 @@ on:
- 'hotfix/**'
- 'release/**'
- 'dependabot/**'
paths-ignore:
- 'backend/**'
- '**.md'
paths:
- 'frontend/**'
- '.github/workflows/frontend*'
- 'docker-compose.yaml'

jobs:
frontend-test:
name: Test Frontend
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run frontend tests
run: sudo docker-compose --env-file .env-ci run frontend yarn test
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build frontend image, using cache from Github registry
uses: docker/build-push-action@v6
with:
context: frontend/.
push: true
tags: ghcr.io/uudigitalhumanitieslab/ianalyzer-frontend:latest
cache-from: type=registry,ref=ghcr.io/uudigitalhumanitieslab/ianalyzer-frontend:latest
cache-to: type=inline
- name: Run frontend unit tests
run: |
docker compose pull frontend
docker compose --env-file .env-ci run --rm frontend yarn test
25 changes: 0 additions & 25 deletions .github/workflows/release.yml

This file was deleted.

10 changes: 10 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,16 @@
}
},
{
"name": "Python: Debug Tests",
"type": "debugpy",
"request": "launch",
"program": "${file}",
"purpose": [
"debug-test"
],
"console": "internalConsole",
"justMyCode": false
}, {
"name": "celery",
"type": "debugpy",
"request": "launch",
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ keywords:
- elasticsearch
- natural language processing
license: MIT
version: 5.9.0
date-released: '2024-07-05'
version: 5.11.0
date-released: '2024-08-08'
1 change: 0 additions & 1 deletion backend/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ RUN apt-get -y update && apt-get -y upgrade
RUN apt-get install -y pkg-config libxml2-dev libxmlsec1-dev libxmlsec1-openssl default-libmysqlclient-dev

RUN pip install --upgrade pip
RUN pip install pip-tools
# make a directory in the container
WORKDIR /backend
# copy requirements from the host system to the directory in the container
Expand Down
13 changes: 11 additions & 2 deletions backend/addcorpus/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,18 @@ class VisualizationType(Enum):
'scan',
'tab-scan'
'p',
'tags',
'context',
'tab',
]
'''
Field names that cannot be used because they are also query parameters in frontend routes.
Field names that cannot be used because they interfere with other functionality.
Using them would make routing ambiguous.
This is usually because they are also query parameters in frontend routes, and using them
would make routing ambiguous.
`query` is also forbidden because it is a reserved column in CSV downloads. Likewise,
`context` is forbidden because it's used in download requests.
`scan` and `tab-scan` are added because they interfere with element IDs in the DOM.
'''
4 changes: 1 addition & 3 deletions backend/addcorpus/reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,6 @@ class NewReader(CSVReader):
for f in corpus.configuration.fields.all()]

def sources(self, *args, **kwargs):
return (
(fn, {}) for fn in glob.glob(f'{self.data_directory}/**/*.csv', recursive=True)
)
return glob.glob(f'{self.data_directory}/**/*.csv', recursive=True)

return NewReader()
29 changes: 14 additions & 15 deletions backend/corpora/dbnl/dbnl.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
import re
from tqdm import tqdm
from ianalyzer_readers.xml_tag import Tag, CurrentTag, TransformTag

from django.conf import settings
from addcorpus.python_corpora.corpus import XMLCorpusDefinition, FieldDefinition
Expand All @@ -25,8 +26,8 @@ class DBNL(XMLCorpusDefinition):
languages = ['nl', 'dum', 'fr', 'la', 'fy', 'lat', 'en', 'nds', 'de', 'af']
category = 'book'

tag_toplevel = 'TEI.2'
tag_entry = { 'name': 'div', 'attrs': {'type': 'chapter'} }
tag_toplevel = Tag('TEI.2')
tag_entry = Tag('div', type='chapter')

document_context = {
'context_fields': ['title_id'],
Expand Down Expand Up @@ -261,18 +262,18 @@ def _xml_files(self):
Pass(
Backup(
XML( # get the language on chapter-level if available
CurrentTag(),
attribute='lang',
transform=lambda value: [value] if value else None,
),
XML( # look for section-level codes
{'name': 'div', 'attrs': {'type': 'section'}},
Tag('div', type='section'),
attribute='lang',
multiple=True,
),
XML( # look in the top-level metadata
'language',
Tag('language'),
toplevel=True,
recursive=True,
multiple=True,
attribute='id'
),
Expand All @@ -298,17 +299,17 @@ def _xml_files(self):
extractor=Pass(
Backup(
XML( # get the language on chapter-level if available
CurrentTag(),
attribute='lang',
),
XML( # look for section-level code
{'name': 'div', 'attrs': {'type': 'section'}},
Tag('div', type='section'),
attribute='lang'
),
XML( #otherwise, get the (first) language for the book
'language',
Tag('language'),
attribute='id',
toplevel=True,
recursive=True,
),
transform=utils.single_language_code,
),
Expand All @@ -322,13 +323,11 @@ def _xml_files(self):
display_name='Chapter',
extractor=Backup(
XML(
tag='head',
recursive=True,
Tag('head'),
flatten=True,
),
XML(
tag=utils.LINE_TAG,
recursive=True,
Tag(utils.LINE_TAG),
flatten=True,
)
),
Expand Down Expand Up @@ -359,11 +358,11 @@ def _xml_files(self):
search_field_core=True,
csv_core=True,
extractor=XML(
tag=utils.LINE_TAG,
recursive=True,
Tag(utils.LINE_TAG),
TransformTag(utils.pad_content),
multiple=True,
flatten=True,
transform_soup_func=utils.pad_content,
transform=lambda lines: '\n'.join(lines).strip() if lines else None,
),
es_mapping=main_content_mapping(token_counts=True),
visualizations=['wordcloud'],
Expand Down
14 changes: 8 additions & 6 deletions backend/corpora/dbnl/tests/test_dbnl_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,12 +145,12 @@ def test_append_to_tag(xml, tag, padding, original_output, new_output):
'content': '\n'.join([
'Register der Liedekens.',
'A.',
'ACh gesalfde van den Heer. Pag. 30 ',
'Als Saul, en david den vyant in\'t velt. 41 ',
'Als ick de Son verhoogen sie. 184 ',
'Als hem de Son begeeft. 189 ',
'Als ick den Herfst aenschou. 194 ',
'Als in koelt, de nacht komt overkleeden 208 ',
'ACh gesalfde van den Heer. Pag. 30',
'Als Saul, en david den vyant in\'t velt. 41',
'Als ick de Son verhoogen sie. 184',
'Als hem de Son begeeft. 189',
'Als ick den Herfst aenschou. 194',
'Als in koelt, de nacht komt overkleeden 208',
'Als van der meer op Eng\'le-vleug\'len vloog. 232',
])
}, { # metadata-only book
Expand Down Expand Up @@ -194,6 +194,8 @@ def test_dbnl_extraction(dbnl_corpus):
for actual, expected in zip(docs, expected_docs):
# assert that actual is a superset of expected
for key in expected:
if expected[key] != actual[key]:
print(key)
assert expected[key] == actual[key]
assert expected.items() <= actual.items()

Expand Down
3 changes: 2 additions & 1 deletion backend/corpora/dbnl/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,8 @@ def append_to_tag(soup, tag, padding):
def pad_content(node):
pad_cells = lambda n: append_to_tag(n, 'cell', ' ')
pad_linebreaks = lambda n: append_to_tag(n, 'lb', '\n')
return pad_cells(pad_linebreaks(node))
pad_cells(pad_linebreaks(node))
return [node]

def standardize_language_code(code):
if code:
Expand Down
10 changes: 4 additions & 6 deletions backend/corpora/dutchannualreports/dutchannualreports.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import os.path as op
import logging
from datetime import datetime
from ianalyzer_readers.xml_tag import Tag

from django.conf import settings

Expand All @@ -20,7 +21,6 @@
class DutchAnnualReports(XMLCorpusDefinition):
""" Alto XML corpus of Dutch annual reports. """

# Data overrides from .common.Corpus (fields at bottom of class)
title = "Dutch Annual Reports"
description = "Annual reports of Dutch financial and non-financial institutes"
min_date = datetime(year=1957, month=1, day=1)
Expand All @@ -38,9 +38,8 @@ class DutchAnnualReports(XMLCorpusDefinition):

mimetype = 'application/pdf'

# Data overrides from .common.XMLCorpus
tag_toplevel = 'alto'
tag_entry = 'Page'
tag_toplevel = Tag('alto')
tag_entry = Tag('Page')

# New data members
non_xml_msg = 'Skipping non-XML file {}'
Expand Down Expand Up @@ -187,9 +186,8 @@ def sources(self, start=min_date, end=max_date):
description='Text content of the page.',
results_overview=True,
extractor=XML(
tag='String',
Tag('String'),
attribute='CONTENT',
recursive=True,
multiple=True,
transform=lambda x: ' '.join(x),
),
Expand Down
Loading

0 comments on commit 671ba3f

Please sign in to comment.