Writing a corpus definition in Python

This document is a guide to writing a Python corpus definition.

The steps of adding a new Python corpus are usually the following:

Create a new python class in the I-analyzer repository, which will describe the corpus
Include the corpus in your local django settings and include (local) source data
Load the corpus into your local database
Create and populate a local elasticsearch index for the corpus
Workshop the corpus definition, add unit tests
Make a pull request
Create and populate a production elasticsearch index on the production cluster using your test branch. (We use a dedicated I-analyzer instance for indexing.)
Include the corpus definition in the next release and deploy it in production.
Verify everything works as expected and adjust the corpus permissions in the production admin interface, so users can see it.

Corpus definition

Start by adding a new Python module <corpusname>.py to the backend/corpora directory, and include in the CORPORA setting of your Django settings. (Use settings_local.py to set this for your own development server only.)

The actual definition is a class that you define in this module. It should subclass the CorpusDefinition class. This class includes some default values for attributes and default behaviour.

It also inherits the Reader class from ianalyzer_readers which provides very minimal functionality for reading source files. Most corpus definitions also inherit from a more specific Reader that provides functionality for the type of source data, e.g. XMLReader, CSVReader, etc. For convenience, you can use the classes XMLCorpusDefinition, CSVCorpusDefinition, etc., defined in corpus.py. See the documentation of ianalyzer_readers for the available Reader classes and the API for each of them.

Your definition module should now look something like this:

from addcorpus.corpus import CSVCorpusDefinition

class MyCorpus(CSVCorpusDefinition):
    pass

This class will describe all metadata for the corpus, but various attributes and methods need to be filled in before the corpus is ready for use.

Many of these values will be hard-coded in the definition class, but some will need to be imported from the project settings, because they need to be configurable. (For example, the location of source data.) More on the use of settings below.

Attributes and methods

Required attributes

The following attributes are required for a corpus to function.

Attribute	Type	Description
`title`	`str`	Title to be used in the interface.
`description`	`str`	Short description; appears as a subtitle in the interface.
`min_date`	`datetime.date`	The minimum date for the data in the corpus. This is shown as metadata in the corpus overview. It is not used to restrict the data.
`max_date`	`datetime.date`	The maximum date for the data - analogous to `min_date`.
`category`	`str`	The type of data in the corpus. See the options for categories.
`languages`	`List[str]`	A list of IETF tags of the languages used in your corpus. Corpus languages are intended as a way for users to select interesting datasets, so only include languages for which your corpus contains a meaningful amount of data. The list should go from most to least frequent. You can also include `''` for "unknown".
`es_index`	`str`	The name of the elasticsearch index. In development, the corpus name will do. On a production cluster, you may need to use a particular prefix. If the name starts with `test-`, the index may be deleted when running unit tests; do this for test corpora, don't do it elsewhere.
`data_directory`	`Optional[str]`	Path to the directory containing source files. Always get this from the setttings. You can also set this to `None`; usually because you are getting source data from an API instead of a local directory.
`fields`	`List[Field]`	The fields for the corpus. See defining fields.

Required methods

The corpus class must define a method sources(self, **kwargs). See the API documentation of ianalyzer_readers. When you run the indexing command, I-analyzer can provide two named arguments, start and end, which give a minimum and maximum date to select source files.

Optional attributes

Attribute	Type	Description
`image`	`str`	The filename of the image used for the corpus in the interface. (See below.)
`es_alias`	`str`	An alias that you want to assign to the index in elasticsearch.
`es_settings`	`Dict`	Customises the settings of the elasticsearch index. Can be generated using es_settings.py
`scan_image_type`	`str`	The MIME type of media attachments to documents, if these are included.
`allow_image_download`	`bool`	If the corpus has media attachments, this controls if they can be downloaded from the interface.
`document_context`	`Dict`	Defines how to group documents into a "context". For example, if each document is a page, you can configure this setting so users can view a book. See the docstring of this attribute for details.
`default_sort`	`Dict`	Defines the default method to sort search results if the user has not entered a query. See the docstring of this attribute for details.
`language_field`	`str`	If the corpus contains documents in multiple languages, this can specify the name of the field that stores the IETF tag for each document.
`description_page`	`str`	Name of the file with a general description of the corpus. See below.
`citation_page`	`str`	Name of the file with citation guidelines. See below.
`wordmodels_page`	`str`	Name of the file documenting word models. See below.
`license_page`	`str`	Name of the file containing a licence for the data. See below.
`terms_of_service_page`	`str`	Name of the file containing terms of service. See below.

Documentation files and corpus image

If you include a corpus image or documentation pages, these need to be included as separate files.

Each file should be located in a specific subdirectory of the directory that contains your definition module. Specifically:

Filename attribute	Subdirectory
`image`	`images`
`description_page`	`description`
`citation_page`	`citation`
`wordmodels_page`	`wm`
`license_page`	`license`
`terms_of_service_page`	`terms_of_service`

This means that if your corpus includes image = 'mycorpus.jpg', your directory should be structured like this:

mycorpus/
├ mycorpus.py
└ images/
  └ mycorpus.jpg

The image can be any image file. Documentation pages must be markdown files. See corpus documentation for more information about writing documentation.

Definining fields

The fields property lists the configuration for each field in the corpus. Each of these defines how that field should be extracted from the source file, how it should be stored in elasticsearch, and how it should appear in the interface. See corpus.py for the class definition.

Note that unlike with CorpusDefinition, fields are not defined as classes but as objects. Rather than creating a custom subclass of FieldDefinition, you can just call the FieldDefinition() constructor with appropriate parameters.

See the docstring of FieldDefinition for a comprehensive overview of all parameters.

Extracting values

The extractor attribute of a field should define how it extracts its data from source files. This value should be an instance of Extractor, which is defined in the ianalyzer_readers package. See the API documentation of ianalyzer_readers for a list of available extractors and their parameters.

These extractors are typically sufficient for new corpora; if they are not, you can create a custom Extractor subclass for your corpus, or expand the ianalyzer_readers package.

Using project settings

Several of the attributes in a corpus definition need to be configurable per environment. This is done by including these values in the project settings.

Please use the following naming convention when you add settings for your corpus.

CORPUSNAME_DATA = '/MyData/CorpusData' # the directory where the xml / html or other files are located
CORPUSNAME_ES_INDEX = 'dutchbanking' # the name that elasticsearch gives to the index
CORPUSNAME_SCAN_IMAGE_TYPE = 'image/png' #mimetype of document media
# etc...

These can be retrieved in the corpus definition, for example:

from django.conf import settings

class Times(XMLCorpus):
    title = "Times"
    description = "Newspaper archive, 1785-2010"
    min_date = datetime(year=1785, month=1, day=1)
    max_date = datetime(year=2010, month=12, day=31)
    data_directory = settings.TIMES_DATA
    es_index = getattr(settings, 'TIMES_ES_INDEX', 'times')
    # ...

Note that for a property like the elasticsearch index, we define a default value but make it possible to override this in the settings file, while the data directory is required.

Unit testing

It is strongly recommended that you include unit tests for your corpus. A minimal test is to try to load your corpus into the database and check that it does not run into validation errors. In addition, it is recommended that you include some tests that check the output of the data extraction.

The rechtspraak corpus includes good examples of such tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing-a-corpus-definition-in-Python.md

Writing-a-corpus-definition-in-Python.md

Writing a corpus definition in Python

Corpus definition

Attributes and methods

Required attributes

Required methods

Optional attributes

Documentation files and corpus image

Definining fields

Extracting values

Using project settings

Unit testing

Files

Writing-a-corpus-definition-in-Python.md

Latest commit

History

Writing-a-corpus-definition-in-Python.md

File metadata and controls

Writing a corpus definition in Python

Corpus definition

Attributes and methods

Required attributes

Required methods

Optional attributes

Documentation files and corpus image

Definining fields

Extracting values

Using project settings

Unit testing