alto2txt2fixture is a standalone tool to convert alto2txtXML output and other related datasets into JSON (and where feasible CSV) data with corresponding relational IDs to ease general use and ingestion into a relational database.
+
We target the the JSON produced for importing into lwmdb: a database built using the Djangopython webframework database fixture structure.
+
Installation and simple use
+
We provide a command line interface to process alto2txtXML files stored locally (or mounted via azureblobfuse), and for additional public data we automate a means of downloading those automatically.
+
Installation
+
We recommend downloading a copy of the reposity or using git clone. From a local copy use poetry to install dependencies:
+
$ cdalto2txt2fixture
+$ poetryinstall
+
+
If you would like to test, render documentation and/or contribute to the code included dev dependencies in a local install:
+
$ poetryinstall--withdev
+
+
Simple use
+
To processing newspaper metadata with a local copy of alto2txtXML results, it's easiest to have that data in the same folder as your alto2txt2fixture checkout and poetry installed folder. One arranged, you should be able to begin the JSON converstion with
+
$ poetryruna2t2f-news
+
+
To generate related data in JSON and CSV form, assuming you have an internet collection and access to a living-with-machinesazure account, the following will download related data into JSON and CSV files. The JSON results should be consistent with lwmdb tables for ease of import.
defparse_args(argv:list[str]|None=None)->Namespace:
+"""Manage command line arguments for `run()`
+
+ This constructs an `ArgumentParser` instance to manage
+ configurating calls of `run()` to manage `newspaper`
+ `XML` to `JSON` converstion.
+
+ Arguments:
+ argv:
+ If `None` treat as equivalent of ['--help`],
+ if a `list` of `str` pass those options to `ArgumentParser`
+
+ Returns:
+ A `Namespace` `dict`-like configuration for `run()`
+ """
+argv=Noneifnotargvelseargv
+parser=ArgumentParser(
+prog="a2t2f-news",
+description="Process alto2txt XML into and Django JSON Fixture files",
+epilog=(
+"Note: this is still in beta mode and contributions welcome\n\n"+__doc__
+),
+formatter_class=RawTextHelpFormatter,
+)
+parser.add_argument(
+"-c",
+"--collections",
+nargs="+",
+help="<Optional> Set collections",
+required=False,
+)
+parser.add_argument(
+"-m",
+"--mountpoint",
+type=str,
+help="<Optional> Mountpoint",
+required=False,
+)
+parser.add_argument(
+"-o",
+"--output",
+type=str,
+help="<Optional> Set an output directory",
+required=False,
+)
+parser.add_argument(
+"-t",
+"--test-config",
+default=False,
+help="Only print the configuration",
+action=BooleanOptionalAction,
+)
+parser.add_argument(
+"-f",
+"--show-fixture-tables",
+default=True,
+help="Print included fixture table configurations",
+action=BooleanOptionalAction,
+)
+parser.add_argument(
+"--export-fixture-tables",
+default=True,
+help="Experimental: export fixture tables prior to data processing",
+action=BooleanOptionalAction,
+)
+parser.add_argument(
+"--data-provider-field",
+type=str,
+default=DATA_PROVIDER_INDEX,
+help="Key for indexing DataProvider records",
+)
+returnparser.parse_args(argv)
+
+
+
+
+
+
+
+
+
+
+
+
+ run
+
+
+
+
run(local_args:list[str]|None=None)->None
+
+
+
+
+
Manage running newspaper XML to JSON conversion.
+
First parse_args is called for command line arguments including:
+
+
collections
+
output
+
mountpoint
+
+
If any of these arguments are specified, they will be used, otherwise they
+will default to the values in the settings module.
+
The show_setup function is then called to display the configurations
+being used.
+
The route function is then called to route the alto2txt files into
+subdirectories with structured files.
+
The parse function is then called to parse the resulting JSON files.
+
Finally, the clear_cache function is called to clear the cache
+(pending the user's confirmation).
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
local_args
+
+ list[str] | None
+
+
+
+
Options passed to parse_args()
+
+
+
+ None
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ None
+
+
+
+
None
+
+
+
+
+
+
+
+ Source code in alto2txt2fixture/__main__.py
+
defrun(local_args:list[str]|None=None)->None:
+"""Manage running newspaper `XML` to `JSON` conversion.
+
+ First `parse_args` is called for command line arguments including:
+
+ - `collections`
+ - `output`
+ - `mountpoint`
+
+ If any of these arguments are specified, they will be used, otherwise they
+ will default to the values in the `settings` module.
+
+ The `show_setup` function is then called to display the configurations
+ being used.
+
+ The `route` function is then called to route the alto2txt files into
+ subdirectories with structured files.
+
+ The `parse` function is then called to parse the resulting JSON files.
+
+ Finally, the `clear_cache` function is called to clear the cache
+ (pending the user's confirmation).
+
+ Arguments:
+ local_args: Options passed to `parse_args()`
+
+ Returns:
+ None
+ """
+args:Namespace=parse_args(argv=local_args)
+
+ifargs.collections:
+COLLECTIONS=[x.lower()forxinargs.collections]
+else:
+COLLECTIONS=settings.COLLECTIONS
+
+ifargs.output:
+OUTPUT=args.output.rstrip("/")
+else:
+OUTPUT=settings.OUTPUT
+
+ifargs.mountpoint:
+MOUNTPOINT=args.mountpoint.rstrip("/")
+else:
+MOUNTPOINT=settings.MOUNTPOINT
+
+show_setup(
+COLLECTIONS=COLLECTIONS,
+OUTPUT=OUTPUT,
+CACHE_HOME=settings.CACHE_HOME,
+MOUNTPOINT=MOUNTPOINT,
+JISC_PAPERS_CSV=settings.JISC_PAPERS_CSV,
+REPORT_DIR=settings.REPORT_DIR,
+MAX_ELEMENTS_PER_FILE=settings.MAX_ELEMENTS_PER_FILE,
+)
+
+ifargs.show_fixture_tables:
+# Show a table of fixtures used, defaults to DataProvider Table
+show_fixture_tables(settings,data_provider_index=args.data_provider_field)
+
+ifargs.export_fixture_tables:
+export_fixtures(
+fixture_tables=settings.FIXTURE_TABLES,
+path=OUTPUT,
+formats=settings.FIXTURE_TABLES_FORMATS,
+)
+
+ifnotargs.test_config:
+# Routing alto2txt into subdirectories with structured files
+route(
+COLLECTIONS,
+settings.CACHE_HOME,
+MOUNTPOINT,
+settings.JISC_PAPERS_CSV,
+settings.REPORT_DIR,
+)
+
+# Parsing the resulting JSON files
+parse(
+COLLECTIONS,
+settings.CACHE_HOME,
+OUTPUT,
+settings.MAX_ELEMENTS_PER_FILE,
+)
+
+clear_cache(settings.CACHE_HOME)
+
deffile_rename_table(
+paths_dict:dict[os.PathLike,os.PathLike],
+compress_format:ArchiveFormatEnum=COMPRESSION_TYPE_DEFAULT,
+title:str=FILE_RENAME_TABLE_TITLE_DEFAULT,
+prefix:str="",
+renumber:bool=True,
+)->Table:
+"""Create a `rich.Table` of rename configuration.
+
+ Args:
+ paths_dict: dict[os.PathLike, os.PathLike],
+ Original and renumbered `paths` `dict`
+ compress_format:
+ Which `ArchiveFormatEnum` for compression
+ title:
+ Title of returned `Table`
+ prefix:
+ `str` to add in front of every new path
+ renumber:
+ Whether an `int` in each path will be renumbered.
+
+ """
+table:Table=Table(title=title)
+table.add_column("Current File Name",justify="right",style="cyan")
+table.add_column("New File Name",style="magenta")
+
+deffinal_file_name(name:os.PathLike)->str:
+return(
+prefix
++str(Path(name).name)
++(f".{compress_format}"ifcompress_formatelse"")
+)
+
+forold_path,new_pathinpaths_dict.items():
+name:str=final_file_name(new_pathifrenumberelseold_path)
+table.add_row(Path(old_path).name,name)
+returntable
+
Geneate richTable from func signature and help attr.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
func
+
+ Callable
+
+
+
+
Function whose args and type hints will be converted
+to a table.
+
+
+
+ required
+
+
+
+
values
+
+ dict
+
+
+
+
dict of variables covered in func signature.
+local() often suffices.
+
+
+
+ required
+
+
+
+
title
+
+ str
+
+
+
+
str for table title.
+
+
+
+ ''
+
+
+
+
extra_dict
+
+ dict[str, Any]
+
+
+
+
A dict of additional rows to add to the table. For each
+key, value pair: if the value is a tuple, it will
+be expanded to match the Type, Value, and Notes
+columns; else the Type will be inferred and Notes
+left blank.
+
+
+
+ {}
+
+
+
+
+
+
+ Example
+
>>> deftest_func(
+... var_a:Annotated[str,typer.Option(help="Example")]="Default"
+... )->None:
+... test_func_table:Table=func_table(test_func,values=vars())
+... console.print(test_func_table)
+>>> ifis_platform_win:
+... pytest.skip('fails on certain Windows root paths: issue #56')
+>>> test_func()
+ test_func config
+┏━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
+┃ Variable ┃ Type ┃ Value ┃ Notes ┃
+┡━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
+│ var_a │ str │ Default │ Example │
+└──────────┴──────┴─────────┴─────────┘
+
@cli.command()
+defplaintext(
+path:Annotated[Path,typer.Argument(help="Path to raw plaintext files")],
+save_path:Annotated[
+Path,typer.Option(help="Path to save json export files")
+]=Path(DEFAULT_PLAINTEXT_FIXTURE_OUTPUT),
+data_provider_code:Annotated[
+str,typer.Option(help="Data provider code use existing config")
+]="",
+extract_path:Annotated[
+Path,typer.Option(help="Folder to extract compressed raw plaintext to")
+]=Path(DEFAULT_EXTRACTED_SUBDIR),
+initial_pk:Annotated[
+int,typer.Option(help="First primary key to increment json export from")
+]=DEFAULT_INITIAL_PK,
+records_per_json:Annotated[
+int,typer.Option(help="Max records per json fixture")
+]=DEFAULT_MAX_PLAINTEXT_PER_FIXTURE_FILE,
+digit_padding:Annotated[
+int,typer.Option(help="Padding '0's for indexing json fixture filenames")
+]=FILE_NAME_0_PADDING_DEFAULT,
+compress:Annotated[bool,typer.Option(help="Compress json fixtures")]=False,
+compress_path:Annotated[
+Path,typer.Option(help="Folder to compress json fixtueres to")
+]=Path(COMPRESSED_PATH_DEFAULT),
+compress_format:Annotated[
+ArchiveFormatEnum,
+typer.Option(case_sensitive=False,help="Compression format"),
+]=COMPRESSION_TYPE_DEFAULT,
+)->None:
+"""Create a PlainTextFixture and save to `save_path`."""
+plaintext_fixture=PlainTextFixture(
+path=path,
+data_provider_code=data_provider_code,
+extract_subdir=extract_path,
+export_directory=save_path,
+initial_pk=initial_pk,
+max_plaintext_per_fixture_file=records_per_json,
+json_0_file_name_padding=digit_padding,
+json_export_compression_format=compress_format,
+json_export_compression_subdir=compress_path,
+)
+plaintext_fixture.info()
+while(
+notplaintext_fixture.compressed_files
+andnotplaintext_fixture.plaintext_provided_uncompressed
+):
+try_another_compressed_txt_source:bool=Confirm.ask(
+f"No .txt files available from extract path: "
+f"{plaintext_fixture.trunc_extract_path_str}\n"
+f"Would you like to extract fixtures from a different path?",
+default="n",
+)
+iftry_another_compressed_txt_source:
+new_extract_path:str=Prompt.ask("Please enter a new extract path")
+plaintext_fixture.path=Path(new_extract_path)
+else:
+return
+plaintext_fixture.info()
+plaintext_fixture.extract_compressed()
+plaintext_fixture.export_to_json_fixtures()
+ifcompress:
+plaintext_fixture.compress_json_exports()
+
+
+
+
+
+
+
+
+
+
+
+
+ rename
+
+
+
+
rename(
+path:Annotated[Path,typer.Argument(help="Path to files to manage")],
+folder:Annotated[
+Path,typer.Option(help="Path under `path` for new files")
+]=Path(),
+renumber:Annotated[
+bool,typer.Option(help="Show changes without applying")
+]=False,
+regex:Annotated[str,typer.Option(help="Regex to filter files")]="*.txt",
+padding:Annotated[
+int,typer.Option(help="Digits to pad file name")
+]=FILE_NAME_0_PADDING_DEFAULT,
+prefix:Annotated[str,typer.Option(help="Prefix for new file names")]="",
+dry_run:Annotated[
+bool,typer.Option(help="Show changes without applying")
+]=True,
+compress:Annotated[
+bool,typer.Option(help="Whether to compress files")
+]=False,
+compress_format:Annotated[
+ArchiveFormatEnum,
+typer.Option(case_sensitive=False,help="Compression format"),
+]=COMPRESSION_TYPE_DEFAULT,
+compress_suffix:Annotated[
+str,typer.Option(help="Compressed file name suffix")
+]="",
+compress_folder:Annotated[
+Path,typer.Option(help="Optional folder to differ from renaming")
+]=COMPRESSED_PATH_DEFAULT,
+delete_uncompressed:Annotated[
+bool,typer.Option(help="Delete unneeded files after compression")
+]=False,
+log_level:Annotated[
+int,typer.Option(help="Set logging level for debugging")
+]=WARNING,
+force:Annotated[
+bool,typer.Option(--force,help="Force run without prompt")
+]=False,
+)->None
+
It is possible for the example test to fail in different screen sizes. Try
+increasing the window or screen width of terminal used to check before
+raising an issue.
defshow_fixture_tables(
+run_settings:dotdict=settings,
+print_in_call:bool=True,
+data_provider_index:str=DATA_PROVIDER_INDEX,
+)->list[Table]:
+"""Print fixture tables specified in ``settings.fixture_tables`` in `rich.Table` format.
+
+ Arguments:
+ run_settings: `alto2txt2fixture` run configuration
+ print_in_call: whether to print to console (will use ``console`` variable if so)
+ data_provider_index: key to index `dataprovider` from ``NEWSPAPER_COLLECTION_METADATA``
+
+ Returns:
+ A `list` of `rich.Table` renders from configurations in ``run_settings.FIXTURE_TABLES``
+
+ Example:
+ ```pycon
+ >>> fixture_tables: list[Table] = show_fixture_tables(
+ ... settings,
+ ... print_in_call=False)
+ >>> len(fixture_tables)
+ 1
+ >>> fixture_tables[0].title
+ 'dataprovider'
+ >>> [column.header for column in fixture_tables[0].columns]
+ ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']
+ >>> fixture_tables = show_fixture_tables(settings)
+ <BLANKLINE>
+ ...dataprovider...Heritage...│ bl_hmd...│ hmd...
+
+ ```
+
+ Note:
+ It is possible for the example test to fail in different screen sizes. Try
+ increasing the window or screen width of terminal used to check before
+ raising an issue.
+ """
+ifrun_settings.FIXTURE_TABLES:
+if"dataprovider"inrun_settings.FIXTURE_TABLES:
+check_newspaper_collection_configuration(
+run_settings.COLLECTIONS,
+run_settings.FIXTURE_TABLES["dataprovider"],
+data_provider_index=data_provider_index,
+)
+console_tables:list[Table]=list(
+gen_fixture_tables(run_settings.FIXTURE_TABLES)
+)
+ifprint_in_call:
+forconsole_tableinconsole_tables:
+console.print(console_table)
+returnconsole_tables
+else:
+return[]
+
defcorrect_dict(o:dict)->list:
+"""Returns a list with corrected data from a provided dictionary."""
+return[(k,v[0],v[1])fork,vino.items()ifnotv[0].startswith("Q")]+[
+(k,v[1],v[0])fork,vino.items()ifv[0].startswith("Q")
+]
+
defget_list(x):
+"""Get a list from a string, which contains <SEP> as separator. If no
+ string is encountered, the function returns an empty list."""
+returnx.split("<SEP>")ifisinstance(x,str)else[]
+
defget_outpaths_dict(names:Sequence[str],module_name:str)->TableOutputConfigType:
+"""Return a `dict` of `csv` and `json` paths for each `module_name` table.
+
+ The `csv` and `json` paths
+
+ Args:
+ names: iterable of names of each `module_name`'s component. Main target is `csv` and `json` table names
+ module_name: name of module each name is part of, that is added as a prefix
+
+ Returns:
+ A ``TableOutputConfigType``: a `dict` of table ``names`` and output
+ `csv` and `json` filenames.
+
+ Example:
+ ```pycon
+ >>> pprint(get_outpaths_dict(MITCHELLS_TABELS, "mitchells"))
+ {'Entry': {'csv': 'mitchells.Entry.csv', 'json': 'mitchells.Entry.json'},
+ 'Issue': {'csv': 'mitchells.Issue.csv', 'json': 'mitchells.Issue.json'},
+ 'PoliticalLeaning': {'csv': 'mitchells.PoliticalLeaning.csv',
+ 'json': 'mitchells.PoliticalLeaning.json'},
+ 'Price': {'csv': 'mitchells.Price.csv', 'json': 'mitchells.Price.json'}}
+
+ ```
+ """
+return{
+name:OutputPathDict(
+csv=f"{module_name}.{name}.csv",
+json=f"{module_name}.{name}.json",
+)
+fornameinnames
+}
+
Takes an input_sub_path, a publication_code, and an (optional)
+abbreviation for any newspaper to locate the title in the
+jisc_papersDataFrame. jisc_papers is usually loaded via the
+setup_jisc_papers function.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
title
+
+ str
+
+
+
+
target newspaper title
+
+
+
+ required
+
+
+
+
issue_date
+
+ str
+
+
+
+
target newspaper issue_date
+
+
+
+ required
+
+
+
+
jisc_papers
+
+ pd.DataFrame
+
+
+
+
DataFrame of jisc_papers to match
+
+
+
+ required
+
+
+
+
input_sub_path
+
+ str
+
+
+
+
path of files to narrow down query input_sub_path
+
+
+
+ required
+
+
+
+
publication_code
+
+ str
+
+
+
+
unique codes to match newspaper records
+
+
+
+ required
+
+
+
+
abbr
+
+ str | None
+
+
+
+
an optional abbreviation of the newspaper title
+
+
+
+ None
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ str
+
+
+
+
Matched titlestr or abbr.
+
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ str
+
+
+
+
A string estimating the JISC equivalent newspaper title
defget_jisc_title(
+title:str,
+issue_date:str,
+jisc_papers:pd.DataFrame,
+input_sub_path:str,
+publication_code:str,
+abbr:str|None=None,
+)->str:
+"""
+ Match a newspaper ``title`` with ``jisc_papers`` records.
+
+ Takes an ``input_sub_path``, a ``publication_code``, and an (optional)
+ abbreviation for any newspaper to locate the ``title`` in the
+ ``jisc_papers`` `DataFrame`. ``jisc_papers`` is usually loaded via the
+ ``setup_jisc_papers`` function.
+
+ Args:
+ title: target newspaper title
+ issue_date: target newspaper issue_date
+ jisc_papers: `DataFrame` of `jisc_papers` to match
+ input_sub_path: path of files to narrow down query input_sub_path
+ publication_code: unique codes to match newspaper records
+ abbr: an optional abbreviation of the newspaper title
+
+ Returns:
+ Matched ``title`` `str` or ``abbr``.
+
+
+ Returns:
+ A string estimating the JISC equivalent newspaper title
+ """
+
+# First option, search the input_sub_path for a valid-looking publication_code
+g=PUBLICATION_CODE.findall(input_sub_path)
+
+iflen(g)==1:
+publication_code=g[0]
+# Let's see if we can find title:
+title=(
+jisc_papers[
+jisc_papers.publication_code==publication_code
+].title.to_list()[0]
+ifjisc_papers[
+jisc_papers.publication_code==publication_code
+].title.count()
+==1
+elsetitle
+)
+returntitle
+
+# Second option, look through JISC papers for best match (on publication_code if we have it, but abbr more importantly if we have it)
+ifabbr:
+_publication_code=publication_code
+publication_code=abbr
+
+ifjisc_papers.abbr[jisc_papers.abbr==publication_code].count():
+date=datetime.strptime(issue_date,"%Y-%m-%d")
+mask=(
+(jisc_papers.abbr==publication_code)
+&(date>=jisc_papers.start_date)
+&(date<=jisc_papers.end_date)
+)
+filtered=jisc_papers.loc[mask]
+iffiltered.publication_code.count()==1:
+publication_code=filtered.publication_code.to_list()[0]
+title=filtered.title.to_list()[0]
+returntitle
+
+# Last option: let's find all the possible titles in the jisc_papers for the abbreviation, and if it's just one unique title, let's pick it!
+ifabbr:
+test=list({xforxinjisc_papers[jisc_papers.abbr==abbr].title})
+iflen(test)==1:
+returntest[0]
+else:
+mask1=(jisc_papers.abbr==publication_code)&(
+jisc_papers.publication_code==_publication_code
+)
+test1=jisc_papers.loc[mask1]
+test1=list({xforxinjisc_papers[jisc_papers.abbr==abbr].title})
+iflen(test)==1:
+returntest1[0]
+
+# Fallback: if abbreviation is set, we'll return that:
+ifabbr:
+# For these exceptions, see issue comment:
+# https://github.com/alan-turing-institute/Living-with-Machines/issues/2453#issuecomment-1050652587
+ifabbr=="IPJL":
+return"Ipswich Journal"
+elifabbr=="BHCH":
+return"Bath Chronicle"
+elifabbr=="LSIR":
+return"Leeds Intelligencer"
+elifabbr=="AGER":
+return"Lancaster Gazetter, And General Advertiser For Lancashire West"
+
+returnabbr
+
+raiseRuntimeError(f"Title {title} could not be found.")
+
Generates fixtures for a specified model using a list of files.
+
This function takes a list of files and generates fixtures for a specified
+model. The fixtures can be used to populate a database or perform other
+data-related operations.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
filelist
+
+ list
+
+
+
+
A list of files to process and generate fixtures from.
+
+
+
+ []
+
+
+
+
model
+
+ str
+
+
+
+
The name of the model for which fixtures are generated.
+translate: A nested dictionary representing the translation mapping
+for fields. The structure of the translator follows the format:
+
+The translated fields will be used as keys, and their
+corresponding primary keys (obtained from the provided files) will
+be used as values in the generated fixtures.
+
+
+
+ ''
+
+
+
+
rename
+
+ dict
+
+
+
+
A nested dictionary representing the field renaming
+mapping. The structure of the dictionary follows the format:
+
{
+'part1':{
+'part2':'new_field_name'
+}
+}
+
+The fields specified in the dictionary will be renamed to the
+provided new field names in the generated fixtures.
+
+
+
+ {}
+
+
+
+
uniq_keys
+
+ list
+
+
+
+
A list of fields that need to be considered for
+uniqueness in the fixtures. If specified, the fixtures will yield
+only unique items based on the combination of these fields.
deffixtures(
+filelist:list=[],
+model:str="",
+translate:dict={},
+rename:dict={},
+uniq_keys:list=[],
+)->Generator[FixtureDict,None,None]:
+"""
+ Generates fixtures for a specified model using a list of files.
+
+ This function takes a list of files and generates fixtures for a specified
+ model. The fixtures can be used to populate a database or perform other
+ data-related operations.
+
+ Args:
+ filelist: A list of files to process and generate fixtures from.
+ model: The name of the model for which fixtures are generated.
+ translate: A nested dictionary representing the translation mapping
+ for fields. The structure of the translator follows the format:
+ ```python
+ {
+ 'part1': {
+ 'part2': {
+ 'translated_field': 'pk'
+ }
+ }
+ }
+ ```
+ The translated fields will be used as keys, and their
+ corresponding primary keys (obtained from the provided files) will
+ be used as values in the generated fixtures.
+ rename: A nested dictionary representing the field renaming
+ mapping. The structure of the dictionary follows the format:
+ ```python
+ {
+ 'part1': {
+ 'part2': 'new_field_name'
+ }
+ }
+ ```
+ The fields specified in the dictionary will be renamed to the
+ provided new field names in the generated fixtures.
+ uniq_keys: A list of fields that need to be considered for
+ uniqueness in the fixtures. If specified, the fixtures will yield
+ only unique items based on the combination of these fields.
+
+ Yields:
+ `FixtureDict` from ``model``, ``pk`` and `dict` of ``fields``.
+
+ Returns:
+ This function generates fixtures but does not return any value.
+ """
+
+filelist=sorted(filelist,key=lambdax:str(x).split("/")[:-1])
+count=len(filelist)
+
+# Process JSONL
+if[xforxinfilelistif".jsonl"inx.name]:
+pk=0
+# In the future, we might want to show progress here (tqdm or suchlike)
+forfileinfilelist:
+forlineinfile.read_text().splitlines():
+pk+=1
+line=json.loads(line)
+yieldFixtureDict(
+pk=pk,
+model=model,
+fields=dict(**get_fields(line,translate=translate,rename=rename)),
+)
+
+return
+else:
+# Process JSON
+pks=[xforxinrange(1,count+1)]
+
+iflen(uniq_keys):
+uniq_files=list(uniq(filelist,uniq_keys))
+count=len(uniq_files)
+zipped=zip(uniq_files,pks)
+else:
+zipped=zip(filelist,pks)
+
+forxintqdm(
+zipped,total=count,desc=f"{model} ({count:,} objs)",leave=False
+):
+yieldFixtureDict(
+pk=x[1],
+model=model,
+fields=dict(**get_fields(x[0],translate=translate,rename=rename)),
+)
+
+return
+
Retrieves fields from a file and performs modifications and checks.
+
This function takes a file (in various formats: Path, str, or dict)
+and processes its fields. It retrieves the fields from the file and
+performs modifications, translations, and checks on the fields.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
file
+
+ Union[Path, str, dict]
+
+
+
+
The file from which the fields are retrieved.
+
+
+
+ required
+
+
+
+
translate
+
+ dict
+
+
+
+
A nested dictionary representing the translation mapping
+for fields. The structure of the translator follows the format:
+
+The translated fields will be used to replace the original fields
+in the retrieved fields.
+
+
+
+ {}
+
+
+
+
rename
+
+ dict
+
+
+
+
A nested dictionary representing the field renaming
+mapping. The structure of the dictionary follows the format:
+
{
+'part1':{
+'part2':'new_field_name'
+}
+}
+
+The fields specified in the dictionary will be renamed to the
+provided new field names in the retrieved fields.
+
+
+
+ {}
+
+
+
+
allow_null
+
+ bool
+
+
+
+
Determines whether to allow None values for
+relational fields. If set to True, relational fields with
+missing values will be assigned None. If set to False, an
+error will be raised.
+
+
+
+ False
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ dict
+
+
+
+
A dictionary representing the retrieved fields from the file,
+with modifications and checks applied.
+
+
+
+
+
+
+
+
+
Raises:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ RuntimeError
+
+
+
+
If the file type is unsupported or if an error occurs
+during field retrieval or processing.
defget_fields(
+file:Union[Path,str,dict],
+translate:dict={},
+rename:dict={},
+allow_null:bool=False,
+)->dict:
+"""
+ Retrieves fields from a file and performs modifications and checks.
+
+ This function takes a file (in various formats: `Path`, `str`, or `dict`)
+ and processes its fields. It retrieves the fields from the file and
+ performs modifications, translations, and checks on the fields.
+
+ Args:
+ file: The file from which the fields are retrieved.
+ translate: A nested dictionary representing the translation mapping
+ for fields. The structure of the translator follows the format:
+ ```python
+ {
+ 'part1': {
+ 'part2': {
+ 'translated_field': 'pk'
+ }
+ }
+ }
+ ```
+ The translated fields will be used to replace the original fields
+ in the retrieved fields.
+ rename: A nested dictionary representing the field renaming
+ mapping. The structure of the dictionary follows the format:
+ ```python
+ {
+ 'part1': {
+ 'part2': 'new_field_name'
+ }
+ }
+ ```
+ The fields specified in the dictionary will be renamed to the
+ provided new field names in the retrieved fields.
+ allow_null: Determines whether to allow ``None`` values for
+ relational fields. If set to ``True``, relational fields with
+ missing values will be assigned ``None``. If set to ``False``, an
+ error will be raised.
+
+ Returns:
+ A dictionary representing the retrieved fields from the file,
+ with modifications and checks applied.
+
+ Raises:
+ RuntimeError: If the file type is unsupported or if an error occurs
+ during field retrieval or processing.
+ """
+ifisinstance(file,Path):
+try:
+fields=json.loads(file.read_text())
+exceptExceptionase:
+raiseRuntimeError(f"Cannot interpret JSON ({e}): {file}")
+elifisinstance(file,str):
+if"\n"infile:
+raiseRuntimeError("File has multiple lines.")
+try:
+fields=json.loads(file)
+exceptjson.decoder.JSONDecodeErrorase:
+raiseRuntimeError(f"Cannot interpret JSON ({e}): {file}")
+elifisinstance(file,dict):
+fields=file
+else:
+raiseRuntimeError(f"Cannot process type {type(file)}.")
+
+# Fix relational fields for any file
+forkeyin[keyforkeyinfields.keys()if"__"inkey]:
+parts=key.split("__")
+
+try:
+before=fields[key]
+ifbefore:
+before=before.replace("---","/")
+loc=translate.get(parts[0],{}).get(parts[1],{})
+fields[key]=loc.get(before)
+iffields[key]isNone:
+raiseRuntimeError(
+f"Cannot translate fields.{key} from {before}: {loc}"
+)
+
+exceptAttributeError:
+ifallow_null:
+fields[key]=None
+else:
+print(
+"Content had relational fields, but something went wrong in parsing the data:"
+)
+print("file",file)
+print("fields",fields)
+print("KEY:",key)
+raiseRuntimeError()
+
+new_name=rename.get(parts[0],{}).get(parts[1],None)
+ifnew_name:
+fields[new_name]=fields[key]
+delfields[key]
+
+fields["created_at"]=NOW_str
+fields["updated_at"]=NOW_str
+
+try:
+fields["item_type"]=str(fields["item_type"]).upper()
+exceptKeyError:
+pass
+
+try:
+iffields["ocr_quality_mean"]=="":
+fields["ocr_quality_mean"]=0
+exceptKeyError:
+pass
+
+try:
+iffields["ocr_quality_sd"]=="":
+fields["ocr_quality_sd"]=0
+exceptKeyError:
+pass
+
+returnfields
+
+
+
+
+
+
+
+
+
+
+
+
+ get_key_from
+
+
+
+
get_key_from(item:Path,x:str)->str
+
+
+
+
+
Retrieves a specific key from a file and returns its value.
+
This function reads a file and extracts the value of a specified
+key. If the key is not found or an error occurs while processing
+the file, a warning is printed, and an empty string is returned.
defget_key_from(item:Path,x:str)->str:
+"""
+ Retrieves a specific key from a file and returns its value.
+
+ This function reads a file and extracts the value of a specified
+ key. If the key is not found or an error occurs while processing
+ the file, a warning is printed, and an empty string is returned.
+
+ Args:
+ item: The file from which the key is extracted.
+ x: The key to be retrieved from the file.
+
+ Returns:
+ The value of the specified key from the file.
+ """
+result=json.loads(item.read_text()).get(x,None)
+ifnotresult:
+print(f"[WARN] Could not find key {x} in {item}")
+result=""
+returnresult
+
Parses files from collections and generates fixtures for various models.
+
This function processes files from the specified collections and generates
+fixtures for different models, such as newspapers.dataprovider,
+newspapers.ingest, newspapers.digitisation, newspapers.newspaper,
+newspapers.issue, and newspapers.item.
+
It performs various steps, such as file listing, fixture generation,
+translation mapping, renaming fields, and saving fixtures to files.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
collections
+
+ list
+
+
+
+
A list of collections from which files are
+processed and fixtures are generated.
+
+
+
+ required
+
+
+
+
cache_home
+
+ str
+
+
+
+
The directory path where the collections are located.
+
+
+
+ required
+
+
+
+
output
+
+ str
+
+
+
+
The directory path where the fixtures will be saved.
+
+
+
+ required
+
+
+
+
max_elements_per_file
+
+ int
+
+
+
+
The maximum number of elements per file
+when saving fixtures.
+
+
+
+ required
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ None
+
+
+
+
This function generates fixtures but does not return any value.
defparse(
+collections:list,cache_home:str,output:str,max_elements_per_file:int
+)->None:
+"""
+ Parses files from collections and generates fixtures for various models.
+
+ This function processes files from the specified collections and generates
+ fixtures for different models, such as `newspapers.dataprovider`,
+ `newspapers.ingest`, `newspapers.digitisation`, `newspapers.newspaper`,
+ `newspapers.issue`, and `newspapers.item`.
+
+ It performs various steps, such as file listing, fixture generation,
+ translation mapping, renaming fields, and saving fixtures to files.
+
+ Args:
+ collections: A list of collections from which files are
+ processed and fixtures are generated.
+ cache_home: The directory path where the collections are located.
+ output: The directory path where the fixtures will be saved.
+ max_elements_per_file: The maximum number of elements per file
+ when saving fixtures.
+
+ Returns:
+ This function generates fixtures but does not return any value.
+ """
+globalCACHE_HOME
+globalOUTPUT
+globalMAX_ELEMENTS_PER_FILE
+
+CACHE_HOME=cache_home
+OUTPUT=output
+MAX_ELEMENTS_PER_FILE=max_elements_per_file
+
+# Set up output directory
+reset_fixture_dir(OUTPUT)
+
+# Get file lists
+print("\nGetting file lists...")
+
+defissues_in_x(x):
+return"issues"instr(x.parent).split("/")
+
+defnewspapers_in_x(x):
+returnnotany(
+[
+condition
+foryinstr(x.parent).split("/")
+forconditionin[
+"issues"iny,
+"ingest"iny,
+"digitisation"iny,
+"data-provider"iny,
+]
+]
+)
+
+all_json=[
+xforyincollectionsforxin(Path(CACHE_HOME)/y).glob("**/*.json")
+]
+all_jsonl=[
+xforyincollectionsforxin(Path(CACHE_HOME)/y).glob("**/*.jsonl")
+]
+print(f"--> {len(all_json):,} JSON files altogether")
+print(f"--> {len(all_jsonl):,} JSONL files altogether")
+
+print("\nSetting up fixtures...")
+
+# Process data providers
+defdata_provider_in_x(x):
+return"data-provider"instr(x.parent).split("/")
+
+data_provider_json=list(
+fixtures(
+model="newspapers.dataprovider",
+filelist=[xforxinall_jsonifdata_provider_in_x(x)],
+uniq_keys=["name"],
+)
+)
+print(f"--> {len(data_provider_json):,} DataProvider fixtures")
+
+# Process ingest
+defingest_in_x(x):
+return"ingest"instr(x.parent).split("/")
+
+ingest_json=list(
+fixtures(
+model="newspapers.ingest",
+filelist=[xforxinall_jsonifingest_in_x(x)],
+uniq_keys=["lwm_tool_name","lwm_tool_version"],
+)
+)
+print(f"--> {len(ingest_json):,} Ingest fixtures")
+
+# Process digitisation
+defdigitisation_in_x(x):
+return"digitisation"instr(x.parent).split("/")
+
+digitisation_json=list(
+fixtures(
+model="newspapers.digitisation",
+filelist=[xforxinall_jsonifdigitisation_in_x(x)],
+uniq_keys=["software"],
+)
+)
+print(f"--> {len(digitisation_json):,} Digitisation fixtures")
+
+# Process newspapers
+newspaper_json=list(
+fixtures(
+model="newspapers.newspaper",
+filelist=[fileforfileinall_jsonifnewspapers_in_x(file)],
+)
+)
+print(f"--> {len(newspaper_json):,} Newspaper fixtures")
+
+# Process issue
+translate=get_translator(
+[
+TranslatorTuple(
+"publication__publication_code","publication_code",newspaper_json
+)
+]
+)
+rename={"publication":{"publication_code":"newspaper_id"}}
+
+issue_json=list(
+fixtures(
+model="newspapers.issue",
+filelist=[fileforfileinall_jsonifissues_in_x(file)],
+translate=translate,
+rename=rename,
+)
+)
+print(f"--> {len(issue_json):,} Issue fixtures")
+
+# Create translator/clear up memory before processing items
+translate=get_translator(
+[
+("issue__issue_identifier","issue_code",issue_json),
+("digitisation__software","software",digitisation_json),
+("data_provider__name","name",data_provider_json),
+(
+"ingest__lwm_tool_identifier",
+["lwm_tool_name","lwm_tool_version"],
+ingest_json,
+),
+]
+)
+
+rename={
+"issue":{"issue_identifier":"issue_id"},
+"digitisation":{"software":"digitisation_id"},
+"data_provider":{"name":"data_provider_id"},
+"ingest":{"lwm_tool_identifier":"ingest_id"},
+}
+
+save_fixture(newspaper_json,"Newspaper")
+save_fixture(issue_json,"Issue")
+
+delnewspaper_json
+delissue_json
+gc.collect()
+
+print("\nSaving...")
+
+save_fixture(digitisation_json,"Digitisation")
+save_fixture(ingest_json,"Ingest")
+save_fixture(data_provider_json,"DataProvider")
+
+# Process items
+item_json=fixtures(
+model="newspapers.item",
+filelist=all_jsonl,
+translate=translate,
+rename=rename,
+)
+save_fixture(item_json,"Item")
+
+return
+
+
+
+
+
+
+
+
+
+
+
+
+ reset_fixture_dir
+
+
+
+
reset_fixture_dir(output:str|Path)->None
+
+
+
+
+
Resets the fixture directory by removing all JSON files inside it.
+
This function takes a directory path (output) as input and removes all
+JSON files within the directory.
+
Prior to removal, it prompts the user for confirmation to proceed. If the
+user confirms, the function clears the fixture directory by deleting the
+JSON files.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
output
+
+ str | Path
+
+
+
+
The directory path of the fixture directory to be reset.
+
+
+
+ required
+
+
+
+
+
+
+
+
Raises:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ RuntimeError
+
+
+
+
If the output directory is not specified as a string.
defreset_fixture_dir(output:str|Path)->None:
+"""
+ Resets the fixture directory by removing all JSON files inside it.
+
+ This function takes a directory path (``output``) as input and removes all
+ JSON files within the directory.
+
+ Prior to removal, it prompts the user for confirmation to proceed. If the
+ user confirms, the function clears the fixture directory by deleting the
+ JSON files.
+
+ Args:
+ output: The directory path of the fixture directory to be reset.
+
+ Raises:
+ RuntimeError: If the ``output`` directory is not specified as a string.
+ """
+
+ifnotisinstance(output,str):
+raiseRuntimeError("`output` directory needs to be specified as a string.")
+
+output=Path(output)
+
+y=input(
+f"This command will automatically empty the fixture directory ({output.absolute()}). "
+"Do you want to proceed? [y/N]"
+)
+
+ifnoty.lower()=="y":
+output.mkdir(parents=True,exist_ok=True)
+return
+
+print("\nClearing up the fixture directory")
+
+# Ensure directory exists
+output.mkdir(parents=True,exist_ok=True)
+
+# Drop all JSON files
+[x.unlink()forxinPath(output).glob("*.json")]
+
+return
+
Generates unique items from a list of files based on specified keys.
+
This function takes a list of files and yields unique items based on a
+combination of keys. The keys are extracted from each file using the
+get_key_from function, and duplicate items are ignored.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
filelist
+
+ list
+
+
+
+
A list of files from which unique items are
+generated.
+
+
+
+ required
+
+
+
+
keys
+
+ list
+
+
+
+
A list of keys used for uniqueness. Each key specifies
+a field to be used for uniqueness checking in the generated
+items.
defuniq(filelist:list,keys:list=[])->Generator[Any,None,None]:
+"""
+ Generates unique items from a list of files based on specified keys.
+
+ This function takes a list of files and yields unique items based on a
+ combination of keys. The keys are extracted from each file using the
+ ``get_key_from`` function, and duplicate items are ignored.
+
+ Args:
+ filelist: A list of files from which unique items are
+ generated.
+ keys: A list of keys used for uniqueness. Each key specifies
+ a field to be used for uniqueness checking in the generated
+ items.
+
+ Yields:
+ A unique item from `filelist`.
+ """
+
+seen=set()
+foriteminfilelist:
+key="-".join([get_key_from(item,x)forxinkeys])
+
+ifkeynotinseen:
+seen.add(key)
+yielditem
+else:
+# Drop it if duplicate
+pass
+
+the fulltext app has a fulltextmodelclass specified in
+lwmdb.fulltext.models.fulltext. A sql table is generated from
+on that fulltextclass and the jsonfixture structure generated
+from this class is where records will be stored.
+
+
+
+
+
extract_subdir
+
+ PathLike
+
+
+
+
Folder to extract self.compressed_files to.
+
+
+
+
+
plaintext_extension
+
+ str
+
+
+
+
What file extension to use to filter plaintext files.
def__str__(self)->str:
+"""Return class name with count and `DataProvider` if available."""
+return(
+f"{type(self).__name__} "
+f"for {len(self)} "
+f"{self._data_provider_code_quoted_with_trailing_space}files"
+)
+
The Archive class represents a zip archive of XML files. The class is used
+to extract information from a ZIP archive, and it contains several methods
+to process the data contained in the archive.
+
+
open(Archive) context manager
+
Archive can be opened with a context manager, which creates a meta
+object, with timings for the object. When closed, it will save the
+meta JSON to the correct paths.
+
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
path
+
+ Path
+
+
+
+
The path to the zip archive.
+
+
+
+
+
collection
+
+ str
+
+
+
+
The collection of the XML files in the archive. Default is "".
+
+
+
+
+
report
+
+ Path
+
+
+
+
The file path of the report file for the archive.
+
+
+
+
+
report_id
+
+ str
+
+
+
+
The report ID for the archive. If not provided, a random UUID is
+generated.
+
+
+
+
+
report_parent
+
+ Path
+
+
+
+
The parent directory of the report file for the archive.
+
+
+
+
+
jisc_papers
+
+ pd.DataFrame
+
+
+
+
A DataFrame of JISC papers.
+
+
+
+
+
size
+
+ str | float
+
+
+
+
The size of the archive, in human-readable format.
+
+
+
+
+
size_raw
+
+ str | float
+
+
+
+
The raw size of the archive, in bytes.
+
+
+
+
+
roots
+
+ Generator[ET.Element, None, None]
+
+
+
+
The root elements of the XML documents contained in the archive.
def__len__(self):
+"""The number of files inside the zip archive."""
+returnlen(self.filelist)
+
+
+
+
+
+
+
+
+
+
+
+
+ get_documents
+
+
+
+
get_documents()->Generator[Document,None,None]
+
+
+
+
+
A generator that yields instances of the Document class for each XML
+file in the ZIP archive.
+
It uses the tqdm library to display a progress bar in the terminal
+while it is running.
+
If the contents of the ZIP file are not empty, the method creates an
+instance of the Document class by passing the root element of the XML
+file, the collection name, meta information about the archive, and the
+JISC papers data frame (if provided) to the constructor of the
+Document class. The instance of the Document class is then
+returned by the generator.
defget_documents(self)->Generator[Document,None,None]:
+"""
+ A generator that yields instances of the Document class for each XML
+ file in the ZIP archive.
+
+ It uses the `tqdm` library to display a progress bar in the terminal
+ while it is running.
+
+ If the contents of the ZIP file are not empty, the method creates an
+ instance of the ``Document`` class by passing the root element of the XML
+ file, the collection name, meta information about the archive, and the
+ JISC papers data frame (if provided) to the constructor of the
+ ``Document`` class. The instance of the ``Document`` class is then
+ returned by the generator.
+
+ Yields:
+ ``Document`` class instance for each unzipped `XML` file.
+ """
+forxml_fileintqdm(
+self.filelist,
+desc=f"{Path(self.zip_file.filename).stem} ({self.meta.size})",
+leave=False,
+colour="green",
+):
+withself.zip_file.open(xml_file)asf:
+xml=f.read()
+ifxml:
+yieldDocument(
+root=ET.fromstring(xml),
+collection=self.collection,
+meta=self.meta,
+jisc_papers=self.jisc_papers,
+)
+
+
+
+
+
+
+
+
+
+
+
+
+ get_roots
+
+
+
+
get_roots()->Generator[ET.Element,None,None]
+
+
+
+
+
Yields the root elements of the XML documents contained in the archive.
defget_roots(self)->Generator[ET.Element,None,None]:
+"""
+ Yields the root elements of the XML documents contained in the archive.
+ """
+forxml_fileintqdm(self.filelist,leave=False,colour="blue"):
+withself.zip_file.open(xml_file)asf:
+xml=f.read()
+ifxml:
+yieldET.fromstring(xml)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Cache
+
+
+
+
Cache()
+
+
+
+
+
+
The Cache class provides a blueprint for creating and managing cache data.
+The class has several methods that help in getting the cache path,
+converting the data to a dictionary, and writing the cache data to a file.
+
It is inherited by many other classes in this document.
defas_dict(self)->dict:
+"""
+ Converts the cache data to a dictionary and returns it.
+ """
+return{}
+
+
+
+
+
+
+
+
+
+
+
+
+ get_cache_path
+
+
+
+
get_cache_path()->Path
+
+
+
+
+
Returns the cache path, which is used to store the cache data.
+The path is normally constructed using some of the object's
+properties (collection, kind, and id) but can be changed when
+inherited.
defget_cache_path(self)->Path:
+"""
+ Returns the cache path, which is used to store the cache data.
+ The path is normally constructed using some of the object's
+ properties (collection, kind, and id) but can be changed when
+ inherited.
+ """
+returnPath(f"{CACHE_HOME}/{self.collection}/{self.kind}/{self.id}.json")
+
Writes the cache data to a file at the specified cache path. The cache
+data is first converted to a dictionary using the as_dict method. If
+the cache path already exists, the function returns True.
defwrite_to_cache(self,json_indent:int=JSON_INDENT)->Optional[bool]:
+"""
+ Writes the cache data to a file at the specified cache path. The cache
+ data is first converted to a dictionary using the as_dict method. If
+ the cache path already exists, the function returns True.
+ """
+
+path=self.get_cache_path()
+
+try:
+ifpath.exists():
+returnTrue
+exceptAttributeError:
+error(
+f"Error occurred when getting cache path for "
+f"{self.kind}: {path}. It was not of expected "
+f"type Path but of type {type(path)}:",
+)
+
+path.parent.mkdir(parents=True,exist_ok=True)
+
+withopen(path,"w+")asf:
+f.write(json.dumps(self.as_dict(),indent=json_indent))
+
+return
+
A Collection represents a group of newspaper archives from any passed
+alto2txt metadata output.
+
A Collection is initialised with a name and an optional pandas DataFrame
+of JISC papers. The archives property returns an iterable of the
+Archive objects within the collection.
The DataProvider class extends the Cache class and represents a newspaper
+data provider. The class has several properties and methods that allow
+creation of a data provider object and the manipulation of its data.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
collection
+
+ str
+
+
+
+
A string representing publication collection
+
+
+
+
+
kind
+
+ str
+
+
+
+
Indication of object type, defaults to data-provider
The Digitisation class extends the Cache class and represents a newspaper
+digitisation. The class has several properties and methods that allow
+creation of an digitisation object and the manipulation of its data.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
root
+
+ ET.Element
+
+
+
+
An xml element that represents the root of the publication
+
+
+
+
+
collection
+
+ str
+
+
+
+
A string that represents the collection of the publication
defas_dict(self)->dict:
+"""
+ A method that returns a dictionary representation of the digitisation
+ object.
+
+ Returns:
+ Dictionary representation of the Digitising object
+ """
+dic={
+x.tag:x.textor""
+forxinself.root.findall("./process/*")
+ifx.tag
+in[
+"xml_flavour",
+"software",
+"mets_namespace",
+"alto_namespace",
+]
+}
+ifnotdic.get("software"):
+return{}
+
+returndic
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Document
+
+
+
+
Document(*args,**kwargs)
+
+
+
+
+
+
The Document class is a representation of a document that contains
+information about a publication, newspaper, item, digitisation, and
+ingest. This class holds all the relevant information about a document in
+a structured manner and provides properties that can be used to access
+different aspects of the document.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
collection
+
+ str | None
+
+
+
+
A string that represents the collection of the publication
+
+
+
+
+
root
+
+ ET.Element | None
+
+
+
+
An XML element that represents the root of the publication
+
+
+
+
+
zip_file
+
+ str | None
+
+
+
+
A path to a valid zip file
+
+
+
+
+
jisc_papers
+
+ pd.DataFrame | None
+
+
+
+
A pandasDataFrame object that holds information about the JISC papers
The Ingest class extends the Cache class and represents a newspaper ingest.
+The class has several properties and methods that allow the creation of an
+ingest object and the manipulation of its data.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
root
+
+ ET.Element
+
+
+
+
An xml element that represents the root of the publication
+
+
+
+
+
collection
+
+ str
+
+
+
+
A string that represents the collection of the publication
defas_dict(self)->dict:
+"""
+ A method that returns a dictionary representation of the ingest
+ object.
+
+ Returns:
+ Dictionary representation of the Ingest object
+ """
+return{
+f"lwm_tool_{x.tag}":x.textor""
+forxinself.root.findall("./process/lwm_tool/*")
+}
+
The Issue class extends the Cache class and represents a newspaper issue.
+The class has several properties and methods that allow the creation of an
+issue object and the manipulation of its data.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
root
+
+
+
+
+
An xml element that represents the root of the publication
defget_cache_path(self)->Path:
+"""
+ Returns the path to the cache file for the issue object.
+
+ Returns:
+ Path to the cache file for the issue object
+ """
+
+json_file=f"/{self.newspaper.publication_code}/issues/{self.issue_code}.json"
+
+returnPath(
+f"{CACHE_HOME}/{self.collection}/"
++"/".join(self.newspaper.number_paths)
++json_file
+)
+
The Newspaper class extends the Cache class and represents a newspaper
+item, i.e. an article. The class has several properties and methods that
+allow the creation of an article object and the manipulation of its data.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
root
+
+ ET.Element
+
+
+
+
An xml element that represents the root of the publication
+
+
+
+
+
issue_code
+
+ str
+
+
+
+
A string that represents the issue code
+
+
+
+
+
digitisation
+
+ dict
+
+
+
+
TODO
+
+
+
+
+
ingest
+
+ dict
+
+
+
+
TODO
+
+
+
+
+
collection
+
+ str
+
+
+
+
A string that represents the collection of the publication
defget_cache_path(self)->Path:
+"""
+ Returns the path to the cache file for the item (article) object.
+
+ Returns:
+ Path to the cache file for the article object
+ """
+returnPath(
+f"{CACHE_HOME}/{self.collection}/"
++"/".join(self.newspaper.number_paths)
++f"/{self.newspaper.publication_code}/items.jsonl"
+)
+
+
+
+
+
+
+
+
+
+
+
+
+ write_to_cache
+
+
+
+
write_to_cache(json_indent=JSON_INDENT)->None
+
+
+
+
+
Special cache-write function that appends rather than writes at the
+end of the process.
defwrite_to_cache(self,json_indent=JSON_INDENT)->None:
+"""
+ Special cache-write function that appends rather than writes at the
+ end of the process.
+
+ Returns:
+ None.
+ """
+path=self.get_cache_path()
+
+path.parent.mkdir(parents=True,exist_ok=True)
+
+withopen(path,"a+")asf:
+f.write(json.dumps(self.as_dict(),indent=json_indent)+"\n")
+
+return
+
defget_cache_path(self)->Path:
+"""
+ Returns the path to the cache file for the newspaper object.
+
+ Returns:
+ Path to the cache file for the newspaper object
+ """
+json_file=f"/{self.publication_code}/{self.publication_code}.json"
+
+returnPath(
+f"{CACHE_HOME}/{self.collection}/"+"/".join(self.number_paths)+json_file
+)
+
defpublication_code_from_input_sub_path(self)->str|None:
+"""
+ A method that returns the publication code from the input sub-path of
+ the publication process.
+
+ Returns:
+ The code of the publication
+ """
+
+g=PUBLICATION_CODE.findall(self.input_sub_path)
+iflen(g)==1:
+returng[0]
+returnNone
+
This function is responsible for setting up the path for the alto2txt
+mountpoint, setting up the JISC papers and routing the collections for
+processing.
defroute(
+collections:list,
+cache_home:str,
+mountpoint:str,
+jisc_papers_path:str,
+report_dir:str,
+)->None:
+"""
+ This function is responsible for setting up the path for the alto2txt
+ mountpoint, setting up the JISC papers and routing the collections for
+ processing.
+
+ Args:
+ collections: List of collection names
+ cache_home: Directory path for the cache
+ mountpoint: Directory path for the alto2txt mountpoint
+ jisc_papers_path: Path to the JISC papers
+ report_dir: Path to the report directory
+
+ Returns:
+ None
+ """
+
+globalCACHE_HOME
+globalMNT
+globalREPORT_DIR
+
+CACHE_HOME=cache_home
+REPORT_DIR=report_dir
+
+MNT=Path(mountpoint)ifisinstance(mountpoint,str)elsemountpoint
+ifnotMNT.exists():
+error(
+f"The mountpoint provided for alto2txt does not exist. "
+f"Either create a local copy or blobfuse it to "
+f"`{MNT.absolute()}`."
+)
+
+jisc_papers=setup_jisc_papers(path=jisc_papers_path)
+
+forcollection_nameincollections:
+collection=Collection(name=collection_name,jisc_papers=jisc_papers)
+
+ifcollection.empty:
+error(
+f"It looks like {collection_name} is empty in the "
+f"alto2txt mountpoint: `{collection.dir.absolute()}`."
+)
+
+forarchiveincollection.archives:
+witharchiveas_:
+[
+(
+doc.item.write_to_cache(),
+doc.newspaper.write_to_cache(),
+doc.issue.write_to_cache(),
+doc.data_provider.write_to_cache(),
+doc.ingest.write_to_cache(),
+doc.digitisation.write_to_cache(),
+)
+fordocinarchive.documents
+]
+
+return
+
Fields within the fields portion of a FixtureDict to fit lwmdb.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
name
+
+ str
+
+
+
+
The name of the collection data source. For lwmdb this should
+be less than 600 characters.
+
+
+
+
+
code
+
+ str | NEWSPAPER_OCR_FORMATS
+
+
+
+
A short slug-like, url-compatible (replace spaces with -)
+str to uniquely identify a data provider in urls, api calls etc.
+Designed to fit NEWSPAPER_OCR_FORMATS and any future slug-like codes.
+
+
+
+
+
legacy_code
+
+ LEGACY_NEWSPAPER_OCR_FORMATS | None
+
+
+
+
Either blank or a legacy slug-like, url-compatible (replace spaces with
+-) str originally used by alto2txt following
+LEGACY_NEWSPAPER_OCR_FORMATSNEWSPAPER_OCR_FORMATS.
No pk is included. By not specifying one, django should generate new onces during
+import.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ PlaintextFixtureFieldsDict
+
+
+
+
+
+
+
+ Bases: TypedDict
+
+
+
A typed dict for Plaintext Fixutres to match lwmdb.Fulltextmodel
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
text
+
+ str
+
+
+
+
Plaintext, potentially quite large newspaper articles.
+May have unusual or unreadable sequences of characters
+due to issues with Optical Character Recognition quality.
+
+
+
+
+
path
+
+ str
+
+
+
+
Path of provided plaintext file. If compressed_path is
+None, this is the original relative Path of the plaintext file.
+
+
+
+
+
compressed_path
+
+ str | None
+
+
+
+
The path of a compressed data source, the extraction of which provides
+access to plaintext files.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ TranslatorTuple
+
+
+
+
+
+
+
+ Bases: NamedTuple
+
+
+
A named tuple of fields for translation.
+
+
+
+
Attributes:
+
+
+
+
Name
+
Type
+
Description
+
+
+
+
+
start
+
+ str
+
+
+
+
A string representing the starting field name.
+
+
+
+
+
finish
+
+ str | list
+
+
+
+
A string or list specifying the field(s) to be translated.
+If it is a string, the translated field
+will be a direct mapping of the specified field in
+each item of the input list.
+If it is a list, the translated field will be a
+hyphen-separated concatenation of the specified fields
+in each item of the input list.
+
+
+
+
+
lst
+
+ list[dict]
+
+
+
+
A list of dictionaries representing the items to be
+translated. Each dictionary should contain the necessary
+fields for translation, with the field names specified in
+the start parameter.
defcheck_newspaper_collection_configuration(
+collections:Iterable[str]=settings.COLLECTIONS,
+newspaper_collections:Iterable[FixtureDict]=NEWSPAPER_COLLECTION_METADATA,
+data_provider_index:str=DATA_PROVIDER_INDEX,
+)->set[str]:
+"""Check the names in `collections` match the names in `newspaper_collections`.
+
+ Arguments:
+ collections:
+ Names of newspaper collections, defaults to ``settings.COLLECTIONS``
+ newspaper_collections:
+ Newspaper collections in a list of `FixtureDict` format. Defaults
+ to ``settings.FIXTURE_TABLE['dataprovider]``
+ data_provider_index:
+ `dict` `fields` `key` used to check matchiching `collections` name
+
+ Returns:
+ A set of ``collections`` without a matching `newspaper_collections` record.
+
+ Example:
+ ```pycon
+ >>> check_newspaper_collection_configuration()
+ set()
+ >>> unmatched: set[str] = check_newspaper_collection_configuration(
+ ... ["cat", "dog"])
+ <BLANKLINE>
+ ...Warning: 2 `collections` not in `newspaper_collections`: ...
+ >>> unmatched == {'dog', 'cat'}
+ True
+
+ ```
+
+ !!! note
+
+ Set orders are random so checking `unmatched == {'dog, 'cat'}` to
+ ensure correctness irrespective of order in the example above.
+
+ """
+newspaper_collection_names:tuple[str,...]=tuple(
+dict_from_list_fixture_fields(
+newspaper_collections,field_name=data_provider_index
+).keys()
+)
+collection_diff:set[str]=set(collections)-set(newspaper_collection_names)
+ifcollection_diff:
+warning(
+f"{len(collection_diff)} `collections` "
+f"not in `newspaper_collections`: {collection_diff}"
+)
+returncollection_diff
+
+
+
+
+
+
+
+
+
+
+
+
+ clear_cache
+
+
+
+
clear_cache(dir:str|Path)->None
+
+
+
+
+
Clears the cache directory by removing all .json files in it.
defclear_cache(dir:str|Path)->None:
+"""
+ Clears the cache directory by removing all `.json` files in it.
+
+ Args:
+ dir: The path of the directory to be cleared.
+ """
+
+dir=get_path_from(dir)
+
+y=input(
+f"Do you want to erase the cache path now that the "
+f"files have been generated ({dir.absolute()})? [y/N]"
+)
+
+ify.lower()=="y":
+info("Clearing up the cache directory")
+forxindir.glob("*.json"):
+x.unlink()
+
Compress exported fixtures files using make_archive.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
path
+
+ PathLike
+
+
+
+
Path to file to compress
+
+
+
+ required
+
+
+
+
output_path
+
+ PathLike | str
+
+
+
+
Compressed file name (without extension specified from format).
+
+
+
+ settings.OUTPUT
+
+
+
+
format
+
+ str | ArchiveFormatEnum
+
+
+
+
A str of one of the registered compression formats. By default
+Python provides zip, tar, gztar, bztar, and xztar.
+See ArchiveFormatEnum variable for options checked.
+
+
+
+ ZIP_FILE_EXTENSION
+
+
+
+
suffix
+
+ str
+
+
+
+
str to add to comprssed file name saved.
+For example: if path = plaintext_fixture-1.json and
+suffix=_compressed, then the saved file might be called
+plaintext_fixture-1_compressed.json.zip
defcreate_lookup(lst:list=[],on:list=[])->dict:
+"""
+ Create a lookup dictionary from a list of dictionaries.
+
+ Args:
+ lst: A list of dictionaries that should be used to generate the lookup.
+ on: A list of keys from the dictionaries in the list that should be used as the keys in the lookup.
+
+ Returns:
+ The generated lookup dictionary.
+ """
+return{get_key(x,on):x["pk"]forxinlst}
+
Saves fixtures generated by a generator to separate separate CSV files.
+
This function takes an Iterable or Generator of fixtures and saves to
+separate CSV files. The fixtures are saved in batches, where each batch
+is determined by the max_elements_per_file parameter.
deffixtures_dict2csv(
+fixtures:Iterable[FixtureDict]|Generator[FixtureDict,None,None],
+prefix:str="",
+output_path:PathLike|str=settings.OUTPUT,
+index:bool=False,
+max_elements_per_file:int=settings.MAX_ELEMENTS_PER_FILE,
+file_name_0_padding:int=FILE_NAME_0_PADDING_DEFAULT,
+)->None:
+"""Saves fixtures generated by a generator to separate separate `CSV` files.
+
+ This function takes an `Iterable` or `Generator` of fixtures and saves to
+ separate `CSV` files. The fixtures are saved in batches, where each batch
+ is determined by the ``max_elements_per_file`` parameter.
+
+ Args:
+ fixtures:
+ An `Iterable` or `Generator` of the fixtures to be saved.
+ prefix:
+ A string prefix to be added to the file names of the
+ saved fixtures.
+ output_path:
+ Path to folder fixtures are saved to
+ max_elements_per_file:
+ Maximum `JSON` records saved in each file
+ file_name_0_padding:
+ Zeros to prefix the number of each fixture file name.
+
+ Returns:
+ This function saves fixtures to files and does not return a value.
+
+ Example:
+ ```pycon
+ >>> tmp_path: Path = getfixture('tmp_path')
+ >>> from pandas import read_csv
+ >>> fixtures_dict2csv(NEWSPAPER_COLLECTION_METADATA,
+ ... prefix='test', output_path=tmp_path)
+ >>> imported_fixture = read_csv(tmp_path / 'test-000001.csv')
+ >>> imported_fixture.iloc[1]['pk']
+ 2
+ >>> imported_fixture.iloc[1][DATA_PROVIDER_INDEX]
+ 'hmd'
+
+ ```
+
+ """
+internal_counter:int=1
+counter:int=1
+lst:list=[]
+file_name:str
+df:DataFrame
+Path(output_path).mkdir(parents=True,exist_ok=True)
+foriteminfixtures:
+lst.append(fixture_fields(item,as_dict=True))
+internal_counter+=1
+ifinternal_counter>max_elements_per_file:
+df=DataFrame.from_records(lst)
+
+file_name=f"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv"
+df.to_csv(Path(output_path)/file_name,index=index)
+# Save up some memory
+dellst
+gc.collect()
+
+# Re-instantiate
+lst=[]
+internal_counter=1
+counter+=1
+else:
+df=DataFrame.from_records(lst)
+file_name=f"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv"
+df.to_csv(Path(output_path)/file_name,index=index)
+
defgen_fixture_tables(
+fixture_tables:dict[str,list[FixtureDict]]={},
+include_fixture_pk_column:bool=True,
+)->Generator[Table,None,None]:
+"""Generator of `rich.Table` instances from `FixtureDict` configuration tables.
+
+ Args:
+ fixture_tables: `dict` where `key` is for `Table` title and `value` is a `FixtureDict`
+ include_fixture_pk_column: whether to include the `pk` field from `FixtureDict`
+
+ Example:
+ ```pycon
+ >>> table_name: str = "data_provider"
+ >>> tables = tuple(
+ ... gen_fixture_tables(
+ ... {table_name: NEWSPAPER_COLLECTION_METADATA}
+ ... ))
+ >>> len(tables)
+ 1
+ >>> assert tables[0].title == table_name
+ >>> [column.header for column in tables[0].columns]
+ ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']
+
+ ```
+ """
+forname,fixture_recordsinfixture_tables.items():
+fixture_table:Table=Table(title=name)
+fori,fixture_dictinenumerate(fixture_records):
+ifi==0:
+[
+fixture_table.add_column(name)
+fornameinfixture_fields(fixture_dict,include_fixture_pk_column)
+]
+row_values:tuple[str,...]=tuple(
+str(x)forxin(fixture_dict["pk"],*fixture_dict["fields"].values())
+)
+fixture_table.add_row(*row_values)
+yieldfixture_table
+
+
+
+
+
+
+
+
+
+
+
+
+ get_chunked_zipfiles
+
+
+
+
get_chunked_zipfiles(path:Path)->list
+
+
+
+
+
This function takes in a Path object path and returns a list of lists
+of zipfiles sorted and chunked according to certain conditions defined
+in the settings object (see settings.CHUNK_THRESHOLD).
+
Note: the function will also skip zip files of a certain file size, which
+can be specified in the settings object (see settings.SKIP_FILE_SIZE).
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
path
+
+ Path
+
+
+
+
The input path where the zipfiles are located
+
+
+
+ required
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ list
+
+
+
+
A list of lists of zipfiles, each inner list represents a chunk of
+zipfiles.
defget_chunked_zipfiles(path:Path)->list:
+"""This function takes in a `Path` object `path` and returns a list of lists
+ of `zipfiles` sorted and chunked according to certain conditions defined
+ in the `settings` object (see `settings.CHUNK_THRESHOLD`).
+
+ Note: the function will also skip zip files of a certain file size, which
+ can be specified in the `settings` object (see `settings.SKIP_FILE_SIZE`).
+
+ Args:
+ path: The input path where the zipfiles are located
+
+ Returns:
+ A list of lists of `zipfiles`, each inner list represents a chunk of
+ zipfiles.
+ """
+
+zipfiles=sorted(
+path.glob("*.zip"),
+key=lambdax:x.stat().st_size,
+reverse=settings.START_WITH_LARGEST,
+)
+
+zipfiles=[xforxinzipfilesifx.stat().st_size<=settings.SKIP_FILE_SIZE]
+
+iflen(zipfiles)>settings.CHUNK_THRESHOLD:
+chunks=array_split(zipfiles,len(zipfiles)/settings.CHUNK_THRESHOLD)
+else:
+chunks=[zipfiles]
+
+returnchunks
+
+
+
+
+
+
+
+
+
+
+
+
+ get_key
+
+
+
+
get_key(x:dict=dict(),on:list=[])->str
+
+
+
+
+
Get a string key from a dictionary using values from specified keys.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
x
+
+ dict
+
+
+
+
A dictionary from which the key is generated.
+
+
+
+ dict()
+
+
+
+
on
+
+ list
+
+
+
+
A list of keys from the dictionary that should be used to
+generate the key.
defget_key(x:dict=dict(),on:list=[])->str:
+"""
+ Get a string key from a dictionary using values from specified keys.
+
+ Args:
+ x: A dictionary from which the key is generated.
+ on: A list of keys from the dictionary that should be used to
+ generate the key.
+
+ Returns:
+ The generated string key.
+ """
+
+returnf"{'-'.join([str(x['fields'][y])foryinon])}"
+
defget_lockfile(collection:str,kind:NewspaperElements,dic:dict)->Path:
+"""
+ Provides the path to any given lockfile, which controls whether any
+ existing files should be overwritten or not.
+
+ Args:
+ collection: Collection folder name
+ kind: Either `newspaper` or `issue` or `item`
+ dic: A dictionary with required information for either `kind` passed
+
+ Returns:
+ Path to the resulting lockfile
+ """
+
+p:Path
+base=Path(f"cache-lockfiles/{collection}")
+
+ifkind=="newspaper":
+p=base/f"newspapers/{dic['publication_code']}"
+elifkind=="issue":
+p=base/f"issues/{dic['publication__publication_code']}/{dic['issue_code']}"
+elifkind=="item":
+try:
+ifdic.get("issue_code"):
+p=base/f"items/{dic['issue_code']}/{dic['item_code']}"
+elifdic.get("issue__issue_identifier"):
+p=base/f"items/{dic['issue__issue_identifier']}/{dic['item_code']}"
+exceptKeyError:
+error("An unknown error occurred (in get_lockfile)")
+else:
+p=base/"lockfile"
+
+p.parent.mkdir(parents=True,exist_ok=True)ifsettings.WRITE_LOCKFILESelseNone
+
+returnp
+
defget_now(as_str:bool=False)->datetime.datetime|str:
+"""
+ Return `datetime.now()` as either a string or `datetime` object.
+
+ Args:
+ as_str: Whether to return `now` `time` as a `str` or not, default: `False`
+
+ Returns:
+ `datetime.now()` in `pytz.UTC` time zone as a string if `as_str`, else
+ as a `datetime.datetime` object.
+ """
+now=datetime.datetime.now(tz=pytz.UTC)
+
+ifas_str:
+returnstr(now)
+else:
+assertisinstance(now,datetime.datetime)
+returnnow
+
+
+
+
+
+
+
+
+
+
+
+
+ get_path_from
+
+
+
+
get_path_from(p:str|Path)->Path
+
+
+
+
+
Converts an input value into a Path object if it's not already one.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
p
+
+ str | Path
+
+
+
+
The input value, which can be a string or a Path object.
defget_path_from(p:str|Path)->Path:
+"""
+ Converts an input value into a Path object if it's not already one.
+
+ Args:
+ p: The input value, which can be a string or a Path object.
+
+ Returns:
+ The input value as a Path object.
+ """
+ifisinstance(p,str):
+p=Path(p)
+
+ifnotisinstance(p,Path):
+raiseRuntimeError(f"Unable to handle type: {type(p)}")
+
+returnp
+
defget_size_from_path(p:str|Path,raw:bool=False)->str|float:
+"""
+ Returns a nice string for any given file size.
+
+ Args:
+ p: Path to read the size from
+ raw: Whether to return the file size as total number of bytes or
+ a human-readable MB/GB amount
+
+ Returns:
+ Return `str` followed by `MB` or `GB` for size if not `raw` otherwise `float`.
+ """
+
+p=get_path_from(p)
+
+bytes=p.stat().st_size
+
+ifraw:
+returnbytes
+
+rel_size:float|int|str=round(bytes/1000/1000/1000,1)
+
+assertnotisinstance(rel_size,str)
+
+ifrel_size<0.5:
+rel_size=round(bytes/1000/1000,1)
+rel_size=f"{rel_size}MB"
+else:
+rel_size=f"{rel_size}GB"
+
+returnrel_size
+
+
+
+
+
+
+
+
+
+
+
+
+ glob_filter
+
+
+
+
glob_filter(p:str)->list
+
+
+
+
+
Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
p
+
+ str
+
+
+
+
Path to a directory to filter
+
+
+
+ required
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ list
+
+
+
+
Sorted list of files contained in the provided path without the ones
defglob_filter(p:str)->list:
+"""
+ Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.
+
+ Args:
+ p: Path to a directory to filter
+
+ Returns:
+ Sorted list of files contained in the provided path without the ones
+ whose names start with a `.`
+ """
+returnsorted([xforxinget_path_from(p).glob("*")ifnotx.name.startswith(".")])
+
defglob_path_rename_by_0_padding(
+path:PathLike,
+output_path:PathLike|None=None,
+glob_regex_str:str="*",
+padding:int|None=0,
+match_int_regex:str=PADDING_0_REGEX_DEFAULT,
+index:int=-1,
+)->dict[PathLike,PathLike]:
+"""Return an `OrderedDict` of replacement 0-padded file names from `path`.
+
+ Params:
+ path:
+ `PathLike` to source files to rename.
+
+ output_path:
+ `PathLike` to save renamed files to.
+
+ glob_regex_str:
+ `str` to match files to rename within `path`.
+
+ padding:
+ How many digits (0s) to pad `match_int` with.
+
+ match_int_regex:
+ Regular expression for matching numbers in `s` to pad.
+ Only rename parts of `Path(file_path).name`; else
+ replace across `Path(file_path).parents` as well.
+
+ index:
+ Which index of number in `s` to pad with 0s.
+ Like numbering a `list`, 0 indicates the first match
+ and -1 indicates the last match.
+
+ Example:
+ ```pycon
+ >>> tmp_path: Path = getfixture('tmp_path')
+ >>> for i in range(4):
+ ... (tmp_path / f'test_file-{i}.txt').touch(exist_ok=True)
+ >>> pprint(sorted(tmp_path.iterdir()))
+ [...Path('...test_file-0.txt'),
+ ...Path('...test_file-1.txt'),
+ ...Path('...test_file-2.txt'),
+ ...Path('...test_file-3.txt')]
+ >>> pprint(glob_path_rename_by_0_padding(tmp_path))
+ {...Path('...test_file-0.txt'): ...Path('...test_file-00.txt'),
+ ...Path('...test_file-1.txt'): ...Path('...test_file-01.txt'),
+ ...Path('...test_file-2.txt'): ...Path('...test_file-02.txt'),
+ ...Path('...test_file-3.txt'): ...Path('...test_file-03.txt')}
+
+ ```
+
+ """
+try:
+assertPath(path).exists()
+exceptAssertionError:
+raiseValueError(f'path does not exist: "{Path(path)}"')
+paths_tuple:tuple[PathLike,...]=path_globs_to_tuple(path,glob_regex_str)
+try:
+assertpaths_tuple
+exceptAssertionError:
+raiseFileNotFoundError(
+f"No files found matching 'glob_regex_str': "
+f"'{glob_regex_str}' in: '{path}'"
+)
+paths_to_index:tuple[tuple[str,int],...]=tuple(
+int_from_str(str(matched_path),index=index,regex=match_int_regex)
+formatched_pathinpaths_tuple
+)
+max_index:int=max(index[1]forindexinpaths_to_index)
+max_index_digits:int=len(str(max_index))
+ifnotpaddingorpadding<max_index_digits:
+padding=max_index_digits+1
+new_names_dict:dict[PathLike,PathLike]={}
+ifoutput_path:
+ifnotPath(output_path).is_absolute():
+output_path=Path(path)/output_path
+logger.debug(f"Specified '{output_path}' for saving file copies")
+fori,old_pathinenumerate(paths_tuple):
+match_str,match_int=paths_to_index[i]
+new_names_dict[old_path]=rename_by_0_padding(
+old_path,match_str=str(match_str),match_int=match_int,padding=padding
+)
+ifoutput_path:
+new_names_dict[old_path]=(
+Path(output_path)/Path(new_names_dict[old_path]).name
+)
+returnnew_names_dict
+
defint_from_str(
+s:str,
+index:int=-1,
+regex:str=PADDING_0_REGEX_DEFAULT,
+)->tuple[str,int]:
+"""Return matched (or None) `regex` from `s` by index `index`.
+
+ Params:
+ s:
+ `str` to match and via `regex`.
+
+ index:
+ Which index of number in `s` to pad with 0s.
+ Like numbering a `list`, 0 indicates the first match
+ and -1 indicates the last match.
+
+ regex:
+ Regular expression for matching numbers in `s` to pad.
+
+ Example:
+ ```pycon
+ >>> int_from_str('a/path/to/fixture-03-05.txt')
+ ('05', 5)
+ >>> int_from_str('a/path/to/fixture-03-05.txt', index=0)
+ ('03', 3)
+
+ ```
+ """
+matches:list[str]=[matchformatchinfindall(regex,s)ifmatch]
+match_str:str=matches[index]
+returnmatch_str,int(match_str)
+
deflist_json_files(
+p:str|Path,
+drill:bool=False,
+exclude_names:list=[],
+include_names:list=[],
+)->Generator[Path,None,None]|list[Path]:
+"""
+ List `json` files under the path specified in ``p``.
+
+ Args:
+ p: The path to search for `json` files
+ drill: A flag indicating whether to drill down the subdirectories
+ or not. Default is ``False``
+ exclude_names: A list of file names to exclude from the search
+ result. Default is an empty list
+ include_names: A list of file names to include in search result.
+ If provided, the ``exclude_names`` argument will be ignored.
+ Default is an empty list
+
+ Returns:
+ A list of `Path` objects pointing to the found `json` files
+ """
+
+q:str="**/*.json"ifdrillelse"*.json"
+files=get_path_from(p).glob(q)
+
+ifexclude_names:
+files=list({xforxinfilesifx.namenotinexclude_names})
+elifinclude_names:
+files=list({xforxinfilesifx.nameininclude_names})
+
+returnsorted(files)
+
defload_json(p:str|Path,crash:bool=False)->dict|list:
+"""
+ Easier access to reading `json` files.
+
+ Args:
+ p: Path to read `json` from
+ crash: Whether the program should crash if there is a `json` decode
+ error, default: ``False``
+
+ Returns:
+ The decoded `json` contents from the path, but an empty dictionary
+ if the file cannot be decoded and ``crash`` is set to ``False``
+ """
+
+p=get_path_from(p)
+
+try:
+returnjson.loads(p.read_text())
+exceptjson.JSONDecodeError:
+msg=f"Error: {p.read_text()}"
+error(msg,crash=crash)
+
+return{}
+
defload_multiple_json(
+p:str|Path,
+drill:bool=False,
+filter_na:bool=True,
+crash:bool=False,
+)->list:
+"""
+ Load multiple `json` files and return a list of their content.
+
+ Args:
+ p: The path to search for `json` files
+ drill: A flag indicating whether to drill down the subdirectories
+ or not. Default is `False`
+ filter_na: A flag indicating whether to filter out the content that
+ is `None`. Default is `True`.
+ crash: A flag indicating whether to raise an exception when an
+ error occurs while loading a `json` file. Default is `False`.
+
+ Returns:
+ A `list` of the content of the loaded `json` files.
+ """
+
+files=list_json_files(p,drill=drill)
+
+content=[load_json(x,crash=crash)forxinfiles]
+
+return[xforxincontentifx]iffilter_naelsecontent
+
+
+
+
+
+
+
+
+
+
+
+
+ lock
+
+
+
+
lock(lockfile:Path)->None
+
+
+
+
+
Writes a '.' to a lockfile, after making sure the parent directory exists.
deflock(lockfile:Path)->None:
+"""
+ Writes a '.' to a lockfile, after making sure the parent directory exists.
+
+ Args:
+ lockfile: The path to the lock file to be created
+
+ Returns:
+ None
+ """
+lockfile.parent.mkdir(parents=True,exist_ok=True)
+
+lockfile.write_text("")
+
+return
+
Saves fixtures generated by a generator to separate JSON files.
+
This function takes a generator and saves the generated fixtures to
+separate JSON files. The fixtures are saved in batches, where each batch
+is determined by the max_elements_per_file parameter.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
generator
+
+ Sequence | Generator
+
+
+
+
A generator that yields the fixtures to be saved.
+
+
+
+ []
+
+
+
+
prefix
+
+ str
+
+
+
+
A string prefix to be added to the file names of the
+saved fixtures.
+
+
+
+ ''
+
+
+
+
output_path
+
+ PathLike | str
+
+
+
+
Path to folder fixtures are saved to
+
+
+
+ settings.OUTPUT
+
+
+
+
max_elements_per_file
+
+ int
+
+
+
+
Maximum JSON records saved in each file
+
+
+
+ settings.MAX_ELEMENTS_PER_FILE
+
+
+
+
add_created
+
+ bool
+
+
+
+
Whether to add created_at and updated_attimestamps
+
+
+
+ True
+
+
+
+
json_indent
+
+ int
+
+
+
+
Number of indent spaces per line in saved JSON
+
+
+
+ JSON_INDENT
+
+
+
+
file_name_0_padding
+
+ int
+
+
+
+
Zeros to prefix the number of each fixture file name.
+
+
+
+ FILE_NAME_0_PADDING_DEFAULT
+
+
+
+
+
+
+
+
Returns:
+
+
+
+
Type
+
Description
+
+
+
+
+
+ None
+
+
+
+
This function saves the fixtures to files but does not return
defsave_fixture(
+generator:Sequence|Generator=[],
+prefix:str="",
+output_path:PathLike|str=settings.OUTPUT,
+max_elements_per_file:int=settings.MAX_ELEMENTS_PER_FILE,
+add_created:bool=True,
+json_indent:int=JSON_INDENT,
+file_name_0_padding:int=FILE_NAME_0_PADDING_DEFAULT,
+)->None:
+"""Saves fixtures generated by a generator to separate JSON files.
+
+ This function takes a generator and saves the generated fixtures to
+ separate JSON files. The fixtures are saved in batches, where each batch
+ is determined by the ``max_elements_per_file`` parameter.
+
+ Args:
+ generator:
+ A generator that yields the fixtures to be saved.
+ prefix:
+ A string prefix to be added to the file names of the
+ saved fixtures.
+ output_path:
+ Path to folder fixtures are saved to
+ max_elements_per_file:
+ Maximum `JSON` records saved in each file
+ add_created:
+ Whether to add `created_at` and `updated_at` `timestamps`
+ json_indent:
+ Number of indent spaces per line in saved `JSON`
+ file_name_0_padding:
+ Zeros to prefix the number of each fixture file name.
+
+ Returns:
+ This function saves the fixtures to files but does not return
+ any value.
+
+ Example:
+ ```pycon
+ >>> tmp_path: Path = getfixture('tmp_path')
+ >>> save_fixture(NEWSPAPER_COLLECTION_METADATA,
+ ... prefix='test', output_path=tmp_path)
+ >>> imported_fixture = load_json(tmp_path / 'test-000001.json')
+ >>> imported_fixture[1]['pk']
+ 2
+ >>> imported_fixture[1]['fields'][DATA_PROVIDER_INDEX]
+ 'hmd'
+ >>> 'created_at' in imported_fixture[1]['fields']
+ True
+
+ ```
+
+ """
+internal_counter=1
+counter=1
+lst=[]
+file_name:str
+Path(output_path).mkdir(parents=True,exist_ok=True)
+foritemingenerator:
+lst.append(item)
+internal_counter+=1
+ifinternal_counter>max_elements_per_file:
+file_name=f"{prefix}-{str(counter).zfill(file_name_0_padding)}.json"
+write_json(
+p=Path(f"{output_path}/file_name"),
+o=lst,
+add_created=add_created,
+json_indent=json_indent,
+)
+
+# Save up some memory
+dellst
+gc.collect()
+
+# Re-instantiate
+lst=[]
+internal_counter=1
+counter+=1
+else:
+file_name=f"{prefix}-{str(counter).zfill(file_name_0_padding)}.json"
+write_json(
+p=Path(f"{output_path}/{file_name}"),
+o=lst,
+add_created=add_created,
+json_indent=json_indent,
+)
+
+return
+
Easier access to writing json files. Checks whether parent exists.
+
+
+
+
Parameters:
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
p
+
+ str | Path
+
+
+
+
Path to write json to
+
+
+
+ required
+
+
+
+
o
+
+ dict
+
+
+
+
Object to write to json file
+
+
+
+ required
+
+
+
+
add_created
+
+ bool
+
+
+
+
If set to True will add created_at and updated_at
+to the dictionary's fields. If created_at and updated_at
+already exist in the fields, they will be forcefully updated.
The program should run automatically with the following command:
+
$poetryruna2t2f-news
+
+
Alternatively, if you want to add optional parameters and don’t want to use the standard poetry script to run, you can use the (somewhat convoluted) poetry run alto2txt2fixture/run.py and provide any optional parameters. You can see a list of all the “Optional parameters” below. For example, if you want to only include the hmd collection:
If you find yourself in trouble with poetry, the program should run perfectly fine on its own, assuming the dependencies are installed. The same command, then, would be:
+
$pythonalto2txt2fixture/run.py--collectionshmd
+
+
+
Note
+
See the list under [tool.poetry.dependencies] in pyproject.toml for a list of dependencies that would need to be installed for alto2txt2fixture to work outside a python poetry environment.
+
+
Optional parameters
+
The program has a number of optional parameters that you can choose to include or not. The table below describes each parameter, how to pass it to the program, and what its defaults are.
+
+
+
+
Flag
+
Description
+
Default value
+
+
+
+
+
-c, --collections
+
Which collections to process in the mounted alto2txt directory
+
hmd, lwm, jisc, bna
+
+
+
-o, --output
+
Into which directory should the processed files be put?
+
./output/fixtures/
+
+
+
-m, --mountpoint
+
Where is the alto2txt directories mounted?
+
./input/alto2txt/
+
+
+
-t, --test-config
+
Print the config table but do not run
+
False
+
+
+
+
Successfully running the program: An example
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/search/search_index.js b/search/search_index.js
new file mode 100644
index 0000000..1960953
--- /dev/null
+++ b/search/search_index.js
@@ -0,0 +1 @@
+var __index = {"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"alto2txt2fixture","text":"
alto2txt2fixture is a standalone tool to convert alto2txtXML output and other related datasets into JSON (and where feasible CSV) data with corresponding relational IDs to ease general use and ingestion into a relational database.
We target the the JSON produced for importing into lwmdb: a database built using the Djangopython webframework database fixture structure.
"},{"location":"index.html#installation-and-simple-use","title":"Installation and simple use","text":"
We provide a command line interface to process alto2txtXML files stored locally (or mounted via azureblobfuse), and for additional public data we automate a means of downloading those automatically.
To processing newspaper metadata with a local copy of alto2txtXML results, it's easiest to have that data in the same folder as your alto2txt2fixture checkout and poetry installed folder. One arranged, you should be able to begin the JSON converstion with
$ poetry run a2t2f-news\n
To generate related data in JSON and CSV form, assuming you have an internet collection and access to a living-with-machinesazure account, the following will download related data into JSON and CSV files. The JSON results should be consistent with lwmdb tables for ease of import.
$ poetry run a2t2f-adj\n
"},{"location":"running.html","title":"Running the Program","text":""},{"location":"running.html#using-poetry-to-run","title":"Using poetry to run","text":"
The program should run automatically with the following command:
$ poetry run a2t2f-news\n
Alternatively, if you want to add optional parameters and don\u2019t want to use the standard poetry script to run, you can use the (somewhat convoluted) poetry run alto2txt2fixture/run.py and provide any optional parameters. You can see a list of all the \u201cOptional parameters\u201d below. For example, if you want to only include the hmd collection:
$ poetry run alto2txt2fixture/run.py --collections hmd\n
"},{"location":"running.html#alternative-run-the-script-without-poetry","title":"Alternative: Run the script without poetry","text":"
If you find yourself in trouble with poetry, the program should run perfectly fine on its own, assuming the dependencies are installed. The same command, then, would be:
See the list under [tool.poetry.dependencies] in pyproject.toml for a list of dependencies that would need to be installed for alto2txt2fixture to work outside a python poetry environment.
The program has a number of optional parameters that you can choose to include or not. The table below describes each parameter, how to pass it to the program, and what its defaults are.
Flag Description Default value -c, --collections Which collections to process in the mounted alto2txt directory hmd, lwm, jisc, bna-o, --output Into which directory should the processed files be put? ./output/fixtures/-m, --mountpoint Where is the alto2txt directories mounted? ./input/alto2txt/-t, --test-config Print the config table but do not run False"},{"location":"running.html#successfully-running-the-program-an-example","title":"Successfully running the program: An example","text":""},{"location":"understanding-results.html","title":"Understanding the Results","text":""},{"location":"understanding-results.html#the-resulting-file-structure","title":"The resulting file structure","text":"
The examples below follow standard settings
If you choose other settings for when you run the program, your output directory may look different from the information on this page.
Reports are automatically generated with a unique hash as the overarching folder structure. Inside the reports directory, you\u2019ll find a JSON file for each alto2txt directory (organised by NLP identifier).
The report structure, thus, looks like this:
The JSON file has some good troubleshooting information. You\u2019ll find that the contents are structured as a Python dictionary (or JavaScript Object). Here is an example:
Here is an explanation of each of the keys in the dictionary:
Key Explanation Data type path The input path for the zip file that is being converted. stringbytes The size of the input zip file represented in bytes. integersize The size of the input zip file represented in a human-readable string. stringcontents #TODO #3 integerstart Date and time when processing started (see also end below). datestringnewspaper_paths #TODO #3 list (string) publication_codes A list of the NLPs that are contained in the input zip file. list (string) issue_paths A list of all the issue paths that are contained in the cache directory. list (string) item_paths A list of all the item paths that are contained in the cache directory. list (string) end Date and time when processing ended (see also start above). datestringseconds Seconds that the script spent interpreting the zip file (should be added to the microseconds below). integermicroseconds Microseconds that the script spent interpreting the zip file (should be added to the seconds above). integer"},{"location":"understanding-results.html#fixtures","title":"Fixtures","text":"
The most important output of the script is contained in the fixtures directory. This directory contains JSON files for all the different columns in the corresponding Django metadata database (i.e. DataProvider, Digitisation, Ingest, Issue, Newspaper, and Item). The numbering at the end of each file indicates the order of the files as they are divided into a maximum of 2e6 elements*:
Each JSON file contains a Python-like list (JavaScript Array) of dictionaries (JavaScript Objects), which have a primary key (pk), the related database model (in the example below the Django newspapers app\u2019s newspaper table), and a nested dictionary/Object which contains all the values for the database\u2019s table entry:
* The maximum elements per file can be adjusted in the settings.py file\u2019s settings object\u2019s MAX_ELEMENTS_PER_FILE value.
This constructs an ArgumentParser instance to manage configurating calls of run() to manage newspaperXML to JSON converstion.
Parameters:
Name Type Description Default argvlist[str] | None
If None treat as equivalent of ['--help], if alistofstrpass those options toArgumentParser`
None
Returns:
Type Description Namespace
A Namespacedict-like configuration for run()
Source code in alto2txt2fixture/__main__.py
def parse_args(argv: list[str] | None = None) -> Namespace:\n\"\"\"Manage command line arguments for `run()`\n This constructs an `ArgumentParser` instance to manage\n configurating calls of `run()` to manage `newspaper`\n `XML` to `JSON` converstion.\n Arguments:\n argv:\n If `None` treat as equivalent of ['--help`],\n if a `list` of `str` pass those options to `ArgumentParser`\n Returns:\n A `Namespace` `dict`-like configuration for `run()`\n \"\"\"\nargv = None if not argv else argv\nparser = ArgumentParser(\nprog=\"a2t2f-news\",\ndescription=\"Process alto2txt XML into and Django JSON Fixture files\",\nepilog=(\n\"Note: this is still in beta mode and contributions welcome\\n\\n\" + __doc__\n),\nformatter_class=RawTextHelpFormatter,\n)\nparser.add_argument(\n\"-c\",\n\"--collections\",\nnargs=\"+\",\nhelp=\"<Optional> Set collections\",\nrequired=False,\n)\nparser.add_argument(\n\"-m\",\n\"--mountpoint\",\ntype=str,\nhelp=\"<Optional> Mountpoint\",\nrequired=False,\n)\nparser.add_argument(\n\"-o\",\n\"--output\",\ntype=str,\nhelp=\"<Optional> Set an output directory\",\nrequired=False,\n)\nparser.add_argument(\n\"-t\",\n\"--test-config\",\ndefault=False,\nhelp=\"Only print the configuration\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"-f\",\n\"--show-fixture-tables\",\ndefault=True,\nhelp=\"Print included fixture table configurations\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"--export-fixture-tables\",\ndefault=True,\nhelp=\"Experimental: export fixture tables prior to data processing\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"--data-provider-field\",\ntype=str,\ndefault=DATA_PROVIDER_INDEX,\nhelp=\"Key for indexing DataProvider records\",\n)\nreturn parser.parse_args(argv)\n
First parse_args is called for command line arguments including:
collections
output
mountpoint
If any of these arguments are specified, they will be used, otherwise they will default to the values in the settings module.
The show_setup function is then called to display the configurations being used.
The route function is then called to route the alto2txt files into subdirectories with structured files.
The parse function is then called to parse the resulting JSON files.
Finally, the clear_cache function is called to clear the cache (pending the user's confirmation).
Parameters:
Name Type Description Default local_argslist[str] | None
Options passed to parse_args()
None
Returns:
Type Description None
None
Source code in alto2txt2fixture/__main__.py
def run(local_args: list[str] | None = None) -> None:\n\"\"\"Manage running newspaper `XML` to `JSON` conversion.\n First `parse_args` is called for command line arguments including:\n - `collections`\n - `output`\n - `mountpoint`\n If any of these arguments are specified, they will be used, otherwise they\n will default to the values in the `settings` module.\n The `show_setup` function is then called to display the configurations\n being used.\n The `route` function is then called to route the alto2txt files into\n subdirectories with structured files.\n The `parse` function is then called to parse the resulting JSON files.\n Finally, the `clear_cache` function is called to clear the cache\n (pending the user's confirmation).\n Arguments:\n local_args: Options passed to `parse_args()`\n Returns:\n None\n \"\"\"\nargs: Namespace = parse_args(argv=local_args)\nif args.collections:\nCOLLECTIONS = [x.lower() for x in args.collections]\nelse:\nCOLLECTIONS = settings.COLLECTIONS\nif args.output:\nOUTPUT = args.output.rstrip(\"/\")\nelse:\nOUTPUT = settings.OUTPUT\nif args.mountpoint:\nMOUNTPOINT = args.mountpoint.rstrip(\"/\")\nelse:\nMOUNTPOINT = settings.MOUNTPOINT\nshow_setup(\nCOLLECTIONS=COLLECTIONS,\nOUTPUT=OUTPUT,\nCACHE_HOME=settings.CACHE_HOME,\nMOUNTPOINT=MOUNTPOINT,\nJISC_PAPERS_CSV=settings.JISC_PAPERS_CSV,\nREPORT_DIR=settings.REPORT_DIR,\nMAX_ELEMENTS_PER_FILE=settings.MAX_ELEMENTS_PER_FILE,\n)\nif args.show_fixture_tables:\n# Show a table of fixtures used, defaults to DataProvider Table\nshow_fixture_tables(settings, data_provider_index=args.data_provider_field)\nif args.export_fixture_tables:\nexport_fixtures(\nfixture_tables=settings.FIXTURE_TABLES,\npath=OUTPUT,\nformats=settings.FIXTURE_TABLES_FORMATS,\n)\nif not args.test_config:\n# Routing alto2txt into subdirectories with structured files\nroute(\nCOLLECTIONS,\nsettings.CACHE_HOME,\nMOUNTPOINT,\nsettings.JISC_PAPERS_CSV,\nsettings.REPORT_DIR,\n)\n# Parsing the resulting JSON files\nparse(\nCOLLECTIONS,\nsettings.CACHE_HOME,\nOUTPUT,\nsettings.MAX_ELEMENTS_PER_FILE,\n)\nclear_cache(settings.CACHE_HOME)\n
Name Type Description Default paths_dictdict[os.PathLike, os.PathLike]
dict[os.PathLike, os.PathLike], Original and renumbered pathsdict
required compress_formatArchiveFormatEnum
Which ArchiveFormatEnum for compression
COMPRESSION_TYPE_DEFAULTtitlestr
Title of returned Table
FILE_RENAME_TABLE_TITLE_DEFAULTprefixstr
str to add in front of every new path
''renumberbool
Whether an int in each path will be renumbered.
True Source code in alto2txt2fixture/cli.py
def file_rename_table(\npaths_dict: dict[os.PathLike, os.PathLike],\ncompress_format: ArchiveFormatEnum = COMPRESSION_TYPE_DEFAULT,\ntitle: str = FILE_RENAME_TABLE_TITLE_DEFAULT,\nprefix: str = \"\",\nrenumber: bool = True,\n) -> Table:\n\"\"\"Create a `rich.Table` of rename configuration.\n Args:\n paths_dict: dict[os.PathLike, os.PathLike],\n Original and renumbered `paths` `dict`\n compress_format:\n Which `ArchiveFormatEnum` for compression\n title:\n Title of returned `Table`\n prefix:\n `str` to add in front of every new path\n renumber:\n Whether an `int` in each path will be renumbered.\n \"\"\"\ntable: Table = Table(title=title)\ntable.add_column(\"Current File Name\", justify=\"right\", style=\"cyan\")\ntable.add_column(\"New File Name\", style=\"magenta\")\ndef final_file_name(name: os.PathLike) -> str:\nreturn (\nprefix\n+ str(Path(name).name)\n+ (f\".{compress_format}\" if compress_format else \"\")\n)\nfor old_path, new_path in paths_dict.items():\nname: str = final_file_name(new_path if renumber else old_path)\ntable.add_row(Path(old_path).name, name)\nreturn table\n
Geneate richTable from func signature and help attr.
Parameters:
Name Type Description Default funcCallable
Function whose args and type hints will be converted to a table.
required valuesdict
dict of variables covered in func signature. local() often suffices.
required titlestr
str for table title.
''extra_dictdict[str, Any]
A dict of additional rows to add to the table. For each key, value pair: if the value is a tuple, it will be expanded to match the Type, Value, and Notes columns; else the Type will be inferred and Notes left blank.
{} Example
>>> def test_func(\n... var_a: Annotated[str, typer.Option(help=\"Example\")] = \"Default\"\n... ) -> None:\n... test_func_table: Table = func_table(test_func, values=vars())\n... console.print(test_func_table)\n>>> if is_platform_win:\n... pytest.skip('fails on certain Windows root paths: issue #56')\n>>> test_func()\n test_func config\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Variable \u2503 Type \u2503 Value \u2503 Notes \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 var_a \u2502 str \u2502 Default \u2502 Example \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Source code in alto2txt2fixture/cli.py
def func_table(\nfunc: Callable, values: dict, title: str = \"\", extra_dict: dict[str, Any] = {}\n) -> Table:\n\"\"\"Geneate `rich` `Table` from `func` signature and `help` attr.\n Args:\n func:\n Function whose `args` and `type` hints will be converted\n to a table.\n values:\n `dict` of variables covered in `func` signature.\n `local()` often suffices.\n title:\n `str` for table title.\n extra_dict:\n A `dict` of additional rows to add to the table. For each\n `key`, `value` pair: if the `value` is a `tuple`, it will\n be expanded to match the `Type`, `Value`, and `Notes`\n columns; else the `Type` will be inferred and `Notes`\n left blank.\n Example:\n ```pycon\n >>> def test_func(\n ... var_a: Annotated[str, typer.Option(help=\"Example\")] = \"Default\"\n ... ) -> None:\n ... test_func_table: Table = func_table(test_func, values=vars())\n ... console.print(test_func_table)\n >>> if is_platform_win:\n ... pytest.skip('fails on certain Windows root paths: issue #56')\n >>> test_func()\n test_func config\n \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n \u2503 Variable \u2503 Type \u2503 Value \u2503 Notes \u2503\n \u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n \u2502 var_a \u2502 str \u2502 Default \u2502 Example \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n ```\n \"\"\"\ntitle = title if title else f\"{func.__name__} config\"\nfunc_signature: dict = get_type_hints(func, include_extras=True)\ntable: Table = Table(title=title)\ntable.add_column(\"Variable\", justify=\"right\", style=\"cyan\")\ntable.add_column(\"Type\", style=\"yellow\")\ntable.add_column(\"Value\", style=\"magenta\")\ntable.add_column(\"Notes\")\nfor var, info in func_signature.items():\ntry:\nvar_type, annotation = get_args(info)\nvalue: Any = values[var]\nif value in (\"\", \"\"):\nvalue = \"''\"\ntable.add_row(str(var), var_type.__name__, str(value), annotation.help)\nexcept ValueError:\ncontinue\nfor key, val in extra_dict.items():\nif isinstance(val, tuple):\ntable.add_row(key, *val)\nelse:\ntable.add_row(key, type(val).__name__, str(val))\nreturn table\n
plaintext(\npath: Annotated[Path, typer.Argument(help=\"Path to raw plaintext files\")],\nsave_path: Annotated[\nPath, typer.Option(help=\"Path to save json export files\")\n] = Path(DEFAULT_PLAINTEXT_FIXTURE_OUTPUT),\ndata_provider_code: Annotated[\nstr, typer.Option(help=\"Data provider code use existing config\")\n] = \"\",\nextract_path: Annotated[\nPath, typer.Option(help=\"Folder to extract compressed raw plaintext to\")\n] = Path(DEFAULT_EXTRACTED_SUBDIR),\ninitial_pk: Annotated[\nint,\ntyper.Option(help=\"First primary key to increment json export from\"),\n] = DEFAULT_INITIAL_PK,\nrecords_per_json: Annotated[\nint, typer.Option(help=\"Max records per json fixture\")\n] = DEFAULT_MAX_PLAINTEXT_PER_FIXTURE_FILE,\ndigit_padding: Annotated[\nint,\ntyper.Option(help=\"Padding '0's for indexing json fixture filenames\"),\n] = FILE_NAME_0_PADDING_DEFAULT,\ncompress: Annotated[\nbool, typer.Option(help=\"Compress json fixtures\")\n] = False,\ncompress_path: Annotated[\nPath, typer.Option(help=\"Folder to compress json fixtueres to\")\n] = Path(COMPRESSED_PATH_DEFAULT),\ncompress_format: Annotated[\nArchiveFormatEnum,\ntyper.Option(case_sensitive=False, help=\"Compression format\"),\n] = COMPRESSION_TYPE_DEFAULT,\n) -> None\n
Create a PlainTextFixture and save to save_path.
Source code in alto2txt2fixture/cli.py
@cli.command()\ndef plaintext(\npath: Annotated[Path, typer.Argument(help=\"Path to raw plaintext files\")],\nsave_path: Annotated[\nPath, typer.Option(help=\"Path to save json export files\")\n] = Path(DEFAULT_PLAINTEXT_FIXTURE_OUTPUT),\ndata_provider_code: Annotated[\nstr, typer.Option(help=\"Data provider code use existing config\")\n] = \"\",\nextract_path: Annotated[\nPath, typer.Option(help=\"Folder to extract compressed raw plaintext to\")\n] = Path(DEFAULT_EXTRACTED_SUBDIR),\ninitial_pk: Annotated[\nint, typer.Option(help=\"First primary key to increment json export from\")\n] = DEFAULT_INITIAL_PK,\nrecords_per_json: Annotated[\nint, typer.Option(help=\"Max records per json fixture\")\n] = DEFAULT_MAX_PLAINTEXT_PER_FIXTURE_FILE,\ndigit_padding: Annotated[\nint, typer.Option(help=\"Padding '0's for indexing json fixture filenames\")\n] = FILE_NAME_0_PADDING_DEFAULT,\ncompress: Annotated[bool, typer.Option(help=\"Compress json fixtures\")] = False,\ncompress_path: Annotated[\nPath, typer.Option(help=\"Folder to compress json fixtueres to\")\n] = Path(COMPRESSED_PATH_DEFAULT),\ncompress_format: Annotated[\nArchiveFormatEnum,\ntyper.Option(case_sensitive=False, help=\"Compression format\"),\n] = COMPRESSION_TYPE_DEFAULT,\n) -> None:\n\"\"\"Create a PlainTextFixture and save to `save_path`.\"\"\"\nplaintext_fixture = PlainTextFixture(\npath=path,\ndata_provider_code=data_provider_code,\nextract_subdir=extract_path,\nexport_directory=save_path,\ninitial_pk=initial_pk,\nmax_plaintext_per_fixture_file=records_per_json,\njson_0_file_name_padding=digit_padding,\njson_export_compression_format=compress_format,\njson_export_compression_subdir=compress_path,\n)\nplaintext_fixture.info()\nwhile (\nnot plaintext_fixture.compressed_files\nand not plaintext_fixture.plaintext_provided_uncompressed\n):\ntry_another_compressed_txt_source: bool = Confirm.ask(\nf\"No .txt files available from extract path: \"\nf\"{plaintext_fixture.trunc_extract_path_str}\\n\"\nf\"Would you like to extract fixtures from a different path?\",\ndefault=\"n\",\n)\nif try_another_compressed_txt_source:\nnew_extract_path: str = Prompt.ask(\"Please enter a new extract path\")\nplaintext_fixture.path = Path(new_extract_path)\nelse:\nreturn\nplaintext_fixture.info()\nplaintext_fixture.extract_compressed()\nplaintext_fixture.export_to_json_fixtures()\nif compress:\nplaintext_fixture.compress_json_exports()\n
It is possible for the example test to fail in different screen sizes. Try increasing the window or screen width of terminal used to check before raising an issue.
Source code in alto2txt2fixture/cli.py
def show_fixture_tables(\nrun_settings: dotdict = settings,\nprint_in_call: bool = True,\ndata_provider_index: str = DATA_PROVIDER_INDEX,\n) -> list[Table]:\n\"\"\"Print fixture tables specified in ``settings.fixture_tables`` in `rich.Table` format.\n Arguments:\n run_settings: `alto2txt2fixture` run configuration\n print_in_call: whether to print to console (will use ``console`` variable if so)\n data_provider_index: key to index `dataprovider` from ``NEWSPAPER_COLLECTION_METADATA``\n Returns:\n A `list` of `rich.Table` renders from configurations in ``run_settings.FIXTURE_TABLES``\n Example:\n ```pycon\n >>> fixture_tables: list[Table] = show_fixture_tables(\n ... settings,\n ... print_in_call=False)\n >>> len(fixture_tables)\n 1\n >>> fixture_tables[0].title\n 'dataprovider'\n >>> [column.header for column in fixture_tables[0].columns]\n ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']\n >>> fixture_tables = show_fixture_tables(settings)\n <BLANKLINE>\n ...dataprovider...Heritage...\u2502 bl_hmd...\u2502 hmd...\n ```\n Note:\n It is possible for the example test to fail in different screen sizes. Try\n increasing the window or screen width of terminal used to check before\n raising an issue.\n \"\"\"\nif run_settings.FIXTURE_TABLES:\nif \"dataprovider\" in run_settings.FIXTURE_TABLES:\ncheck_newspaper_collection_configuration(\nrun_settings.COLLECTIONS,\nrun_settings.FIXTURE_TABLES[\"dataprovider\"],\ndata_provider_index=data_provider_index,\n)\nconsole_tables: list[Table] = list(\ngen_fixture_tables(run_settings.FIXTURE_TABLES)\n)\nif print_in_call:\nfor console_table in console_tables:\nconsole.print(console_table)\nreturn console_tables\nelse:\nreturn []\n
Returns a list with corrected data from a provided dictionary.
Source code in alto2txt2fixture/create_adjacent_tables.py
def correct_dict(o: dict) -> list:\n\"\"\"Returns a list with corrected data from a provided dictionary.\"\"\"\nreturn [(k, v[0], v[1]) for k, v in o.items() if not v[0].startswith(\"Q\")] + [\n(k, v[1], v[0]) for k, v in o.items() if v[0].startswith(\"Q\")\n]\n
Source code in alto2txt2fixture/create_adjacent_tables.py
def download_data(\nfiles_dict: RemoteDataFilesType = {},\noverwrite: bool = OVERWRITE,\nexclude: list[str] = [],\n) -> None:\n\"\"\"Download files in ``files_dict``, overwrite if specified.\n Args:\n files_dict: `dict` of related files to download\n overwrite: `bool` to overwrite ``LOCAL_CACHE`` files or not\n exclude: `list` of files to exclude from ``files_dict``\n Example:\n ```pycon\n >>> from os import chdir\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> set_path: Path = chdir(tmp_path)\n >>> download_data(exclude=[\"mitchells\", \"Newspaper-1\", \"linking\"])\n Excluding mitchells...\n Excluding Newspaper-1...\n Excluding linking...\n Downloading cache...dict_admin_counties.json\n 100% ... 37/37 bytes\n Downloading cache...dict_countries.json\n 100% ... 33.2/33.2 kB\n Downloading cache...dict_historic_counties.json\n 100% ... 41.4/41.4 kB\n Downloading cache...nlp_loc_wikidata_concat.csv\n 100% ... 59.8/59.8 kB\n Downloading cache...wikidata_gazetteer_selected_columns.csv\n 100% ... 47.8/47.8 MB\n ```\n \"\"\"\nif not files_dict:\nfiles_dict = deepcopy(FILES)\nfor data_source in exclude:\nif data_source in files_dict:\nprint(f\"Excluding {data_source}...\")\nfiles_dict.pop(data_source, 0)\nelse:\nlogger.warning(\nf'\"{data_source}\" not an option to exclude from {files_dict}'\n)\n# Describe whether local file exists\nfor k in files_dict.keys():\nfiles_dict[k][\"exists\"] = files_dict[k][\"local\"].exists()\nfiles_to_download = [\n(v[\"remote\"], v[\"local\"], v[\"exists\"])\nfor v in files_dict.values()\nif \"exists\" in v and not v[\"exists\"] or overwrite\n]\nfor url, out, exists in files_to_download:\nrmtree(Path(out), ignore_errors=True) if exists else None\nprint(f\"Downloading {out}\")\nPath(out).parent.mkdir(parents=True, exist_ok=True)\nassert isinstance(url, str)\nwith urlopen(url) as response, open(out, \"wb\") as out_file:\ntotal: int = int(response.info()[\"Content-length\"])\nwith Progress(\n\"[progress.percentage]{task.percentage:>3.0f}%\",\nBarColumn(), # removed bar_width=None to avoid too long when resized\nDownloadColumn(),\n) as progress:\ndownload_task = progress.add_task(\"Download\", total=total)\nfor chunk in response:\nout_file.write(chunk)\nprogress.update(download_task, advance=len(chunk))\n
Get a list from a string, which contains as separator. If no string is encountered, the function returns an empty list. Source code in alto2txt2fixture/create_adjacent_tables.py
def get_list(x):\n\"\"\"Get a list from a string, which contains <SEP> as separator. If no\n string is encountered, the function returns an empty list.\"\"\"\nreturn x.split(\"<SEP>\") if isinstance(x, str) else []\n
Source code in alto2txt2fixture/create_adjacent_tables.py
def get_outpaths_dict(names: Sequence[str], module_name: str) -> TableOutputConfigType:\n\"\"\"Return a `dict` of `csv` and `json` paths for each `module_name` table.\n The `csv` and `json` paths\n Args:\n names: iterable of names of each `module_name`'s component. Main target is `csv` and `json` table names\n module_name: name of module each name is part of, that is added as a prefix\n Returns:\n A ``TableOutputConfigType``: a `dict` of table ``names`` and output\n `csv` and `json` filenames.\n Example:\n ```pycon\n >>> pprint(get_outpaths_dict(MITCHELLS_TABELS, \"mitchells\"))\n {'Entry': {'csv': 'mitchells.Entry.csv', 'json': 'mitchells.Entry.json'},\n 'Issue': {'csv': 'mitchells.Issue.csv', 'json': 'mitchells.Issue.json'},\n 'PoliticalLeaning': {'csv': 'mitchells.PoliticalLeaning.csv',\n 'json': 'mitchells.PoliticalLeaning.json'},\n 'Price': {'csv': 'mitchells.Price.csv', 'json': 'mitchells.Price.json'}}\n ```\n \"\"\"\nreturn {\nname: OutputPathDict(\ncsv=f\"{module_name}.{name}.csv\",\njson=f\"{module_name}.{name}.json\",\n)\nfor name in names\n}\n
Takes an input_sub_path, a publication_code, and an (optional) abbreviation for any newspaper to locate the title in the jisc_papersDataFrame. jisc_papers is usually loaded via the setup_jisc_papers function.
Parameters:
Name Type Description Default titlestr
target newspaper title
required issue_datestr
target newspaper issue_date
required jisc_paperspd.DataFrame
DataFrame of jisc_papers to match
required input_sub_pathstr
path of files to narrow down query input_sub_path
required publication_codestr
unique codes to match newspaper records
required abbrstr | None
an optional abbreviation of the newspaper title
None
Returns:
Type Description str
Matched titlestr or abbr.
Returns:
Type Description str
A string estimating the JISC equivalent newspaper title
Source code in alto2txt2fixture/jisc.py
def get_jisc_title(\ntitle: str,\nissue_date: str,\njisc_papers: pd.DataFrame,\ninput_sub_path: str,\npublication_code: str,\nabbr: str | None = None,\n) -> str:\n\"\"\"\n Match a newspaper ``title`` with ``jisc_papers`` records.\n Takes an ``input_sub_path``, a ``publication_code``, and an (optional)\n abbreviation for any newspaper to locate the ``title`` in the\n ``jisc_papers`` `DataFrame`. ``jisc_papers`` is usually loaded via the\n ``setup_jisc_papers`` function.\n Args:\n title: target newspaper title\n issue_date: target newspaper issue_date\n jisc_papers: `DataFrame` of `jisc_papers` to match\n input_sub_path: path of files to narrow down query input_sub_path\n publication_code: unique codes to match newspaper records\n abbr: an optional abbreviation of the newspaper title\n Returns:\n Matched ``title`` `str` or ``abbr``.\n Returns:\n A string estimating the JISC equivalent newspaper title\n \"\"\"\n# First option, search the input_sub_path for a valid-looking publication_code\ng = PUBLICATION_CODE.findall(input_sub_path)\nif len(g) == 1:\npublication_code = g[0]\n# Let's see if we can find title:\ntitle = (\njisc_papers[\njisc_papers.publication_code == publication_code\n].title.to_list()[0]\nif jisc_papers[\njisc_papers.publication_code == publication_code\n].title.count()\n== 1\nelse title\n)\nreturn title\n# Second option, look through JISC papers for best match (on publication_code if we have it, but abbr more importantly if we have it)\nif abbr:\n_publication_code = publication_code\npublication_code = abbr\nif jisc_papers.abbr[jisc_papers.abbr == publication_code].count():\ndate = datetime.strptime(issue_date, \"%Y-%m-%d\")\nmask = (\n(jisc_papers.abbr == publication_code)\n& (date >= jisc_papers.start_date)\n& (date <= jisc_papers.end_date)\n)\nfiltered = jisc_papers.loc[mask]\nif filtered.publication_code.count() == 1:\npublication_code = filtered.publication_code.to_list()[0]\ntitle = filtered.title.to_list()[0]\nreturn title\n# Last option: let's find all the possible titles in the jisc_papers for the abbreviation, and if it's just one unique title, let's pick it!\nif abbr:\ntest = list({x for x in jisc_papers[jisc_papers.abbr == abbr].title})\nif len(test) == 1:\nreturn test[0]\nelse:\nmask1 = (jisc_papers.abbr == publication_code) & (\njisc_papers.publication_code == _publication_code\n)\ntest1 = jisc_papers.loc[mask1]\ntest1 = list({x for x in jisc_papers[jisc_papers.abbr == abbr].title})\nif len(test) == 1:\nreturn test1[0]\n# Fallback: if abbreviation is set, we'll return that:\nif abbr:\n# For these exceptions, see issue comment:\n# https://github.com/alan-turing-institute/Living-with-Machines/issues/2453#issuecomment-1050652587\nif abbr == \"IPJL\":\nreturn \"Ipswich Journal\"\nelif abbr == \"BHCH\":\nreturn \"Bath Chronicle\"\nelif abbr == \"LSIR\":\nreturn \"Leeds Intelligencer\"\nelif abbr == \"AGER\":\nreturn \"Lancaster Gazetter, And General Advertiser For Lancashire West\"\nreturn abbr\nraise RuntimeError(f\"Title {title} could not be found.\")\n
fixtures(\nfilelist: list = [],\nmodel: str = \"\",\ntranslate: dict = {},\nrename: dict = {},\nuniq_keys: list = [],\n) -> Generator[FixtureDict, None, None]\n
Generates fixtures for a specified model using a list of files.
This function takes a list of files and generates fixtures for a specified model. The fixtures can be used to populate a database or perform other data-related operations.
Parameters:
Name Type Description Default filelistlist
A list of files to process and generate fixtures from.
[]modelstr
The name of the model for which fixtures are generated. translate: A nested dictionary representing the translation mapping for fields. The structure of the translator follows the format:
The translated fields will be used as keys, and their corresponding primary keys (obtained from the provided files) will be used as values in the generated fixtures. ''renamedict
A nested dictionary representing the field renaming mapping. The structure of the dictionary follows the format:
{\n'part1': {\n'part2': 'new_field_name'\n}\n}\n
The fields specified in the dictionary will be renamed to the provided new field names in the generated fixtures. {}uniq_keyslist
A list of fields that need to be considered for uniqueness in the fixtures. If specified, the fixtures will yield only unique items based on the combination of these fields.
[]
Yields:
Type Description FixtureDict
FixtureDict from model, pk and dict of fields.
Returns:
Type Description Generator[FixtureDict, None, None]
This function generates fixtures but does not return any value.
Source code in alto2txt2fixture/parser.py
def fixtures(\nfilelist: list = [],\nmodel: str = \"\",\ntranslate: dict = {},\nrename: dict = {},\nuniq_keys: list = [],\n) -> Generator[FixtureDict, None, None]:\n\"\"\"\n Generates fixtures for a specified model using a list of files.\n This function takes a list of files and generates fixtures for a specified\n model. The fixtures can be used to populate a database or perform other\n data-related operations.\n Args:\n filelist: A list of files to process and generate fixtures from.\n model: The name of the model for which fixtures are generated.\n translate: A nested dictionary representing the translation mapping\n for fields. The structure of the translator follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n The translated fields will be used as keys, and their\n corresponding primary keys (obtained from the provided files) will\n be used as values in the generated fixtures.\n rename: A nested dictionary representing the field renaming\n mapping. The structure of the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': 'new_field_name'\n }\n }\n ```\n The fields specified in the dictionary will be renamed to the\n provided new field names in the generated fixtures.\n uniq_keys: A list of fields that need to be considered for\n uniqueness in the fixtures. If specified, the fixtures will yield\n only unique items based on the combination of these fields.\n Yields:\n `FixtureDict` from ``model``, ``pk`` and `dict` of ``fields``.\n Returns:\n This function generates fixtures but does not return any value.\n \"\"\"\nfilelist = sorted(filelist, key=lambda x: str(x).split(\"/\")[:-1])\ncount = len(filelist)\n# Process JSONL\nif [x for x in filelist if \".jsonl\" in x.name]:\npk = 0\n# In the future, we might want to show progress here (tqdm or suchlike)\nfor file in filelist:\nfor line in file.read_text().splitlines():\npk += 1\nline = json.loads(line)\nyield FixtureDict(\npk=pk,\nmodel=model,\nfields=dict(**get_fields(line, translate=translate, rename=rename)),\n)\nreturn\nelse:\n# Process JSON\npks = [x for x in range(1, count + 1)]\nif len(uniq_keys):\nuniq_files = list(uniq(filelist, uniq_keys))\ncount = len(uniq_files)\nzipped = zip(uniq_files, pks)\nelse:\nzipped = zip(filelist, pks)\nfor x in tqdm(\nzipped, total=count, desc=f\"{model} ({count:,} objs)\", leave=False\n):\nyield FixtureDict(\npk=x[1],\nmodel=model,\nfields=dict(**get_fields(x[0], translate=translate, rename=rename)),\n)\nreturn\n
Retrieves fields from a file and performs modifications and checks.
This function takes a file (in various formats: Path, str, or dict) and processes its fields. It retrieves the fields from the file and performs modifications, translations, and checks on the fields.
Parameters:
Name Type Description Default fileUnion[Path, str, dict]
The file from which the fields are retrieved.
required translatedict
A nested dictionary representing the translation mapping for fields. The structure of the translator follows the format:
The translated fields will be used to replace the original fields in the retrieved fields. {}renamedict
A nested dictionary representing the field renaming mapping. The structure of the dictionary follows the format:
{\n'part1': {\n'part2': 'new_field_name'\n}\n}\n
The fields specified in the dictionary will be renamed to the provided new field names in the retrieved fields. {}allow_nullbool
Determines whether to allow None values for relational fields. If set to True, relational fields with missing values will be assigned None. If set to False, an error will be raised.
False
Returns:
Type Description dict
A dictionary representing the retrieved fields from the file, with modifications and checks applied.
Raises:
Type Description RuntimeError
If the file type is unsupported or if an error occurs during field retrieval or processing.
Source code in alto2txt2fixture/parser.py
def get_fields(\nfile: Union[Path, str, dict],\ntranslate: dict = {},\nrename: dict = {},\nallow_null: bool = False,\n) -> dict:\n\"\"\"\n Retrieves fields from a file and performs modifications and checks.\n This function takes a file (in various formats: `Path`, `str`, or `dict`)\n and processes its fields. It retrieves the fields from the file and\n performs modifications, translations, and checks on the fields.\n Args:\n file: The file from which the fields are retrieved.\n translate: A nested dictionary representing the translation mapping\n for fields. The structure of the translator follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n The translated fields will be used to replace the original fields\n in the retrieved fields.\n rename: A nested dictionary representing the field renaming\n mapping. The structure of the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': 'new_field_name'\n }\n }\n ```\n The fields specified in the dictionary will be renamed to the\n provided new field names in the retrieved fields.\n allow_null: Determines whether to allow ``None`` values for\n relational fields. If set to ``True``, relational fields with\n missing values will be assigned ``None``. If set to ``False``, an\n error will be raised.\n Returns:\n A dictionary representing the retrieved fields from the file,\n with modifications and checks applied.\n Raises:\n RuntimeError: If the file type is unsupported or if an error occurs\n during field retrieval or processing.\n \"\"\"\nif isinstance(file, Path):\ntry:\nfields = json.loads(file.read_text())\nexcept Exception as e:\nraise RuntimeError(f\"Cannot interpret JSON ({e}): {file}\")\nelif isinstance(file, str):\nif \"\\n\" in file:\nraise RuntimeError(\"File has multiple lines.\")\ntry:\nfields = json.loads(file)\nexcept json.decoder.JSONDecodeError as e:\nraise RuntimeError(f\"Cannot interpret JSON ({e}): {file}\")\nelif isinstance(file, dict):\nfields = file\nelse:\nraise RuntimeError(f\"Cannot process type {type(file)}.\")\n# Fix relational fields for any file\nfor key in [key for key in fields.keys() if \"__\" in key]:\nparts = key.split(\"__\")\ntry:\nbefore = fields[key]\nif before:\nbefore = before.replace(\"---\", \"/\")\nloc = translate.get(parts[0], {}).get(parts[1], {})\nfields[key] = loc.get(before)\nif fields[key] is None:\nraise RuntimeError(\nf\"Cannot translate fields.{key} from {before}: {loc}\"\n)\nexcept AttributeError:\nif allow_null:\nfields[key] = None\nelse:\nprint(\n\"Content had relational fields, but something went wrong in parsing the data:\"\n)\nprint(\"file\", file)\nprint(\"fields\", fields)\nprint(\"KEY:\", key)\nraise RuntimeError()\nnew_name = rename.get(parts[0], {}).get(parts[1], None)\nif new_name:\nfields[new_name] = fields[key]\ndel fields[key]\nfields[\"created_at\"] = NOW_str\nfields[\"updated_at\"] = NOW_str\ntry:\nfields[\"item_type\"] = str(fields[\"item_type\"]).upper()\nexcept KeyError:\npass\ntry:\nif fields[\"ocr_quality_mean\"] == \"\":\nfields[\"ocr_quality_mean\"] = 0\nexcept KeyError:\npass\ntry:\nif fields[\"ocr_quality_sd\"] == \"\":\nfields[\"ocr_quality_sd\"] = 0\nexcept KeyError:\npass\nreturn fields\n
Retrieves a specific key from a file and returns its value.
This function reads a file and extracts the value of a specified key. If the key is not found or an error occurs while processing the file, a warning is printed, and an empty string is returned.
Parameters:
Name Type Description Default itemPath
The file from which the key is extracted.
required xstr
The key to be retrieved from the file.
required
Returns:
Type Description str
The value of the specified key from the file.
Source code in alto2txt2fixture/parser.py
def get_key_from(item: Path, x: str) -> str:\n\"\"\"\n Retrieves a specific key from a file and returns its value.\n This function reads a file and extracts the value of a specified\n key. If the key is not found or an error occurs while processing\n the file, a warning is printed, and an empty string is returned.\n Args:\n item: The file from which the key is extracted.\n x: The key to be retrieved from the file.\n Returns:\n The value of the specified key from the file.\n \"\"\"\nresult = json.loads(item.read_text()).get(x, None)\nif not result:\nprint(f\"[WARN] Could not find key {x} in {item}\")\nresult = \"\"\nreturn result\n
def get_translator(\nfields: list[TranslatorTuple] = [TranslatorTuple(\"\", \"\", [])]\n) -> dict:\n\"\"\"\n Converts a list of fields into a nested dictionary representing a\n translator.\n Args:\n fields: A list of tuples representing fields to be translated.\n Returns:\n A nested dictionary representing the translator. The structure of\n the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n Example:\n ```pycon\n >>> fields = [\n ... TranslatorTuple(\n ... start='start__field1',\n ... finish='field1',\n ... lst=[{\n ... 'fields': {'field1': 'translation1'},\n ... 'pk': 1}],\n ... )]\n >>> get_translator(fields)\n {'start': {'field1': {'translation1': 1}}}\n ```\n \"\"\"\n_ = dict()\nfor field in fields:\nstart, finish, lst = field\npart1, part2 = start.split(\"__\")\nif part1 not in _:\n_[part1] = {}\nif part2 not in _[part1]:\n_[part1][part2] = {}\nif isinstance(finish, str):\n_[part1][part2] = {o[\"fields\"][finish]: o[\"pk\"] for o in lst}\nelif isinstance(finish, list):\n_[part1][part2] = {\n\"-\".join([o[\"fields\"][x] for x in finish]): o[\"pk\"] for o in lst\n}\nreturn _\n
Parses files from collections and generates fixtures for various models.
This function processes files from the specified collections and generates fixtures for different models, such as newspapers.dataprovider, newspapers.ingest, newspapers.digitisation, newspapers.newspaper, newspapers.issue, and newspapers.item.
It performs various steps, such as file listing, fixture generation, translation mapping, renaming fields, and saving fixtures to files.
Parameters:
Name Type Description Default collectionslist
A list of collections from which files are processed and fixtures are generated.
required cache_homestr
The directory path where the collections are located.
required outputstr
The directory path where the fixtures will be saved.
required max_elements_per_fileint
The maximum number of elements per file when saving fixtures.
required
Returns:
Type Description None
This function generates fixtures but does not return any value.
Source code in alto2txt2fixture/parser.py
def parse(\ncollections: list, cache_home: str, output: str, max_elements_per_file: int\n) -> None:\n\"\"\"\n Parses files from collections and generates fixtures for various models.\n This function processes files from the specified collections and generates\n fixtures for different models, such as `newspapers.dataprovider`,\n `newspapers.ingest`, `newspapers.digitisation`, `newspapers.newspaper`,\n `newspapers.issue`, and `newspapers.item`.\n It performs various steps, such as file listing, fixture generation,\n translation mapping, renaming fields, and saving fixtures to files.\n Args:\n collections: A list of collections from which files are\n processed and fixtures are generated.\n cache_home: The directory path where the collections are located.\n output: The directory path where the fixtures will be saved.\n max_elements_per_file: The maximum number of elements per file\n when saving fixtures.\n Returns:\n This function generates fixtures but does not return any value.\n \"\"\"\nglobal CACHE_HOME\nglobal OUTPUT\nglobal MAX_ELEMENTS_PER_FILE\nCACHE_HOME = cache_home\nOUTPUT = output\nMAX_ELEMENTS_PER_FILE = max_elements_per_file\n# Set up output directory\nreset_fixture_dir(OUTPUT)\n# Get file lists\nprint(\"\\nGetting file lists...\")\ndef issues_in_x(x):\nreturn \"issues\" in str(x.parent).split(\"/\")\ndef newspapers_in_x(x):\nreturn not any(\n[\ncondition\nfor y in str(x.parent).split(\"/\")\nfor condition in [\n\"issues\" in y,\n\"ingest\" in y,\n\"digitisation\" in y,\n\"data-provider\" in y,\n]\n]\n)\nall_json = [\nx for y in collections for x in (Path(CACHE_HOME) / y).glob(\"**/*.json\")\n]\nall_jsonl = [\nx for y in collections for x in (Path(CACHE_HOME) / y).glob(\"**/*.jsonl\")\n]\nprint(f\"--> {len(all_json):,} JSON files altogether\")\nprint(f\"--> {len(all_jsonl):,} JSONL files altogether\")\nprint(\"\\nSetting up fixtures...\")\n# Process data providers\ndef data_provider_in_x(x):\nreturn \"data-provider\" in str(x.parent).split(\"/\")\ndata_provider_json = list(\nfixtures(\nmodel=\"newspapers.dataprovider\",\nfilelist=[x for x in all_json if data_provider_in_x(x)],\nuniq_keys=[\"name\"],\n)\n)\nprint(f\"--> {len(data_provider_json):,} DataProvider fixtures\")\n# Process ingest\ndef ingest_in_x(x):\nreturn \"ingest\" in str(x.parent).split(\"/\")\ningest_json = list(\nfixtures(\nmodel=\"newspapers.ingest\",\nfilelist=[x for x in all_json if ingest_in_x(x)],\nuniq_keys=[\"lwm_tool_name\", \"lwm_tool_version\"],\n)\n)\nprint(f\"--> {len(ingest_json):,} Ingest fixtures\")\n# Process digitisation\ndef digitisation_in_x(x):\nreturn \"digitisation\" in str(x.parent).split(\"/\")\ndigitisation_json = list(\nfixtures(\nmodel=\"newspapers.digitisation\",\nfilelist=[x for x in all_json if digitisation_in_x(x)],\nuniq_keys=[\"software\"],\n)\n)\nprint(f\"--> {len(digitisation_json):,} Digitisation fixtures\")\n# Process newspapers\nnewspaper_json = list(\nfixtures(\nmodel=\"newspapers.newspaper\",\nfilelist=[file for file in all_json if newspapers_in_x(file)],\n)\n)\nprint(f\"--> {len(newspaper_json):,} Newspaper fixtures\")\n# Process issue\ntranslate = get_translator(\n[\nTranslatorTuple(\n\"publication__publication_code\", \"publication_code\", newspaper_json\n)\n]\n)\nrename = {\"publication\": {\"publication_code\": \"newspaper_id\"}}\nissue_json = list(\nfixtures(\nmodel=\"newspapers.issue\",\nfilelist=[file for file in all_json if issues_in_x(file)],\ntranslate=translate,\nrename=rename,\n)\n)\nprint(f\"--> {len(issue_json):,} Issue fixtures\")\n# Create translator/clear up memory before processing items\ntranslate = get_translator(\n[\n(\"issue__issue_identifier\", \"issue_code\", issue_json),\n(\"digitisation__software\", \"software\", digitisation_json),\n(\"data_provider__name\", \"name\", data_provider_json),\n(\n\"ingest__lwm_tool_identifier\",\n[\"lwm_tool_name\", \"lwm_tool_version\"],\ningest_json,\n),\n]\n)\nrename = {\n\"issue\": {\"issue_identifier\": \"issue_id\"},\n\"digitisation\": {\"software\": \"digitisation_id\"},\n\"data_provider\": {\"name\": \"data_provider_id\"},\n\"ingest\": {\"lwm_tool_identifier\": \"ingest_id\"},\n}\nsave_fixture(newspaper_json, \"Newspaper\")\nsave_fixture(issue_json, \"Issue\")\ndel newspaper_json\ndel issue_json\ngc.collect()\nprint(\"\\nSaving...\")\nsave_fixture(digitisation_json, \"Digitisation\")\nsave_fixture(ingest_json, \"Ingest\")\nsave_fixture(data_provider_json, \"DataProvider\")\n# Process items\nitem_json = fixtures(\nmodel=\"newspapers.item\",\nfilelist=all_jsonl,\ntranslate=translate,\nrename=rename,\n)\nsave_fixture(item_json, \"Item\")\nreturn\n
Resets the fixture directory by removing all JSON files inside it.
This function takes a directory path (output) as input and removes all JSON files within the directory.
Prior to removal, it prompts the user for confirmation to proceed. If the user confirms, the function clears the fixture directory by deleting the JSON files.
Parameters:
Name Type Description Default outputstr | Path
The directory path of the fixture directory to be reset.
required
Raises:
Type Description RuntimeError
If the output directory is not specified as a string.
Source code in alto2txt2fixture/parser.py
def reset_fixture_dir(output: str | Path) -> None:\n\"\"\"\n Resets the fixture directory by removing all JSON files inside it.\n This function takes a directory path (``output``) as input and removes all\n JSON files within the directory.\n Prior to removal, it prompts the user for confirmation to proceed. If the\n user confirms, the function clears the fixture directory by deleting the\n JSON files.\n Args:\n output: The directory path of the fixture directory to be reset.\n Raises:\n RuntimeError: If the ``output`` directory is not specified as a string.\n \"\"\"\nif not isinstance(output, str):\nraise RuntimeError(\"`output` directory needs to be specified as a string.\")\noutput = Path(output)\ny = input(\nf\"This command will automatically empty the fixture directory ({output.absolute()}). \"\n\"Do you want to proceed? [y/N]\"\n)\nif not y.lower() == \"y\":\noutput.mkdir(parents=True, exist_ok=True)\nreturn\nprint(\"\\nClearing up the fixture directory\")\n# Ensure directory exists\noutput.mkdir(parents=True, exist_ok=True)\n# Drop all JSON files\n[x.unlink() for x in Path(output).glob(\"*.json\")]\nreturn\n
uniq(filelist: list, keys: list = []) -> Generator[Any, None, None]\n
Generates unique items from a list of files based on specified keys.
This function takes a list of files and yields unique items based on a combination of keys. The keys are extracted from each file using the get_key_from function, and duplicate items are ignored.
Parameters:
Name Type Description Default filelistlist
A list of files from which unique items are generated.
required keyslist
A list of keys used for uniqueness. Each key specifies a field to be used for uniqueness checking in the generated items.
[]
Yields:
Type Description Any
A unique item from filelist.
Source code in alto2txt2fixture/parser.py
def uniq(filelist: list, keys: list = []) -> Generator[Any, None, None]:\n\"\"\"\n Generates unique items from a list of files based on specified keys.\n This function takes a list of files and yields unique items based on a\n combination of keys. The keys are extracted from each file using the\n ``get_key_from`` function, and duplicate items are ignored.\n Args:\n filelist: A list of files from which unique items are\n generated.\n keys: A list of keys used for uniqueness. Each key specifies\n a field to be used for uniqueness checking in the generated\n items.\n Yields:\n A unique item from `filelist`.\n \"\"\"\nseen = set()\nfor item in filelist:\nkey = \"-\".join([get_key_from(item, x) for x in keys])\nif key not in seen:\nseen.add(key)\nyield item\nelse:\n# Drop it if duplicate\npass\n
the fulltext app has a fulltextmodelclass specified in lwmdb.fulltext.models.fulltext. A sql table is generated from on that fulltextclass and the jsonfixture structure generated from this class is where records will be stored. extract_subdirPathLike
Folder to extract self.compressed_files to.
plaintext_extensionstr
What file extension to use to filter plaintext files.
Return class name with count and DataProvider if available.
Source code in alto2txt2fixture/plaintext.py
def __str__(self) -> str:\n\"\"\"Return class name with count and `DataProvider` if available.\"\"\"\nreturn (\nf\"{type(self).__name__} \"\nf\"for {len(self)} \"\nf\"{self._data_provider_code_quoted_with_trailing_space}files\"\n)\n
Name Type Description Default output_pathPathLike | None
Path to save compressed json files to. Uses self.json_export_compression_subdir if None is passed.
NoneformatArchiveFormatEnum | None
What compression format to use from ArchiveFormatEnum. Uses self.json_export_compression_format if None is passed.
None Note
Neither output_path nor format overwrite the related attributes of self.
Example
>>> if is_platform_win:\n... pytest.skip('decompression fails on Windows: issue #55')\n>>> plaintext_bl_lwm = getfixture('bl_lwm_plaintext_json_export')\n<BLANKLINE>\n...\n>>> compressed_paths: Path = plaintext_bl_lwm.compress_json_exports(\n... format='tar')\n<BLANKLINE>\n...Compressing...'...01.json' to...'tar'...in:...\n>>> compressed_paths\n(...Path('.../plaintext_fixture-000001.json.tar'),)\n
Source code in alto2txt2fixture/plaintext.py
def compress_json_exports(\nself,\noutput_path: PathLike | None = None,\nformat: ArchiveFormatEnum | None = None,\n) -> tuple[Path, ...]:\n\"\"\"Compress `self._exported_json_paths` to `format`.\n Args:\n output_path:\n `Path` to save compressed `json` files to. Uses\n `self.json_export_compression_subdir` if `None` is passed.\n format:\n What compression format to use from `ArchiveFormatEnum`. Uses\n `self.json_export_compression_format` if `None` is passed.\n Note:\n Neither `output_path` nor `format` overwrite the related attributes\n of `self`.\n Returns: The the `output_path` passed to save compressed `json`.\n Example:\n ```pycon\n >>> if is_platform_win:\n ... pytest.skip('decompression fails on Windows: issue #55')\n >>> plaintext_bl_lwm = getfixture('bl_lwm_plaintext_json_export')\n <BLANKLINE>\n ...\n >>> compressed_paths: Path = plaintext_bl_lwm.compress_json_exports(\n ... format='tar')\n <BLANKLINE>\n ...Compressing...'...01.json' to...'tar'...in:...\n >>> compressed_paths\n (...Path('.../plaintext_fixture-000001.json.tar'),)\n ```\n \"\"\"\noutput_path = (\nPath(self.json_export_compression_subdir)\nif not output_path\nelse Path(output_path)\n)\nformat = self.json_export_compression_format if not format else format\ncompressed_paths: list[Path] = []\nfor json_path in self.exported_json_paths:\ncompressed_paths.append(\ncompress_fixture(json_path, output_path=output_path, format=format)\n)\nreturn tuple(compressed_paths)\n
The Archive class represents a zip archive of XML files. The class is used to extract information from a ZIP archive, and it contains several methods to process the data contained in the archive.
open(Archive) context manager
Archive can be opened with a context manager, which creates a meta object, with timings for the object. When closed, it will save the meta JSON to the correct paths.
Attributes:
Name Type Description pathPath
The path to the zip archive.
collectionstr
The collection of the XML files in the archive. Default is \"\".
reportPath
The file path of the report file for the archive.
report_idstr
The report ID for the archive. If not provided, a random UUID is generated.
report_parentPath
The parent directory of the report file for the archive.
jisc_paperspd.DataFrame
A DataFrame of JISC papers.
sizestr | float
The size of the archive, in human-readable format.
size_rawstr | float
The raw size of the archive, in bytes.
rootsGenerator[ET.Element, None, None]
The root elements of the XML documents contained in the archive.
metadotdict
Metadata about the archive, such as its path, size, and number of contents.
A generator that yields instances of the Document class for each XML file in the ZIP archive.
It uses the tqdm library to display a progress bar in the terminal while it is running.
If the contents of the ZIP file are not empty, the method creates an instance of the Document class by passing the root element of the XML file, the collection name, meta information about the archive, and the JISC papers data frame (if provided) to the constructor of the Document class. The instance of the Document class is then returned by the generator.
Yields:
Type Description Document
Document class instance for each unzipped XML file.
Source code in alto2txt2fixture/router.py
def get_documents(self) -> Generator[Document, None, None]:\n\"\"\"\n A generator that yields instances of the Document class for each XML\n file in the ZIP archive.\n It uses the `tqdm` library to display a progress bar in the terminal\n while it is running.\n If the contents of the ZIP file are not empty, the method creates an\n instance of the ``Document`` class by passing the root element of the XML\n file, the collection name, meta information about the archive, and the\n JISC papers data frame (if provided) to the constructor of the\n ``Document`` class. The instance of the ``Document`` class is then\n returned by the generator.\n Yields:\n ``Document`` class instance for each unzipped `XML` file.\n \"\"\"\nfor xml_file in tqdm(\nself.filelist,\ndesc=f\"{Path(self.zip_file.filename).stem} ({self.meta.size})\",\nleave=False,\ncolour=\"green\",\n):\nwith self.zip_file.open(xml_file) as f:\nxml = f.read()\nif xml:\nyield Document(\nroot=ET.fromstring(xml),\ncollection=self.collection,\nmeta=self.meta,\njisc_papers=self.jisc_papers,\n)\n
Yields the root elements of the XML documents contained in the archive.
Source code in alto2txt2fixture/router.py
def get_roots(self) -> Generator[ET.Element, None, None]:\n\"\"\"\n Yields the root elements of the XML documents contained in the archive.\n \"\"\"\nfor xml_file in tqdm(self.filelist, leave=False, colour=\"blue\"):\nwith self.zip_file.open(xml_file) as f:\nxml = f.read()\nif xml:\nyield ET.fromstring(xml)\n
The Cache class provides a blueprint for creating and managing cache data. The class has several methods that help in getting the cache path, converting the data to a dictionary, and writing the cache data to a file.
It is inherited by many other classes in this document.
Initializes the Cache class object.
Source code in alto2txt2fixture/router.py
def __init__(self):\n\"\"\"\n Initializes the Cache class object.\n \"\"\"\npass\n
Returns the cache path, which is used to store the cache data. The path is normally constructed using some of the object's properties (collection, kind, and id) but can be changed when inherited.
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the cache path, which is used to store the cache data.\n The path is normally constructed using some of the object's\n properties (collection, kind, and id) but can be changed when\n inherited.\n \"\"\"\nreturn Path(f\"{CACHE_HOME}/{self.collection}/{self.kind}/{self.id}.json\")\n
write_to_cache(json_indent: int = JSON_INDENT) -> Optional[bool]\n
Writes the cache data to a file at the specified cache path. The cache data is first converted to a dictionary using the as_dict method. If the cache path already exists, the function returns True.
Source code in alto2txt2fixture/router.py
def write_to_cache(self, json_indent: int = JSON_INDENT) -> Optional[bool]:\n\"\"\"\n Writes the cache data to a file at the specified cache path. The cache\n data is first converted to a dictionary using the as_dict method. If\n the cache path already exists, the function returns True.\n \"\"\"\npath = self.get_cache_path()\ntry:\nif path.exists():\nreturn True\nexcept AttributeError:\nerror(\nf\"Error occurred when getting cache path for \"\nf\"{self.kind}: {path}. It was not of expected \"\nf\"type Path but of type {type(path)}:\",\n)\npath.parent.mkdir(parents=True, exist_ok=True)\nwith open(path, \"w+\") as f:\nf.write(json.dumps(self.as_dict(), indent=json_indent))\nreturn\n
A Collection represents a group of newspaper archives from any passed alto2txt metadata output.
A Collection is initialised with a name and an optional pandas DataFrame of JISC papers. The archives property returns an iterable of the Archive objects within the collection.
The DataProvider class extends the Cache class and represents a newspaper data provider. The class has several properties and methods that allow creation of a data provider object and the manipulation of its data.
Attributes:
Name Type Description collectionstr
A string representing publication collection
kindstr
Indication of object type, defaults to data-provider
providers_meta_datalist[FixtureDict]
structured dict of metadata for known collection sources
collection_typestr
related data sources and potential linkage source
index_fieldstr
field name for querying existing records
Example
>>> from pprint import pprint\n>>> hmd = DataProvider(\"hmd\")\n>>> hmd.pk\n2\n>>> pprint(hmd.as_dict())\n{'code': 'bl_hmd',\n 'collection': 'newspapers',\n 'legacy_code': 'hmd',\n 'name': 'Heritage Made Digital',\n 'source_note': 'British Library-funded digitised newspapers provided by the '\n 'British Newspaper Archive'}\n
The Digitisation class extends the Cache class and represents a newspaper digitisation. The class has several properties and methods that allow creation of an digitisation object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
collectionstr
A string that represents the collection of the publication
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, root: ET.Element, collection: str = \"\"):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nself.root: ET.Element = root\nself.collection: str = collection\n
A method that returns a dictionary representation of the digitisation object.
Returns:
Type Description dict
Dictionary representation of the Digitising object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the digitisation\n object.\n Returns:\n Dictionary representation of the Digitising object\n \"\"\"\ndic = {\nx.tag: x.text or \"\"\nfor x in self.root.findall(\"./process/*\")\nif x.tag\nin [\n\"xml_flavour\",\n\"software\",\n\"mets_namespace\",\n\"alto_namespace\",\n]\n}\nif not dic.get(\"software\"):\nreturn {}\nreturn dic\n
The Document class is a representation of a document that contains information about a publication, newspaper, item, digitisation, and ingest. This class holds all the relevant information about a document in a structured manner and provides properties that can be used to access different aspects of the document.
Attributes:
Name Type Description collectionstr | None
A string that represents the collection of the publication
rootET.Element | None
An XML element that represents the root of the publication
zip_filestr | None
A path to a valid zip file
jisc_paperspd.DataFrame | None
A pandasDataFrame object that holds information about the JISC papers
metadotdict | None
TODO
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, *args, **kwargs):\n\"\"\"Constructor method.\"\"\"\nself.collection: str | None = kwargs.get(\"collection\")\nif not self.collection or not isinstance(self.collection, str):\nraise RuntimeError(\"A valid collection must be passed\")\nself.root: ET.Element | None = kwargs.get(\"root\")\nif not self.root or not isinstance(self.root, ET.Element):\nraise RuntimeError(\"A valid XML root must be passed\")\nself.zip_file: str | None = kwargs.get(\"zip_file\")\nif self.zip_file and not isinstance(self.zip_file, str):\nraise RuntimeError(\"A valid zip file must be passed\")\nself.jisc_papers: pd.DataFrame | None = kwargs.get(\"jisc_papers\")\nif not isinstance(self.jisc_papers, pd.DataFrame):\nraise RuntimeError(\n\"A valid DataFrame containing JISC papers must be passed\"\n)\nself.meta: dotdict | None = kwargs.get(\"meta\")\nself._publication_elem = None\nself._input_sub_path = None\nself._ingest = None\nself._digitisation = None\nself._item = None\nself._issue = None\nself._newspaper = None\nself._data_provider = None\n
The Ingest class extends the Cache class and represents a newspaper ingest. The class has several properties and methods that allow the creation of an ingest object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
collectionstr
A string that represents the collection of the publication
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, root: ET.Element, collection: str = \"\"):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nself.root: ET.Element = root\nself.collection: str = collection\n
A method that returns a dictionary representation of the ingest object.
Returns:
Type Description dict
Dictionary representation of the Ingest object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the ingest\n object.\n Returns:\n Dictionary representation of the Ingest object\n \"\"\"\nreturn {\nf\"lwm_tool_{x.tag}\": x.text or \"\"\nfor x in self.root.findall(\"./process/lwm_tool/*\")\n}\n
The Issue class extends the Cache class and represents a newspaper issue. The class has several properties and methods that allow the creation of an issue object and the manipulation of its data.
Attributes:
Name Type Description root
An xml element that represents the root of the publication
newspaperNewspaper | None
The parent newspaper
collectionstr
A string that represents the collection of the publication
A method that returns a dictionary representation of the issue object.
Returns:
Type Description dict
Dictionary representation of the Issue object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the issue\n object.\n Returns:\n Dictionary representation of the Issue object\n \"\"\"\nif not self._issue:\nself._issue = dict(\nissue_code=self.issue_code,\nissue_date=self.issue_date,\npublication__publication_code=self.newspaper.publication_code,\ninput_sub_path=self.input_sub_path,\n)\nreturn self._issue\n
Returns the path to the cache file for the issue object.
Returns:
Type Description Path
Path to the cache file for the issue object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the issue object.\n Returns:\n Path to the cache file for the issue object\n \"\"\"\njson_file = f\"/{self.newspaper.publication_code}/issues/{self.issue_code}.json\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\"\n+ \"/\".join(self.newspaper.number_paths)\n+ json_file\n)\n
The Newspaper class extends the Cache class and represents a newspaper item, i.e. an article. The class has several properties and methods that allow the creation of an article object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
issue_codestr
A string that represents the issue code
digitisationdict
TODO
ingestdict
TODO
collectionstr
A string that represents the collection of the publication
newspaperNewspaper | None
The parent newspaper
metadotdict
TODO
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(\nself,\nroot: ET.Element,\nissue_code: str = \"\",\ndigitisation: dict = {},\ningest: dict = {},\ncollection: str = \"\",\nnewspaper: Optional[Newspaper] = None,\nmeta: dotdict = dotdict(),\n):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nif not isinstance(newspaper, Newspaper):\nraise RuntimeError(\"Expected newspaper to be of type router.Newspaper\")\nself.root: ET.Element = root\nself.issue_code: str = issue_code\nself.digitisation: dict = digitisation\nself.ingest: dict = ingest\nself.collection: str = collection\nself.newspaper: Newspaper | None = newspaper\nself.meta: dotdict = meta\nself._item_elem = None\nself._item_code = None\nself._item = None\npath: str = str(self.get_cache_path())\nif not self.meta.item_paths:\nself.meta.item_paths = [path]\nelif path not in self.meta.item_paths:\nself.meta.item_paths.append(path)\n
Returns the path to the cache file for the item (article) object.
Returns:
Type Description Path
Path to the cache file for the article object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the item (article) object.\n Returns:\n Path to the cache file for the article object\n \"\"\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\"\n+ \"/\".join(self.newspaper.number_paths)\n+ f\"/{self.newspaper.publication_code}/items.jsonl\"\n)\n
Special cache-write function that appends rather than writes at the end of the process.
Returns:
Type Description None
None.
Source code in alto2txt2fixture/router.py
def write_to_cache(self, json_indent=JSON_INDENT) -> None:\n\"\"\"\n Special cache-write function that appends rather than writes at the\n end of the process.\n Returns:\n None.\n \"\"\"\npath = self.get_cache_path()\npath.parent.mkdir(parents=True, exist_ok=True)\nwith open(path, \"a+\") as f:\nf.write(json.dumps(self.as_dict(), indent=json_indent) + \"\\n\")\nreturn\n
A method that returns a dictionary representation of the newspaper object.
Returns:
Type Description dict
Dictionary representation of the Newspaper object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the newspaper\n object.\n Returns:\n Dictionary representation of the Newspaper object\n \"\"\"\nif not self._newspaper:\nself._newspaper = dict(\n**dict(publication_code=self.publication_code, title=self.title),\n**{\nx.tag: x.text or \"\"\nfor x in self.publication.findall(\"*\")\nif x.tag in [\"location\"]\n},\n)\nreturn self._newspaper\n
Returns the path to the cache file for the newspaper object.
Returns:
Type Description Path
Path to the cache file for the newspaper object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the newspaper object.\n Returns:\n Path to the cache file for the newspaper object\n \"\"\"\njson_file = f\"/{self.publication_code}/{self.publication_code}.json\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\" + \"/\".join(self.number_paths) + json_file\n)\n
A method that returns the publication code from the input sub-path of the publication process.
Returns:
Type Description str | None
The code of the publication
Source code in alto2txt2fixture/router.py
def publication_code_from_input_sub_path(self) -> str | None:\n\"\"\"\n A method that returns the publication code from the input sub-path of\n the publication process.\n Returns:\n The code of the publication\n \"\"\"\ng = PUBLICATION_CODE.findall(self.input_sub_path)\nif len(g) == 1:\nreturn g[0]\nreturn None\n
This function is responsible for setting up the path for the alto2txt mountpoint, setting up the JISC papers and routing the collections for processing.
Parameters:
Name Type Description Default collectionslist
List of collection names
required cache_homestr
Directory path for the cache
required mountpointstr
Directory path for the alto2txt mountpoint
required jisc_papers_pathstr
Path to the JISC papers
required report_dirstr
Path to the report directory
required
Returns:
Type Description None
None
Source code in alto2txt2fixture/router.py
def route(\ncollections: list,\ncache_home: str,\nmountpoint: str,\njisc_papers_path: str,\nreport_dir: str,\n) -> None:\n\"\"\"\n This function is responsible for setting up the path for the alto2txt\n mountpoint, setting up the JISC papers and routing the collections for\n processing.\n Args:\n collections: List of collection names\n cache_home: Directory path for the cache\n mountpoint: Directory path for the alto2txt mountpoint\n jisc_papers_path: Path to the JISC papers\n report_dir: Path to the report directory\n Returns:\n None\n \"\"\"\nglobal CACHE_HOME\nglobal MNT\nglobal REPORT_DIR\nCACHE_HOME = cache_home\nREPORT_DIR = report_dir\nMNT = Path(mountpoint) if isinstance(mountpoint, str) else mountpoint\nif not MNT.exists():\nerror(\nf\"The mountpoint provided for alto2txt does not exist. \"\nf\"Either create a local copy or blobfuse it to \"\nf\"`{MNT.absolute()}`.\"\n)\njisc_papers = setup_jisc_papers(path=jisc_papers_path)\nfor collection_name in collections:\ncollection = Collection(name=collection_name, jisc_papers=jisc_papers)\nif collection.empty:\nerror(\nf\"It looks like {collection_name} is empty in the \"\nf\"alto2txt mountpoint: `{collection.dir.absolute()}`.\"\n)\nfor archive in collection.archives:\nwith archive as _:\n[\n(\ndoc.item.write_to_cache(),\ndoc.newspaper.write_to_cache(),\ndoc.issue.write_to_cache(),\ndoc.data_provider.write_to_cache(),\ndoc.ingest.write_to_cache(),\ndoc.digitisation.write_to_cache(),\n)\nfor doc in archive.documents\n]\nreturn\n
Fields within the fields portion of a FixtureDict to fit lwmdb.
Attributes:
Name Type Description namestr
The name of the collection data source. For lwmdb this should be less than 600 characters.
codestr | NEWSPAPER_OCR_FORMATS
A short slug-like, url-compatible (replace spaces with -) str to uniquely identify a data provider in urls, api calls etc. Designed to fit NEWSPAPER_OCR_FORMATS and any future slug-like codes.
legacy_codeLEGACY_NEWSPAPER_OCR_FORMATS | None
Either blank or a legacy slug-like, url-compatible (replace spaces with -) str originally used by alto2txt following LEGACY_NEWSPAPER_OCR_FORMATSNEWSPAPER_OCR_FORMATS.
A typed dict for Plaintext Fixutres to match lwmdb.Fulltextmodel
Attributes:
Name Type Description textstr
Plaintext, potentially quite large newspaper articles. May have unusual or unreadable sequences of characters due to issues with Optical Character Recognition quality.
pathstr
Path of provided plaintext file. If compressed_path is None, this is the original relative Path of the plaintext file.
compressed_pathstr | None
The path of a compressed data source, the extraction of which provides access to plaintext files.
A string or list specifying the field(s) to be translated. If it is a string, the translated field will be a direct mapping of the specified field in each item of the input list. If it is a list, the translated field will be a hyphen-separated concatenation of the specified fields in each item of the input list.
lstlist[dict]
A list of dictionaries representing the items to be translated. Each dictionary should contain the necessary fields for translation, with the field names specified in the start parameter.
dictfieldskey used to check matchiching collections name
DATA_PROVIDER_INDEX
Returns:
Type Description set[str]
A set of collections without a matching newspaper_collections record.
Example
>>> check_newspaper_collection_configuration()\nset()\n>>> unmatched: set[str] = check_newspaper_collection_configuration(\n... [\"cat\", \"dog\"])\n<BLANKLINE>\n...Warning: 2 `collections` not in `newspaper_collections`: ...\n>>> unmatched == {'dog', 'cat'}\nTrue\n
Note
Set orders are random so checking unmatched == {'dog, 'cat'} to ensure correctness irrespective of order in the example above.
Source code in alto2txt2fixture/utils.py
def check_newspaper_collection_configuration(\ncollections: Iterable[str] = settings.COLLECTIONS,\nnewspaper_collections: Iterable[FixtureDict] = NEWSPAPER_COLLECTION_METADATA,\ndata_provider_index: str = DATA_PROVIDER_INDEX,\n) -> set[str]:\n\"\"\"Check the names in `collections` match the names in `newspaper_collections`.\n Arguments:\n collections:\n Names of newspaper collections, defaults to ``settings.COLLECTIONS``\n newspaper_collections:\n Newspaper collections in a list of `FixtureDict` format. Defaults\n to ``settings.FIXTURE_TABLE['dataprovider]``\n data_provider_index:\n `dict` `fields` `key` used to check matchiching `collections` name\n Returns:\n A set of ``collections`` without a matching `newspaper_collections` record.\n Example:\n ```pycon\n >>> check_newspaper_collection_configuration()\n set()\n >>> unmatched: set[str] = check_newspaper_collection_configuration(\n ... [\"cat\", \"dog\"])\n <BLANKLINE>\n ...Warning: 2 `collections` not in `newspaper_collections`: ...\n >>> unmatched == {'dog', 'cat'}\n True\n ```\n !!! note\n Set orders are random so checking `unmatched == {'dog, 'cat'}` to\n ensure correctness irrespective of order in the example above.\n \"\"\"\nnewspaper_collection_names: tuple[str, ...] = tuple(\ndict_from_list_fixture_fields(\nnewspaper_collections, field_name=data_provider_index\n).keys()\n)\ncollection_diff: set[str] = set(collections) - set(newspaper_collection_names)\nif collection_diff:\nwarning(\nf\"{len(collection_diff)} `collections` \"\nf\"not in `newspaper_collections`: {collection_diff}\"\n)\nreturn collection_diff\n
Clears the cache directory by removing all .json files in it.
Parameters:
Name Type Description Default dirstr | Path
The path of the directory to be cleared.
required Source code in alto2txt2fixture/utils.py
def clear_cache(dir: str | Path) -> None:\n\"\"\"\n Clears the cache directory by removing all `.json` files in it.\n Args:\n dir: The path of the directory to be cleared.\n \"\"\"\ndir = get_path_from(dir)\ny = input(\nf\"Do you want to erase the cache path now that the \"\nf\"files have been generated ({dir.absolute()})? [y/N]\"\n)\nif y.lower() == \"y\":\ninfo(\"Clearing up the cache directory\")\nfor x in dir.glob(\"*.json\"):\nx.unlink()\n
Compress exported fixtures files using make_archive.
Parameters:
Name Type Description Default pathPathLike
Path to file to compress
required output_pathPathLike | str
Compressed file name (without extension specified from format).
settings.OUTPUTformatstr | ArchiveFormatEnum
A str of one of the registered compression formats. By default Python provides zip, tar, gztar, bztar, and xztar. See ArchiveFormatEnum variable for options checked.
ZIP_FILE_EXTENSIONsuffixstr
str to add to comprssed file name saved. For example: if path = plaintext_fixture-1.json and suffix=_compressed, then the saved file might be called plaintext_fixture-1_compressed.json.zip
create_lookup(lst: list = [], on: list = []) -> dict\n
Create a lookup dictionary from a list of dictionaries.
Parameters:
Name Type Description Default lstlist
A list of dictionaries that should be used to generate the lookup.
[]onlist
A list of keys from the dictionaries in the list that should be used as the keys in the lookup.
[]
Returns:
Type Description dict
The generated lookup dictionary.
Source code in alto2txt2fixture/utils.py
def create_lookup(lst: list = [], on: list = []) -> dict:\n\"\"\"\n Create a lookup dictionary from a list of dictionaries.\n Args:\n lst: A list of dictionaries that should be used to generate the lookup.\n on: A list of keys from the dictionaries in the list that should be used as the keys in the lookup.\n Returns:\n The generated lookup dictionary.\n \"\"\"\nreturn {get_key(x, on): x[\"pk\"] for x in lst}\n
def dict_from_list_fixture_fields(\nfixture_list: Iterable[FixtureDict] = NEWSPAPER_COLLECTION_METADATA,\nfield_name: str = DATA_PROVIDER_INDEX,\n) -> dict[str, FixtureDict]:\n\"\"\"Create a `dict` from ``fixture_list`` with ``attr_name`` as `key`.\n Args:\n fixture_list: `list` of `FixtureDict` with ``attr_name`` key `fields`.\n field_name: key for values within ``fixture_list`` `fields`.\n Returns:\n A `dict` where extracted `field_name` is key for related `FixtureDict` values.\n Example:\n ```pycon\n >>> fixture_dict: dict[str, FixtureDict] = dict_from_list_fixture_fields()\n >>> fixture_dict['hmd']['pk']\n 2\n >>> fixture_dict['hmd']['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n >>> fixture_dict['hmd']['fields']['code']\n 'bl_hmd'\n ```\n \"\"\"\nreturn {record[\"fields\"][field_name]: record for record in fixture_list}\n
Saves fixtures generated by a generator to separate separate CSV files.
This function takes an Iterable or Generator of fixtures and saves to separate CSV files. The fixtures are saved in batches, where each batch is determined by the max_elements_per_file parameter.
Parameters:
Name Type Description Default fixturesIterable[FixtureDict] | Generator[FixtureDict, None, None]
An Iterable or Generator of the fixtures to be saved.
required prefixstr
A string prefix to be added to the file names of the saved fixtures.
def fixtures_dict2csv(\nfixtures: Iterable[FixtureDict] | Generator[FixtureDict, None, None],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nindex: bool = False,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None:\n\"\"\"Saves fixtures generated by a generator to separate separate `CSV` files.\n This function takes an `Iterable` or `Generator` of fixtures and saves to\n separate `CSV` files. The fixtures are saved in batches, where each batch\n is determined by the ``max_elements_per_file`` parameter.\n Args:\n fixtures:\n An `Iterable` or `Generator` of the fixtures to be saved.\n prefix:\n A string prefix to be added to the file names of the\n saved fixtures.\n output_path:\n Path to folder fixtures are saved to\n max_elements_per_file:\n Maximum `JSON` records saved in each file\n file_name_0_padding:\n Zeros to prefix the number of each fixture file name.\n Returns:\n This function saves fixtures to files and does not return a value.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> from pandas import read_csv\n >>> fixtures_dict2csv(NEWSPAPER_COLLECTION_METADATA,\n ... prefix='test', output_path=tmp_path)\n >>> imported_fixture = read_csv(tmp_path / 'test-000001.csv')\n >>> imported_fixture.iloc[1]['pk']\n 2\n >>> imported_fixture.iloc[1][DATA_PROVIDER_INDEX]\n 'hmd'\n ```\n \"\"\"\ninternal_counter: int = 1\ncounter: int = 1\nlst: list = []\nfile_name: str\ndf: DataFrame\nPath(output_path).mkdir(parents=True, exist_ok=True)\nfor item in fixtures:\nlst.append(fixture_fields(item, as_dict=True))\ninternal_counter += 1\nif internal_counter > max_elements_per_file:\ndf = DataFrame.from_records(lst)\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv\"\ndf.to_csv(Path(output_path) / file_name, index=index)\n# Save up some memory\ndel lst\ngc.collect()\n# Re-instantiate\nlst = []\ninternal_counter = 1\ncounter += 1\nelse:\ndf = DataFrame.from_records(lst)\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv\"\ndf.to_csv(Path(output_path) / file_name, index=index)\n
def free_hd_space_in_GB(\ndisk_usage_tuple: DiskUsageTuple | None = None, path: PathLike | None = None\n) -> float:\n\"\"\"Return remaing hard drive space estimate in gigabytes.\n Args:\n disk_usage_tuple:\n A `NamedTuple` normally returned from `disk_usage()` or `None`.\n path:\n A `path` to pass to `disk_usage` if `disk_usage_tuple` is `None`.\n Returns:\n A `float` from dividing the `disk_usage_tuple.free` value by `BYTES_PER_GIGABYTE`\n Example:\n ```pycon\n >>> space_in_gb = free_hd_space_in_GB()\n >>> space_in_gb > 1 # Hopefully true wherever run...\n True\n ```\n \"\"\"\nif not disk_usage_tuple:\nif not path:\npath = Path(getcwd())\ndisk_usage_tuple = disk_usage(path=path)\nassert disk_usage_tuple\nreturn disk_usage_tuple.free / BYTES_PER_GIGABYTE\n
def gen_fixture_tables(\nfixture_tables: dict[str, list[FixtureDict]] = {},\ninclude_fixture_pk_column: bool = True,\n) -> Generator[Table, None, None]:\n\"\"\"Generator of `rich.Table` instances from `FixtureDict` configuration tables.\n Args:\n fixture_tables: `dict` where `key` is for `Table` title and `value` is a `FixtureDict`\n include_fixture_pk_column: whether to include the `pk` field from `FixtureDict`\n Example:\n ```pycon\n >>> table_name: str = \"data_provider\"\n >>> tables = tuple(\n ... gen_fixture_tables(\n ... {table_name: NEWSPAPER_COLLECTION_METADATA}\n ... ))\n >>> len(tables)\n 1\n >>> assert tables[0].title == table_name\n >>> [column.header for column in tables[0].columns]\n ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']\n ```\n \"\"\"\nfor name, fixture_records in fixture_tables.items():\nfixture_table: Table = Table(title=name)\nfor i, fixture_dict in enumerate(fixture_records):\nif i == 0:\n[\nfixture_table.add_column(name)\nfor name in fixture_fields(fixture_dict, include_fixture_pk_column)\n]\nrow_values: tuple[str, ...] = tuple(\nstr(x) for x in (fixture_dict[\"pk\"], *fixture_dict[\"fields\"].values())\n)\nfixture_table.add_row(*row_values)\nyield fixture_table\n
This function takes in a Path object path and returns a list of lists of zipfiles sorted and chunked according to certain conditions defined in the settings object (see settings.CHUNK_THRESHOLD).
Note: the function will also skip zip files of a certain file size, which can be specified in the settings object (see settings.SKIP_FILE_SIZE).
Parameters:
Name Type Description Default pathPath
The input path where the zipfiles are located
required
Returns:
Type Description list
A list of lists of zipfiles, each inner list represents a chunk of zipfiles.
Source code in alto2txt2fixture/utils.py
def get_chunked_zipfiles(path: Path) -> list:\n\"\"\"This function takes in a `Path` object `path` and returns a list of lists\n of `zipfiles` sorted and chunked according to certain conditions defined\n in the `settings` object (see `settings.CHUNK_THRESHOLD`).\n Note: the function will also skip zip files of a certain file size, which\n can be specified in the `settings` object (see `settings.SKIP_FILE_SIZE`).\n Args:\n path: The input path where the zipfiles are located\n Returns:\n A list of lists of `zipfiles`, each inner list represents a chunk of\n zipfiles.\n \"\"\"\nzipfiles = sorted(\npath.glob(\"*.zip\"),\nkey=lambda x: x.stat().st_size,\nreverse=settings.START_WITH_LARGEST,\n)\nzipfiles = [x for x in zipfiles if x.stat().st_size <= settings.SKIP_FILE_SIZE]\nif len(zipfiles) > settings.CHUNK_THRESHOLD:\nchunks = array_split(zipfiles, len(zipfiles) / settings.CHUNK_THRESHOLD)\nelse:\nchunks = [zipfiles]\nreturn chunks\n
Get a string key from a dictionary using values from specified keys.
Parameters:
Name Type Description Default xdict
A dictionary from which the key is generated.
dict()onlist
A list of keys from the dictionary that should be used to generate the key.
[]
Returns:
Type Description str
The generated string key.
Source code in alto2txt2fixture/utils.py
def get_key(x: dict = dict(), on: list = []) -> str:\n\"\"\"\n Get a string key from a dictionary using values from specified keys.\n Args:\n x: A dictionary from which the key is generated.\n on: A list of keys from the dictionary that should be used to\n generate the key.\n Returns:\n The generated string key.\n \"\"\"\nreturn f\"{'-'.join([str(x['fields'][y]) for y in on])}\"\n
Provides the path to any given lockfile, which controls whether any existing files should be overwritten or not.
Parameters:
Name Type Description Default collectionstr
Collection folder name
required kindNewspaperElements
Either newspaper or issue or item
required dicdict
A dictionary with required information for either kind passed
required
Returns:
Type Description Path
Path to the resulting lockfile
Source code in alto2txt2fixture/utils.py
def get_lockfile(collection: str, kind: NewspaperElements, dic: dict) -> Path:\n\"\"\"\n Provides the path to any given lockfile, which controls whether any\n existing files should be overwritten or not.\n Args:\n collection: Collection folder name\n kind: Either `newspaper` or `issue` or `item`\n dic: A dictionary with required information for either `kind` passed\n Returns:\n Path to the resulting lockfile\n \"\"\"\np: Path\nbase = Path(f\"cache-lockfiles/{collection}\")\nif kind == \"newspaper\":\np = base / f\"newspapers/{dic['publication_code']}\"\nelif kind == \"issue\":\np = base / f\"issues/{dic['publication__publication_code']}/{dic['issue_code']}\"\nelif kind == \"item\":\ntry:\nif dic.get(\"issue_code\"):\np = base / f\"items/{dic['issue_code']}/{dic['item_code']}\"\nelif dic.get(\"issue__issue_identifier\"):\np = base / f\"items/{dic['issue__issue_identifier']}/{dic['item_code']}\"\nexcept KeyError:\nerror(\"An unknown error occurred (in get_lockfile)\")\nelse:\np = base / \"lockfile\"\np.parent.mkdir(parents=True, exist_ok=True) if settings.WRITE_LOCKFILES else None\nreturn p\n
Return datetime.now() as either a string or datetime object.
Parameters:
Name Type Description Default as_strbool
Whether to return nowtime as a str or not, default: False
False
Returns:
Type Description datetime.datetime | str
datetime.now() in pytz.UTC time zone as a string if as_str, else as a datetime.datetime object.
Source code in alto2txt2fixture/utils.py
def get_now(as_str: bool = False) -> datetime.datetime | str:\n\"\"\"\n Return `datetime.now()` as either a string or `datetime` object.\n Args:\n as_str: Whether to return `now` `time` as a `str` or not, default: `False`\n Returns:\n `datetime.now()` in `pytz.UTC` time zone as a string if `as_str`, else\n as a `datetime.datetime` object.\n \"\"\"\nnow = datetime.datetime.now(tz=pytz.UTC)\nif as_str:\nreturn str(now)\nelse:\nassert isinstance(now, datetime.datetime)\nreturn now\n
Converts an input value into a Path object if it's not already one.
Parameters:
Name Type Description Default pstr | Path
The input value, which can be a string or a Path object.
required
Returns:
Type Description Path
The input value as a Path object.
Source code in alto2txt2fixture/utils.py
def get_path_from(p: str | Path) -> Path:\n\"\"\"\n Converts an input value into a Path object if it's not already one.\n Args:\n p: The input value, which can be a string or a Path object.\n Returns:\n The input value as a Path object.\n \"\"\"\nif isinstance(p, str):\np = Path(p)\nif not isinstance(p, Path):\nraise RuntimeError(f\"Unable to handle type: {type(p)}\")\nreturn p\n
Whether to return the file size as total number of bytes or a human-readable MB/GB amount
False
Returns:
Type Description str | float
Return str followed by MB or GB for size if not raw otherwise float.
Source code in alto2txt2fixture/utils.py
def get_size_from_path(p: str | Path, raw: bool = False) -> str | float:\n\"\"\"\n Returns a nice string for any given file size.\n Args:\n p: Path to read the size from\n raw: Whether to return the file size as total number of bytes or\n a human-readable MB/GB amount\n Returns:\n Return `str` followed by `MB` or `GB` for size if not `raw` otherwise `float`.\n \"\"\"\np = get_path_from(p)\nbytes = p.stat().st_size\nif raw:\nreturn bytes\nrel_size: float | int | str = round(bytes / 1000 / 1000 / 1000, 1)\nassert not isinstance(rel_size, str)\nif rel_size < 0.5:\nrel_size = round(bytes / 1000 / 1000, 1)\nrel_size = f\"{rel_size}MB\"\nelse:\nrel_size = f\"{rel_size}GB\"\nreturn rel_size\n
Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.
Parameters:
Name Type Description Default pstr
Path to a directory to filter
required
Returns:
Type Description list
Sorted list of files contained in the provided path without the ones
list
whose names start with a .
Source code in alto2txt2fixture/utils.py
def glob_filter(p: str) -> list:\n\"\"\"\n Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.\n Args:\n p: Path to a directory to filter\n Returns:\n Sorted list of files contained in the provided path without the ones\n whose names start with a `.`\n \"\"\"\nreturn sorted([x for x in get_path_from(p).glob(\"*\") if not x.name.startswith(\".\")])\n
Return an OrderedDict of replacement 0-padded file names from path.
Parameters:
Name Type Description Default pathPathLike
PathLike to source files to rename.
required output_pathPathLike | None
PathLike to save renamed files to.
Noneglob_regex_strstr
str to match files to rename within path.
'*'paddingint | None
How many digits (0s) to pad match_int with.
0match_int_regexstr
Regular expression for matching numbers in s to pad. Only rename parts of Path(file_path).name; else replace across Path(file_path).parents as well.
PADDING_0_REGEX_DEFAULTindexint
Which index of number in s to pad with 0s. Like numbering a list, 0 indicates the first match and -1 indicates the last match.
-1 Example
>>> tmp_path: Path = getfixture('tmp_path')\n>>> for i in range(4):\n... (tmp_path / f'test_file-{i}.txt').touch(exist_ok=True)\n>>> pprint(sorted(tmp_path.iterdir()))\n[...Path('...test_file-0.txt'),\n ...Path('...test_file-1.txt'),\n ...Path('...test_file-2.txt'),\n ...Path('...test_file-3.txt')]\n>>> pprint(glob_path_rename_by_0_padding(tmp_path))\n{...Path('...test_file-0.txt'): ...Path('...test_file-00.txt'),\n ...Path('...test_file-1.txt'): ...Path('...test_file-01.txt'),\n ...Path('...test_file-2.txt'): ...Path('...test_file-02.txt'),\n ...Path('...test_file-3.txt'): ...Path('...test_file-03.txt')}\n
Source code in alto2txt2fixture/utils.py
def glob_path_rename_by_0_padding(\npath: PathLike,\noutput_path: PathLike | None = None,\nglob_regex_str: str = \"*\",\npadding: int | None = 0,\nmatch_int_regex: str = PADDING_0_REGEX_DEFAULT,\nindex: int = -1,\n) -> dict[PathLike, PathLike]:\n\"\"\"Return an `OrderedDict` of replacement 0-padded file names from `path`.\n Params:\n path:\n `PathLike` to source files to rename.\n output_path:\n `PathLike` to save renamed files to.\n glob_regex_str:\n `str` to match files to rename within `path`.\n padding:\n How many digits (0s) to pad `match_int` with.\n match_int_regex:\n Regular expression for matching numbers in `s` to pad.\n Only rename parts of `Path(file_path).name`; else\n replace across `Path(file_path).parents` as well.\n index:\n Which index of number in `s` to pad with 0s.\n Like numbering a `list`, 0 indicates the first match\n and -1 indicates the last match.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> for i in range(4):\n ... (tmp_path / f'test_file-{i}.txt').touch(exist_ok=True)\n >>> pprint(sorted(tmp_path.iterdir()))\n [...Path('...test_file-0.txt'),\n ...Path('...test_file-1.txt'),\n ...Path('...test_file-2.txt'),\n ...Path('...test_file-3.txt')]\n >>> pprint(glob_path_rename_by_0_padding(tmp_path))\n {...Path('...test_file-0.txt'): ...Path('...test_file-00.txt'),\n ...Path('...test_file-1.txt'): ...Path('...test_file-01.txt'),\n ...Path('...test_file-2.txt'): ...Path('...test_file-02.txt'),\n ...Path('...test_file-3.txt'): ...Path('...test_file-03.txt')}\n ```\n \"\"\"\ntry:\nassert Path(path).exists()\nexcept AssertionError:\nraise ValueError(f'path does not exist: \"{Path(path)}\"')\npaths_tuple: tuple[PathLike, ...] = path_globs_to_tuple(path, glob_regex_str)\ntry:\nassert paths_tuple\nexcept AssertionError:\nraise FileNotFoundError(\nf\"No files found matching 'glob_regex_str': \"\nf\"'{glob_regex_str}' in: '{path}'\"\n)\npaths_to_index: tuple[tuple[str, int], ...] = tuple(\nint_from_str(str(matched_path), index=index, regex=match_int_regex)\nfor matched_path in paths_tuple\n)\nmax_index: int = max(index[1] for index in paths_to_index)\nmax_index_digits: int = len(str(max_index))\nif not padding or padding < max_index_digits:\npadding = max_index_digits + 1\nnew_names_dict: dict[PathLike, PathLike] = {}\nif output_path:\nif not Path(output_path).is_absolute():\noutput_path = Path(path) / output_path\nlogger.debug(f\"Specified '{output_path}' for saving file copies\")\nfor i, old_path in enumerate(paths_tuple):\nmatch_str, match_int = paths_to_index[i]\nnew_names_dict[old_path] = rename_by_0_padding(\nold_path, match_str=str(match_str), match_int=match_int, padding=padding\n)\nif output_path:\nnew_names_dict[old_path] = (\nPath(output_path) / Path(new_names_dict[old_path]).name\n)\nreturn new_names_dict\n
def int_from_str(\ns: str,\nindex: int = -1,\nregex: str = PADDING_0_REGEX_DEFAULT,\n) -> tuple[str, int]:\n\"\"\"Return matched (or None) `regex` from `s` by index `index`.\n Params:\n s:\n `str` to match and via `regex`.\n index:\n Which index of number in `s` to pad with 0s.\n Like numbering a `list`, 0 indicates the first match\n and -1 indicates the last match.\n regex:\n Regular expression for matching numbers in `s` to pad.\n Example:\n ```pycon\n >>> int_from_str('a/path/to/fixture-03-05.txt')\n ('05', 5)\n >>> int_from_str('a/path/to/fixture-03-05.txt', index=0)\n ('03', 3)\n ```\n \"\"\"\nmatches: list[str] = [match for match in findall(regex, s) if match]\nmatch_str: str = matches[index]\nreturn match_str, int(match_str)\n
list_json_files(\np: str | Path,\ndrill: bool = False,\nexclude_names: list = [],\ninclude_names: list = [],\n) -> Generator[Path, None, None] | list[Path]\n
List json files under the path specified in p.
Parameters:
Name Type Description Default pstr | Path
The path to search for json files
required drillbool
A flag indicating whether to drill down the subdirectories or not. Default is False
Falseexclude_nameslist
A list of file names to exclude from the search result. Default is an empty list
[]include_nameslist
A list of file names to include in search result. If provided, the exclude_names argument will be ignored. Default is an empty list
[]
Returns:
Type Description Generator[Path, None, None] | list[Path]
A list of Path objects pointing to the found json files
Source code in alto2txt2fixture/utils.py
def list_json_files(\np: str | Path,\ndrill: bool = False,\nexclude_names: list = [],\ninclude_names: list = [],\n) -> Generator[Path, None, None] | list[Path]:\n\"\"\"\n List `json` files under the path specified in ``p``.\n Args:\n p: The path to search for `json` files\n drill: A flag indicating whether to drill down the subdirectories\n or not. Default is ``False``\n exclude_names: A list of file names to exclude from the search\n result. Default is an empty list\n include_names: A list of file names to include in search result.\n If provided, the ``exclude_names`` argument will be ignored.\n Default is an empty list\n Returns:\n A list of `Path` objects pointing to the found `json` files\n \"\"\"\nq: str = \"**/*.json\" if drill else \"*.json\"\nfiles = get_path_from(p).glob(q)\nif exclude_names:\nfiles = list({x for x in files if x.name not in exclude_names})\nelif include_names:\nfiles = list({x for x in files if x.name in include_names})\nreturn sorted(files)\n
Whether the program should crash if there is a json decode error, default: False
False
Returns:
Type Description dict | list
The decoded json contents from the path, but an empty dictionary
dict | list
if the file cannot be decoded and crash is set to False
Source code in alto2txt2fixture/utils.py
def load_json(p: str | Path, crash: bool = False) -> dict | list:\n\"\"\"\n Easier access to reading `json` files.\n Args:\n p: Path to read `json` from\n crash: Whether the program should crash if there is a `json` decode\n error, default: ``False``\n Returns:\n The decoded `json` contents from the path, but an empty dictionary\n if the file cannot be decoded and ``crash`` is set to ``False``\n \"\"\"\np = get_path_from(p)\ntry:\nreturn json.loads(p.read_text())\nexcept json.JSONDecodeError:\nmsg = f\"Error: {p.read_text()}\"\nerror(msg, crash=crash)\nreturn {}\n
Load multiple json files and return a list of their content.
Parameters:
Name Type Description Default pstr | Path
The path to search for json files
required drillbool
A flag indicating whether to drill down the subdirectories or not. Default is False
Falsefilter_nabool
A flag indicating whether to filter out the content that is None. Default is True.
Truecrashbool
A flag indicating whether to raise an exception when an error occurs while loading a json file. Default is False.
False
Returns:
Type Description list
A list of the content of the loaded json files.
Source code in alto2txt2fixture/utils.py
def load_multiple_json(\np: str | Path,\ndrill: bool = False,\nfilter_na: bool = True,\ncrash: bool = False,\n) -> list:\n\"\"\"\n Load multiple `json` files and return a list of their content.\n Args:\n p: The path to search for `json` files\n drill: A flag indicating whether to drill down the subdirectories\n or not. Default is `False`\n filter_na: A flag indicating whether to filter out the content that\n is `None`. Default is `True`.\n crash: A flag indicating whether to raise an exception when an\n error occurs while loading a `json` file. Default is `False`.\n Returns:\n A `list` of the content of the loaded `json` files.\n \"\"\"\nfiles = list_json_files(p, drill=drill)\ncontent = [load_json(x, crash=crash) for x in files]\nreturn [x for x in content if x] if filter_na else content\n
Writes a '.' to a lockfile, after making sure the parent directory exists.
Parameters:
Name Type Description Default lockfilePath
The path to the lock file to be created
required
Returns:
Type Description None
None
Source code in alto2txt2fixture/utils.py
def lock(lockfile: Path) -> None:\n\"\"\"\n Writes a '.' to a lockfile, after making sure the parent directory exists.\n Args:\n lockfile: The path to the lock file to be created\n Returns:\n None\n \"\"\"\nlockfile.parent.mkdir(parents=True, exist_ok=True)\nlockfile.write_text(\"\")\nreturn\n
def rename_by_0_padding(\nfile_path: PathLike,\nmatch_str: str | None = None,\nmatch_int: int | None = None,\npadding: int = FILE_NAME_0_PADDING_DEFAULT,\nreplace_count: int = 1,\nexclude_parents: bool = True,\nreverse_int_match: bool = False,\n) -> Path:\n\"\"\"Return `file_path` with `0` `padding` `Path` change.\n Params:\n file_path:\n `PathLike` to rename.\n match_str:\n `str` to match and replace with padded `match_int`\n match_int:\n `int` to pad and replace `match_str`\n padding:\n How many digits (0s) to pad `match_int` with.\n exclude_parents:\n Only rename parts of `Path(file_path).name`; else\n replace across `Path(file_path).parents` as well.\n reverse_int_match:\n Whether to match from the end of the `file_path`.\n Example:\n ```pycon\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='05', match_int=5)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-03-000005.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='03')\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-000003-05.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='05', padding=0)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-03-5.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_int=3)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-0000003-05.txt')...\n >>> rename_by_0_padding('a/path/to/3/f-03-05-0003.txt',\n ... match_int=3, padding=2,\n ... exclude_parents=False)\n <BLANKLINE>\n ...Path('a/path/to/03/f-03-05-0003.txt')...\n >>> rename_by_0_padding('a/path/to/3/f-03-05-0003.txt',\n ... match_int=3, padding=2,\n ... exclude_parents=False,\n ... replace_count=3, )\n <BLANKLINE>\n ...Path('a/path/to/03/f-003-05-00003.txt')...\n ```\n \"\"\"\nif match_int is None and match_str in (None, \"\"):\nraise ValueError(f\"At least `match_int` or `match_str` required; both None.\")\nelif match_str and not match_int:\nmatch_int = int(match_str)\nelif match_int is not None and not match_str:\nassert str(match_int) in str(file_path)\nmatch_str = int_from_str(\nstr(file_path),\nindex=-1 if reverse_int_match else 0,\n)[0]\nassert match_int is not None and match_str is not None\nif exclude_parents:\nreturn Path(file_path).parent / Path(file_path).name.replace(\nmatch_str, str(match_int).zfill(padding), replace_count\n)\nelse:\nreturn Path(\nstr(file_path).replace(\nmatch_str, str(match_int).zfill(padding), replace_count\n)\n)\n
save_fixture(\ngenerator: Sequence | Generator = [],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nadd_created: bool = True,\njson_indent: int = JSON_INDENT,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None\n
Saves fixtures generated by a generator to separate JSON files.
This function takes a generator and saves the generated fixtures to separate JSON files. The fixtures are saved in batches, where each batch is determined by the max_elements_per_file parameter.
Parameters:
Name Type Description Default generatorSequence | Generator
A generator that yields the fixtures to be saved.
[]prefixstr
A string prefix to be added to the file names of the saved fixtures.
''output_pathPathLike | str
Path to folder fixtures are saved to
settings.OUTPUTmax_elements_per_fileint
Maximum JSON records saved in each file
settings.MAX_ELEMENTS_PER_FILEadd_createdbool
Whether to add created_at and updated_attimestamps
Truejson_indentint
Number of indent spaces per line in saved JSON
JSON_INDENTfile_name_0_paddingint
Zeros to prefix the number of each fixture file name.
FILE_NAME_0_PADDING_DEFAULT
Returns:
Type Description None
This function saves the fixtures to files but does not return
def save_fixture(\ngenerator: Sequence | Generator = [],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nadd_created: bool = True,\njson_indent: int = JSON_INDENT,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None:\n\"\"\"Saves fixtures generated by a generator to separate JSON files.\n This function takes a generator and saves the generated fixtures to\n separate JSON files. The fixtures are saved in batches, where each batch\n is determined by the ``max_elements_per_file`` parameter.\n Args:\n generator:\n A generator that yields the fixtures to be saved.\n prefix:\n A string prefix to be added to the file names of the\n saved fixtures.\n output_path:\n Path to folder fixtures are saved to\n max_elements_per_file:\n Maximum `JSON` records saved in each file\n add_created:\n Whether to add `created_at` and `updated_at` `timestamps`\n json_indent:\n Number of indent spaces per line in saved `JSON`\n file_name_0_padding:\n Zeros to prefix the number of each fixture file name.\n Returns:\n This function saves the fixtures to files but does not return\n any value.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> save_fixture(NEWSPAPER_COLLECTION_METADATA,\n ... prefix='test', output_path=tmp_path)\n >>> imported_fixture = load_json(tmp_path / 'test-000001.json')\n >>> imported_fixture[1]['pk']\n 2\n >>> imported_fixture[1]['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n >>> 'created_at' in imported_fixture[1]['fields']\n True\n ```\n \"\"\"\ninternal_counter = 1\ncounter = 1\nlst = []\nfile_name: str\nPath(output_path).mkdir(parents=True, exist_ok=True)\nfor item in generator:\nlst.append(item)\ninternal_counter += 1\nif internal_counter > max_elements_per_file:\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.json\"\nwrite_json(\np=Path(f\"{output_path}/file_name\"),\no=lst,\nadd_created=add_created,\njson_indent=json_indent,\n)\n# Save up some memory\ndel lst\ngc.collect()\n# Re-instantiate\nlst = []\ninternal_counter = 1\ncounter += 1\nelse:\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.json\"\nwrite_json(\np=Path(f\"{output_path}/{file_name}\"),\no=lst,\nadd_created=add_created,\njson_indent=json_indent,\n)\nreturn\n
def truncate_path_str(\npath: PathLike,\nmax_length: int = MAX_TRUNCATE_PATH_STR_LEN,\nfolder_filler_str: str = INTERMEDIATE_PATH_TRUNCATION_STR,\nhead_parts: int = TRUNC_HEADS_PATH_DEFAULT,\ntail_parts: int = TRUNC_TAILS_PATH_DEFAULT,\npath_sep: str = sep,\n_force_type: Type[Path] | Type[PureWindowsPath] = Path,\n) -> str:\n\"\"\"If `len(text) > max_length` return `text` followed by `trail_str`.\n Args:\n path:\n `PathLike` object to truncate\n max_length:\n maximum length of `path` to allow, anything belond truncated\n folder_filler_str:\n what to fill intermediate path names with\n head_parts:\n how many parts of `path` from the root to keep.\n These must be `int` >= 0\n tail_parts:\n how many parts from the `path` tail the root to keep.\n These must be `int` >= 0\n path_sep:\n what `str` to replace `path` parts with if over `max_length`\n Returns:\n `text` truncated to `max_length` (if longer than `max_length`),\n with with `folder_filler_str` for intermediate folder names\n Note:\n For errors running on windows see:\n [#56](https://github.com/Living-with-machines/alto2txt2fixture/issues/56)\n Example:\n ```pycon\n >>> love_shadows: Path = (\n ... Path('Standing') / 'in' / 'the' / 'shadows'/ 'of' / 'love.')\n >>> truncate_path_str(love_shadows)\n 'Standing...love.'\n >>> truncate_path_str(love_shadows, max_length=100)\n 'Standing...in...the...shadows...of...love.'\n >>> truncate_path_str(love_shadows, folder_filler_str=\"*\")\n 'Standing...*...*...*...*...love.'\n >>> root_love_shadows: Path = Path(sep) / love_shadows\n >>> truncate_path_str(root_love_shadows, folder_filler_str=\"*\")\n <BLANKLINE>\n ...\n '...Standing...*...*...*...*...love.'\n >>> if is_platform_win:\n ... pytest.skip('fails on certain Windows root paths: issue #56')\n >>> truncate_path_str(root_love_shadows,\n ... folder_filler_str=\"*\", tail_parts=3)\n <BLANKLINE>\n ...\n '...Standing...*...*...shadows...of...love.'...\n ```\n \"\"\"\npath = _force_type(normpath(path))\nif len(str(path)) > max_length:\ntry:\nassert not (head_parts < 0 or tail_parts < 0)\nexcept AssertionError:\nlogger.error(\nf\"Both index params for `truncate_path_str` must be >=0: \"\nf\"(head_parts={head_parts}, tail_parts={tail_parts})\"\n)\nreturn str(path)\noriginal_path_parts: tuple[str, ...] = path.parts\nhead_index_fix: int = 0\nif path.is_absolute() or path.drive:\nhead_index_fix += 1\nfor part in original_path_parts[head_parts + head_index_fix :]:\nif not part:\nhead_index_fix += 1\nelse:\nbreak\nlogger.debug(\nf\"Adding {head_index_fix} to `head_parts`: {head_parts} \"\nf\"to truncate: '{path}'\"\n)\nhead_parts += head_index_fix\ntry:\nassert head_parts + tail_parts < len(str(original_path_parts))\nexcept AssertionError:\nlogger.error(\nf\"Returning untruncated. Params \"\nf\"(head_parts={head_parts}, tail_parts={tail_parts}) \"\nf\"not valid to truncate: '{path}'\"\n)\nreturn str(path)\ntail_index: int = len(original_path_parts) - tail_parts\nreplaced_path_parts: tuple[str, ...] = tuple(\npart if (i < head_parts or i >= tail_index) else folder_filler_str\nfor i, part in enumerate(original_path_parts)\n)\nreplaced_start_str: str = \"\".join(replaced_path_parts[:head_parts])\nreplaced_end_str: str = path_sep.join(\npath for path in replaced_path_parts[head_parts:]\n)\nreturn path_sep.join((replaced_start_str, replaced_end_str))\nelse:\nreturn str(path)\n
def valid_compression_files(files: Sequence[PathLike]) -> list[PathLike]:\n\"\"\"Return a `tuple` of valid compression paths in `files`.\n Args:\n files:\n `Sequence` of files to filter compression types from.\n Returns:\n A list of files that could be decompressed.\n Example:\n ```pycon\n >>> valid_compression_files([\n ... 'cat.tar.bz2', 'dog.tar.bz3', 'fish.tgz', 'bird.zip',\n ... 'giraffe.txt', 'frog'\n ... ])\n ['cat.tar.bz2', 'fish.tgz', 'bird.zip']\n ```\n \"\"\"\nreturn [\nfile\nfor file in files\nif \"\".join(Path(file).suffixes) in VALID_COMPRESSION_FORMATS\n]\n
Easier access to writing json files. Checks whether parent exists.
Parameters:
Name Type Description Default pstr | Path
Path to write json to
required odict
Object to write to json file
required add_createdbool
If set to True will add created_at and updated_at to the dictionary's fields. If created_at and updated_at already exist in the fields, they will be forcefully updated.
def write_json(\np: str | Path, o: dict, add_created: bool = True, json_indent: int = JSON_INDENT\n) -> None:\n\"\"\"\n Easier access to writing `json` files. Checks whether parent exists.\n Args:\n p: Path to write `json` to\n o: Object to write to `json` file\n add_created:\n If set to True will add `created_at` and `updated_at`\n to the dictionary's fields. If `created_at` and `updated_at`\n already exist in the fields, they will be forcefully updated.\n json_indent:\n What indetation format to write out `JSON` file in\n Returns:\n None\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> path: Path = tmp_path / 'test-write-json-example.json'\n >>> write_json(p=path,\n ... o=NEWSPAPER_COLLECTION_METADATA,\n ... add_created=True)\n >>> imported_fixture = load_json(path)\n >>> imported_fixture[1]['pk']\n 2\n >>> imported_fixture[1]['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n ```\n `\n \"\"\"\np = get_path_from(p)\nif not (isinstance(o, dict) or isinstance(o, list)):\nraise RuntimeError(f\"Unable to handle data of type: {type(o)}\")\ndef _append_created_fields(o: dict):\n\"\"\"Add `created_at` and `updated_at` fields to a `dict` with `FixtureDict` values.\"\"\"\nreturn dict(\n**{k: v for k, v in o.items() if not k == \"fields\"},\nfields=dict(\n**{\nk: v\nfor k, v in o[\"fields\"].items()\nif not k == \"created_at\" and not k == \"updated_at\"\n},\n**{\"created_at\": NOW_str, \"updated_at\": NOW_str},\n),\n)\ntry:\nif add_created and isinstance(o, dict):\no = _append_created_fields(o)\nelif add_created and isinstance(o, list):\no = [_append_created_fields(x) for x in o]\nexcept KeyError:\nerror(\"An unknown error occurred (in write_json)\")\np.parent.mkdir(parents=True, exist_ok=True)\np.write_text(json.dumps(o, indent=json_indent))\nreturn\n
The installation process should be fairly easy to take care of, using poetry:
$ poetry install\n
However, this is only the first step in the process. As the script works through the alto2txt collections, you will either need to choose the slower option \u2014 mounting them to your computer (using blobfuse) \u2014\u00a0or the faster option \u2014 downloading the required zip files from the Azure storage to your local hard drive. In the two following sections, both of those options are described.
"},{"location":"tutorial/first-steps.html#connecting-alto2txt-to-the-program","title":"Connecting alto2txt to the program","text":""},{"location":"tutorial/first-steps.html#downloading-local-copies-of-alto2txt-on-your-computer","title":"Downloading local copies of alto2txt on your computer","text":"
This option will take up a lot of hard drive space
As of the time of writing, downloading all of alto2txt\u2019s metadata takes up about 185GB on your local drive.
You do not have to download all of the collections or all of the zip files for each collection, as long as you are aware that the resulting fixtures will be limited in scope.
"},{"location":"tutorial/first-steps.html#step-1-log-in-to-azure-using-microsoft-azure-storage-explorer","title":"Step 1: Log in to Azure using Microsoft Azure Storage Explorer","text":"
Microsoft Azure Storage Explorer (MASE) is a great and free tool for downloading content off Azure. Your first step is to download and install this product on your local computer.
Once you have opened MASE, you will need to sign into the appropriate Azure account.
"},{"location":"tutorial/first-steps.html#step-2-download-the-alto2txt-blob-container-to-your-hard-drive","title":"Step 2: Download the alto2txt blob container to your hard drive","text":"
On your left-hand side, you should see a menu where you can navigate to the correct \u201cblob container\u201d: Living with Machines > Storage Accounts > alto2txt > Blob Containers:
You will want to replicate the same structure as the Blob Container itself in a folder on your hard drive:
Once you have the structure set up, you are ready to download all of the files needed. For each of the blob containers, make sure that you download the metadata directory only onto your computer:
Select all of the files and press the download button:
Make sure you save all the zip files inside the correct local folder:
The \u201cActivities\u201d bar will now show you the progress and speed:
"},{"location":"tutorial/first-steps.html#mounting-alto2txt-on-your-computer","title":"Mounting alto2txt on your computer","text":"
This option will only work on a Linux or UNIX computer
If you have a mac, your only option is the one below.
Follow the instructions for installing BlobFuse and the instructions for how to prepare your drive for mounting.
"},{"location":"tutorial/first-steps.html#step-2-set-up-sas-tokens","title":"Step 2: Set up SAS tokens","text":"
Follow the instructions for setting up access to your Azure storage account.
"},{"location":"tutorial/first-steps.html#step-3-mount-your-blobs","title":"Step 3: Mount your blobs","text":"
TODO #3: Write this section.
Note that you can also search on the internet for ideas on how to create local scripts to facilitate easier connection next time.
"}]}
\ No newline at end of file
diff --git a/search/search_index.json b/search/search_index.json
new file mode 100644
index 0000000..12055a1
--- /dev/null
+++ b/search/search_index.json
@@ -0,0 +1 @@
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"alto2txt2fixture","text":"
alto2txt2fixture is a standalone tool to convert alto2txtXML output and other related datasets into JSON (and where feasible CSV) data with corresponding relational IDs to ease general use and ingestion into a relational database.
We target the the JSON produced for importing into lwmdb: a database built using the Djangopython webframework database fixture structure.
"},{"location":"index.html#installation-and-simple-use","title":"Installation and simple use","text":"
We provide a command line interface to process alto2txtXML files stored locally (or mounted via azureblobfuse), and for additional public data we automate a means of downloading those automatically.
To processing newspaper metadata with a local copy of alto2txtXML results, it's easiest to have that data in the same folder as your alto2txt2fixture checkout and poetry installed folder. One arranged, you should be able to begin the JSON converstion with
$ poetry run a2t2f-news\n
To generate related data in JSON and CSV form, assuming you have an internet collection and access to a living-with-machinesazure account, the following will download related data into JSON and CSV files. The JSON results should be consistent with lwmdb tables for ease of import.
$ poetry run a2t2f-adj\n
"},{"location":"running.html","title":"Running the Program","text":""},{"location":"running.html#using-poetry-to-run","title":"Using poetry to run","text":"
The program should run automatically with the following command:
$ poetry run a2t2f-news\n
Alternatively, if you want to add optional parameters and don\u2019t want to use the standard poetry script to run, you can use the (somewhat convoluted) poetry run alto2txt2fixture/run.py and provide any optional parameters. You can see a list of all the \u201cOptional parameters\u201d below. For example, if you want to only include the hmd collection:
$ poetry run alto2txt2fixture/run.py --collections hmd\n
"},{"location":"running.html#alternative-run-the-script-without-poetry","title":"Alternative: Run the script without poetry","text":"
If you find yourself in trouble with poetry, the program should run perfectly fine on its own, assuming the dependencies are installed. The same command, then, would be:
See the list under [tool.poetry.dependencies] in pyproject.toml for a list of dependencies that would need to be installed for alto2txt2fixture to work outside a python poetry environment.
The program has a number of optional parameters that you can choose to include or not. The table below describes each parameter, how to pass it to the program, and what its defaults are.
Flag Description Default value -c, --collections Which collections to process in the mounted alto2txt directory hmd, lwm, jisc, bna-o, --output Into which directory should the processed files be put? ./output/fixtures/-m, --mountpoint Where is the alto2txt directories mounted? ./input/alto2txt/-t, --test-config Print the config table but do not run False"},{"location":"running.html#successfully-running-the-program-an-example","title":"Successfully running the program: An example","text":""},{"location":"understanding-results.html","title":"Understanding the Results","text":""},{"location":"understanding-results.html#the-resulting-file-structure","title":"The resulting file structure","text":"
The examples below follow standard settings
If you choose other settings for when you run the program, your output directory may look different from the information on this page.
Reports are automatically generated with a unique hash as the overarching folder structure. Inside the reports directory, you\u2019ll find a JSON file for each alto2txt directory (organised by NLP identifier).
The report structure, thus, looks like this:
The JSON file has some good troubleshooting information. You\u2019ll find that the contents are structured as a Python dictionary (or JavaScript Object). Here is an example:
Here is an explanation of each of the keys in the dictionary:
Key Explanation Data type path The input path for the zip file that is being converted. stringbytes The size of the input zip file represented in bytes. integersize The size of the input zip file represented in a human-readable string. stringcontents #TODO #3 integerstart Date and time when processing started (see also end below). datestringnewspaper_paths #TODO #3 list (string) publication_codes A list of the NLPs that are contained in the input zip file. list (string) issue_paths A list of all the issue paths that are contained in the cache directory. list (string) item_paths A list of all the item paths that are contained in the cache directory. list (string) end Date and time when processing ended (see also start above). datestringseconds Seconds that the script spent interpreting the zip file (should be added to the microseconds below). integermicroseconds Microseconds that the script spent interpreting the zip file (should be added to the seconds above). integer"},{"location":"understanding-results.html#fixtures","title":"Fixtures","text":"
The most important output of the script is contained in the fixtures directory. This directory contains JSON files for all the different columns in the corresponding Django metadata database (i.e. DataProvider, Digitisation, Ingest, Issue, Newspaper, and Item). The numbering at the end of each file indicates the order of the files as they are divided into a maximum of 2e6 elements*:
Each JSON file contains a Python-like list (JavaScript Array) of dictionaries (JavaScript Objects), which have a primary key (pk), the related database model (in the example below the Django newspapers app\u2019s newspaper table), and a nested dictionary/Object which contains all the values for the database\u2019s table entry:
* The maximum elements per file can be adjusted in the settings.py file\u2019s settings object\u2019s MAX_ELEMENTS_PER_FILE value.
This constructs an ArgumentParser instance to manage configurating calls of run() to manage newspaperXML to JSON converstion.
Parameters:
Name Type Description Default argvlist[str] | None
If None treat as equivalent of ['--help], if alistofstrpass those options toArgumentParser`
None
Returns:
Type Description Namespace
A Namespacedict-like configuration for run()
Source code in alto2txt2fixture/__main__.py
def parse_args(argv: list[str] | None = None) -> Namespace:\n\"\"\"Manage command line arguments for `run()`\n This constructs an `ArgumentParser` instance to manage\n configurating calls of `run()` to manage `newspaper`\n `XML` to `JSON` converstion.\n Arguments:\n argv:\n If `None` treat as equivalent of ['--help`],\n if a `list` of `str` pass those options to `ArgumentParser`\n Returns:\n A `Namespace` `dict`-like configuration for `run()`\n \"\"\"\nargv = None if not argv else argv\nparser = ArgumentParser(\nprog=\"a2t2f-news\",\ndescription=\"Process alto2txt XML into and Django JSON Fixture files\",\nepilog=(\n\"Note: this is still in beta mode and contributions welcome\\n\\n\" + __doc__\n),\nformatter_class=RawTextHelpFormatter,\n)\nparser.add_argument(\n\"-c\",\n\"--collections\",\nnargs=\"+\",\nhelp=\"<Optional> Set collections\",\nrequired=False,\n)\nparser.add_argument(\n\"-m\",\n\"--mountpoint\",\ntype=str,\nhelp=\"<Optional> Mountpoint\",\nrequired=False,\n)\nparser.add_argument(\n\"-o\",\n\"--output\",\ntype=str,\nhelp=\"<Optional> Set an output directory\",\nrequired=False,\n)\nparser.add_argument(\n\"-t\",\n\"--test-config\",\ndefault=False,\nhelp=\"Only print the configuration\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"-f\",\n\"--show-fixture-tables\",\ndefault=True,\nhelp=\"Print included fixture table configurations\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"--export-fixture-tables\",\ndefault=True,\nhelp=\"Experimental: export fixture tables prior to data processing\",\naction=BooleanOptionalAction,\n)\nparser.add_argument(\n\"--data-provider-field\",\ntype=str,\ndefault=DATA_PROVIDER_INDEX,\nhelp=\"Key for indexing DataProvider records\",\n)\nreturn parser.parse_args(argv)\n
First parse_args is called for command line arguments including:
collections
output
mountpoint
If any of these arguments are specified, they will be used, otherwise they will default to the values in the settings module.
The show_setup function is then called to display the configurations being used.
The route function is then called to route the alto2txt files into subdirectories with structured files.
The parse function is then called to parse the resulting JSON files.
Finally, the clear_cache function is called to clear the cache (pending the user's confirmation).
Parameters:
Name Type Description Default local_argslist[str] | None
Options passed to parse_args()
None
Returns:
Type Description None
None
Source code in alto2txt2fixture/__main__.py
def run(local_args: list[str] | None = None) -> None:\n\"\"\"Manage running newspaper `XML` to `JSON` conversion.\n First `parse_args` is called for command line arguments including:\n - `collections`\n - `output`\n - `mountpoint`\n If any of these arguments are specified, they will be used, otherwise they\n will default to the values in the `settings` module.\n The `show_setup` function is then called to display the configurations\n being used.\n The `route` function is then called to route the alto2txt files into\n subdirectories with structured files.\n The `parse` function is then called to parse the resulting JSON files.\n Finally, the `clear_cache` function is called to clear the cache\n (pending the user's confirmation).\n Arguments:\n local_args: Options passed to `parse_args()`\n Returns:\n None\n \"\"\"\nargs: Namespace = parse_args(argv=local_args)\nif args.collections:\nCOLLECTIONS = [x.lower() for x in args.collections]\nelse:\nCOLLECTIONS = settings.COLLECTIONS\nif args.output:\nOUTPUT = args.output.rstrip(\"/\")\nelse:\nOUTPUT = settings.OUTPUT\nif args.mountpoint:\nMOUNTPOINT = args.mountpoint.rstrip(\"/\")\nelse:\nMOUNTPOINT = settings.MOUNTPOINT\nshow_setup(\nCOLLECTIONS=COLLECTIONS,\nOUTPUT=OUTPUT,\nCACHE_HOME=settings.CACHE_HOME,\nMOUNTPOINT=MOUNTPOINT,\nJISC_PAPERS_CSV=settings.JISC_PAPERS_CSV,\nREPORT_DIR=settings.REPORT_DIR,\nMAX_ELEMENTS_PER_FILE=settings.MAX_ELEMENTS_PER_FILE,\n)\nif args.show_fixture_tables:\n# Show a table of fixtures used, defaults to DataProvider Table\nshow_fixture_tables(settings, data_provider_index=args.data_provider_field)\nif args.export_fixture_tables:\nexport_fixtures(\nfixture_tables=settings.FIXTURE_TABLES,\npath=OUTPUT,\nformats=settings.FIXTURE_TABLES_FORMATS,\n)\nif not args.test_config:\n# Routing alto2txt into subdirectories with structured files\nroute(\nCOLLECTIONS,\nsettings.CACHE_HOME,\nMOUNTPOINT,\nsettings.JISC_PAPERS_CSV,\nsettings.REPORT_DIR,\n)\n# Parsing the resulting JSON files\nparse(\nCOLLECTIONS,\nsettings.CACHE_HOME,\nOUTPUT,\nsettings.MAX_ELEMENTS_PER_FILE,\n)\nclear_cache(settings.CACHE_HOME)\n
Name Type Description Default paths_dictdict[os.PathLike, os.PathLike]
dict[os.PathLike, os.PathLike], Original and renumbered pathsdict
required compress_formatArchiveFormatEnum
Which ArchiveFormatEnum for compression
COMPRESSION_TYPE_DEFAULTtitlestr
Title of returned Table
FILE_RENAME_TABLE_TITLE_DEFAULTprefixstr
str to add in front of every new path
''renumberbool
Whether an int in each path will be renumbered.
True Source code in alto2txt2fixture/cli.py
def file_rename_table(\npaths_dict: dict[os.PathLike, os.PathLike],\ncompress_format: ArchiveFormatEnum = COMPRESSION_TYPE_DEFAULT,\ntitle: str = FILE_RENAME_TABLE_TITLE_DEFAULT,\nprefix: str = \"\",\nrenumber: bool = True,\n) -> Table:\n\"\"\"Create a `rich.Table` of rename configuration.\n Args:\n paths_dict: dict[os.PathLike, os.PathLike],\n Original and renumbered `paths` `dict`\n compress_format:\n Which `ArchiveFormatEnum` for compression\n title:\n Title of returned `Table`\n prefix:\n `str` to add in front of every new path\n renumber:\n Whether an `int` in each path will be renumbered.\n \"\"\"\ntable: Table = Table(title=title)\ntable.add_column(\"Current File Name\", justify=\"right\", style=\"cyan\")\ntable.add_column(\"New File Name\", style=\"magenta\")\ndef final_file_name(name: os.PathLike) -> str:\nreturn (\nprefix\n+ str(Path(name).name)\n+ (f\".{compress_format}\" if compress_format else \"\")\n)\nfor old_path, new_path in paths_dict.items():\nname: str = final_file_name(new_path if renumber else old_path)\ntable.add_row(Path(old_path).name, name)\nreturn table\n
Geneate richTable from func signature and help attr.
Parameters:
Name Type Description Default funcCallable
Function whose args and type hints will be converted to a table.
required valuesdict
dict of variables covered in func signature. local() often suffices.
required titlestr
str for table title.
''extra_dictdict[str, Any]
A dict of additional rows to add to the table. For each key, value pair: if the value is a tuple, it will be expanded to match the Type, Value, and Notes columns; else the Type will be inferred and Notes left blank.
{} Example
>>> def test_func(\n... var_a: Annotated[str, typer.Option(help=\"Example\")] = \"Default\"\n... ) -> None:\n... test_func_table: Table = func_table(test_func, values=vars())\n... console.print(test_func_table)\n>>> if is_platform_win:\n... pytest.skip('fails on certain Windows root paths: issue #56')\n>>> test_func()\n test_func config\n\u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n\u2503 Variable \u2503 Type \u2503 Value \u2503 Notes \u2503\n\u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n\u2502 var_a \u2502 str \u2502 Default \u2502 Example \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
Source code in alto2txt2fixture/cli.py
def func_table(\nfunc: Callable, values: dict, title: str = \"\", extra_dict: dict[str, Any] = {}\n) -> Table:\n\"\"\"Geneate `rich` `Table` from `func` signature and `help` attr.\n Args:\n func:\n Function whose `args` and `type` hints will be converted\n to a table.\n values:\n `dict` of variables covered in `func` signature.\n `local()` often suffices.\n title:\n `str` for table title.\n extra_dict:\n A `dict` of additional rows to add to the table. For each\n `key`, `value` pair: if the `value` is a `tuple`, it will\n be expanded to match the `Type`, `Value`, and `Notes`\n columns; else the `Type` will be inferred and `Notes`\n left blank.\n Example:\n ```pycon\n >>> def test_func(\n ... var_a: Annotated[str, typer.Option(help=\"Example\")] = \"Default\"\n ... ) -> None:\n ... test_func_table: Table = func_table(test_func, values=vars())\n ... console.print(test_func_table)\n >>> if is_platform_win:\n ... pytest.skip('fails on certain Windows root paths: issue #56')\n >>> test_func()\n test_func config\n \u250f\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2533\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2513\n \u2503 Variable \u2503 Type \u2503 Value \u2503 Notes \u2503\n \u2521\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2547\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2529\n \u2502 var_a \u2502 str \u2502 Default \u2502 Example \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n ```\n \"\"\"\ntitle = title if title else f\"{func.__name__} config\"\nfunc_signature: dict = get_type_hints(func, include_extras=True)\ntable: Table = Table(title=title)\ntable.add_column(\"Variable\", justify=\"right\", style=\"cyan\")\ntable.add_column(\"Type\", style=\"yellow\")\ntable.add_column(\"Value\", style=\"magenta\")\ntable.add_column(\"Notes\")\nfor var, info in func_signature.items():\ntry:\nvar_type, annotation = get_args(info)\nvalue: Any = values[var]\nif value in (\"\", \"\"):\nvalue = \"''\"\ntable.add_row(str(var), var_type.__name__, str(value), annotation.help)\nexcept ValueError:\ncontinue\nfor key, val in extra_dict.items():\nif isinstance(val, tuple):\ntable.add_row(key, *val)\nelse:\ntable.add_row(key, type(val).__name__, str(val))\nreturn table\n
plaintext(\npath: Annotated[Path, typer.Argument(help=\"Path to raw plaintext files\")],\nsave_path: Annotated[\nPath, typer.Option(help=\"Path to save json export files\")\n] = Path(DEFAULT_PLAINTEXT_FIXTURE_OUTPUT),\ndata_provider_code: Annotated[\nstr, typer.Option(help=\"Data provider code use existing config\")\n] = \"\",\nextract_path: Annotated[\nPath, typer.Option(help=\"Folder to extract compressed raw plaintext to\")\n] = Path(DEFAULT_EXTRACTED_SUBDIR),\ninitial_pk: Annotated[\nint,\ntyper.Option(help=\"First primary key to increment json export from\"),\n] = DEFAULT_INITIAL_PK,\nrecords_per_json: Annotated[\nint, typer.Option(help=\"Max records per json fixture\")\n] = DEFAULT_MAX_PLAINTEXT_PER_FIXTURE_FILE,\ndigit_padding: Annotated[\nint,\ntyper.Option(help=\"Padding '0's for indexing json fixture filenames\"),\n] = FILE_NAME_0_PADDING_DEFAULT,\ncompress: Annotated[\nbool, typer.Option(help=\"Compress json fixtures\")\n] = False,\ncompress_path: Annotated[\nPath, typer.Option(help=\"Folder to compress json fixtueres to\")\n] = Path(COMPRESSED_PATH_DEFAULT),\ncompress_format: Annotated[\nArchiveFormatEnum,\ntyper.Option(case_sensitive=False, help=\"Compression format\"),\n] = COMPRESSION_TYPE_DEFAULT,\n) -> None\n
Create a PlainTextFixture and save to save_path.
Source code in alto2txt2fixture/cli.py
@cli.command()\ndef plaintext(\npath: Annotated[Path, typer.Argument(help=\"Path to raw plaintext files\")],\nsave_path: Annotated[\nPath, typer.Option(help=\"Path to save json export files\")\n] = Path(DEFAULT_PLAINTEXT_FIXTURE_OUTPUT),\ndata_provider_code: Annotated[\nstr, typer.Option(help=\"Data provider code use existing config\")\n] = \"\",\nextract_path: Annotated[\nPath, typer.Option(help=\"Folder to extract compressed raw plaintext to\")\n] = Path(DEFAULT_EXTRACTED_SUBDIR),\ninitial_pk: Annotated[\nint, typer.Option(help=\"First primary key to increment json export from\")\n] = DEFAULT_INITIAL_PK,\nrecords_per_json: Annotated[\nint, typer.Option(help=\"Max records per json fixture\")\n] = DEFAULT_MAX_PLAINTEXT_PER_FIXTURE_FILE,\ndigit_padding: Annotated[\nint, typer.Option(help=\"Padding '0's for indexing json fixture filenames\")\n] = FILE_NAME_0_PADDING_DEFAULT,\ncompress: Annotated[bool, typer.Option(help=\"Compress json fixtures\")] = False,\ncompress_path: Annotated[\nPath, typer.Option(help=\"Folder to compress json fixtueres to\")\n] = Path(COMPRESSED_PATH_DEFAULT),\ncompress_format: Annotated[\nArchiveFormatEnum,\ntyper.Option(case_sensitive=False, help=\"Compression format\"),\n] = COMPRESSION_TYPE_DEFAULT,\n) -> None:\n\"\"\"Create a PlainTextFixture and save to `save_path`.\"\"\"\nplaintext_fixture = PlainTextFixture(\npath=path,\ndata_provider_code=data_provider_code,\nextract_subdir=extract_path,\nexport_directory=save_path,\ninitial_pk=initial_pk,\nmax_plaintext_per_fixture_file=records_per_json,\njson_0_file_name_padding=digit_padding,\njson_export_compression_format=compress_format,\njson_export_compression_subdir=compress_path,\n)\nplaintext_fixture.info()\nwhile (\nnot plaintext_fixture.compressed_files\nand not plaintext_fixture.plaintext_provided_uncompressed\n):\ntry_another_compressed_txt_source: bool = Confirm.ask(\nf\"No .txt files available from extract path: \"\nf\"{plaintext_fixture.trunc_extract_path_str}\\n\"\nf\"Would you like to extract fixtures from a different path?\",\ndefault=\"n\",\n)\nif try_another_compressed_txt_source:\nnew_extract_path: str = Prompt.ask(\"Please enter a new extract path\")\nplaintext_fixture.path = Path(new_extract_path)\nelse:\nreturn\nplaintext_fixture.info()\nplaintext_fixture.extract_compressed()\nplaintext_fixture.export_to_json_fixtures()\nif compress:\nplaintext_fixture.compress_json_exports()\n
It is possible for the example test to fail in different screen sizes. Try increasing the window or screen width of terminal used to check before raising an issue.
Source code in alto2txt2fixture/cli.py
def show_fixture_tables(\nrun_settings: dotdict = settings,\nprint_in_call: bool = True,\ndata_provider_index: str = DATA_PROVIDER_INDEX,\n) -> list[Table]:\n\"\"\"Print fixture tables specified in ``settings.fixture_tables`` in `rich.Table` format.\n Arguments:\n run_settings: `alto2txt2fixture` run configuration\n print_in_call: whether to print to console (will use ``console`` variable if so)\n data_provider_index: key to index `dataprovider` from ``NEWSPAPER_COLLECTION_METADATA``\n Returns:\n A `list` of `rich.Table` renders from configurations in ``run_settings.FIXTURE_TABLES``\n Example:\n ```pycon\n >>> fixture_tables: list[Table] = show_fixture_tables(\n ... settings,\n ... print_in_call=False)\n >>> len(fixture_tables)\n 1\n >>> fixture_tables[0].title\n 'dataprovider'\n >>> [column.header for column in fixture_tables[0].columns]\n ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']\n >>> fixture_tables = show_fixture_tables(settings)\n <BLANKLINE>\n ...dataprovider...Heritage...\u2502 bl_hmd...\u2502 hmd...\n ```\n Note:\n It is possible for the example test to fail in different screen sizes. Try\n increasing the window or screen width of terminal used to check before\n raising an issue.\n \"\"\"\nif run_settings.FIXTURE_TABLES:\nif \"dataprovider\" in run_settings.FIXTURE_TABLES:\ncheck_newspaper_collection_configuration(\nrun_settings.COLLECTIONS,\nrun_settings.FIXTURE_TABLES[\"dataprovider\"],\ndata_provider_index=data_provider_index,\n)\nconsole_tables: list[Table] = list(\ngen_fixture_tables(run_settings.FIXTURE_TABLES)\n)\nif print_in_call:\nfor console_table in console_tables:\nconsole.print(console_table)\nreturn console_tables\nelse:\nreturn []\n
Returns a list with corrected data from a provided dictionary.
Source code in alto2txt2fixture/create_adjacent_tables.py
def correct_dict(o: dict) -> list:\n\"\"\"Returns a list with corrected data from a provided dictionary.\"\"\"\nreturn [(k, v[0], v[1]) for k, v in o.items() if not v[0].startswith(\"Q\")] + [\n(k, v[1], v[0]) for k, v in o.items() if v[0].startswith(\"Q\")\n]\n
Source code in alto2txt2fixture/create_adjacent_tables.py
def download_data(\nfiles_dict: RemoteDataFilesType = {},\noverwrite: bool = OVERWRITE,\nexclude: list[str] = [],\n) -> None:\n\"\"\"Download files in ``files_dict``, overwrite if specified.\n Args:\n files_dict: `dict` of related files to download\n overwrite: `bool` to overwrite ``LOCAL_CACHE`` files or not\n exclude: `list` of files to exclude from ``files_dict``\n Example:\n ```pycon\n >>> from os import chdir\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> set_path: Path = chdir(tmp_path)\n >>> download_data(exclude=[\"mitchells\", \"Newspaper-1\", \"linking\"])\n Excluding mitchells...\n Excluding Newspaper-1...\n Excluding linking...\n Downloading cache...dict_admin_counties.json\n 100% ... 37/37 bytes\n Downloading cache...dict_countries.json\n 100% ... 33.2/33.2 kB\n Downloading cache...dict_historic_counties.json\n 100% ... 41.4/41.4 kB\n Downloading cache...nlp_loc_wikidata_concat.csv\n 100% ... 59.8/59.8 kB\n Downloading cache...wikidata_gazetteer_selected_columns.csv\n 100% ... 47.8/47.8 MB\n ```\n \"\"\"\nif not files_dict:\nfiles_dict = deepcopy(FILES)\nfor data_source in exclude:\nif data_source in files_dict:\nprint(f\"Excluding {data_source}...\")\nfiles_dict.pop(data_source, 0)\nelse:\nlogger.warning(\nf'\"{data_source}\" not an option to exclude from {files_dict}'\n)\n# Describe whether local file exists\nfor k in files_dict.keys():\nfiles_dict[k][\"exists\"] = files_dict[k][\"local\"].exists()\nfiles_to_download = [\n(v[\"remote\"], v[\"local\"], v[\"exists\"])\nfor v in files_dict.values()\nif \"exists\" in v and not v[\"exists\"] or overwrite\n]\nfor url, out, exists in files_to_download:\nrmtree(Path(out), ignore_errors=True) if exists else None\nprint(f\"Downloading {out}\")\nPath(out).parent.mkdir(parents=True, exist_ok=True)\nassert isinstance(url, str)\nwith urlopen(url) as response, open(out, \"wb\") as out_file:\ntotal: int = int(response.info()[\"Content-length\"])\nwith Progress(\n\"[progress.percentage]{task.percentage:>3.0f}%\",\nBarColumn(), # removed bar_width=None to avoid too long when resized\nDownloadColumn(),\n) as progress:\ndownload_task = progress.add_task(\"Download\", total=total)\nfor chunk in response:\nout_file.write(chunk)\nprogress.update(download_task, advance=len(chunk))\n
Get a list from a string, which contains as separator. If no string is encountered, the function returns an empty list. Source code in alto2txt2fixture/create_adjacent_tables.py
def get_list(x):\n\"\"\"Get a list from a string, which contains <SEP> as separator. If no\n string is encountered, the function returns an empty list.\"\"\"\nreturn x.split(\"<SEP>\") if isinstance(x, str) else []\n
Source code in alto2txt2fixture/create_adjacent_tables.py
def get_outpaths_dict(names: Sequence[str], module_name: str) -> TableOutputConfigType:\n\"\"\"Return a `dict` of `csv` and `json` paths for each `module_name` table.\n The `csv` and `json` paths\n Args:\n names: iterable of names of each `module_name`'s component. Main target is `csv` and `json` table names\n module_name: name of module each name is part of, that is added as a prefix\n Returns:\n A ``TableOutputConfigType``: a `dict` of table ``names`` and output\n `csv` and `json` filenames.\n Example:\n ```pycon\n >>> pprint(get_outpaths_dict(MITCHELLS_TABELS, \"mitchells\"))\n {'Entry': {'csv': 'mitchells.Entry.csv', 'json': 'mitchells.Entry.json'},\n 'Issue': {'csv': 'mitchells.Issue.csv', 'json': 'mitchells.Issue.json'},\n 'PoliticalLeaning': {'csv': 'mitchells.PoliticalLeaning.csv',\n 'json': 'mitchells.PoliticalLeaning.json'},\n 'Price': {'csv': 'mitchells.Price.csv', 'json': 'mitchells.Price.json'}}\n ```\n \"\"\"\nreturn {\nname: OutputPathDict(\ncsv=f\"{module_name}.{name}.csv\",\njson=f\"{module_name}.{name}.json\",\n)\nfor name in names\n}\n
Takes an input_sub_path, a publication_code, and an (optional) abbreviation for any newspaper to locate the title in the jisc_papersDataFrame. jisc_papers is usually loaded via the setup_jisc_papers function.
Parameters:
Name Type Description Default titlestr
target newspaper title
required issue_datestr
target newspaper issue_date
required jisc_paperspd.DataFrame
DataFrame of jisc_papers to match
required input_sub_pathstr
path of files to narrow down query input_sub_path
required publication_codestr
unique codes to match newspaper records
required abbrstr | None
an optional abbreviation of the newspaper title
None
Returns:
Type Description str
Matched titlestr or abbr.
Returns:
Type Description str
A string estimating the JISC equivalent newspaper title
Source code in alto2txt2fixture/jisc.py
def get_jisc_title(\ntitle: str,\nissue_date: str,\njisc_papers: pd.DataFrame,\ninput_sub_path: str,\npublication_code: str,\nabbr: str | None = None,\n) -> str:\n\"\"\"\n Match a newspaper ``title`` with ``jisc_papers`` records.\n Takes an ``input_sub_path``, a ``publication_code``, and an (optional)\n abbreviation for any newspaper to locate the ``title`` in the\n ``jisc_papers`` `DataFrame`. ``jisc_papers`` is usually loaded via the\n ``setup_jisc_papers`` function.\n Args:\n title: target newspaper title\n issue_date: target newspaper issue_date\n jisc_papers: `DataFrame` of `jisc_papers` to match\n input_sub_path: path of files to narrow down query input_sub_path\n publication_code: unique codes to match newspaper records\n abbr: an optional abbreviation of the newspaper title\n Returns:\n Matched ``title`` `str` or ``abbr``.\n Returns:\n A string estimating the JISC equivalent newspaper title\n \"\"\"\n# First option, search the input_sub_path for a valid-looking publication_code\ng = PUBLICATION_CODE.findall(input_sub_path)\nif len(g) == 1:\npublication_code = g[0]\n# Let's see if we can find title:\ntitle = (\njisc_papers[\njisc_papers.publication_code == publication_code\n].title.to_list()[0]\nif jisc_papers[\njisc_papers.publication_code == publication_code\n].title.count()\n== 1\nelse title\n)\nreturn title\n# Second option, look through JISC papers for best match (on publication_code if we have it, but abbr more importantly if we have it)\nif abbr:\n_publication_code = publication_code\npublication_code = abbr\nif jisc_papers.abbr[jisc_papers.abbr == publication_code].count():\ndate = datetime.strptime(issue_date, \"%Y-%m-%d\")\nmask = (\n(jisc_papers.abbr == publication_code)\n& (date >= jisc_papers.start_date)\n& (date <= jisc_papers.end_date)\n)\nfiltered = jisc_papers.loc[mask]\nif filtered.publication_code.count() == 1:\npublication_code = filtered.publication_code.to_list()[0]\ntitle = filtered.title.to_list()[0]\nreturn title\n# Last option: let's find all the possible titles in the jisc_papers for the abbreviation, and if it's just one unique title, let's pick it!\nif abbr:\ntest = list({x for x in jisc_papers[jisc_papers.abbr == abbr].title})\nif len(test) == 1:\nreturn test[0]\nelse:\nmask1 = (jisc_papers.abbr == publication_code) & (\njisc_papers.publication_code == _publication_code\n)\ntest1 = jisc_papers.loc[mask1]\ntest1 = list({x for x in jisc_papers[jisc_papers.abbr == abbr].title})\nif len(test) == 1:\nreturn test1[0]\n# Fallback: if abbreviation is set, we'll return that:\nif abbr:\n# For these exceptions, see issue comment:\n# https://github.com/alan-turing-institute/Living-with-Machines/issues/2453#issuecomment-1050652587\nif abbr == \"IPJL\":\nreturn \"Ipswich Journal\"\nelif abbr == \"BHCH\":\nreturn \"Bath Chronicle\"\nelif abbr == \"LSIR\":\nreturn \"Leeds Intelligencer\"\nelif abbr == \"AGER\":\nreturn \"Lancaster Gazetter, And General Advertiser For Lancashire West\"\nreturn abbr\nraise RuntimeError(f\"Title {title} could not be found.\")\n
fixtures(\nfilelist: list = [],\nmodel: str = \"\",\ntranslate: dict = {},\nrename: dict = {},\nuniq_keys: list = [],\n) -> Generator[FixtureDict, None, None]\n
Generates fixtures for a specified model using a list of files.
This function takes a list of files and generates fixtures for a specified model. The fixtures can be used to populate a database or perform other data-related operations.
Parameters:
Name Type Description Default filelistlist
A list of files to process and generate fixtures from.
[]modelstr
The name of the model for which fixtures are generated. translate: A nested dictionary representing the translation mapping for fields. The structure of the translator follows the format:
The translated fields will be used as keys, and their corresponding primary keys (obtained from the provided files) will be used as values in the generated fixtures. ''renamedict
A nested dictionary representing the field renaming mapping. The structure of the dictionary follows the format:
{\n'part1': {\n'part2': 'new_field_name'\n}\n}\n
The fields specified in the dictionary will be renamed to the provided new field names in the generated fixtures. {}uniq_keyslist
A list of fields that need to be considered for uniqueness in the fixtures. If specified, the fixtures will yield only unique items based on the combination of these fields.
[]
Yields:
Type Description FixtureDict
FixtureDict from model, pk and dict of fields.
Returns:
Type Description Generator[FixtureDict, None, None]
This function generates fixtures but does not return any value.
Source code in alto2txt2fixture/parser.py
def fixtures(\nfilelist: list = [],\nmodel: str = \"\",\ntranslate: dict = {},\nrename: dict = {},\nuniq_keys: list = [],\n) -> Generator[FixtureDict, None, None]:\n\"\"\"\n Generates fixtures for a specified model using a list of files.\n This function takes a list of files and generates fixtures for a specified\n model. The fixtures can be used to populate a database or perform other\n data-related operations.\n Args:\n filelist: A list of files to process and generate fixtures from.\n model: The name of the model for which fixtures are generated.\n translate: A nested dictionary representing the translation mapping\n for fields. The structure of the translator follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n The translated fields will be used as keys, and their\n corresponding primary keys (obtained from the provided files) will\n be used as values in the generated fixtures.\n rename: A nested dictionary representing the field renaming\n mapping. The structure of the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': 'new_field_name'\n }\n }\n ```\n The fields specified in the dictionary will be renamed to the\n provided new field names in the generated fixtures.\n uniq_keys: A list of fields that need to be considered for\n uniqueness in the fixtures. If specified, the fixtures will yield\n only unique items based on the combination of these fields.\n Yields:\n `FixtureDict` from ``model``, ``pk`` and `dict` of ``fields``.\n Returns:\n This function generates fixtures but does not return any value.\n \"\"\"\nfilelist = sorted(filelist, key=lambda x: str(x).split(\"/\")[:-1])\ncount = len(filelist)\n# Process JSONL\nif [x for x in filelist if \".jsonl\" in x.name]:\npk = 0\n# In the future, we might want to show progress here (tqdm or suchlike)\nfor file in filelist:\nfor line in file.read_text().splitlines():\npk += 1\nline = json.loads(line)\nyield FixtureDict(\npk=pk,\nmodel=model,\nfields=dict(**get_fields(line, translate=translate, rename=rename)),\n)\nreturn\nelse:\n# Process JSON\npks = [x for x in range(1, count + 1)]\nif len(uniq_keys):\nuniq_files = list(uniq(filelist, uniq_keys))\ncount = len(uniq_files)\nzipped = zip(uniq_files, pks)\nelse:\nzipped = zip(filelist, pks)\nfor x in tqdm(\nzipped, total=count, desc=f\"{model} ({count:,} objs)\", leave=False\n):\nyield FixtureDict(\npk=x[1],\nmodel=model,\nfields=dict(**get_fields(x[0], translate=translate, rename=rename)),\n)\nreturn\n
Retrieves fields from a file and performs modifications and checks.
This function takes a file (in various formats: Path, str, or dict) and processes its fields. It retrieves the fields from the file and performs modifications, translations, and checks on the fields.
Parameters:
Name Type Description Default fileUnion[Path, str, dict]
The file from which the fields are retrieved.
required translatedict
A nested dictionary representing the translation mapping for fields. The structure of the translator follows the format:
The translated fields will be used to replace the original fields in the retrieved fields. {}renamedict
A nested dictionary representing the field renaming mapping. The structure of the dictionary follows the format:
{\n'part1': {\n'part2': 'new_field_name'\n}\n}\n
The fields specified in the dictionary will be renamed to the provided new field names in the retrieved fields. {}allow_nullbool
Determines whether to allow None values for relational fields. If set to True, relational fields with missing values will be assigned None. If set to False, an error will be raised.
False
Returns:
Type Description dict
A dictionary representing the retrieved fields from the file, with modifications and checks applied.
Raises:
Type Description RuntimeError
If the file type is unsupported or if an error occurs during field retrieval or processing.
Source code in alto2txt2fixture/parser.py
def get_fields(\nfile: Union[Path, str, dict],\ntranslate: dict = {},\nrename: dict = {},\nallow_null: bool = False,\n) -> dict:\n\"\"\"\n Retrieves fields from a file and performs modifications and checks.\n This function takes a file (in various formats: `Path`, `str`, or `dict`)\n and processes its fields. It retrieves the fields from the file and\n performs modifications, translations, and checks on the fields.\n Args:\n file: The file from which the fields are retrieved.\n translate: A nested dictionary representing the translation mapping\n for fields. The structure of the translator follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n The translated fields will be used to replace the original fields\n in the retrieved fields.\n rename: A nested dictionary representing the field renaming\n mapping. The structure of the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': 'new_field_name'\n }\n }\n ```\n The fields specified in the dictionary will be renamed to the\n provided new field names in the retrieved fields.\n allow_null: Determines whether to allow ``None`` values for\n relational fields. If set to ``True``, relational fields with\n missing values will be assigned ``None``. If set to ``False``, an\n error will be raised.\n Returns:\n A dictionary representing the retrieved fields from the file,\n with modifications and checks applied.\n Raises:\n RuntimeError: If the file type is unsupported or if an error occurs\n during field retrieval or processing.\n \"\"\"\nif isinstance(file, Path):\ntry:\nfields = json.loads(file.read_text())\nexcept Exception as e:\nraise RuntimeError(f\"Cannot interpret JSON ({e}): {file}\")\nelif isinstance(file, str):\nif \"\\n\" in file:\nraise RuntimeError(\"File has multiple lines.\")\ntry:\nfields = json.loads(file)\nexcept json.decoder.JSONDecodeError as e:\nraise RuntimeError(f\"Cannot interpret JSON ({e}): {file}\")\nelif isinstance(file, dict):\nfields = file\nelse:\nraise RuntimeError(f\"Cannot process type {type(file)}.\")\n# Fix relational fields for any file\nfor key in [key for key in fields.keys() if \"__\" in key]:\nparts = key.split(\"__\")\ntry:\nbefore = fields[key]\nif before:\nbefore = before.replace(\"---\", \"/\")\nloc = translate.get(parts[0], {}).get(parts[1], {})\nfields[key] = loc.get(before)\nif fields[key] is None:\nraise RuntimeError(\nf\"Cannot translate fields.{key} from {before}: {loc}\"\n)\nexcept AttributeError:\nif allow_null:\nfields[key] = None\nelse:\nprint(\n\"Content had relational fields, but something went wrong in parsing the data:\"\n)\nprint(\"file\", file)\nprint(\"fields\", fields)\nprint(\"KEY:\", key)\nraise RuntimeError()\nnew_name = rename.get(parts[0], {}).get(parts[1], None)\nif new_name:\nfields[new_name] = fields[key]\ndel fields[key]\nfields[\"created_at\"] = NOW_str\nfields[\"updated_at\"] = NOW_str\ntry:\nfields[\"item_type\"] = str(fields[\"item_type\"]).upper()\nexcept KeyError:\npass\ntry:\nif fields[\"ocr_quality_mean\"] == \"\":\nfields[\"ocr_quality_mean\"] = 0\nexcept KeyError:\npass\ntry:\nif fields[\"ocr_quality_sd\"] == \"\":\nfields[\"ocr_quality_sd\"] = 0\nexcept KeyError:\npass\nreturn fields\n
Retrieves a specific key from a file and returns its value.
This function reads a file and extracts the value of a specified key. If the key is not found or an error occurs while processing the file, a warning is printed, and an empty string is returned.
Parameters:
Name Type Description Default itemPath
The file from which the key is extracted.
required xstr
The key to be retrieved from the file.
required
Returns:
Type Description str
The value of the specified key from the file.
Source code in alto2txt2fixture/parser.py
def get_key_from(item: Path, x: str) -> str:\n\"\"\"\n Retrieves a specific key from a file and returns its value.\n This function reads a file and extracts the value of a specified\n key. If the key is not found or an error occurs while processing\n the file, a warning is printed, and an empty string is returned.\n Args:\n item: The file from which the key is extracted.\n x: The key to be retrieved from the file.\n Returns:\n The value of the specified key from the file.\n \"\"\"\nresult = json.loads(item.read_text()).get(x, None)\nif not result:\nprint(f\"[WARN] Could not find key {x} in {item}\")\nresult = \"\"\nreturn result\n
def get_translator(\nfields: list[TranslatorTuple] = [TranslatorTuple(\"\", \"\", [])]\n) -> dict:\n\"\"\"\n Converts a list of fields into a nested dictionary representing a\n translator.\n Args:\n fields: A list of tuples representing fields to be translated.\n Returns:\n A nested dictionary representing the translator. The structure of\n the dictionary follows the format:\n ```python\n {\n 'part1': {\n 'part2': {\n 'translated_field': 'pk'\n }\n }\n }\n ```\n Example:\n ```pycon\n >>> fields = [\n ... TranslatorTuple(\n ... start='start__field1',\n ... finish='field1',\n ... lst=[{\n ... 'fields': {'field1': 'translation1'},\n ... 'pk': 1}],\n ... )]\n >>> get_translator(fields)\n {'start': {'field1': {'translation1': 1}}}\n ```\n \"\"\"\n_ = dict()\nfor field in fields:\nstart, finish, lst = field\npart1, part2 = start.split(\"__\")\nif part1 not in _:\n_[part1] = {}\nif part2 not in _[part1]:\n_[part1][part2] = {}\nif isinstance(finish, str):\n_[part1][part2] = {o[\"fields\"][finish]: o[\"pk\"] for o in lst}\nelif isinstance(finish, list):\n_[part1][part2] = {\n\"-\".join([o[\"fields\"][x] for x in finish]): o[\"pk\"] for o in lst\n}\nreturn _\n
Parses files from collections and generates fixtures for various models.
This function processes files from the specified collections and generates fixtures for different models, such as newspapers.dataprovider, newspapers.ingest, newspapers.digitisation, newspapers.newspaper, newspapers.issue, and newspapers.item.
It performs various steps, such as file listing, fixture generation, translation mapping, renaming fields, and saving fixtures to files.
Parameters:
Name Type Description Default collectionslist
A list of collections from which files are processed and fixtures are generated.
required cache_homestr
The directory path where the collections are located.
required outputstr
The directory path where the fixtures will be saved.
required max_elements_per_fileint
The maximum number of elements per file when saving fixtures.
required
Returns:
Type Description None
This function generates fixtures but does not return any value.
Source code in alto2txt2fixture/parser.py
def parse(\ncollections: list, cache_home: str, output: str, max_elements_per_file: int\n) -> None:\n\"\"\"\n Parses files from collections and generates fixtures for various models.\n This function processes files from the specified collections and generates\n fixtures for different models, such as `newspapers.dataprovider`,\n `newspapers.ingest`, `newspapers.digitisation`, `newspapers.newspaper`,\n `newspapers.issue`, and `newspapers.item`.\n It performs various steps, such as file listing, fixture generation,\n translation mapping, renaming fields, and saving fixtures to files.\n Args:\n collections: A list of collections from which files are\n processed and fixtures are generated.\n cache_home: The directory path where the collections are located.\n output: The directory path where the fixtures will be saved.\n max_elements_per_file: The maximum number of elements per file\n when saving fixtures.\n Returns:\n This function generates fixtures but does not return any value.\n \"\"\"\nglobal CACHE_HOME\nglobal OUTPUT\nglobal MAX_ELEMENTS_PER_FILE\nCACHE_HOME = cache_home\nOUTPUT = output\nMAX_ELEMENTS_PER_FILE = max_elements_per_file\n# Set up output directory\nreset_fixture_dir(OUTPUT)\n# Get file lists\nprint(\"\\nGetting file lists...\")\ndef issues_in_x(x):\nreturn \"issues\" in str(x.parent).split(\"/\")\ndef newspapers_in_x(x):\nreturn not any(\n[\ncondition\nfor y in str(x.parent).split(\"/\")\nfor condition in [\n\"issues\" in y,\n\"ingest\" in y,\n\"digitisation\" in y,\n\"data-provider\" in y,\n]\n]\n)\nall_json = [\nx for y in collections for x in (Path(CACHE_HOME) / y).glob(\"**/*.json\")\n]\nall_jsonl = [\nx for y in collections for x in (Path(CACHE_HOME) / y).glob(\"**/*.jsonl\")\n]\nprint(f\"--> {len(all_json):,} JSON files altogether\")\nprint(f\"--> {len(all_jsonl):,} JSONL files altogether\")\nprint(\"\\nSetting up fixtures...\")\n# Process data providers\ndef data_provider_in_x(x):\nreturn \"data-provider\" in str(x.parent).split(\"/\")\ndata_provider_json = list(\nfixtures(\nmodel=\"newspapers.dataprovider\",\nfilelist=[x for x in all_json if data_provider_in_x(x)],\nuniq_keys=[\"name\"],\n)\n)\nprint(f\"--> {len(data_provider_json):,} DataProvider fixtures\")\n# Process ingest\ndef ingest_in_x(x):\nreturn \"ingest\" in str(x.parent).split(\"/\")\ningest_json = list(\nfixtures(\nmodel=\"newspapers.ingest\",\nfilelist=[x for x in all_json if ingest_in_x(x)],\nuniq_keys=[\"lwm_tool_name\", \"lwm_tool_version\"],\n)\n)\nprint(f\"--> {len(ingest_json):,} Ingest fixtures\")\n# Process digitisation\ndef digitisation_in_x(x):\nreturn \"digitisation\" in str(x.parent).split(\"/\")\ndigitisation_json = list(\nfixtures(\nmodel=\"newspapers.digitisation\",\nfilelist=[x for x in all_json if digitisation_in_x(x)],\nuniq_keys=[\"software\"],\n)\n)\nprint(f\"--> {len(digitisation_json):,} Digitisation fixtures\")\n# Process newspapers\nnewspaper_json = list(\nfixtures(\nmodel=\"newspapers.newspaper\",\nfilelist=[file for file in all_json if newspapers_in_x(file)],\n)\n)\nprint(f\"--> {len(newspaper_json):,} Newspaper fixtures\")\n# Process issue\ntranslate = get_translator(\n[\nTranslatorTuple(\n\"publication__publication_code\", \"publication_code\", newspaper_json\n)\n]\n)\nrename = {\"publication\": {\"publication_code\": \"newspaper_id\"}}\nissue_json = list(\nfixtures(\nmodel=\"newspapers.issue\",\nfilelist=[file for file in all_json if issues_in_x(file)],\ntranslate=translate,\nrename=rename,\n)\n)\nprint(f\"--> {len(issue_json):,} Issue fixtures\")\n# Create translator/clear up memory before processing items\ntranslate = get_translator(\n[\n(\"issue__issue_identifier\", \"issue_code\", issue_json),\n(\"digitisation__software\", \"software\", digitisation_json),\n(\"data_provider__name\", \"name\", data_provider_json),\n(\n\"ingest__lwm_tool_identifier\",\n[\"lwm_tool_name\", \"lwm_tool_version\"],\ningest_json,\n),\n]\n)\nrename = {\n\"issue\": {\"issue_identifier\": \"issue_id\"},\n\"digitisation\": {\"software\": \"digitisation_id\"},\n\"data_provider\": {\"name\": \"data_provider_id\"},\n\"ingest\": {\"lwm_tool_identifier\": \"ingest_id\"},\n}\nsave_fixture(newspaper_json, \"Newspaper\")\nsave_fixture(issue_json, \"Issue\")\ndel newspaper_json\ndel issue_json\ngc.collect()\nprint(\"\\nSaving...\")\nsave_fixture(digitisation_json, \"Digitisation\")\nsave_fixture(ingest_json, \"Ingest\")\nsave_fixture(data_provider_json, \"DataProvider\")\n# Process items\nitem_json = fixtures(\nmodel=\"newspapers.item\",\nfilelist=all_jsonl,\ntranslate=translate,\nrename=rename,\n)\nsave_fixture(item_json, \"Item\")\nreturn\n
Resets the fixture directory by removing all JSON files inside it.
This function takes a directory path (output) as input and removes all JSON files within the directory.
Prior to removal, it prompts the user for confirmation to proceed. If the user confirms, the function clears the fixture directory by deleting the JSON files.
Parameters:
Name Type Description Default outputstr | Path
The directory path of the fixture directory to be reset.
required
Raises:
Type Description RuntimeError
If the output directory is not specified as a string.
Source code in alto2txt2fixture/parser.py
def reset_fixture_dir(output: str | Path) -> None:\n\"\"\"\n Resets the fixture directory by removing all JSON files inside it.\n This function takes a directory path (``output``) as input and removes all\n JSON files within the directory.\n Prior to removal, it prompts the user for confirmation to proceed. If the\n user confirms, the function clears the fixture directory by deleting the\n JSON files.\n Args:\n output: The directory path of the fixture directory to be reset.\n Raises:\n RuntimeError: If the ``output`` directory is not specified as a string.\n \"\"\"\nif not isinstance(output, str):\nraise RuntimeError(\"`output` directory needs to be specified as a string.\")\noutput = Path(output)\ny = input(\nf\"This command will automatically empty the fixture directory ({output.absolute()}). \"\n\"Do you want to proceed? [y/N]\"\n)\nif not y.lower() == \"y\":\noutput.mkdir(parents=True, exist_ok=True)\nreturn\nprint(\"\\nClearing up the fixture directory\")\n# Ensure directory exists\noutput.mkdir(parents=True, exist_ok=True)\n# Drop all JSON files\n[x.unlink() for x in Path(output).glob(\"*.json\")]\nreturn\n
uniq(filelist: list, keys: list = []) -> Generator[Any, None, None]\n
Generates unique items from a list of files based on specified keys.
This function takes a list of files and yields unique items based on a combination of keys. The keys are extracted from each file using the get_key_from function, and duplicate items are ignored.
Parameters:
Name Type Description Default filelistlist
A list of files from which unique items are generated.
required keyslist
A list of keys used for uniqueness. Each key specifies a field to be used for uniqueness checking in the generated items.
[]
Yields:
Type Description Any
A unique item from filelist.
Source code in alto2txt2fixture/parser.py
def uniq(filelist: list, keys: list = []) -> Generator[Any, None, None]:\n\"\"\"\n Generates unique items from a list of files based on specified keys.\n This function takes a list of files and yields unique items based on a\n combination of keys. The keys are extracted from each file using the\n ``get_key_from`` function, and duplicate items are ignored.\n Args:\n filelist: A list of files from which unique items are\n generated.\n keys: A list of keys used for uniqueness. Each key specifies\n a field to be used for uniqueness checking in the generated\n items.\n Yields:\n A unique item from `filelist`.\n \"\"\"\nseen = set()\nfor item in filelist:\nkey = \"-\".join([get_key_from(item, x) for x in keys])\nif key not in seen:\nseen.add(key)\nyield item\nelse:\n# Drop it if duplicate\npass\n
the fulltext app has a fulltextmodelclass specified in lwmdb.fulltext.models.fulltext. A sql table is generated from on that fulltextclass and the jsonfixture structure generated from this class is where records will be stored. extract_subdirPathLike
Folder to extract self.compressed_files to.
plaintext_extensionstr
What file extension to use to filter plaintext files.
Return class name with count and DataProvider if available.
Source code in alto2txt2fixture/plaintext.py
def __str__(self) -> str:\n\"\"\"Return class name with count and `DataProvider` if available.\"\"\"\nreturn (\nf\"{type(self).__name__} \"\nf\"for {len(self)} \"\nf\"{self._data_provider_code_quoted_with_trailing_space}files\"\n)\n
Name Type Description Default output_pathPathLike | None
Path to save compressed json files to. Uses self.json_export_compression_subdir if None is passed.
NoneformatArchiveFormatEnum | None
What compression format to use from ArchiveFormatEnum. Uses self.json_export_compression_format if None is passed.
None Note
Neither output_path nor format overwrite the related attributes of self.
Example
>>> if is_platform_win:\n... pytest.skip('decompression fails on Windows: issue #55')\n>>> plaintext_bl_lwm = getfixture('bl_lwm_plaintext_json_export')\n<BLANKLINE>\n...\n>>> compressed_paths: Path = plaintext_bl_lwm.compress_json_exports(\n... format='tar')\n<BLANKLINE>\n...Compressing...'...01.json' to...'tar'...in:...\n>>> compressed_paths\n(...Path('.../plaintext_fixture-000001.json.tar'),)\n
Source code in alto2txt2fixture/plaintext.py
def compress_json_exports(\nself,\noutput_path: PathLike | None = None,\nformat: ArchiveFormatEnum | None = None,\n) -> tuple[Path, ...]:\n\"\"\"Compress `self._exported_json_paths` to `format`.\n Args:\n output_path:\n `Path` to save compressed `json` files to. Uses\n `self.json_export_compression_subdir` if `None` is passed.\n format:\n What compression format to use from `ArchiveFormatEnum`. Uses\n `self.json_export_compression_format` if `None` is passed.\n Note:\n Neither `output_path` nor `format` overwrite the related attributes\n of `self`.\n Returns: The the `output_path` passed to save compressed `json`.\n Example:\n ```pycon\n >>> if is_platform_win:\n ... pytest.skip('decompression fails on Windows: issue #55')\n >>> plaintext_bl_lwm = getfixture('bl_lwm_plaintext_json_export')\n <BLANKLINE>\n ...\n >>> compressed_paths: Path = plaintext_bl_lwm.compress_json_exports(\n ... format='tar')\n <BLANKLINE>\n ...Compressing...'...01.json' to...'tar'...in:...\n >>> compressed_paths\n (...Path('.../plaintext_fixture-000001.json.tar'),)\n ```\n \"\"\"\noutput_path = (\nPath(self.json_export_compression_subdir)\nif not output_path\nelse Path(output_path)\n)\nformat = self.json_export_compression_format if not format else format\ncompressed_paths: list[Path] = []\nfor json_path in self.exported_json_paths:\ncompressed_paths.append(\ncompress_fixture(json_path, output_path=output_path, format=format)\n)\nreturn tuple(compressed_paths)\n
The Archive class represents a zip archive of XML files. The class is used to extract information from a ZIP archive, and it contains several methods to process the data contained in the archive.
open(Archive) context manager
Archive can be opened with a context manager, which creates a meta object, with timings for the object. When closed, it will save the meta JSON to the correct paths.
Attributes:
Name Type Description pathPath
The path to the zip archive.
collectionstr
The collection of the XML files in the archive. Default is \"\".
reportPath
The file path of the report file for the archive.
report_idstr
The report ID for the archive. If not provided, a random UUID is generated.
report_parentPath
The parent directory of the report file for the archive.
jisc_paperspd.DataFrame
A DataFrame of JISC papers.
sizestr | float
The size of the archive, in human-readable format.
size_rawstr | float
The raw size of the archive, in bytes.
rootsGenerator[ET.Element, None, None]
The root elements of the XML documents contained in the archive.
metadotdict
Metadata about the archive, such as its path, size, and number of contents.
A generator that yields instances of the Document class for each XML file in the ZIP archive.
It uses the tqdm library to display a progress bar in the terminal while it is running.
If the contents of the ZIP file are not empty, the method creates an instance of the Document class by passing the root element of the XML file, the collection name, meta information about the archive, and the JISC papers data frame (if provided) to the constructor of the Document class. The instance of the Document class is then returned by the generator.
Yields:
Type Description Document
Document class instance for each unzipped XML file.
Source code in alto2txt2fixture/router.py
def get_documents(self) -> Generator[Document, None, None]:\n\"\"\"\n A generator that yields instances of the Document class for each XML\n file in the ZIP archive.\n It uses the `tqdm` library to display a progress bar in the terminal\n while it is running.\n If the contents of the ZIP file are not empty, the method creates an\n instance of the ``Document`` class by passing the root element of the XML\n file, the collection name, meta information about the archive, and the\n JISC papers data frame (if provided) to the constructor of the\n ``Document`` class. The instance of the ``Document`` class is then\n returned by the generator.\n Yields:\n ``Document`` class instance for each unzipped `XML` file.\n \"\"\"\nfor xml_file in tqdm(\nself.filelist,\ndesc=f\"{Path(self.zip_file.filename).stem} ({self.meta.size})\",\nleave=False,\ncolour=\"green\",\n):\nwith self.zip_file.open(xml_file) as f:\nxml = f.read()\nif xml:\nyield Document(\nroot=ET.fromstring(xml),\ncollection=self.collection,\nmeta=self.meta,\njisc_papers=self.jisc_papers,\n)\n
Yields the root elements of the XML documents contained in the archive.
Source code in alto2txt2fixture/router.py
def get_roots(self) -> Generator[ET.Element, None, None]:\n\"\"\"\n Yields the root elements of the XML documents contained in the archive.\n \"\"\"\nfor xml_file in tqdm(self.filelist, leave=False, colour=\"blue\"):\nwith self.zip_file.open(xml_file) as f:\nxml = f.read()\nif xml:\nyield ET.fromstring(xml)\n
The Cache class provides a blueprint for creating and managing cache data. The class has several methods that help in getting the cache path, converting the data to a dictionary, and writing the cache data to a file.
It is inherited by many other classes in this document.
Initializes the Cache class object.
Source code in alto2txt2fixture/router.py
def __init__(self):\n\"\"\"\n Initializes the Cache class object.\n \"\"\"\npass\n
Returns the cache path, which is used to store the cache data. The path is normally constructed using some of the object's properties (collection, kind, and id) but can be changed when inherited.
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the cache path, which is used to store the cache data.\n The path is normally constructed using some of the object's\n properties (collection, kind, and id) but can be changed when\n inherited.\n \"\"\"\nreturn Path(f\"{CACHE_HOME}/{self.collection}/{self.kind}/{self.id}.json\")\n
write_to_cache(json_indent: int = JSON_INDENT) -> Optional[bool]\n
Writes the cache data to a file at the specified cache path. The cache data is first converted to a dictionary using the as_dict method. If the cache path already exists, the function returns True.
Source code in alto2txt2fixture/router.py
def write_to_cache(self, json_indent: int = JSON_INDENT) -> Optional[bool]:\n\"\"\"\n Writes the cache data to a file at the specified cache path. The cache\n data is first converted to a dictionary using the as_dict method. If\n the cache path already exists, the function returns True.\n \"\"\"\npath = self.get_cache_path()\ntry:\nif path.exists():\nreturn True\nexcept AttributeError:\nerror(\nf\"Error occurred when getting cache path for \"\nf\"{self.kind}: {path}. It was not of expected \"\nf\"type Path but of type {type(path)}:\",\n)\npath.parent.mkdir(parents=True, exist_ok=True)\nwith open(path, \"w+\") as f:\nf.write(json.dumps(self.as_dict(), indent=json_indent))\nreturn\n
A Collection represents a group of newspaper archives from any passed alto2txt metadata output.
A Collection is initialised with a name and an optional pandas DataFrame of JISC papers. The archives property returns an iterable of the Archive objects within the collection.
The DataProvider class extends the Cache class and represents a newspaper data provider. The class has several properties and methods that allow creation of a data provider object and the manipulation of its data.
Attributes:
Name Type Description collectionstr
A string representing publication collection
kindstr
Indication of object type, defaults to data-provider
providers_meta_datalist[FixtureDict]
structured dict of metadata for known collection sources
collection_typestr
related data sources and potential linkage source
index_fieldstr
field name for querying existing records
Example
>>> from pprint import pprint\n>>> hmd = DataProvider(\"hmd\")\n>>> hmd.pk\n2\n>>> pprint(hmd.as_dict())\n{'code': 'bl_hmd',\n 'collection': 'newspapers',\n 'legacy_code': 'hmd',\n 'name': 'Heritage Made Digital',\n 'source_note': 'British Library-funded digitised newspapers provided by the '\n 'British Newspaper Archive'}\n
The Digitisation class extends the Cache class and represents a newspaper digitisation. The class has several properties and methods that allow creation of an digitisation object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
collectionstr
A string that represents the collection of the publication
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, root: ET.Element, collection: str = \"\"):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nself.root: ET.Element = root\nself.collection: str = collection\n
A method that returns a dictionary representation of the digitisation object.
Returns:
Type Description dict
Dictionary representation of the Digitising object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the digitisation\n object.\n Returns:\n Dictionary representation of the Digitising object\n \"\"\"\ndic = {\nx.tag: x.text or \"\"\nfor x in self.root.findall(\"./process/*\")\nif x.tag\nin [\n\"xml_flavour\",\n\"software\",\n\"mets_namespace\",\n\"alto_namespace\",\n]\n}\nif not dic.get(\"software\"):\nreturn {}\nreturn dic\n
The Document class is a representation of a document that contains information about a publication, newspaper, item, digitisation, and ingest. This class holds all the relevant information about a document in a structured manner and provides properties that can be used to access different aspects of the document.
Attributes:
Name Type Description collectionstr | None
A string that represents the collection of the publication
rootET.Element | None
An XML element that represents the root of the publication
zip_filestr | None
A path to a valid zip file
jisc_paperspd.DataFrame | None
A pandasDataFrame object that holds information about the JISC papers
metadotdict | None
TODO
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, *args, **kwargs):\n\"\"\"Constructor method.\"\"\"\nself.collection: str | None = kwargs.get(\"collection\")\nif not self.collection or not isinstance(self.collection, str):\nraise RuntimeError(\"A valid collection must be passed\")\nself.root: ET.Element | None = kwargs.get(\"root\")\nif not self.root or not isinstance(self.root, ET.Element):\nraise RuntimeError(\"A valid XML root must be passed\")\nself.zip_file: str | None = kwargs.get(\"zip_file\")\nif self.zip_file and not isinstance(self.zip_file, str):\nraise RuntimeError(\"A valid zip file must be passed\")\nself.jisc_papers: pd.DataFrame | None = kwargs.get(\"jisc_papers\")\nif not isinstance(self.jisc_papers, pd.DataFrame):\nraise RuntimeError(\n\"A valid DataFrame containing JISC papers must be passed\"\n)\nself.meta: dotdict | None = kwargs.get(\"meta\")\nself._publication_elem = None\nself._input_sub_path = None\nself._ingest = None\nself._digitisation = None\nself._item = None\nself._issue = None\nself._newspaper = None\nself._data_provider = None\n
The Ingest class extends the Cache class and represents a newspaper ingest. The class has several properties and methods that allow the creation of an ingest object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
collectionstr
A string that represents the collection of the publication
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(self, root: ET.Element, collection: str = \"\"):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nself.root: ET.Element = root\nself.collection: str = collection\n
A method that returns a dictionary representation of the ingest object.
Returns:
Type Description dict
Dictionary representation of the Ingest object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the ingest\n object.\n Returns:\n Dictionary representation of the Ingest object\n \"\"\"\nreturn {\nf\"lwm_tool_{x.tag}\": x.text or \"\"\nfor x in self.root.findall(\"./process/lwm_tool/*\")\n}\n
The Issue class extends the Cache class and represents a newspaper issue. The class has several properties and methods that allow the creation of an issue object and the manipulation of its data.
Attributes:
Name Type Description root
An xml element that represents the root of the publication
newspaperNewspaper | None
The parent newspaper
collectionstr
A string that represents the collection of the publication
A method that returns a dictionary representation of the issue object.
Returns:
Type Description dict
Dictionary representation of the Issue object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the issue\n object.\n Returns:\n Dictionary representation of the Issue object\n \"\"\"\nif not self._issue:\nself._issue = dict(\nissue_code=self.issue_code,\nissue_date=self.issue_date,\npublication__publication_code=self.newspaper.publication_code,\ninput_sub_path=self.input_sub_path,\n)\nreturn self._issue\n
Returns the path to the cache file for the issue object.
Returns:
Type Description Path
Path to the cache file for the issue object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the issue object.\n Returns:\n Path to the cache file for the issue object\n \"\"\"\njson_file = f\"/{self.newspaper.publication_code}/issues/{self.issue_code}.json\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\"\n+ \"/\".join(self.newspaper.number_paths)\n+ json_file\n)\n
The Newspaper class extends the Cache class and represents a newspaper item, i.e. an article. The class has several properties and methods that allow the creation of an article object and the manipulation of its data.
Attributes:
Name Type Description rootET.Element
An xml element that represents the root of the publication
issue_codestr
A string that represents the issue code
digitisationdict
TODO
ingestdict
TODO
collectionstr
A string that represents the collection of the publication
newspaperNewspaper | None
The parent newspaper
metadotdict
TODO
Constructor method.
Source code in alto2txt2fixture/router.py
def __init__(\nself,\nroot: ET.Element,\nissue_code: str = \"\",\ndigitisation: dict = {},\ningest: dict = {},\ncollection: str = \"\",\nnewspaper: Optional[Newspaper] = None,\nmeta: dotdict = dotdict(),\n):\n\"\"\"Constructor method.\"\"\"\nif not isinstance(root, ET.Element):\nraise RuntimeError(f\"Expected root to be xml.etree.Element: {type(root)}\")\nif not isinstance(newspaper, Newspaper):\nraise RuntimeError(\"Expected newspaper to be of type router.Newspaper\")\nself.root: ET.Element = root\nself.issue_code: str = issue_code\nself.digitisation: dict = digitisation\nself.ingest: dict = ingest\nself.collection: str = collection\nself.newspaper: Newspaper | None = newspaper\nself.meta: dotdict = meta\nself._item_elem = None\nself._item_code = None\nself._item = None\npath: str = str(self.get_cache_path())\nif not self.meta.item_paths:\nself.meta.item_paths = [path]\nelif path not in self.meta.item_paths:\nself.meta.item_paths.append(path)\n
Returns the path to the cache file for the item (article) object.
Returns:
Type Description Path
Path to the cache file for the article object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the item (article) object.\n Returns:\n Path to the cache file for the article object\n \"\"\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\"\n+ \"/\".join(self.newspaper.number_paths)\n+ f\"/{self.newspaper.publication_code}/items.jsonl\"\n)\n
Special cache-write function that appends rather than writes at the end of the process.
Returns:
Type Description None
None.
Source code in alto2txt2fixture/router.py
def write_to_cache(self, json_indent=JSON_INDENT) -> None:\n\"\"\"\n Special cache-write function that appends rather than writes at the\n end of the process.\n Returns:\n None.\n \"\"\"\npath = self.get_cache_path()\npath.parent.mkdir(parents=True, exist_ok=True)\nwith open(path, \"a+\") as f:\nf.write(json.dumps(self.as_dict(), indent=json_indent) + \"\\n\")\nreturn\n
A method that returns a dictionary representation of the newspaper object.
Returns:
Type Description dict
Dictionary representation of the Newspaper object
Source code in alto2txt2fixture/router.py
def as_dict(self) -> dict:\n\"\"\"\n A method that returns a dictionary representation of the newspaper\n object.\n Returns:\n Dictionary representation of the Newspaper object\n \"\"\"\nif not self._newspaper:\nself._newspaper = dict(\n**dict(publication_code=self.publication_code, title=self.title),\n**{\nx.tag: x.text or \"\"\nfor x in self.publication.findall(\"*\")\nif x.tag in [\"location\"]\n},\n)\nreturn self._newspaper\n
Returns the path to the cache file for the newspaper object.
Returns:
Type Description Path
Path to the cache file for the newspaper object
Source code in alto2txt2fixture/router.py
def get_cache_path(self) -> Path:\n\"\"\"\n Returns the path to the cache file for the newspaper object.\n Returns:\n Path to the cache file for the newspaper object\n \"\"\"\njson_file = f\"/{self.publication_code}/{self.publication_code}.json\"\nreturn Path(\nf\"{CACHE_HOME}/{self.collection}/\" + \"/\".join(self.number_paths) + json_file\n)\n
A method that returns the publication code from the input sub-path of the publication process.
Returns:
Type Description str | None
The code of the publication
Source code in alto2txt2fixture/router.py
def publication_code_from_input_sub_path(self) -> str | None:\n\"\"\"\n A method that returns the publication code from the input sub-path of\n the publication process.\n Returns:\n The code of the publication\n \"\"\"\ng = PUBLICATION_CODE.findall(self.input_sub_path)\nif len(g) == 1:\nreturn g[0]\nreturn None\n
This function is responsible for setting up the path for the alto2txt mountpoint, setting up the JISC papers and routing the collections for processing.
Parameters:
Name Type Description Default collectionslist
List of collection names
required cache_homestr
Directory path for the cache
required mountpointstr
Directory path for the alto2txt mountpoint
required jisc_papers_pathstr
Path to the JISC papers
required report_dirstr
Path to the report directory
required
Returns:
Type Description None
None
Source code in alto2txt2fixture/router.py
def route(\ncollections: list,\ncache_home: str,\nmountpoint: str,\njisc_papers_path: str,\nreport_dir: str,\n) -> None:\n\"\"\"\n This function is responsible for setting up the path for the alto2txt\n mountpoint, setting up the JISC papers and routing the collections for\n processing.\n Args:\n collections: List of collection names\n cache_home: Directory path for the cache\n mountpoint: Directory path for the alto2txt mountpoint\n jisc_papers_path: Path to the JISC papers\n report_dir: Path to the report directory\n Returns:\n None\n \"\"\"\nglobal CACHE_HOME\nglobal MNT\nglobal REPORT_DIR\nCACHE_HOME = cache_home\nREPORT_DIR = report_dir\nMNT = Path(mountpoint) if isinstance(mountpoint, str) else mountpoint\nif not MNT.exists():\nerror(\nf\"The mountpoint provided for alto2txt does not exist. \"\nf\"Either create a local copy or blobfuse it to \"\nf\"`{MNT.absolute()}`.\"\n)\njisc_papers = setup_jisc_papers(path=jisc_papers_path)\nfor collection_name in collections:\ncollection = Collection(name=collection_name, jisc_papers=jisc_papers)\nif collection.empty:\nerror(\nf\"It looks like {collection_name} is empty in the \"\nf\"alto2txt mountpoint: `{collection.dir.absolute()}`.\"\n)\nfor archive in collection.archives:\nwith archive as _:\n[\n(\ndoc.item.write_to_cache(),\ndoc.newspaper.write_to_cache(),\ndoc.issue.write_to_cache(),\ndoc.data_provider.write_to_cache(),\ndoc.ingest.write_to_cache(),\ndoc.digitisation.write_to_cache(),\n)\nfor doc in archive.documents\n]\nreturn\n
Fields within the fields portion of a FixtureDict to fit lwmdb.
Attributes:
Name Type Description namestr
The name of the collection data source. For lwmdb this should be less than 600 characters.
codestr | NEWSPAPER_OCR_FORMATS
A short slug-like, url-compatible (replace spaces with -) str to uniquely identify a data provider in urls, api calls etc. Designed to fit NEWSPAPER_OCR_FORMATS and any future slug-like codes.
legacy_codeLEGACY_NEWSPAPER_OCR_FORMATS | None
Either blank or a legacy slug-like, url-compatible (replace spaces with -) str originally used by alto2txt following LEGACY_NEWSPAPER_OCR_FORMATSNEWSPAPER_OCR_FORMATS.
A typed dict for Plaintext Fixutres to match lwmdb.Fulltextmodel
Attributes:
Name Type Description textstr
Plaintext, potentially quite large newspaper articles. May have unusual or unreadable sequences of characters due to issues with Optical Character Recognition quality.
pathstr
Path of provided plaintext file. If compressed_path is None, this is the original relative Path of the plaintext file.
compressed_pathstr | None
The path of a compressed data source, the extraction of which provides access to plaintext files.
A string or list specifying the field(s) to be translated. If it is a string, the translated field will be a direct mapping of the specified field in each item of the input list. If it is a list, the translated field will be a hyphen-separated concatenation of the specified fields in each item of the input list.
lstlist[dict]
A list of dictionaries representing the items to be translated. Each dictionary should contain the necessary fields for translation, with the field names specified in the start parameter.
dictfieldskey used to check matchiching collections name
DATA_PROVIDER_INDEX
Returns:
Type Description set[str]
A set of collections without a matching newspaper_collections record.
Example
>>> check_newspaper_collection_configuration()\nset()\n>>> unmatched: set[str] = check_newspaper_collection_configuration(\n... [\"cat\", \"dog\"])\n<BLANKLINE>\n...Warning: 2 `collections` not in `newspaper_collections`: ...\n>>> unmatched == {'dog', 'cat'}\nTrue\n
Note
Set orders are random so checking unmatched == {'dog, 'cat'} to ensure correctness irrespective of order in the example above.
Source code in alto2txt2fixture/utils.py
def check_newspaper_collection_configuration(\ncollections: Iterable[str] = settings.COLLECTIONS,\nnewspaper_collections: Iterable[FixtureDict] = NEWSPAPER_COLLECTION_METADATA,\ndata_provider_index: str = DATA_PROVIDER_INDEX,\n) -> set[str]:\n\"\"\"Check the names in `collections` match the names in `newspaper_collections`.\n Arguments:\n collections:\n Names of newspaper collections, defaults to ``settings.COLLECTIONS``\n newspaper_collections:\n Newspaper collections in a list of `FixtureDict` format. Defaults\n to ``settings.FIXTURE_TABLE['dataprovider]``\n data_provider_index:\n `dict` `fields` `key` used to check matchiching `collections` name\n Returns:\n A set of ``collections`` without a matching `newspaper_collections` record.\n Example:\n ```pycon\n >>> check_newspaper_collection_configuration()\n set()\n >>> unmatched: set[str] = check_newspaper_collection_configuration(\n ... [\"cat\", \"dog\"])\n <BLANKLINE>\n ...Warning: 2 `collections` not in `newspaper_collections`: ...\n >>> unmatched == {'dog', 'cat'}\n True\n ```\n !!! note\n Set orders are random so checking `unmatched == {'dog, 'cat'}` to\n ensure correctness irrespective of order in the example above.\n \"\"\"\nnewspaper_collection_names: tuple[str, ...] = tuple(\ndict_from_list_fixture_fields(\nnewspaper_collections, field_name=data_provider_index\n).keys()\n)\ncollection_diff: set[str] = set(collections) - set(newspaper_collection_names)\nif collection_diff:\nwarning(\nf\"{len(collection_diff)} `collections` \"\nf\"not in `newspaper_collections`: {collection_diff}\"\n)\nreturn collection_diff\n
Clears the cache directory by removing all .json files in it.
Parameters:
Name Type Description Default dirstr | Path
The path of the directory to be cleared.
required Source code in alto2txt2fixture/utils.py
def clear_cache(dir: str | Path) -> None:\n\"\"\"\n Clears the cache directory by removing all `.json` files in it.\n Args:\n dir: The path of the directory to be cleared.\n \"\"\"\ndir = get_path_from(dir)\ny = input(\nf\"Do you want to erase the cache path now that the \"\nf\"files have been generated ({dir.absolute()})? [y/N]\"\n)\nif y.lower() == \"y\":\ninfo(\"Clearing up the cache directory\")\nfor x in dir.glob(\"*.json\"):\nx.unlink()\n
Compress exported fixtures files using make_archive.
Parameters:
Name Type Description Default pathPathLike
Path to file to compress
required output_pathPathLike | str
Compressed file name (without extension specified from format).
settings.OUTPUTformatstr | ArchiveFormatEnum
A str of one of the registered compression formats. By default Python provides zip, tar, gztar, bztar, and xztar. See ArchiveFormatEnum variable for options checked.
ZIP_FILE_EXTENSIONsuffixstr
str to add to comprssed file name saved. For example: if path = plaintext_fixture-1.json and suffix=_compressed, then the saved file might be called plaintext_fixture-1_compressed.json.zip
create_lookup(lst: list = [], on: list = []) -> dict\n
Create a lookup dictionary from a list of dictionaries.
Parameters:
Name Type Description Default lstlist
A list of dictionaries that should be used to generate the lookup.
[]onlist
A list of keys from the dictionaries in the list that should be used as the keys in the lookup.
[]
Returns:
Type Description dict
The generated lookup dictionary.
Source code in alto2txt2fixture/utils.py
def create_lookup(lst: list = [], on: list = []) -> dict:\n\"\"\"\n Create a lookup dictionary from a list of dictionaries.\n Args:\n lst: A list of dictionaries that should be used to generate the lookup.\n on: A list of keys from the dictionaries in the list that should be used as the keys in the lookup.\n Returns:\n The generated lookup dictionary.\n \"\"\"\nreturn {get_key(x, on): x[\"pk\"] for x in lst}\n
def dict_from_list_fixture_fields(\nfixture_list: Iterable[FixtureDict] = NEWSPAPER_COLLECTION_METADATA,\nfield_name: str = DATA_PROVIDER_INDEX,\n) -> dict[str, FixtureDict]:\n\"\"\"Create a `dict` from ``fixture_list`` with ``attr_name`` as `key`.\n Args:\n fixture_list: `list` of `FixtureDict` with ``attr_name`` key `fields`.\n field_name: key for values within ``fixture_list`` `fields`.\n Returns:\n A `dict` where extracted `field_name` is key for related `FixtureDict` values.\n Example:\n ```pycon\n >>> fixture_dict: dict[str, FixtureDict] = dict_from_list_fixture_fields()\n >>> fixture_dict['hmd']['pk']\n 2\n >>> fixture_dict['hmd']['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n >>> fixture_dict['hmd']['fields']['code']\n 'bl_hmd'\n ```\n \"\"\"\nreturn {record[\"fields\"][field_name]: record for record in fixture_list}\n
Saves fixtures generated by a generator to separate separate CSV files.
This function takes an Iterable or Generator of fixtures and saves to separate CSV files. The fixtures are saved in batches, where each batch is determined by the max_elements_per_file parameter.
Parameters:
Name Type Description Default fixturesIterable[FixtureDict] | Generator[FixtureDict, None, None]
An Iterable or Generator of the fixtures to be saved.
required prefixstr
A string prefix to be added to the file names of the saved fixtures.
def fixtures_dict2csv(\nfixtures: Iterable[FixtureDict] | Generator[FixtureDict, None, None],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nindex: bool = False,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None:\n\"\"\"Saves fixtures generated by a generator to separate separate `CSV` files.\n This function takes an `Iterable` or `Generator` of fixtures and saves to\n separate `CSV` files. The fixtures are saved in batches, where each batch\n is determined by the ``max_elements_per_file`` parameter.\n Args:\n fixtures:\n An `Iterable` or `Generator` of the fixtures to be saved.\n prefix:\n A string prefix to be added to the file names of the\n saved fixtures.\n output_path:\n Path to folder fixtures are saved to\n max_elements_per_file:\n Maximum `JSON` records saved in each file\n file_name_0_padding:\n Zeros to prefix the number of each fixture file name.\n Returns:\n This function saves fixtures to files and does not return a value.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> from pandas import read_csv\n >>> fixtures_dict2csv(NEWSPAPER_COLLECTION_METADATA,\n ... prefix='test', output_path=tmp_path)\n >>> imported_fixture = read_csv(tmp_path / 'test-000001.csv')\n >>> imported_fixture.iloc[1]['pk']\n 2\n >>> imported_fixture.iloc[1][DATA_PROVIDER_INDEX]\n 'hmd'\n ```\n \"\"\"\ninternal_counter: int = 1\ncounter: int = 1\nlst: list = []\nfile_name: str\ndf: DataFrame\nPath(output_path).mkdir(parents=True, exist_ok=True)\nfor item in fixtures:\nlst.append(fixture_fields(item, as_dict=True))\ninternal_counter += 1\nif internal_counter > max_elements_per_file:\ndf = DataFrame.from_records(lst)\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv\"\ndf.to_csv(Path(output_path) / file_name, index=index)\n# Save up some memory\ndel lst\ngc.collect()\n# Re-instantiate\nlst = []\ninternal_counter = 1\ncounter += 1\nelse:\ndf = DataFrame.from_records(lst)\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.csv\"\ndf.to_csv(Path(output_path) / file_name, index=index)\n
def free_hd_space_in_GB(\ndisk_usage_tuple: DiskUsageTuple | None = None, path: PathLike | None = None\n) -> float:\n\"\"\"Return remaing hard drive space estimate in gigabytes.\n Args:\n disk_usage_tuple:\n A `NamedTuple` normally returned from `disk_usage()` or `None`.\n path:\n A `path` to pass to `disk_usage` if `disk_usage_tuple` is `None`.\n Returns:\n A `float` from dividing the `disk_usage_tuple.free` value by `BYTES_PER_GIGABYTE`\n Example:\n ```pycon\n >>> space_in_gb = free_hd_space_in_GB()\n >>> space_in_gb > 1 # Hopefully true wherever run...\n True\n ```\n \"\"\"\nif not disk_usage_tuple:\nif not path:\npath = Path(getcwd())\ndisk_usage_tuple = disk_usage(path=path)\nassert disk_usage_tuple\nreturn disk_usage_tuple.free / BYTES_PER_GIGABYTE\n
def gen_fixture_tables(\nfixture_tables: dict[str, list[FixtureDict]] = {},\ninclude_fixture_pk_column: bool = True,\n) -> Generator[Table, None, None]:\n\"\"\"Generator of `rich.Table` instances from `FixtureDict` configuration tables.\n Args:\n fixture_tables: `dict` where `key` is for `Table` title and `value` is a `FixtureDict`\n include_fixture_pk_column: whether to include the `pk` field from `FixtureDict`\n Example:\n ```pycon\n >>> table_name: str = \"data_provider\"\n >>> tables = tuple(\n ... gen_fixture_tables(\n ... {table_name: NEWSPAPER_COLLECTION_METADATA}\n ... ))\n >>> len(tables)\n 1\n >>> assert tables[0].title == table_name\n >>> [column.header for column in tables[0].columns]\n ['pk', 'name', 'code', 'legacy_code', 'collection', 'source_note']\n ```\n \"\"\"\nfor name, fixture_records in fixture_tables.items():\nfixture_table: Table = Table(title=name)\nfor i, fixture_dict in enumerate(fixture_records):\nif i == 0:\n[\nfixture_table.add_column(name)\nfor name in fixture_fields(fixture_dict, include_fixture_pk_column)\n]\nrow_values: tuple[str, ...] = tuple(\nstr(x) for x in (fixture_dict[\"pk\"], *fixture_dict[\"fields\"].values())\n)\nfixture_table.add_row(*row_values)\nyield fixture_table\n
This function takes in a Path object path and returns a list of lists of zipfiles sorted and chunked according to certain conditions defined in the settings object (see settings.CHUNK_THRESHOLD).
Note: the function will also skip zip files of a certain file size, which can be specified in the settings object (see settings.SKIP_FILE_SIZE).
Parameters:
Name Type Description Default pathPath
The input path where the zipfiles are located
required
Returns:
Type Description list
A list of lists of zipfiles, each inner list represents a chunk of zipfiles.
Source code in alto2txt2fixture/utils.py
def get_chunked_zipfiles(path: Path) -> list:\n\"\"\"This function takes in a `Path` object `path` and returns a list of lists\n of `zipfiles` sorted and chunked according to certain conditions defined\n in the `settings` object (see `settings.CHUNK_THRESHOLD`).\n Note: the function will also skip zip files of a certain file size, which\n can be specified in the `settings` object (see `settings.SKIP_FILE_SIZE`).\n Args:\n path: The input path where the zipfiles are located\n Returns:\n A list of lists of `zipfiles`, each inner list represents a chunk of\n zipfiles.\n \"\"\"\nzipfiles = sorted(\npath.glob(\"*.zip\"),\nkey=lambda x: x.stat().st_size,\nreverse=settings.START_WITH_LARGEST,\n)\nzipfiles = [x for x in zipfiles if x.stat().st_size <= settings.SKIP_FILE_SIZE]\nif len(zipfiles) > settings.CHUNK_THRESHOLD:\nchunks = array_split(zipfiles, len(zipfiles) / settings.CHUNK_THRESHOLD)\nelse:\nchunks = [zipfiles]\nreturn chunks\n
Get a string key from a dictionary using values from specified keys.
Parameters:
Name Type Description Default xdict
A dictionary from which the key is generated.
dict()onlist
A list of keys from the dictionary that should be used to generate the key.
[]
Returns:
Type Description str
The generated string key.
Source code in alto2txt2fixture/utils.py
def get_key(x: dict = dict(), on: list = []) -> str:\n\"\"\"\n Get a string key from a dictionary using values from specified keys.\n Args:\n x: A dictionary from which the key is generated.\n on: A list of keys from the dictionary that should be used to\n generate the key.\n Returns:\n The generated string key.\n \"\"\"\nreturn f\"{'-'.join([str(x['fields'][y]) for y in on])}\"\n
Provides the path to any given lockfile, which controls whether any existing files should be overwritten or not.
Parameters:
Name Type Description Default collectionstr
Collection folder name
required kindNewspaperElements
Either newspaper or issue or item
required dicdict
A dictionary with required information for either kind passed
required
Returns:
Type Description Path
Path to the resulting lockfile
Source code in alto2txt2fixture/utils.py
def get_lockfile(collection: str, kind: NewspaperElements, dic: dict) -> Path:\n\"\"\"\n Provides the path to any given lockfile, which controls whether any\n existing files should be overwritten or not.\n Args:\n collection: Collection folder name\n kind: Either `newspaper` or `issue` or `item`\n dic: A dictionary with required information for either `kind` passed\n Returns:\n Path to the resulting lockfile\n \"\"\"\np: Path\nbase = Path(f\"cache-lockfiles/{collection}\")\nif kind == \"newspaper\":\np = base / f\"newspapers/{dic['publication_code']}\"\nelif kind == \"issue\":\np = base / f\"issues/{dic['publication__publication_code']}/{dic['issue_code']}\"\nelif kind == \"item\":\ntry:\nif dic.get(\"issue_code\"):\np = base / f\"items/{dic['issue_code']}/{dic['item_code']}\"\nelif dic.get(\"issue__issue_identifier\"):\np = base / f\"items/{dic['issue__issue_identifier']}/{dic['item_code']}\"\nexcept KeyError:\nerror(\"An unknown error occurred (in get_lockfile)\")\nelse:\np = base / \"lockfile\"\np.parent.mkdir(parents=True, exist_ok=True) if settings.WRITE_LOCKFILES else None\nreturn p\n
Return datetime.now() as either a string or datetime object.
Parameters:
Name Type Description Default as_strbool
Whether to return nowtime as a str or not, default: False
False
Returns:
Type Description datetime.datetime | str
datetime.now() in pytz.UTC time zone as a string if as_str, else as a datetime.datetime object.
Source code in alto2txt2fixture/utils.py
def get_now(as_str: bool = False) -> datetime.datetime | str:\n\"\"\"\n Return `datetime.now()` as either a string or `datetime` object.\n Args:\n as_str: Whether to return `now` `time` as a `str` or not, default: `False`\n Returns:\n `datetime.now()` in `pytz.UTC` time zone as a string if `as_str`, else\n as a `datetime.datetime` object.\n \"\"\"\nnow = datetime.datetime.now(tz=pytz.UTC)\nif as_str:\nreturn str(now)\nelse:\nassert isinstance(now, datetime.datetime)\nreturn now\n
Converts an input value into a Path object if it's not already one.
Parameters:
Name Type Description Default pstr | Path
The input value, which can be a string or a Path object.
required
Returns:
Type Description Path
The input value as a Path object.
Source code in alto2txt2fixture/utils.py
def get_path_from(p: str | Path) -> Path:\n\"\"\"\n Converts an input value into a Path object if it's not already one.\n Args:\n p: The input value, which can be a string or a Path object.\n Returns:\n The input value as a Path object.\n \"\"\"\nif isinstance(p, str):\np = Path(p)\nif not isinstance(p, Path):\nraise RuntimeError(f\"Unable to handle type: {type(p)}\")\nreturn p\n
Whether to return the file size as total number of bytes or a human-readable MB/GB amount
False
Returns:
Type Description str | float
Return str followed by MB or GB for size if not raw otherwise float.
Source code in alto2txt2fixture/utils.py
def get_size_from_path(p: str | Path, raw: bool = False) -> str | float:\n\"\"\"\n Returns a nice string for any given file size.\n Args:\n p: Path to read the size from\n raw: Whether to return the file size as total number of bytes or\n a human-readable MB/GB amount\n Returns:\n Return `str` followed by `MB` or `GB` for size if not `raw` otherwise `float`.\n \"\"\"\np = get_path_from(p)\nbytes = p.stat().st_size\nif raw:\nreturn bytes\nrel_size: float | int | str = round(bytes / 1000 / 1000 / 1000, 1)\nassert not isinstance(rel_size, str)\nif rel_size < 0.5:\nrel_size = round(bytes / 1000 / 1000, 1)\nrel_size = f\"{rel_size}MB\"\nelse:\nrel_size = f\"{rel_size}GB\"\nreturn rel_size\n
Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.
Parameters:
Name Type Description Default pstr
Path to a directory to filter
required
Returns:
Type Description list
Sorted list of files contained in the provided path without the ones
list
whose names start with a .
Source code in alto2txt2fixture/utils.py
def glob_filter(p: str) -> list:\n\"\"\"\n Return ordered glob, filtered out any pesky, unwanted .DS_Store from macOS.\n Args:\n p: Path to a directory to filter\n Returns:\n Sorted list of files contained in the provided path without the ones\n whose names start with a `.`\n \"\"\"\nreturn sorted([x for x in get_path_from(p).glob(\"*\") if not x.name.startswith(\".\")])\n
Return an OrderedDict of replacement 0-padded file names from path.
Parameters:
Name Type Description Default pathPathLike
PathLike to source files to rename.
required output_pathPathLike | None
PathLike to save renamed files to.
Noneglob_regex_strstr
str to match files to rename within path.
'*'paddingint | None
How many digits (0s) to pad match_int with.
0match_int_regexstr
Regular expression for matching numbers in s to pad. Only rename parts of Path(file_path).name; else replace across Path(file_path).parents as well.
PADDING_0_REGEX_DEFAULTindexint
Which index of number in s to pad with 0s. Like numbering a list, 0 indicates the first match and -1 indicates the last match.
-1 Example
>>> tmp_path: Path = getfixture('tmp_path')\n>>> for i in range(4):\n... (tmp_path / f'test_file-{i}.txt').touch(exist_ok=True)\n>>> pprint(sorted(tmp_path.iterdir()))\n[...Path('...test_file-0.txt'),\n ...Path('...test_file-1.txt'),\n ...Path('...test_file-2.txt'),\n ...Path('...test_file-3.txt')]\n>>> pprint(glob_path_rename_by_0_padding(tmp_path))\n{...Path('...test_file-0.txt'): ...Path('...test_file-00.txt'),\n ...Path('...test_file-1.txt'): ...Path('...test_file-01.txt'),\n ...Path('...test_file-2.txt'): ...Path('...test_file-02.txt'),\n ...Path('...test_file-3.txt'): ...Path('...test_file-03.txt')}\n
Source code in alto2txt2fixture/utils.py
def glob_path_rename_by_0_padding(\npath: PathLike,\noutput_path: PathLike | None = None,\nglob_regex_str: str = \"*\",\npadding: int | None = 0,\nmatch_int_regex: str = PADDING_0_REGEX_DEFAULT,\nindex: int = -1,\n) -> dict[PathLike, PathLike]:\n\"\"\"Return an `OrderedDict` of replacement 0-padded file names from `path`.\n Params:\n path:\n `PathLike` to source files to rename.\n output_path:\n `PathLike` to save renamed files to.\n glob_regex_str:\n `str` to match files to rename within `path`.\n padding:\n How many digits (0s) to pad `match_int` with.\n match_int_regex:\n Regular expression for matching numbers in `s` to pad.\n Only rename parts of `Path(file_path).name`; else\n replace across `Path(file_path).parents` as well.\n index:\n Which index of number in `s` to pad with 0s.\n Like numbering a `list`, 0 indicates the first match\n and -1 indicates the last match.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> for i in range(4):\n ... (tmp_path / f'test_file-{i}.txt').touch(exist_ok=True)\n >>> pprint(sorted(tmp_path.iterdir()))\n [...Path('...test_file-0.txt'),\n ...Path('...test_file-1.txt'),\n ...Path('...test_file-2.txt'),\n ...Path('...test_file-3.txt')]\n >>> pprint(glob_path_rename_by_0_padding(tmp_path))\n {...Path('...test_file-0.txt'): ...Path('...test_file-00.txt'),\n ...Path('...test_file-1.txt'): ...Path('...test_file-01.txt'),\n ...Path('...test_file-2.txt'): ...Path('...test_file-02.txt'),\n ...Path('...test_file-3.txt'): ...Path('...test_file-03.txt')}\n ```\n \"\"\"\ntry:\nassert Path(path).exists()\nexcept AssertionError:\nraise ValueError(f'path does not exist: \"{Path(path)}\"')\npaths_tuple: tuple[PathLike, ...] = path_globs_to_tuple(path, glob_regex_str)\ntry:\nassert paths_tuple\nexcept AssertionError:\nraise FileNotFoundError(\nf\"No files found matching 'glob_regex_str': \"\nf\"'{glob_regex_str}' in: '{path}'\"\n)\npaths_to_index: tuple[tuple[str, int], ...] = tuple(\nint_from_str(str(matched_path), index=index, regex=match_int_regex)\nfor matched_path in paths_tuple\n)\nmax_index: int = max(index[1] for index in paths_to_index)\nmax_index_digits: int = len(str(max_index))\nif not padding or padding < max_index_digits:\npadding = max_index_digits + 1\nnew_names_dict: dict[PathLike, PathLike] = {}\nif output_path:\nif not Path(output_path).is_absolute():\noutput_path = Path(path) / output_path\nlogger.debug(f\"Specified '{output_path}' for saving file copies\")\nfor i, old_path in enumerate(paths_tuple):\nmatch_str, match_int = paths_to_index[i]\nnew_names_dict[old_path] = rename_by_0_padding(\nold_path, match_str=str(match_str), match_int=match_int, padding=padding\n)\nif output_path:\nnew_names_dict[old_path] = (\nPath(output_path) / Path(new_names_dict[old_path]).name\n)\nreturn new_names_dict\n
def int_from_str(\ns: str,\nindex: int = -1,\nregex: str = PADDING_0_REGEX_DEFAULT,\n) -> tuple[str, int]:\n\"\"\"Return matched (or None) `regex` from `s` by index `index`.\n Params:\n s:\n `str` to match and via `regex`.\n index:\n Which index of number in `s` to pad with 0s.\n Like numbering a `list`, 0 indicates the first match\n and -1 indicates the last match.\n regex:\n Regular expression for matching numbers in `s` to pad.\n Example:\n ```pycon\n >>> int_from_str('a/path/to/fixture-03-05.txt')\n ('05', 5)\n >>> int_from_str('a/path/to/fixture-03-05.txt', index=0)\n ('03', 3)\n ```\n \"\"\"\nmatches: list[str] = [match for match in findall(regex, s) if match]\nmatch_str: str = matches[index]\nreturn match_str, int(match_str)\n
list_json_files(\np: str | Path,\ndrill: bool = False,\nexclude_names: list = [],\ninclude_names: list = [],\n) -> Generator[Path, None, None] | list[Path]\n
List json files under the path specified in p.
Parameters:
Name Type Description Default pstr | Path
The path to search for json files
required drillbool
A flag indicating whether to drill down the subdirectories or not. Default is False
Falseexclude_nameslist
A list of file names to exclude from the search result. Default is an empty list
[]include_nameslist
A list of file names to include in search result. If provided, the exclude_names argument will be ignored. Default is an empty list
[]
Returns:
Type Description Generator[Path, None, None] | list[Path]
A list of Path objects pointing to the found json files
Source code in alto2txt2fixture/utils.py
def list_json_files(\np: str | Path,\ndrill: bool = False,\nexclude_names: list = [],\ninclude_names: list = [],\n) -> Generator[Path, None, None] | list[Path]:\n\"\"\"\n List `json` files under the path specified in ``p``.\n Args:\n p: The path to search for `json` files\n drill: A flag indicating whether to drill down the subdirectories\n or not. Default is ``False``\n exclude_names: A list of file names to exclude from the search\n result. Default is an empty list\n include_names: A list of file names to include in search result.\n If provided, the ``exclude_names`` argument will be ignored.\n Default is an empty list\n Returns:\n A list of `Path` objects pointing to the found `json` files\n \"\"\"\nq: str = \"**/*.json\" if drill else \"*.json\"\nfiles = get_path_from(p).glob(q)\nif exclude_names:\nfiles = list({x for x in files if x.name not in exclude_names})\nelif include_names:\nfiles = list({x for x in files if x.name in include_names})\nreturn sorted(files)\n
Whether the program should crash if there is a json decode error, default: False
False
Returns:
Type Description dict | list
The decoded json contents from the path, but an empty dictionary
dict | list
if the file cannot be decoded and crash is set to False
Source code in alto2txt2fixture/utils.py
def load_json(p: str | Path, crash: bool = False) -> dict | list:\n\"\"\"\n Easier access to reading `json` files.\n Args:\n p: Path to read `json` from\n crash: Whether the program should crash if there is a `json` decode\n error, default: ``False``\n Returns:\n The decoded `json` contents from the path, but an empty dictionary\n if the file cannot be decoded and ``crash`` is set to ``False``\n \"\"\"\np = get_path_from(p)\ntry:\nreturn json.loads(p.read_text())\nexcept json.JSONDecodeError:\nmsg = f\"Error: {p.read_text()}\"\nerror(msg, crash=crash)\nreturn {}\n
Load multiple json files and return a list of their content.
Parameters:
Name Type Description Default pstr | Path
The path to search for json files
required drillbool
A flag indicating whether to drill down the subdirectories or not. Default is False
Falsefilter_nabool
A flag indicating whether to filter out the content that is None. Default is True.
Truecrashbool
A flag indicating whether to raise an exception when an error occurs while loading a json file. Default is False.
False
Returns:
Type Description list
A list of the content of the loaded json files.
Source code in alto2txt2fixture/utils.py
def load_multiple_json(\np: str | Path,\ndrill: bool = False,\nfilter_na: bool = True,\ncrash: bool = False,\n) -> list:\n\"\"\"\n Load multiple `json` files and return a list of their content.\n Args:\n p: The path to search for `json` files\n drill: A flag indicating whether to drill down the subdirectories\n or not. Default is `False`\n filter_na: A flag indicating whether to filter out the content that\n is `None`. Default is `True`.\n crash: A flag indicating whether to raise an exception when an\n error occurs while loading a `json` file. Default is `False`.\n Returns:\n A `list` of the content of the loaded `json` files.\n \"\"\"\nfiles = list_json_files(p, drill=drill)\ncontent = [load_json(x, crash=crash) for x in files]\nreturn [x for x in content if x] if filter_na else content\n
Writes a '.' to a lockfile, after making sure the parent directory exists.
Parameters:
Name Type Description Default lockfilePath
The path to the lock file to be created
required
Returns:
Type Description None
None
Source code in alto2txt2fixture/utils.py
def lock(lockfile: Path) -> None:\n\"\"\"\n Writes a '.' to a lockfile, after making sure the parent directory exists.\n Args:\n lockfile: The path to the lock file to be created\n Returns:\n None\n \"\"\"\nlockfile.parent.mkdir(parents=True, exist_ok=True)\nlockfile.write_text(\"\")\nreturn\n
def rename_by_0_padding(\nfile_path: PathLike,\nmatch_str: str | None = None,\nmatch_int: int | None = None,\npadding: int = FILE_NAME_0_PADDING_DEFAULT,\nreplace_count: int = 1,\nexclude_parents: bool = True,\nreverse_int_match: bool = False,\n) -> Path:\n\"\"\"Return `file_path` with `0` `padding` `Path` change.\n Params:\n file_path:\n `PathLike` to rename.\n match_str:\n `str` to match and replace with padded `match_int`\n match_int:\n `int` to pad and replace `match_str`\n padding:\n How many digits (0s) to pad `match_int` with.\n exclude_parents:\n Only rename parts of `Path(file_path).name`; else\n replace across `Path(file_path).parents` as well.\n reverse_int_match:\n Whether to match from the end of the `file_path`.\n Example:\n ```pycon\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='05', match_int=5)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-03-000005.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='03')\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-000003-05.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_str='05', padding=0)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-03-5.txt')...\n >>> rename_by_0_padding('a/path/to/3/fixture-03-05.txt',\n ... match_int=3)\n <BLANKLINE>\n ...Path('a/path/to/3/fixture-0000003-05.txt')...\n >>> rename_by_0_padding('a/path/to/3/f-03-05-0003.txt',\n ... match_int=3, padding=2,\n ... exclude_parents=False)\n <BLANKLINE>\n ...Path('a/path/to/03/f-03-05-0003.txt')...\n >>> rename_by_0_padding('a/path/to/3/f-03-05-0003.txt',\n ... match_int=3, padding=2,\n ... exclude_parents=False,\n ... replace_count=3, )\n <BLANKLINE>\n ...Path('a/path/to/03/f-003-05-00003.txt')...\n ```\n \"\"\"\nif match_int is None and match_str in (None, \"\"):\nraise ValueError(f\"At least `match_int` or `match_str` required; both None.\")\nelif match_str and not match_int:\nmatch_int = int(match_str)\nelif match_int is not None and not match_str:\nassert str(match_int) in str(file_path)\nmatch_str = int_from_str(\nstr(file_path),\nindex=-1 if reverse_int_match else 0,\n)[0]\nassert match_int is not None and match_str is not None\nif exclude_parents:\nreturn Path(file_path).parent / Path(file_path).name.replace(\nmatch_str, str(match_int).zfill(padding), replace_count\n)\nelse:\nreturn Path(\nstr(file_path).replace(\nmatch_str, str(match_int).zfill(padding), replace_count\n)\n)\n
save_fixture(\ngenerator: Sequence | Generator = [],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nadd_created: bool = True,\njson_indent: int = JSON_INDENT,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None\n
Saves fixtures generated by a generator to separate JSON files.
This function takes a generator and saves the generated fixtures to separate JSON files. The fixtures are saved in batches, where each batch is determined by the max_elements_per_file parameter.
Parameters:
Name Type Description Default generatorSequence | Generator
A generator that yields the fixtures to be saved.
[]prefixstr
A string prefix to be added to the file names of the saved fixtures.
''output_pathPathLike | str
Path to folder fixtures are saved to
settings.OUTPUTmax_elements_per_fileint
Maximum JSON records saved in each file
settings.MAX_ELEMENTS_PER_FILEadd_createdbool
Whether to add created_at and updated_attimestamps
Truejson_indentint
Number of indent spaces per line in saved JSON
JSON_INDENTfile_name_0_paddingint
Zeros to prefix the number of each fixture file name.
FILE_NAME_0_PADDING_DEFAULT
Returns:
Type Description None
This function saves the fixtures to files but does not return
def save_fixture(\ngenerator: Sequence | Generator = [],\nprefix: str = \"\",\noutput_path: PathLike | str = settings.OUTPUT,\nmax_elements_per_file: int = settings.MAX_ELEMENTS_PER_FILE,\nadd_created: bool = True,\njson_indent: int = JSON_INDENT,\nfile_name_0_padding: int = FILE_NAME_0_PADDING_DEFAULT,\n) -> None:\n\"\"\"Saves fixtures generated by a generator to separate JSON files.\n This function takes a generator and saves the generated fixtures to\n separate JSON files. The fixtures are saved in batches, where each batch\n is determined by the ``max_elements_per_file`` parameter.\n Args:\n generator:\n A generator that yields the fixtures to be saved.\n prefix:\n A string prefix to be added to the file names of the\n saved fixtures.\n output_path:\n Path to folder fixtures are saved to\n max_elements_per_file:\n Maximum `JSON` records saved in each file\n add_created:\n Whether to add `created_at` and `updated_at` `timestamps`\n json_indent:\n Number of indent spaces per line in saved `JSON`\n file_name_0_padding:\n Zeros to prefix the number of each fixture file name.\n Returns:\n This function saves the fixtures to files but does not return\n any value.\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> save_fixture(NEWSPAPER_COLLECTION_METADATA,\n ... prefix='test', output_path=tmp_path)\n >>> imported_fixture = load_json(tmp_path / 'test-000001.json')\n >>> imported_fixture[1]['pk']\n 2\n >>> imported_fixture[1]['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n >>> 'created_at' in imported_fixture[1]['fields']\n True\n ```\n \"\"\"\ninternal_counter = 1\ncounter = 1\nlst = []\nfile_name: str\nPath(output_path).mkdir(parents=True, exist_ok=True)\nfor item in generator:\nlst.append(item)\ninternal_counter += 1\nif internal_counter > max_elements_per_file:\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.json\"\nwrite_json(\np=Path(f\"{output_path}/file_name\"),\no=lst,\nadd_created=add_created,\njson_indent=json_indent,\n)\n# Save up some memory\ndel lst\ngc.collect()\n# Re-instantiate\nlst = []\ninternal_counter = 1\ncounter += 1\nelse:\nfile_name = f\"{prefix}-{str(counter).zfill(file_name_0_padding)}.json\"\nwrite_json(\np=Path(f\"{output_path}/{file_name}\"),\no=lst,\nadd_created=add_created,\njson_indent=json_indent,\n)\nreturn\n
def truncate_path_str(\npath: PathLike,\nmax_length: int = MAX_TRUNCATE_PATH_STR_LEN,\nfolder_filler_str: str = INTERMEDIATE_PATH_TRUNCATION_STR,\nhead_parts: int = TRUNC_HEADS_PATH_DEFAULT,\ntail_parts: int = TRUNC_TAILS_PATH_DEFAULT,\npath_sep: str = sep,\n_force_type: Type[Path] | Type[PureWindowsPath] = Path,\n) -> str:\n\"\"\"If `len(text) > max_length` return `text` followed by `trail_str`.\n Args:\n path:\n `PathLike` object to truncate\n max_length:\n maximum length of `path` to allow, anything belond truncated\n folder_filler_str:\n what to fill intermediate path names with\n head_parts:\n how many parts of `path` from the root to keep.\n These must be `int` >= 0\n tail_parts:\n how many parts from the `path` tail the root to keep.\n These must be `int` >= 0\n path_sep:\n what `str` to replace `path` parts with if over `max_length`\n Returns:\n `text` truncated to `max_length` (if longer than `max_length`),\n with with `folder_filler_str` for intermediate folder names\n Note:\n For errors running on windows see:\n [#56](https://github.com/Living-with-machines/alto2txt2fixture/issues/56)\n Example:\n ```pycon\n >>> love_shadows: Path = (\n ... Path('Standing') / 'in' / 'the' / 'shadows'/ 'of' / 'love.')\n >>> truncate_path_str(love_shadows)\n 'Standing...love.'\n >>> truncate_path_str(love_shadows, max_length=100)\n 'Standing...in...the...shadows...of...love.'\n >>> truncate_path_str(love_shadows, folder_filler_str=\"*\")\n 'Standing...*...*...*...*...love.'\n >>> root_love_shadows: Path = Path(sep) / love_shadows\n >>> truncate_path_str(root_love_shadows, folder_filler_str=\"*\")\n <BLANKLINE>\n ...\n '...Standing...*...*...*...*...love.'\n >>> if is_platform_win:\n ... pytest.skip('fails on certain Windows root paths: issue #56')\n >>> truncate_path_str(root_love_shadows,\n ... folder_filler_str=\"*\", tail_parts=3)\n <BLANKLINE>\n ...\n '...Standing...*...*...shadows...of...love.'...\n ```\n \"\"\"\npath = _force_type(normpath(path))\nif len(str(path)) > max_length:\ntry:\nassert not (head_parts < 0 or tail_parts < 0)\nexcept AssertionError:\nlogger.error(\nf\"Both index params for `truncate_path_str` must be >=0: \"\nf\"(head_parts={head_parts}, tail_parts={tail_parts})\"\n)\nreturn str(path)\noriginal_path_parts: tuple[str, ...] = path.parts\nhead_index_fix: int = 0\nif path.is_absolute() or path.drive:\nhead_index_fix += 1\nfor part in original_path_parts[head_parts + head_index_fix :]:\nif not part:\nhead_index_fix += 1\nelse:\nbreak\nlogger.debug(\nf\"Adding {head_index_fix} to `head_parts`: {head_parts} \"\nf\"to truncate: '{path}'\"\n)\nhead_parts += head_index_fix\ntry:\nassert head_parts + tail_parts < len(str(original_path_parts))\nexcept AssertionError:\nlogger.error(\nf\"Returning untruncated. Params \"\nf\"(head_parts={head_parts}, tail_parts={tail_parts}) \"\nf\"not valid to truncate: '{path}'\"\n)\nreturn str(path)\ntail_index: int = len(original_path_parts) - tail_parts\nreplaced_path_parts: tuple[str, ...] = tuple(\npart if (i < head_parts or i >= tail_index) else folder_filler_str\nfor i, part in enumerate(original_path_parts)\n)\nreplaced_start_str: str = \"\".join(replaced_path_parts[:head_parts])\nreplaced_end_str: str = path_sep.join(\npath for path in replaced_path_parts[head_parts:]\n)\nreturn path_sep.join((replaced_start_str, replaced_end_str))\nelse:\nreturn str(path)\n
def valid_compression_files(files: Sequence[PathLike]) -> list[PathLike]:\n\"\"\"Return a `tuple` of valid compression paths in `files`.\n Args:\n files:\n `Sequence` of files to filter compression types from.\n Returns:\n A list of files that could be decompressed.\n Example:\n ```pycon\n >>> valid_compression_files([\n ... 'cat.tar.bz2', 'dog.tar.bz3', 'fish.tgz', 'bird.zip',\n ... 'giraffe.txt', 'frog'\n ... ])\n ['cat.tar.bz2', 'fish.tgz', 'bird.zip']\n ```\n \"\"\"\nreturn [\nfile\nfor file in files\nif \"\".join(Path(file).suffixes) in VALID_COMPRESSION_FORMATS\n]\n
Easier access to writing json files. Checks whether parent exists.
Parameters:
Name Type Description Default pstr | Path
Path to write json to
required odict
Object to write to json file
required add_createdbool
If set to True will add created_at and updated_at to the dictionary's fields. If created_at and updated_at already exist in the fields, they will be forcefully updated.
def write_json(\np: str | Path, o: dict, add_created: bool = True, json_indent: int = JSON_INDENT\n) -> None:\n\"\"\"\n Easier access to writing `json` files. Checks whether parent exists.\n Args:\n p: Path to write `json` to\n o: Object to write to `json` file\n add_created:\n If set to True will add `created_at` and `updated_at`\n to the dictionary's fields. If `created_at` and `updated_at`\n already exist in the fields, they will be forcefully updated.\n json_indent:\n What indetation format to write out `JSON` file in\n Returns:\n None\n Example:\n ```pycon\n >>> tmp_path: Path = getfixture('tmp_path')\n >>> path: Path = tmp_path / 'test-write-json-example.json'\n >>> write_json(p=path,\n ... o=NEWSPAPER_COLLECTION_METADATA,\n ... add_created=True)\n >>> imported_fixture = load_json(path)\n >>> imported_fixture[1]['pk']\n 2\n >>> imported_fixture[1]['fields'][DATA_PROVIDER_INDEX]\n 'hmd'\n ```\n `\n \"\"\"\np = get_path_from(p)\nif not (isinstance(o, dict) or isinstance(o, list)):\nraise RuntimeError(f\"Unable to handle data of type: {type(o)}\")\ndef _append_created_fields(o: dict):\n\"\"\"Add `created_at` and `updated_at` fields to a `dict` with `FixtureDict` values.\"\"\"\nreturn dict(\n**{k: v for k, v in o.items() if not k == \"fields\"},\nfields=dict(\n**{\nk: v\nfor k, v in o[\"fields\"].items()\nif not k == \"created_at\" and not k == \"updated_at\"\n},\n**{\"created_at\": NOW_str, \"updated_at\": NOW_str},\n),\n)\ntry:\nif add_created and isinstance(o, dict):\no = _append_created_fields(o)\nelif add_created and isinstance(o, list):\no = [_append_created_fields(x) for x in o]\nexcept KeyError:\nerror(\"An unknown error occurred (in write_json)\")\np.parent.mkdir(parents=True, exist_ok=True)\np.write_text(json.dumps(o, indent=json_indent))\nreturn\n
The installation process should be fairly easy to take care of, using poetry:
$ poetry install\n
However, this is only the first step in the process. As the script works through the alto2txt collections, you will either need to choose the slower option \u2014 mounting them to your computer (using blobfuse) \u2014\u00a0or the faster option \u2014 downloading the required zip files from the Azure storage to your local hard drive. In the two following sections, both of those options are described.
"},{"location":"tutorial/first-steps.html#connecting-alto2txt-to-the-program","title":"Connecting alto2txt to the program","text":""},{"location":"tutorial/first-steps.html#downloading-local-copies-of-alto2txt-on-your-computer","title":"Downloading local copies of alto2txt on your computer","text":"
This option will take up a lot of hard drive space
As of the time of writing, downloading all of alto2txt\u2019s metadata takes up about 185GB on your local drive.
You do not have to download all of the collections or all of the zip files for each collection, as long as you are aware that the resulting fixtures will be limited in scope.
"},{"location":"tutorial/first-steps.html#step-1-log-in-to-azure-using-microsoft-azure-storage-explorer","title":"Step 1: Log in to Azure using Microsoft Azure Storage Explorer","text":"
Microsoft Azure Storage Explorer (MASE) is a great and free tool for downloading content off Azure. Your first step is to download and install this product on your local computer.
Once you have opened MASE, you will need to sign into the appropriate Azure account.
"},{"location":"tutorial/first-steps.html#step-2-download-the-alto2txt-blob-container-to-your-hard-drive","title":"Step 2: Download the alto2txt blob container to your hard drive","text":"
On your left-hand side, you should see a menu where you can navigate to the correct \u201cblob container\u201d: Living with Machines > Storage Accounts > alto2txt > Blob Containers:
You will want to replicate the same structure as the Blob Container itself in a folder on your hard drive:
Once you have the structure set up, you are ready to download all of the files needed. For each of the blob containers, make sure that you download the metadata directory only onto your computer:
Select all of the files and press the download button:
Make sure you save all the zip files inside the correct local folder:
The \u201cActivities\u201d bar will now show you the progress and speed:
"},{"location":"tutorial/first-steps.html#mounting-alto2txt-on-your-computer","title":"Mounting alto2txt on your computer","text":"
This option will only work on a Linux or UNIX computer
If you have a mac, your only option is the one below.
The installation process should be fairly easy to take care of, using poetry:
+
$poetryinstall
+
+
However, this is only the first step in the process. As the script works through the alto2txt collections, you will either need to choose the slower option — mounting them to your computer (using blobfuse) — or the faster option — downloading the required zip files from the Azure storage to your local hard drive. In the two following sections, both of those options are described.
+
Connecting alto2txt to the program
+
Downloading local copies of alto2txt on your computer
+
+
This option will take up a lot of hard drive space
+
As of the time of writing, downloading all of alto2txt’s metadata takes up about 185GB on your local drive.
+
+
You do not have to download all of the collections or all of the zip files for each collection, as long as you are aware that the resulting fixtures will be limited in scope.
+
Step 1: Log in to Azure using Microsoft Azure Storage Explorer
+
Microsoft Azure Storage Explorer (MASE) is a great and free tool for downloading content off Azure. Your first step is to download and install this product on your local computer.
+
Once you have opened MASE, you will need to sign into the appropriate Azure account.
+
Step 2: Download the alto2txt blob container to your hard drive
+
On your left-hand side, you should see a menu where you can navigate to the correct “blob container”: Living with Machines > Storage Accounts > alto2txt > Blob Containers:
+
+
You will want to replicate the same structure as the Blob Container itself in a folder on your hard drive:
+
+
Once you have the structure set up, you are ready to download all of the files needed. For each of the blob containers, make sure that you download the metadata directory only onto your computer:
+
+
Select all of the files and press the download button:
+
+
Make sure you save all the zip files inside the correct local folder:
+
+
The “Activities” bar will now show you the progress and speed:
+
+
Mounting alto2txt on your computer
+
+
This option will only work on a Linux or UNIX computer
+
If you have a mac, your only option is the one below.
If you choose other settings for when you run the program, your output directory may look different from the information on this page.
+
+
Reports
+
Reports are automatically generated with a unique hash as the overarching folder structure. Inside the reports directory, you’ll find a JSON file for each alto2txt directory (organised by NLP identifier).
+
The report structure, thus, looks like this:
+
+
The JSON file has some good troubleshooting information. You’ll find that the contents are structured as a Python dictionary (or JavaScript Object). Here is an example:
+
+
Here is an explanation of each of the keys in the dictionary:
+
+
+
+
Key
+
Explanation
+
Data type
+
+
+
+
+
path
+
The input path for the zip file that is being converted.
+
string
+
+
+
bytes
+
The size of the input zip file represented in bytes.
+
integer
+
+
+
size
+
The size of the input zip file represented in a human-readable string.
+
string
+
+
+
contents
+
#TODO #3
+
integer
+
+
+
start
+
Date and time when processing started (see also end below).
+
datestring
+
+
+
newspaper_paths
+
#TODO #3
+
list (string)
+
+
+
publication_codes
+
A list of the NLPs that are contained in the input zip file.
+
list (string)
+
+
+
issue_paths
+
A list of all the issue paths that are contained in the cache directory.
+
list (string)
+
+
+
item_paths
+
A list of all the item paths that are contained in the cache directory.
+
list (string)
+
+
+
end
+
Date and time when processing ended (see also start above).
+
datestring
+
+
+
seconds
+
Seconds that the script spent interpreting the zip file (should be added to the microseconds below).
+
integer
+
+
+
microseconds
+
Microseconds that the script spent interpreting the zip file (should be added to the seconds above).
+
integer
+
+
+
+
Fixtures
+
The most important output of the script is contained in the fixtures directory. This directory contains JSON files for all the different columns in the corresponding Django metadata database (i.e. DataProvider, Digitisation, Ingest, Issue, Newspaper, and Item). The numbering at the end of each file indicates the order of the files as they are divided into a maximum of 2e6 elements*:
+
+
Each JSON file contains a Python-like list (JavaScript Array) of dictionaries (JavaScript Objects), which have a primary key (pk), the related database model (in the example below the Django newspapers app’s newspaper table), and a nested dictionary/Object which contains all the values for the database’s table entry:
+
+
+
* The maximum elements per file can be adjusted in the settings.py file’s settings object’s MAX_ELEMENTS_PER_FILE value.