SHAWI Transcription Repository

This git repository hosts the transcription data of the project The Shawi-type Arabic dialects (FWF P 33574).

PI: Stephan Procházka (University of Vienna)
National Cooperation Partner: Charly Mörth (Austrian Academy of Sciences)

Status

THIS IS PRELIMINARY DATA AND COPYRIGHTED MATERIAL!

If you want to use any material in this repository please contact PI Stephan Procházka (University of Vienna).

This will change at the end of the project.

Directory Structure

Directory	Content	Remarks
`001_src`	Original sources	Source documents (e.g. raw transcriptions)
`080_scripts_generic`	Conversion Scripts	mostly the ELAN2TEI conversion script (implemented in Python) which generates the initial TEI data prior to tokenization based on the ELAN transcription documents in 122_elan
`082_scripts_xsl`	XSLT scripts	XSLT scripts
`103_tei_w`	TEI-XML with tokens	This is where ELAN2TEI puts its output. Re-running TEI2ELAN will overwrite all content in this directory, so do not do any manual changes here but copy the file to `010_manannot` beforehand.
`010_manannot`	manually annotated TEI-XML	Tokenized TEI documents from `103_tei_w` which are manually annotated.
`802_tei_odd`	TEI customization (ODD)	This is the source of truth for the SHAWI Schema and the HTML documentation generated from it.
`130_vert_plain`	NoSketch Engine Verticals	NoSketch Engine text verticals
`803_RNG-schematron`	Schemas	derived from the ODD in `802_tei_odd`
`804_xsd`	Schemas	derived from the ODD in `802_tei_odd`
`850_docs`	Documentation	Further data documentation, esp. the HTML documentation of the ODD

The oXygen project shawi.xpr contains the configuration for various transformation scenarios.

The directories css, html, js and xsl are used by the TEI Enricher.

Other data locations

Master files of the audio recordings are stored on the project's network share at the University of Vienna
the metadata spreadsheet is hosted on Sharepoint.
The SHAWI Dictionary is curated in (BaseX Curation)[https://redmine.acdh.oeaw.ac.at/issues/11318].

General Workflow

For more information refer to the SHAWI Data Processing and Curation Document

The following steps happen before data is ingested into this repository:

fieldwork (recording audio etc.) – The recordings so far cover only material collected in previous campaigns
collecting metadata: – This is collected at curated in [the metadata spreadsheet].

Workflow steps reflected in the data in this repository:

Transcription and translation – Curators segment the audio recordings into sensible sets of "utterances" and transcribe and translate them using ELAN. When transcription has finished, the curator adds the ELAN document(s) to 122_elan and pushes the changes to git.
Tokenization This push triggers the ELAN2TEI conversion workflow which takes all *.eaf files in 122_ELAN and transforms them into tokenized standalone TEI documents, storing them under 103_tei_w. Additionally, a TEI Corpus file is generated which includes corpus level metadata and controlled vocabularies.
Annotation After transformation to TEI, curators annotate the texts using the TEI_enricher and store the results under 010_manannot.
Conversion to NoSke Verticals During the tokenization process, a NoSke-compatible vertical is created which incorporates the annotations found in ``010_manannot` .
Deployment Inteagration of deployment in the workflow TBD

Re-Deploy SHAWI Website

Start GitHub Workflow in the vicav-app repository https://github.com/acdh-oeaw/vicav-app:
- choose generate-workflow_vars-shawi and
- click re-run this job
- wait until it is done.
Go to ACDH-CH Rancher https://rancher.acdh-dev.oeaw.ac.at/dashboard/home and
- click on AC2 at the upper left corner of the screen or acdh-ch-cluster-2
- then search for vicav-test in the window in the upper right corner of the screen
- click on workloads (menu on the left) and on deployments
- now choose shawi-app-devel and
- click redeploy (three dots on the right)
- wait until it is done

Name		Name	Last commit message	Last commit date
Latest commit History 1,571 Commits
.github/workflows		.github/workflows
009_src_origs		009_src_origs
010_manannot		010_manannot
080_scripts_generic		080_scripts_generic
082_scripts_xsl		082_scripts_xsl
103_tei_w		103_tei_w
106_html		106_html
122_elan		122_elan
130_vert_plain		130_vert_plain
650_css		650_css
802_tei_odd		802_tei_odd
850_docs		850_docs
880_conf		880_conf
css		css
deployment		deployment
framework/shawi		framework/shawi
html		html
js		js
nosketchengine		nosketchengine
vicav_biblio/shawi		vicav_biblio/shawi
vicav_dicts		vicav_dicts
vicav_lingfeatures/shawi		vicav_lingfeatures/shawi
vicav_profiles/shawi		vicav_profiles/shawi
vicav_projects		vicav_projects
vicav_samples/shawi		vicav_samples/shawi
vicav_texts/shawi		vicav_texts/shawi
xsl		xsl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
shawi.xpr		shawi.xpr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHAWI Transcription Repository

Status

Directory Structure

Other data locations

General Workflow

Re-Deploy SHAWI Website

About

Releases

Packages

Contributors 12

Languages

License

acdh-oeaw/shawi-data

Folders and files

Latest commit

History

Repository files navigation

SHAWI Transcription Repository

Status

Directory Structure

Other data locations

General Workflow

Re-Deploy SHAWI Website

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages