Recognition and identification of real-world entities is at the core of virtually any text mining application. As a matter of fact, referential units such as names of persons, locations and organizations underlie the semantics of texts and guide their interpretation. Around since the seminal Message Understanding Conference (MUC) evaluation cycle in the 1990s [1], named entity-related tasks have undergone major evolutions until now, from entity recognition and classification to entity disambiguation and linking [2,3,4,5]. Besides the general domain of well-written newswire data, named entity (NE) processing is also applied on specific domains, particularly bio-medical [6], and on more noisy inputs such as speech transcriptions [7] and tweets [8]. More recently, NE processing has also been called upon to contribute to the domain of digital humanities, where massive digitization of historical documents is producing huge amounts of texts.
De facto, NE processing tools are increasingly being used in the context of historical documents. Research activities in this domain target texts of different nature (e.g., publications by cultural institutions, state-related documents, genealogical data, historical newspapers) and different tasks (NE recognition and classification, entity linking, or both). Experiments involve different time periods (from 16th to 20th c.), focus on different domains, and use different typologies. This great variety demonstrates how many and varied the needs – and the challenges – are, but makes performance comparison difficult, not to say impossible. Compared to the standard analysis of present-time English, very often news, the application of NE tools on historical texts faces news challenges, which can be defined as follows: (i) noisy input texts, (ii) lack of coverage in linguistic resources and knowledge bases, and (iii) dynamics of language [9].
The objective of the tutorial is to provide the participants with essential knowledge with respect to a) NE processing in general and in DH, and b) how to apply NE recognition approaches. To this end, the session will be organized in two parts, theory and hands-on.
Throughout the sessions, the audience will learn about:
- the origins of named entity processing
- the resources needed for their processing
- the evaluation protocols
- the tools and algorithms used for their recognition, classification and disambiguation.
Participants will also learn how to run an existing NER system and, more interestingly, how to build or adapt a system, by training it on historical materials. Two main approaches will be considered:
- rule-based
- deep-learning
- Date: Tuesday 9 July from 9am to 4pm
- Venue: TivoliVredenburg building, room 'Hertz'
- Important: please come with a fully charged laptop
Time | Activity |
---|---|
08h45-09h00 | Welcome of participants |
09h00-10h30 | Theoretical part |
10h30-11h00 | Coffee break |
11h00-12h00 | Theoretical part |
12h00-13h00 | Hands-on |
13h00-14h00 | Lunch break |
14h00-14h45 | Hands-on |
14h45-15h15 | Coffee break |
15h15-16h00 | Hands-on |
Time | Topic |
---|---|
09h00-09h05 | Introduction, welcome |
09h05-09h20 | Segment 1: Context, Definition and Applications |
09h20-09h45 | Segment 2: Resources for Named Entities |
09h45-10h30 | Segment 3: Recognizing and classifying |
11h00-11h20 | Segment 4: Linking |
11h20-11h30 | Segment 5: Evaluating |
11h30-12h00 | Segment 6: Zoom on DH |
12h00-12h30 | Hands-on: System set up |
12h30-13h00 | Hands-on: Experimenting APIs and applying models |
14h00-14h45 | Hands-on: Experimenting APIs and applying models |
15h15-16h00 | Hands-on: Experimenting APIs and applying models |
In the hands-on session we will make use of two datasets consisting of historical texts:
- Quaero Dataset (French historical newspapers of the end of the XIX c.)
- impresso dataset (Swiss and Luxembourgish historical newspapers in French and German).
Additionally, we will provide a list of alternative datasets, both historical and contemporary, that participants can decide to work with, in full respect of copyrights. Finally, participants are welcome to bring to the workshop their own datasets in order to apply the code and tools we will present to them.
Hands-on material will be shared on GitHub and will include:
- Jupyter notebooks with explanations and code examples; if relevant, we will set up a multi user environment (Jupyter Hub) in order to reduce system setup time during the tutorial;
- a bibliography on the topic;
- a list of of available open source academic and industrial tools;
- slides of the tutorial.
- Maud Ehrmann (École polytechnique fédérale de Lausanne, DHLAB)
- Matteo Romanello (École polytechnique fédérale de Lausanne, DHLAB)
- Simon Clematide (University of Zurich, Institute of Computational Linguistics)
[1] R. Grishman and B. Sundheim. 1996. Message Understanding Conference - 6: A brief history. In Proceedings of the 16th Conference on Computational Linguistics - Volume 1, COLING ’96, pages 466–471, Stroudsburg, PA, USA. Association for Computational Linguistics.
[2] D. Nadeau and S. Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26.
[3] D. Rao, P. McNamee, and M. Dredze, 2013. Multisource, Multilingual Information Extraction and Summarization, chapter Entity Linking: Finding Extracted Entities in a Knowledge Base. Pages 93–115. Springer Berlin Heidelberg, Berlin, Heidelberg.
[4] D. Nouvel, M. Ehrmann, S. and Rosset. 2015. Les Entités Nommées pour le Traitement Automatique des Langues, ISTE edition.
[5] M. Ehrmann, D. Nouvel, and S. Rosset. 2016. Named Entities Resources - Overview and Outlook. In N. Calzolari Conference Chair, K. Choukri, T. Declerck, M. Grobelnik, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the 10th International Conference on Language Resources and Evaluation, Portoro, Slovenia, May 2016.
[6] J.D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. 2003. Genia corpus, a semantically annotated corpus for bio-text mining. Bioinformatics, 19(suppl 1):180–182.
[7] O. Galibert, J. Leixa, G. Adda, K. Choukri, and G. Gravier. 2014. The ETAPE speech processing evaluation. In Proc. of the 9th International Conference on Language Resources and Evaluation (LREC’09), Reykjavik, Iceland.
[8] A. Ritter, S. Clark, M. Etzioni, and O. Etzioni. 2011. Named Entity Recognition in Tweets: An Experimental Study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1524–1534, Stroudsburg, PA, USA. Association for Computational Linguistics.
[9] M. Ehrmann, G. Colavizza, Y. Rochat, Y. and F. Kaplan. 2016. Diachronic evaluation of NER systems on old newspapers. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016) (No. EPFL-CONF-221391, pp. 97-107). Bochumer Linguistische Arbeitsberichte.
[10] O. Galibert, S. Rosset, C. Grouin, P. Zweigenbaum and L. Quintard. 2012. Extended Named Entity Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey.