The last live session (about three hours) has taken place on June 11th, 2021 at 18:00 CEST on Twitch.
Quick Start: Go straight to the default learning path – Absolute Beginner (Long).
Welcome to this repository for the Python Programming for Linguists workshop.
In this workshop, consisting of several videos and exercises, as well as a live session (recordings/videos available), you will be introduced to Python and its application within (corpus) linguistics. After a short general introduction to programming as well as Python, we will utilize Python to solve several (corpus) linguistic exercises.
This workshop is specifically targeted towards people who have no prior experience programming. While this workshop is not intended to make you a programmer, you will gain a fundamental understanding of how programming works and how to proceed should you want to deepen your knowledge and skills. In addition, by looking at various example tasks that are commonly solved using existing software, we will try to deepen our understanding of how commonly used tools work under the hood.
Please be aware that this workshop was specifically designed as a first introduction to programming for non-coders and linguists and not as a fully-fledged Python course. Therefore, we will take some shortcuts, disobey some best practices, and hide away quite a few of the underlying complexities. If you are interested in a more thorough introduction or want to deepen your already existing knowledge, please refer to the final video in which I present many great resources. Also, feel free to have a look at the List of Additional Resources.
While the materials and exercises are targeted towards beginners, they are challenging, and this workshop is designed as an intense deep dive!
Please do not feel discouraged if you get stuck or if something seems too hard at first. I have provided solutions for all exercises, and you will also find lots of additional helpful resources in this repository. Also, while not required, you can prepare for this workshop by consulting other slower-paced introductory courses such as the ones listed in this document. You will get there! 🚀
This repository currently reflects the the second iteration of this workshop (2021) which started as part of my Data Literacy for Linguists class taught at the University of Cologne. Therefore, I will be reusing most of the material from the 2020 rendition. In 2022, the material has been slightly updated, especially regarding a new solutions video for Exercises 8 to 17 and a note regarding the use of ChatGPT and similar AI systems. Hence, please do not get confused about the 2020
, 2021
, and 2022
folders and simply follow the learning path(s) provided below. That being said, feel free to explore as much as you want!
Originally, this workshop has been inspired by workshops I held at 35c3 and 36c3 a/36c3 b.
After completing this workshop, you will be able to ...
- describe what programming essentially is about.
- name and describe some basic programming terminology.
- model simple problems in terms of data structures and basic algorithms.
- write basic scripts in Python in order to solve specific problems.
- utilize third-party libraries such as NLTK, spaCy, and TextDirectory.
- construct and apply basic regular expressions.
- utilize Python for text manipulation.
- utilize Python to perform concordance and frequency analysis.
- automatically annotate texts (PoS, Universal Dependencies, NER) using spaCy.
- scrape web data in order to build corpora (Web as Corpus) using Python.
- compute basic statistics using Python.
This workshop is designed as a blend of asynchronous and synchronous elements. However, as everything will be recorded, you can also do this in a completely self-paced fashion.
The general idea is that you watch a series of videos and complete/attempt a series of exercises before joining the synchronous live session hosted on Twitch. During this live session, I will be solving Exercises 8 to 17 while you are invited to code along and to ask questions.
To make things as straightforward as possible, I have created three learning paths for you to follow.
The Absolute Beginner (Long) path also contains additional materials and exercises. If you are already somewhat familiar with Python, you can have a look at the Experienced in Python path.
The las live session (about three hours) has taken place on June 11th, 2021 at 18:00 CEST on Twitch.
If you are interested, last year's recording (slightly edited and polished) is available on YouTube.
Also, as of December 2022, pre-recorded solutions for Exercises 8 to 17, instead of the live session(s), are available for you to watch (see below).
I want to strongly encourage you to code along and to experiment with the exercises. The easiest way of doing this is to use Google Colab. In order to do this, you will need a Google Account. If you have never used Colab you might want to have a look at this tutorial on YouTube.
If you do not want to rely on Google, you can also set up your own local development environment. For a tutorial on how to do this on Windows, have a look at the video "Setting Up Your Development Environment (Windows)."
The videos are intended to be paused from time to time. Do not feel forced to watch through a whole video before playing with the code 😀.
This is a list of the available materials. I would suggest following one of the learning paths provided above, but of course, you are free to use the materials as you see fit.
There is no fixed schedule for updating these materials as I am not actively teaching this workshop at the moment. While the "old" materials, e.g., from 2020, are largely kept as is, I have gone back and made some very minor changes – e.g., when I came across a typo.
All of these videos are currently hosted on YouTube (Playlist). Additional Technology Primers for Linguisics are available via their own YouTube playlist.
- 00 - Python Programming for Absolute Beginners
- 01 - The Pizza Problem
- 02 - Working with Files, Texts, and Regular Expressions
- 03 - Python for (Corpus) Linguists / Exercises 8 to 17 (2022 Recording) (Old 2020 Recording)
- 04 - Summary and Resources
- 05 - Setting Up Your Development Environment (Windows) (Alternative to using Google Colab. Please watch video 00 first in any case!)
- 06 - Getting Started with Google Colab
This workshop, next to the videos and livestream, has 17 main exercises as well as number of additional ones. Solutions to these exercises are available in the form of notebooks.
- Exercises 1 to 3 (Solutions)
- Exercises 4 and 5 (Solutions)
- Exercises 6 and 7 (Solutions)
- Exercises 8 to 17 (Solutions)
- Additional Exercise: Regular Expressions (Exercise Video)
- Additional Exercise: Frequency Distribution (Solutions)
Please note that for each exercise, you will find solutions in this repository. Don't feel bad if you cannot immediately solve the exercises - the solutions are there to help you. Of course, feel free to take apart these suggested solutions and play with them.
All of the slides (in both .pptx
and .pdf
) are available as well. See 2020, 2021, and 2022.
Aside from the main material, there are also a few advanced bonus notebooks in this repository for you to explore. Have a look at them to see more advanced and/or alternative solutions to some of the problems discussed in the workshop.
- Command Line Primer
- Markdown Primer
- Commenting in Python
- Video: A RegEx Primer for Linguistics
- Video: A Git Primer for Linguistics
- Video: A Shell Primer for Linguistics
- List of Additional Resources
This workshop is based on modern Python and requires a version of Python >= 3.6. All of the code, as well as the used external libraries, should be compatible with everything up to Python 3.9 as well. If you are interested, also have a look at the Coding Style Guide for this workshop in which I discuss how most of the code is styled and why.
You are (relatively) free to use all of these materials as you like.
- The code (notebooks and scripts) is licensed under the MIT License.
- The slides, videos, and exercises are licensed under a CC BY-SA 4.0 license.
In some exercises, this workshop relies on the wonderful HUM19UK corpus (Huddersfield, Utrecht, Middelburg Corpus of 19th Century Fiction) compiled by Fransina Stradling, Brian Walker, Dan McIntyre, Elliot Land, Hazel Price, and Michael Burke.
Unfortunately, the corpus website (linguisticsathuddersfield.com) cannot be reached anymore, and access to the corpus has become harder for the moment. Hence, for now, and in the spirit of their original licensing, I have made the data available through one of my servers and updated the download script accordingly.