This course was developed under a previous phase of the Yale Digital Humanities Lab. Now a part of Yale Library’s Computational Methods and Data department, the Lab no longer includes this course in its scope of work. As such, it will receive no further updates.
This repository contains materials for YData: Humanities Data Mining (S&DS 176 / S&DS 576), taught in the Spring of 2022 at Yale University. For more information on the course, please consult the preliminary syllabus or the course materials below (please note that some materials require a Yale NetID for access).
You can also view the syllabus from the first year the class was taught, Spring 2021.
In our first week, of class we will discuss some of the ways researchers from the humanities and beyond have used data mining, and we will take our first steps with the Python programming language.
Tuesday Slides
Thursday Lab Notebook
Readings
- Michael Witmore, “Text: A Massively Addressable Object”
- Ted Underwood, “Seven ways humanists are using computers to understand text”
In our second week, of class we will take a deeper dive into data--what it is, how it's created, and how we can find and use it. In particular, we'll explore Application Programming Interfaces (APIs)--little machines that give us data to analyze!
Tuesday Slides
Readings
- Christof Schöch, “Big? Smart? Clean? Messy? Data in the Humanities”
- Johanna Drucker, “Why Distant Reading Isn’t” (VPN or on-campus network needed)
Tuesday Slides
Thursday Lab Tutorial
Readings
- Catherine D’Ignazio and Lauren Klein, “Feminist Data Visualization”
In our third week, we will consider strategies and best practices for visualizing data that take into account what kind of data we have, who we have in mind as our audience, what story we're aiming to tell, and where we think the visualization will circulate. For Thursday's lab, please download Tableau Public.
No problem set assigned this week -- Work on Project Review 1: Text Mining. You can find the prompt in Canvas under "Assignments."
In our fourth week, we will begin turning our attention to text analysis in more detail. In particular, we will experiment with an approach called named entity recognition, which can help us extract entities (names, locations, organizations) from text.'
Readings
- Richard Jean So, “All Models are Wrong”
- Jean Baptiste-Michel et al. “Quantitative Analysis of Culture Using Millions of Digitized Books”
In our fifth week, we will explore supervised methods for classifying and clustering data using Python. We will consider when such approaches could be helpful, as well as what the limitations are and what kind of data we need to have.
Readings
- Patrick Juola, “How a Computer Program Helped Show J.K. Rowling write A Cuckoo’s Calling” [sic]
- Franco Moretti, "The Slaughterhouse of Literature" [on-campus network or VPN required]
In our sixth week, we will review several of the programming topics we have covered so far in the semester, and we'll explore a few new topics that will prove useful as we continue our data science work in the coming weeks. We will learn about topic modeling by looking at case studies and experimenting with model parameters. The particular approach we'll be using is called non-negative matrix factorization (NMF), which like the classifier we trained in week five, starts with a Term-Document Matrix.
Readings
- Underwood, Ted. "Topic Modeling Made Just Simple Enough"
- Blevins, Cameron. "Topic Modeling Martha Ballard's Diary"
In our seventh week, we will begin our transition from text mining to image mining techniques by way of neural networks. On Thursday, we will focus on word embeddings, a technique for identifying words that appear in similar contexts.
Readings
- Gideon Lewis-Kraus, “The Great A.I. Awakening”
- Jonathan Fitzgerald, “Word Embeddings are the New Topic Models”
- Optional: Ryan Heuser, “Word Vectors in the Eighteenth Century”
- Optional: Ben Schmidt, “Vector Space Models for the Digital Humanities”
In our eighth week, we will start looking more closely at image mining, with an overview of projects, techniques, and data considerations. For hands-on practice, we will experiment with color extraction.
Tuesday Readings:
Thursday Notebooks and links:
In our tenth week, we will be discussing techniques for measuring and identifying image similarity. In particular, we will focus on Convolutional Neural Networks as our approach.
Tuesday Readings:
Tuesday In-Class Links:
- Lyrics Text Comparison
- Image Similarity Ordering
- Neural Neighbors (Meserve-Kunhardt Collection)
Thursday Notebooks:
In our eleventh week, we will look at methods for video (or moving image) analysis and consider when, why, and how we might go about it. As a capstone to our image analysis module, we will use the Distant Viewing Toolkit. We'll also explore classifing sound files according to musical genre with guest lecturer Nicole Cosme (Yale Music Department).
Tuesday Slides
Tuesday Notebook
Tuesday Readings
In our twelfth week, we will begin our open lab sessions, which are designed to pull together the material from the course while giving you time to work on your final projects. For our first open lab, we will discuss pathways for finding and preparing data.
In our thirtienth week, we will continue bringing the course material together by discussing how to identify an appropriate method based on your research question and available data. We will also review strategies for visualizing and sharing results.
In our fourteenth week, we will conclude with class presentations to showcase everyone's work. Thank you for an incredible semester!
The course materials are published under a CC BY 3.0 US license. This course, Humanities Data Mining, was created in 2021 by Dr. Catherine DeRose and Dr. Douglas Duhaime.