Microcredential komex Text-as-Data with Python
Subscribe to course dates | |
---|---|
Subscribe to Microcredential komex Text-as-Data with Python dates | More info |
A crash course on how to use text-as-data and NLP methods in Python — from from Bag-of-Words to Large Language Models
What Is This Course About?
Large swathes of text data, obtained from books to social media conversations, have the potential to revolutionize the social sciences. Traces of human behavior are encoded in these large text corpora, and modern text-as-data methods can help us efficiently analyze these corpora and develop novel measurements. This 3-day in-person course serves as an introduction to a wide range of text-as-data (TADA) approaches, from basic bag-of-words representations to more complex representations obtained from recent large language models (LLMs). Besides getting an overview of popular TADA methods, participants will learn hands-on approaches for analyzing textual data and validating the measurements they draw. We will discuss how to tailor these approaches to our needs, e.g., based on the characteristics of the corpora or data, or on varying computational budgets.
Learning Goals
- Get an idea of the basic building blocks of working with textual data, from data loading, preprocessing, to analysis and validation.
- Get an overview on some important Python libraries for working with textual data such as NLTK, HuggingFace Transformers
- Learn the theoretical background of basic and advanced NLP techniques and apply them to your own research
Assignments for the Course
We will have two types of assignments:
- Daily in-class exercises to be solved individually or in groups
- A small group project to conceptualize a research project where participants can use existing large-scale textual datasets (including their own) and apply some of the Text-as-data methods to investigate a social science research question
Schedule
Day 1-3 (4 * 3 hours)
- 09.00-10.30 – Course
- 10.30-11.00 – Break
- 11.00-12.30 – Course
- 12.30-13.30 – Lunch break
- 13.30-14.30 – Course
- 15:00-16:00 – Office Hours
Recommended Readings for the Course
- Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. "Text as data." Journal of Economic Literature 57.3 (2019): 535-574.
- Grimmer, Justin, and Brandon M. Stewart. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political analysis 21.3 (2013): 267-297.
- Hovy, Dirk. Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press, 2020.
Who Is Your Instructor?
Indira Sen is a Postdoc at the Political Science department at the University of Konstanz and her research is about understanding and characterizing the measurement quality of social science constructs like political attitudes and abusive content from digital traces. Her work with NLP and measurement theory. You can reach her at @indiiigosky on Twitter or https://indiiigo.github.io/