Microcredential komex Text-as-Data Methods in Python

Registration
Registration Microcredential komex Text-as-Data Methods in Python

Subscribe to course dates
Subscribe to Microcredential komex Text-as-Data Methods in Python dates	More info

Content

A crash course on how to use text-as-data and NLP methods in Python — from from Bag-of-Words to Large Language Models

What Is This Course About?
Large swathes of text data, obtained from books to social media conversations, have the potential to revolutionize the social sciences. Traces of human behavior are encoded in these large text corpora, and modern text-as-data methods can help us efficiently analyze these corpora and develop novel measurements. This 3-day in-person course serves as an introduction to a wide range of text-as-data (TADA) approaches, from basic bag-of-words representations to more complex representations obtained from recent large language models (LLMs). Besides getting an overview of popular TADA methods, participants will learn hands-on approaches for analyzing textual data and validating the measurements they draw. We will discuss how to tailor these approaches to our needs, e.g., based on the characteristics of the corpora or data, or on varying computational budgets.

Learning Goals

Get an idea of the basic building blocks of working with textual data, from data loading, preprocessing, to analysis and validation.
Get an overview on some important Python libraries for working with textual data such as NLTK, HuggingFace Transformers
Learn the theoretical background of basic and advanced NLP techniques and apply them to your own research

Assignments for the Course
We will have two types of assignments:

Daily in-class exercises to be solved individually or in groups
A small group project to conceptualize a research project where participants can use existing large-scale textual datasets (including their own) and apply some of the Text-as-data methods to investigate a social science research question

Schedule
Day 1-3 (4 * 3 hours)

09.00-10.30 – Course
10.30-11.00 – Break
11.00-12.30 – Course
12.30-13.30 – Lunch break
13.30-14.30 – Course
15:00-16:00 – Office Hours

Recommended Readings for the Course

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. "Text as data." Journal of Economic Literature 57.3 (2019): 535-574.
Grimmer, Justin, and Brandon M. Stewart. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political analysis 21.3 (2013): 267-297.
Hovy, Dirk. Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press, 2020.

Who Is Your Instructor?
Indira Sen is a Postdoc at the Political Science department at the University of Konstanz and her research is about understanding and characterizing the measurement quality of social science constructs like political attitudes and abusive content from digital traces. Her work with NLP and measurement theory. You can reach her at @indiiigosky on Twitter or https://indiiigo.github.io/

Bildungszeit (can be claimed by employees in Baden-Württemberg)

Anforderungen des Bildungszeitgesetzes Baden-Württemberg sind erfüllt

Fee

350 EUR / Early bird 270 EUR / Please note: you will gain access to our learning management system Moodle only after having paid your course fee

ECTS Credits

Contact for Questions

komex Office

Place

University of Konstanz

Date

17.02.2025 (All day)

18.02.2025 (All day)