Microcredential komex Text-as-Data with Python

Content 

A crash course on how to use text-as-data and NLP methods in Python — from from Bag-of-Words to Large Language Models

What Is This Course About?
Large swathes of text data, obtained from books to social media conversations, have the potential to revolutionize the social sciences. Traces of human behavior are encoded in these large text corpora, and modern text-as-data methods can help us efficiently analyze these corpora and develop novel measurements. This 3-day in-person course serves as an introduction to a wide range of text-as-data (TADA) approaches, from basic bag-of-words representations to more complex representations obtained from recent large language models (LLMs). Besides getting an overview of popular TADA methods, participants will learn hands-on approaches for analyzing textual data and validating the measurements they draw. We will discuss how to tailor these approaches to our needs, e.g., based on the characteristics of the corpora or data, or on varying computational budgets.

Learning Goals

  • Get an idea of the basic  building blocks of working with textual data, from data loading, preprocessing, to analysis and validation.
  • Get an overview on some important Python libraries for working with textual data such as NLTK, HuggingFace Transformers
  • Learn the theoretical background of basic and advanced NLP techniques and apply them to your own research


Assignments for the Course
We will have two types of assignments:

  • Daily in-class exercises to be solved individually or in groups
  • A small group project to conceptualize a research project where participants can use existing large-scale textual datasets (including their own) and apply some of the Text-as-data methods to investigate a social science research question


Schedule
Day 1-3 (4 * 3 hours)

  • 09.00-10.30 – Course
  • 10.30-11.00 – Break
  • 11.00-12.30 – Course
  • 12.30-13.30 – Lunch break
  • 13.30-14.30 – Course
  • 15:00-16:00 – Office Hours 


Recommended Readings for the Course

  • Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. "Text as data." Journal of Economic Literature 57.3 (2019): 535-574.
  • Grimmer, Justin, and Brandon M. Stewart. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political analysis 21.3 (2013): 267-297.
  • Hovy, Dirk. Text analysis in Python for social scientists: Discovery and exploration. Cambridge University Press, 2020.


Who Is Your Instructor?
Indira Sen is a Postdoc at the Political Science department at the University of Konstanz and her research is about understanding and characterizing the measurement quality of social science constructs like political attitudes and abusive content from digital traces. Her work with NLP and measurement theory. You can reach her at @indiiigosky on Twitter or https://indiiigo.github.io/

Bildungszeit (can be claimed by employees in Baden-Württemberg) 
Anforderungen des Bildungszeitgesetzes Baden-Württemberg sind erfüllt
Fee 
350 EUR / Early bird 270 EUR / Please note: you will gain access to our learning management system Moodle only after having paid your course fee
ECTS Credits 
2
Contact for Questions 
Date 
17.02.2025 (All day)
18.02.2025 (All day)
19.02.2025 (All day)
Duration 
3 study days
Requirements 
Participants should have introductory Python knowledge which they can also obtain in the KOMEX ‘Introduction to Python’ course.