Text Mining with Python for Economists

The IDSC offers courses on practice-oriented skills in the area of economic research methods for national and international researchers. This introductory course on text mining with python is now part of the IDSC repertory. Contact us at idsc@iza.org if you are interested in a course at your institution.


Python, originally a language for the web, is now a prime statistical language, sporting a rich collection of diverse modules that include regressions, machine learning, all kinds of stats, supreme graphing, agent-based simulations etc. According to the TIOBE Index (https://www.tiobe.com/tiobe-index), as of February 18 2018, Python is the fourth most popular programming language, being first among scripting languages. In comparison Stata ranks somewhere between 50 and 100. According to the World Economic Forum Python ranks among the top skills that the world tech giants require by both engineers and data scientists.


As more and more markets (marriage market, transport market, labor market, etc) move online or are born exclusively online, our ability to study markets and understand socioeconomic phenomena will depend on being able to leverage the internet as a data source. This means data and text mining will be an important skill for social scientists. In recognition of this fact the European parliament is working on excluding data and text mining from future digital copyright legislation. The course covers the basics of Python selectively, depending on which language elements are necessary for the examples. The core aim is to study:

  • Hit the limits working with Stata’s built in rudimentary web browser and regular expressions.
  • The basics of how to install and manage a python installation and its modules.
  • How to construct and brand a web browser in Python.
  • How to use Python to download pages from the web and store them.
  • How to use regular expression (module: re) to harvest data out of html documents.
  • The data types Python provides for storing data (module: panda).
  • Some graphing, basic regressions with Python etc.
  • Integration of Python with Stata.

The lectures will be written in Jupyter notebooks which run in a web browser so that participants can play with the code as we go along. Example highlights include downloading data from Google Trends, RePEc, wahlrecht.de, LinkedIn, Yahoo Finance, etc.


Nikos Askitas

Nikos Askitas

Head of IDSC