Day 2 - Classification of Political US Tweets

Introduction

Context: Presidential Debate 2020

Yes, we are going back in time to the Presidential Debate in the US 2020 - the time of lots of unhappy Tweeting. It’s just too good a dataset and case to let it go…

Data

  • Political tweets: https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pol_tweets.gz from https://github.com/alexlitel/congresstweets We’ve preprocessed a bit to make things easier. 1: Dems. 0: Rep.

  • Tweets around the time of the debate in oktober 20 (8000): https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/pres_debate_2020.gz

Both datasets are in JSON format.

Tasks

  • Preprocess both datasets for NLP (supervised)
  • Start by building a classification model for the congress tweets
  • Use a well performing model to classify new data (tweets from the presidential debate)
  • Explore the different classes

Schedule for the workshop

TimeActivity
11:40-12:00Introduction to the context
12:00-13:00Joint EDA and NLP refresher
13:15-14:45Setting up the NLP workflow on congress tweets
15:00-16:00PRedicting on new data, evaluation of results
16:15-17:00Hand out Peergrade assignment, Introduction to final project