This assignment is less structured than previous individual assignments.
You are given a collection of approximately 25k tweets that have been manually (human) annotated. class
denotes: 0 - hate speech, 1 - offensive language, 2 - neither
https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/twitter_hate.zip
Justify your choices and explain possible alternatives (e.g. removing stopwords, identifying bi/tri-grams, removing verbs or use of stemming, lemmatization etc.)
Use the ML pipeline (learned in M1) to build a classification model that can identify offensive language and hate speech. It is not an easy task to get good results. Experiment with different models on the two types of text-representations that you create in 2.
Bonus: Explore missclassified hate speech tweets vs those correctly predicted. Can you find specific patterns? Can you observe some topics that are more prevalent in those that the model identifies correcly?
The best-reported results for this dataset are.
Class | Precision |
---|---|
0 | 0.61 |
1 | 0.91 |
2 | 0.95 |
Overall | 0.91 |
Here advanced NLP feature engineering has been used, and thus everything around an overall accuracy of 85 is fine. You will see that it is not easy to lift class 0 accuracy over 0.5
Good Luck!
Submission as PDF (notebook and output)
Submission: Friday 15.10.2021 23:59:00. Peergrade.io