With the individual assignments, you already performed most of the steps in a typical machine learning pipeline. You imported some data, cleaned it, explored the variables and their relationships using summary statistics and visualisations. You also exercised some standard machine learning preprocessing procedures such as feature scaling and dealing with missing values.
You practised unsupervised machine learning techniques for dimensionality reduction (e.g. PCA) and clustering (e.g. KNN) to discover latent relationships between features and groupings of observations. In the final workshop and online material you finally used supervised machine learning for regression and classification problems, where you created models to predict an outcome of interest given some input features.
Now it is time to bring all these steps together and apply them to a setting that you find interesting. This should apply the following tasks.
In this exercise, you are asked to choose and obtain a dataset you consider interesting and appropriate for the tasks required. Some of you may already have some ideas about interesting datasets. There are many open datasets available on the internet (e.g. kaggle or individual projects like Stanford Open Policing or download some of the Datacamp project datasets) here a recent list of open data repositories for inspiration
If you instead want to collect your data (e.g. scraping Twitter or other platforms) – we will not hold you back. However, consider the timeframe.
The data should fulfill the following minimum requirements:
The analysis to be carried out by you has to contain elements of data manipulation, exploration, unsupervised and supervised ML.
Generally, you can combine parts from the individual assignments and use them as a template for the module assignment. Going beyond that is not required (but for sure appreciated). Below a (rather detailed) checklist to make sure you have all the pieces.
Many of the steps are optional. So choose which methods you deem helpful and relevant to explore your chosen problem.
Note: Quality > Quantity. Consider which analysis, summarization, and visualization adds value. Excessive and unselective outputs (e.g. running 20 different models without providing a reason for, providing all possibilities of different plots without discussing and evaluating the insights gained from it) will not be considered helpful but rather distracting.
You are asked to hand in two different report formats, namely:
The notebook targets a machine-learning literate audience. Here you can go deeper into the technical details and method considerations. Provide thorough documentation of the whole process, the used methods. Describe the intuition behind the selected and used methods, justify choices made, and interpret results (e.g. Why scaling? Why splitting the data? Why certain tabulations and visualizations? What can be seen from … ?, How did you select a particular algorithm? Why did you scale features in one way or another?).
Please provide the notebook as a PDF (Knittered from rmd or converted ipynb) with a public link to a functional Colab version (test it beforehand in incognito/private mode of your browser)
The stakeholder report (simple PDF, no code) summarises the analysis for a non-technical audience. Here you don’t need to discuss alternative approaches to standardization and alike.
Instead, you should try to explain the analysis and results, emphasizing its meaning and interpretation. Aim at a length of not more than 5 pages, including tables & visualizations.