Assignment 2 - UML with Pokemon
Description
This time you will work with Pokemon data. No data munging needed. Just old-school (U)ML.
Submission
Submission as PDF (notebook and output)
Submission: Monday 20.09.2021 23:59:00. Peergrade.io
Data
The data is available through the URL: https://sds-aau.github.io/SDS-master/00_data/pokemon.csv
. It contains data on 800 Pokemon from the 1st to the 6th generation.
Tasks
- Give a brief overview of data, what variables are there, how are the variables scaled and variation of the data columns.
- Execute a PCA analysis on all numerical variables in the dataset. Hint: Don’t forget to scale them first. Use 4 components. What is the cumulative explained variance ratio? Hint: I am not sure this terminology and code was introduced during class, but try and look into cumulative explained variance and sklearn(package) and see if you can figure out the code needed.
- Use a different dimensionality reduction method (eg. UMAP/NMF) – do the findings differ?
- Perform a cluster analysis (KMeans) on all numerical variables (scaled & before PCA). Pick a realistic number of clusters (up to you where the large clusters remain mostly stable).
- Visualize the first 2 principal components and color the datapoints by cluster.
- Inspect the distribution of the variable
Type1
across clusters. Does the algorithm separate the different types of pokemon? - Perform a cluster analysis on all numerical variables scaled and AFTER dimensionality reduction and visualize the first 2 principal components.
- Again, inspect the distribution of the variable “Type 1” across clusters, does it differ from the distribution before dimensionality reduction?
Solutions