Project 4: K-means Clustering with Zoo Dataset

Introduce the Problem

For this project, we aim to categorize different animals based on their characteristics using clustering. The goal is to identify distinct groups or clusters within the animal kingdom. We seek to answer questions such as: What natural groupings exist among these animals? How can clustering help in understanding animal traits and their relationships?

What is Clustering and How Does It Work?

Clustering is an unsupervised machine learning technique that groups data points into clusters based on their similarities. The K-means algorithm is a popular clustering method, where "K" is the number of clusters predefined by the user. The algorithm works as follows:

Initialization: Select K initial centroids randomly.
Assignment: Assign each data point to the nearest centroid, forming K clusters.
Update: Calculate the new centroids by taking the mean of all data points assigned to each centroid.
Iteration: Repeat the assignment and update steps until the centroids no longer change significantly.

Introduce the Data

The dataset used is titled “Zoo Animals Classification” from Kaggle, which can be found here. The dataset includes various features of animals such as:

animal_name: The name of the animal.
hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize, class_type: Various attributes that describe the animal.

Data Understanding/Visualization

To understand and visualize the data, we start with a quick examination of the dataset. The heatmap helps us see the relationships between different features, guiding us in understanding how they might influence our clustering model.

Pre-processing the Data

Pre-processing is crucial to ensure our data is clean and ready for modeling:

Remove Non-Numeric Columns: Drop the animal_name column as it is not needed for clustering.
Standardize the Data: Use StandardScaler to standardize features for uniformity.

Modeling (Clustering)

We use the K-means algorithm to cluster the animals. The elbow method helps determine the optimal number of clusters by plotting the inertia (sum of squared distances to the nearest centroid).

We then fit the K-means model with the chosen number of clusters.

Storytelling (Clustering Analysis)

Using PCA for dimensionality reduction, we visualize the clusters. The clusters reveal patterns and similarities among the animals, answering our initial question of how animals can be grouped based on their traits.

The silhouette score of 0.3997 indicates moderate clustering performance.

Impact Section

The project’s impact spans biological classification and conservation efforts. By understanding animal groupings, researchers and zoologists can better design conservation strategies, identify species at risk, and promote biodiversity.

Code

The full code for this project can be found here.

References

Kaggle Dataset: Zoo Animals Classification
Sklearn Documentation
Microsoft copilot: Used to understand syntax and data comprehension
Class Material