ML Starter — Titanic, Penguins & Iris

Illustration

Welcome to the ML Starter series, a progressive set of tutorials for beginners. We’ll start with the basics of machine learning and gradually introduce more complex concepts using popular datasets.

All the datasets are available on Kaggle but more easily accessible on seaborn and scikit-learn.

Getting Started

To get started, make sure you have the following libraries installed:

pip install pandas numpy scikit-learn seaborn matplotlib

You can use Jupyter Notebook, or any Python IDE of your choice. The code is structured to be run in a Jupyter Notebook environment, but you can adapt it to any Python script.

Note : On the code i propose some utility functions to help you with the data exploration and visualization, feel free to use them or not, they are not mandatory. you can also check on generic pipelines on this notebook here. This notebook provides a generic pipeline for data preprocessing, model training, and evaluation using scikit-learn. It can be adapted to any dataset. On src/utils.py you can find some utility functions and how to add classes compatible with sklearn pipelines (you need notion of heritage in python to understand this part).

Iris Classification (Beginner)

The Iris dataset is a classic introduction to classification. In this section, we’ll:

The notebook for this task is available here.

Penguins Classification (Intermediate)

The Palmer Penguins dataset adds complexity: In this dataset there are missing values and categorical features.

We’ll cover:

  1. Handle missing values
  2. Perform feature encoding
  3. Compare multiple models
  4. Evaluate model performance
  5. Introduction to cross-validation (to compare models)

The notebook for this task is available here.

Titanic Classification (Advanced)

Predict survival on the Titanic with: This dataset is more complex with a mix of numerical and categorical features, missing values, and requires more advanced techniques.

The notebook for this task is available here.

Regression Task (Coming Soon)

Stay tuned for regression tasks using datasets like Boston Housing or California Housing. We’ll cover:

All notebooks and data are available on GitHub. Happy learning!