ML Starter — Titanic, Penguins & Iris

Welcome to the ML Starter series, a progressive set of tutorials for beginners. We’ll start with the basics of machine learning and gradually introduce more complex concepts using popular datasets.

All the datasets are available on Kaggle but more easily accessible on seaborn and scikit-learn.

Disclaimer: This is not a comprehensive course, but rather a series of notebooks to get you started with practical machine learning tasks.
Disclaimer#2 : For most of this dataset a naive method can be used to achieve a good accuracy, remember the best model is the one that solves the problem (and the simpler the better), not the most complex one.

Getting Started

To get started, make sure you have the following libraries installed:

pip install pandas numpy scikit-learn seaborn matplotlib

You can use Jupyter Notebook, or any Python IDE of your choice. The code is structured to be run in a Jupyter Notebook environment, but you can adapt it to any Python script.

Note : On the code i propose some utility functions to help you with the data exploration and visualization, feel free to use them or not, they are not mandatory. you can also check on generic pipelines on this notebook here. This notebook provides a generic pipeline for data preprocessing, model training, and evaluation using scikit-learn. It can be adapted to any dataset. On src/utils.py you can find some utility functions and how to add classes compatible with sklearn pipelines (you need notion of heritage in python to understand this part).

Iris Classification (Beginner)

The Iris dataset is a classic introduction to classification. In this section, we’ll:

Load the Iris dataset
Explore feature distributions
Train a simple classifier
Evaluate model performance (accuracy, confusion matrix etc…)

The notebook for this task is available here.

Penguins Classification (Intermediate)

The Palmer Penguins dataset adds complexity: In this dataset there are missing values and categorical features.

We’ll cover:

Handle missing values
Perform feature encoding
Compare multiple models
Evaluate model performance
Introduction to cross-validation (to compare models)

The notebook for this task is available here.

Titanic Classification (Advanced)

Predict survival on the Titanic with: This dataset is more complex with a mix of numerical and categorical features, missing values, and requires more advanced techniques.

Advanced feature engineering (titles, family size)
Cross-validation
Ensemble methods

The notebook for this task is available here.

Regression Task (Coming Soon)

Stay tuned for regression tasks using datasets like Boston Housing or California Housing. We’ll cover:

Data preprocessing
Model training
Evaluation metrics and a bit more complexity.

All notebooks and data are available on GitHub. Happy learning!