Splitting Data: Train vs Test
In machine learning, we split datasets into training and testing sets to evaluate how well a model generalizes to unseen data.
- Training set – Used by the model to learn patterns.
- Testing set – Used to check performance on data the model has never seen before.
If we don’t separate them, the model might overfit — memorizing data instead of learning general rules.
Using train_test_split
in Scikit-learn
train_test_split()
randomly divides data into training and test sets.
Basic Train-Test Split
# Install scikit-learn in Jupyter Lite import piplite await piplite.install('scikit-learn') from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load dataset iris = load_iris() X, y = iris.data, iris.target # Split into train (80%) and test (20%) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print("Train size:", X_train.shape) print("Test size:", X_test.shape)
Controlling Randomness
The random_state
parameter ensures reproducibility — without it, every run may split differently.
Fixed Random State
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=123 ) print("Train size:", X_train.shape) print("Test size:", X_test.shape)
Stratified Splits
For classification tasks, use stratify=y
to maintain class proportions.
Stratified Split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, stratify=y, random_state=42 ) # Check distribution import numpy as np unique_train, counts_train = np.unique(y_train, return_counts=True) unique_test, counts_test = np.unique(y_test, return_counts=True) print("Train distribution:", dict(zip(unique_train, counts_train))) print("Test distribution:", dict(zip(unique_test, counts_test)))
Key Takeaways
- Always split data before training to avoid overfitting.
train_test_split()
is the most common and flexible approach.- Use
stratify=y
for classification tasks to preserve label proportions. - Fix
random_state
for reproducibility.
What’s Next?
In the next lesson, we’ll explore the ML Workflow and Model Lifecycle.
Quiz
0 / 1
What is the primary reason for splitting a dataset into training and testing sets in machine learning?
To reduce the size of the dataset
To ensure the dataset is balanced
To evaluate how well a model generalizes to unseen data
To increase the complexity of the model
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help