lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

pythonDataAnalyticsAdvancedChapter4Title

pythonDataAnalyticsAdvancedChapter1Title

pythonDataAnalyticsAdvancedChapter2Title

pythonDataAnalyticsAdvancedChapter3Title

# Splitting Data: Train vs Test

 

In machine learning, we **split datasets** into *training* and *testing* sets to evaluate how well a model generalizes to unseen data.

- *Training set* – Used by the model to learn patterns. 
- *Testing set* – Used to check performance on data the model has never seen before.

If we don’t separate them, the model might **overfit** — memorizing data instead of learning general rules.

 

## Using `train_test_split` in Scikit-learn

`train_test_split()` randomly divides data into training and test sets.

```python title="Basic Train-Test Split"
# Install scikit-learn in Jupyter Lite
import piplite
await piplite.install('scikit-learn')

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)
```

 

## Controlling Randomness

The `random_state` parameter ensures reproducibility — without it, every run may split differently.

```python title="Fixed Random State"
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.3, random_state=123
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)
```

 

## Stratified Splits

For classification tasks, use `stratify=y` to maintain class proportions.

```python title="Stratified Split"
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.25, stratify=y, random_state=42
)

# Check distribution
import numpy as np
unique_train, counts_train = np.unique(y_train, return_counts=True)
unique_test, counts_test = np.unique(y_test, return_counts=True)

print("Train distribution:", dict(zip(unique_train, counts_train)))
print("Test distribution:", dict(zip(unique_test, counts_test)))
```

 

## Key Takeaways

* Always **split** data before training to avoid overfitting.
* `train_test_split()` is the most common and flexible approach.
* Use `stratify=y` for classification tasks to preserve label proportions.
* Fix `random_state` for reproducibility.

 

## What’s Next?

In the next lesson, we’ll explore the **ML Workflow and Model Lifecycle**.

Splitting a dataset into training and testing sets is crucial for assessing a model's ability to generalize beyond the training data. It helps prevent overfitting, where a model memorizes the data rather than learning general rules. This practice ensures that the model's performance is robust and applicable to new, unseen data.

Splitting Data: Train vs Test

Using train_test_split in Scikit-learn

Controlling Randomness

Stratified Splits

Key Takeaways

What’s Next?

What is the primary reason for splitting a dataset into training and testing sets in machine learning?

Using `train_test_split` in Scikit-learn