lesson1Title

lesson2Title

lesson3Title

lesson4Title

lesson5Title

lesson6Title

lesson7Title

lesson8Title

lesson9Title

lesson10Title

lesson11Title

lesson12Title

lesson13Title

lesson14Title

lesson15Title

pythonDataAnalyticsAdvancedChapter4Title

pythonDataAnalyticsAdvancedChapter1Title

pythonDataAnalyticsAdvancedChapter2Title

pythonDataAnalyticsAdvancedChapter3Title

# Model Selection and Cross-Validation

 

Choosing the **right machine learning model** is one of the most important steps in any ML project. 
Even if two models perform well, **their generalization to unseen data** can be very different.

 

## Why Model Selection Matters

- **Avoid Overfitting** – Some models perform exceptionally well on training data but fail on new data.
- **Balance Accuracy and Complexity** – A simpler model might generalize better than a complex one.
- **Optimize Resources** – The best-performing model might also be the most computationally efficient.

 

## What is Cross-Validation?

**Cross-validation** is a technique to evaluate model performance by splitting the dataset into multiple subsets (folds) and training/testing across different combinations.

For example, in **k-fold cross-validation**:
1. The data is divided into *k* folds.
2. For each fold:
 - Train the model on *k-1* folds.
 - Test it on the remaining fold.
3. Average the results to get a more reliable performance estimate.

 

## Common Cross-Validation Types

- *K-Fold Cross-Validation* – Most common, splits into *k* equal folds.
- *Stratified K-Fold* – Maintains class proportions in each fold (important for classification).
- *Leave-One-Out (LOO)* – Each observation is tested individually.
- *ShuffleSplit* – Random splits with replacement.

 

## Example: Comparing Models with Cross-Validation

```python title="Cross-Validation Example"
import piplite
await piplite.install('scikit-learn')

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Define models
log_reg = LogisticRegression(max_iter=200)
knn = KNeighborsClassifier(n_neighbors=5)

# Cross-validation
log_scores = cross_val_score(log_reg, X, y, cv=5)
knn_scores = cross_val_score(knn, X, y, cv=5)

print(f"Logistic Regression mean score: {log_scores.mean():.3f}")
print(f"KNN mean score: {knn_scores.mean():.3f}")
```

> This example uses **5-fold cross-validation** to compare two models and select the one with the highest average accuracy.

 

## Key Takeaways

* Model selection ensures the chosen model is the best fit for both accuracy and efficiency.
* Cross-validation gives a **more robust estimate** of real-world performance.
* Always use the **same cross-validation strategy** when comparing models to ensure fairness.

 

## What’s Next?

In the next lesson, we’ll wrap up the chapter with the **Final Quiz – Machine Learning Essentials** to review everything you’ve learned.

Cross-validation is primarily used to evaluate how a model will perform on unseen data. By splitting the dataset into multiple subsets and running the model across different combinations, it provides a more accurate estimate of the model's effectiveness compared to just training and testing on a single split. This method is crucial to ensure that the model generalizes well and does not overfit the training data.

### What is the primary purpose of using cross-validation in model selection?