Model Selection and Cross-Validation
Choosing the right machine learning model is one of the most important steps in any ML project.
Even if two models perform well, their generalization to unseen data can be very different.
Why Model Selection Matters
- Avoid Overfitting – Some models perform exceptionally well on training data but fail on new data.
- Balance Accuracy and Complexity – A simpler model might generalize better than a complex one.
- Optimize Resources – The best-performing model might also be the most computationally efficient.
What is Cross-Validation?
Cross-validation is a technique to evaluate model performance by splitting the dataset into multiple subsets (folds) and training/testing across different combinations.
For example, in k-fold cross-validation:
- The data is divided into k folds.
- For each fold:
- Train the model on k-1 folds.
- Test it on the remaining fold.
- Average the results to get a more reliable performance estimate.
Common Cross-Validation Types
- K-Fold Cross-Validation – Most common, splits into k equal folds.
- Stratified K-Fold – Maintains class proportions in each fold (important for classification).
- Leave-One-Out (LOO) – Each observation is tested individually.
- ShuffleSplit – Random splits with replacement.
Example: Comparing Models with Cross-Validation
Cross-Validation Example
import piplite await piplite.install('scikit-learn') from sklearn.datasets import load_iris from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier # Load dataset iris = load_iris() X, y = iris.data, iris.target # Define models log_reg = LogisticRegression(max_iter=200) knn = KNeighborsClassifier(n_neighbors=5) # Cross-validation log_scores = cross_val_score(log_reg, X, y, cv=5) knn_scores = cross_val_score(knn, X, y, cv=5) print(f"Logistic Regression mean score: {log_scores.mean():.3f}") print(f"KNN mean score: {knn_scores.mean():.3f}")
This example uses 5-fold cross-validation to compare two models and select the one with the highest average accuracy.
Key Takeaways
- Model selection ensures the chosen model is the best fit for both accuracy and efficiency.
- Cross-validation gives a more robust estimate of real-world performance.
- Always use the same cross-validation strategy when comparing models to ensure fairness.
What’s Next?
In the next lesson, we’ll wrap up the chapter with the Final Quiz – Machine Learning Essentials to review everything you’ve learned.
Quiz
0 / 1
What is the primary purpose of using cross-validation in model selection?
Cross-validation helps in the model's performance by splitting the dataset into multiple subsets.
training
testing
evaluating
simplifying
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help