Dataset Structure: Features and Labels
In machine learning, a dataset is typically organized into:
- Features (X) – The input variables used by the model to make predictions.
Examples: age, height, number of purchases. - Labels (y) – The target variable the model is trying to predict.
Examples: whether an email is spam, the price of a house.
A model learns the relationship between features and labels in supervised learning.
Loading a Dataset in Scikit-learn
Scikit-learn provides built-in datasets for practice. One of the most famous is the Iris dataset.
Loading the Iris Dataset
from sklearn.datasets import load_iris iris = load_iris() # Features (X) - shape: (samples, features) X = iris.data print("Feature shape:", X.shape) print("First row of features:", X[0]) # Labels (y) - shape: (samples,) y = iris.target print("Label shape:", y.shape) print("First label:", y[0])
Inspecting Feature and Label Names
Feature and Label Names
print("Feature names:", iris.feature_names) print("Target names:", iris.target_names)
Why This Matters
- Features are the information your model uses to make predictions.
- Labels define the correct answers during training.
- Organizing data correctly into X and y is essential for Scikit-learn functions like
train_test_split()
and.fit()
.
Key Takeaways
- X → input features, 2D array shape
(n_samples, n_features)
. - y → target labels, 1D array shape
(n_samples,)
. - Proper separation of features and labels is the first step in preparing data for training.
What’s Next?
In the next lesson, we’ll learn how to split data into training and testing sets to evaluate model performance.
Quiz
0 / 1
Understanding Dataset Structure
In a dataset used for machine learning, the input variables are referred to as .
Features
Labels
Targets
Outputs
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help