Dataset Structure: Features and Labels
In machine learning, a dataset is typically organized into:
Features (X)
: The input variables used by the model to make predictions. For example, age, height, or number of purchases.Labels (y)
: The target variable that the model is trying to predict. For example, whether an email is spam or the price of a house.
A model learns the relationship between features
and labels
in supervised learning.
Loading a Dataset in Scikit-learn
Scikit-learn provides built-in datasets for practice. One of the most famous is the Iris
dataset.
from sklearn.datasets import load_iris iris = load_iris() # Features (X) - shape: (samples, features) X = iris.data print("Feature shape:", X.shape) print("First row of features:", X[0]) # Labels (y) - shape: (samples,) y = iris.target print("Label shape:", y.shape) print("First label:", y[0])
Inspecting Feature and Label Names
You can inspect the feature and label names of the Iris
dataset using the following code:
print("Feature names:", iris.feature_names) print("Target names:", iris.target_names)
The following are some key points about features and labels:
-
Features
are the information your model uses to make predictions. -
Labels
define the correct answers during training. -
X
: input features, 2D array shape(n_samples, n_features)
. -
y
: target labels, 1D array shape(n_samples,)
. -
Organizing data correctly into
X
andy
is essential for Scikit-learn functions liketrain_test_split()
and.fit()
. -
Proper separation of features and labels is the first step in preparing data for training.
Understanding Dataset Structure
Lecture
AI Tutor
Design
Upload
Notes
Favorites
Help