Lecture

Handling Missing and Duplicate Data

Real-world datasets are rarely perfect.

You'll often encounter missing values or duplicate rows that can skew your analysis.

Pandas provides powerful tools to identify and handle these issues efficiently.


Dealing with Missing Data

Missing values are usually represented as NaN (Not a Number) in pandas.

You can handle them in several ways:

  • Detect missing values using .isnull() or .notnull()
  • Drop missing data with .dropna()
  • Fill missing data using .fillna() (e.g., fill with a default value or forward-fill based on previous values)

Properly handling missing values is crucial before performing calculations like mean, sum, or correlation; otherwise, your results may be skewed.


Handling Duplicate Entries

Duplicate rows can occur due to data entry errors or when merging datasets.

  • Use .duplicated() to flag duplicates
  • Use .drop_duplicates() to remove them

Always check if duplicates make sense in the context of your data. Not all repetition is bad.


Summary

TaskMethodDescription
Detect missingdf.isnull()Shows True for missing values
Drop missing rowsdf.dropna()Removes rows with any NaN
Fill missing valuesdf.fillna(value)Replaces NaN with the specified value
Detect duplicatesdf.duplicated()Returns a Boolean Series
Drop duplicatesdf.drop_duplicates()Removes duplicate rows
Quiz
0 / 1

How can you fill missing data in a DataFrame using Pandas?

To replace NaN with a specified value, use the method.
.isnull()
.dropna()
.fillna()
.duplicated()

Lecture

AI Tutor

Design

Upload

Notes

Favorites

Help