Data Preprocessing Using Python

Data Preprocessing is an initial stage in data processing before applying machine learning algorithms. The data we typically use in daily life, whether from databases, Excel files, or other sources, is often unstructured (the data is imperfect). For example, a dataset may contain missing values, different data types, and so on. These issues need to be addressed first to make the data we manage easier to handle and to ensure the output meets our expectations.

There are several cases we will study one by one, including:

  • Importing libraries
  • Importing datasets
  • Handling missing data in the dataset
  • Converting string data into categories
  • Splitting the dataset into training and test sets
  • Feature scaling

Dataset Information

Data Source: Kaggle
Description: Provides information about Titanic passengers who survived and those who did not.
Number of records: 1309
Number of attributes: 12 (including class)

It consists of:

  • PassengerId: Sequence number of the passenger’s data
  • Survived: Survival status (0: died, 1: survived)
  • Pclass: Passenger cabin class (1: first class, 2: second class, 3: third class)
  • Name: Passenger’s name
  • Sex: Passenger’s gender (male, female)
  • Age: Passenger’s age
  • SibSp: Number of siblings and spouses aboard the ship
  • Parch: Number of parents and children aboard the ship
  • Ticket: Passenger ticket code
  • Fare: Ticket price paid by the passenger
  • Cabin: Cabin code
  • Embarked: Port of embarkation for the passenger (C: Cherbourg, Q: Queenstown, S: Southampton)

Dataset

Download Dataset

MODULE

Download Module 7