Statistics Lab Work 7
Data Preprocessing Using Python
Data Preprocessing is an initial stage in data processing before applying machine learning algorithms. The data we typically use in daily life, whether from databases, Excel files, or other sources, is often unstructured (the data is imperfect). For example, a dataset may contain missing values, different data types, and so on. These issues need to be addressed first to make the data we manage easier to handle and to ensure the output meets our expectations.
There are several cases we will study one by one, including:
- Importing libraries
- Importing datasets
- Handling missing data in the dataset
- Converting string data into categories
- Splitting the dataset into training and test sets
- Feature scaling
Dataset Information
Data Source: Kaggle
Description: Provides information about Titanic passengers who survived and those who did not.
Number of records: 1309
Number of attributes: 12 (including class)
It consists of:
- PassengerId: Sequence number of the passenger’s data
- Survived: Survival status (0: died, 1: survived)
- Pclass: Passenger cabin class (1: first class, 2: second class, 3: third class)
- Name: Passenger’s name
- Sex: Passenger’s gender (male, female)
- Age: Passenger’s age
- SibSp: Number of siblings and spouses aboard the ship
- Parch: Number of parents and children aboard the ship
- Ticket: Passenger ticket code
- Fare: Ticket price paid by the passenger
- Cabin: Cabin code
- Embarked: Port of embarkation for the passenger (C: Cherbourg, Q: Queenstown, S: Southampton)
Dataset
MODULE
All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.
Comments