Statistics Lab Work 7

Created2022-11-03|Updated2022-11-03|Lab Work

|Word Count:234|Reading Time:1mins|Post Views:

Data Preprocessing Using Python

Data Preprocessing is an initial stage in data processing before applying machine learning algorithms. The data we typically use in daily life, whether from databases, Excel files, or other sources, is often unstructured (the data is imperfect). For example, a dataset may contain missing values, different data types, and so on. These issues need to be addressed first to make the data we manage easier to handle and to ensure the output meets our expectations.

There are several cases we will study one by one, including:

Importing libraries
Importing datasets
Handling missing data in the dataset
Converting string data into categories
Splitting the dataset into training and test sets
Feature scaling

Dataset Information

Data Source: Kaggle
Description: Provides information about Titanic passengers who survived and those who did not.
Number of records: 1309
Number of attributes: 12 (including class)

It consists of:

PassengerId: Sequence number of the passenger’s data
Survived: Survival status (0: died, 1: survived)
Pclass: Passenger cabin class (1: first class, 2: second class, 3: third class)
Name: Passenger’s name
Sex: Passenger’s gender (male, female)
Age: Passenger’s age
SibSp: Number of siblings and spouses aboard the ship
Parch: Number of parents and children aboard the ship
Ticket: Passenger ticket code
Fare: Ticket price paid by the passenger
Cabin: Cabin code
Embarked: Port of embarkation for the passenger (C: Cherbourg, Q: Queenstown, S: Southampton)

Dataset

Download Dataset

MODULE

Download Module 7

Author: Azhar Rizki Zulma

Link: https://blog.zulma.id/posts/Statistics-Lab-Work-7/

Copyright Notice: All articles on this blog are licensed under CC BY-NC-SA 4.0 unless otherwise stated.

Trisakti Subject Statistics

Related Articles

Statistics Lab Work 1

Introduction to R & R StudioR (also known as GNU S) is a programming language and software for statistical analysis and graphics. It was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is now developed by the R Development Core Team, which includes Chambers. R is partially named after its creators (Robert Gentleman and Ross Ihaka) and partially as a play on the name S. R has become the de facto standard among statisticians for developing...

Statistics Lab Work 2

Data TypesIn programming languages, data types are categories that define the nature of data based on the logical operations that can be performed on them. Data types can be divided into several categories, including Character, Numeric, and List types. There are also complex data types, such as Data Frames. In the R programming language, there are various data types that can be used for analysis. Additionally, data types in R share similarities with those in older programming languages, such...

Statistics Lab Work 3

Merge DataMerge Data is an operation that combines data from two data frames that have the same columns or rows. Data merging can be performed using the rbind() function for rows and cbind() for columns in the R programming language. Sort DataSort Data is an operation that arranges data in ascending order (from smallest to largest) or descending order (from largest to smallest). Data sorting can be done using the order() function in R. DatasetDownload Dataset MODULEDownload Module 3

Statistics Lab Work 4

Estimation of Population ParametersThe estimation of population parameters discussed here is limited to the case of estimating the mean of a population for numerical data and estimating the proportion of a population for categorical data. The population mean (µ) is estimated by the sample mean (x̅ or x-bar) ± MOE (margin of error). The population proportion (p) is estimated by the sample proportion (p̂) ± MOE. A simple illustration is in the case of estimating the pH level of bottled...

Statistics Lab Work 5

One-Sample Mean TestHypothesis testing regarding the mean can use the Normal distribution (commonly referred to as the Z-test) or the T distribution (commonly referred to as the t-test), depending on whether the population standard deviation (σ) is known or not. MODULEDownload Module 5

Statistics Lab Work 6

Box PlotA box plot (or boxplot), also known as a box-and-whisker diagram, is a graphical technique in descriptive statistics used to visually depict numerical data through five measures as follows: Minimum value (the smallest observation) First quartile (Q1), which cuts off the lowest 25% of the data Median (Q2), or the middle value Third quartile (Q3), which cuts off the highest 25% of the data Maximum value (the largest observation) HistogramA histogram is a type of statistical graph...

Comments

Loading Database