Data Science with Python, Part 3: pandas
Contents¶
- Introduction to pandas
- Import Packages
- Reading in Data Set
- Understanding the Data Set
- Conversion of Column Data Types
- Creating A Data Frame
- Naming and Re-naming Columns
- Delete or Remove Columns
- Re-arrange Columns
- Slicing using loc and iloc
This is the third post in Data Science with Python series. Read previous posts here:
Introduction to pandas ¶
pandas is a package in Python that can help you prepare your data for further analysis. It is a great package for data munging. Let's say you want to implement a machine learning technique and requires the data to be in a certain format, with no missing values et cetera, pandas is your friend.
For this tutorial, we will use Echocardiogram Data Set from UCI Machine Learning Repository. The file is in a csv format. Without rambling any further, let's dig in.
Import Packages ¶
First and foremost, we need to import the package. We will also import numpy and matplotlib packages. It is common for pandas to be abbreviated as pd. Less typing, shorter codes and neat.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Reading in Data Set ¶
To read in data set, we will use pd.read_csv
.
# Read in data set
echo = pd.read_csv("/Users/azmirfakkri/Downloads/echocardiogram.csv")
Understanding the Data Set ¶
Once you read in your data, you need to get a feel of the data, see what columns it has, are the are any missing values etc.
# View head
echo.head(10)
# View tail
echo.tail(10)
To quickly view the shape of the data set, the data set has a shape attribute. As mentioned in a previous post, an attribute does not require a bracket ()
.
# View shape
echo.shape
At this point we have gained a few information about our data set:
- We now know that there are 133 rows or observations and 13 columns in this data set
- There are missing values
- The name of the columns
I always stress to myself the importance of knowing the meaning of every columns in the data set as this will help you to really understand what you are dealing with. For this data set, a description of the columns are given.
Attribute Information:
- survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
- alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
- age -- age in years when heart attack occurred
- pericardialeffusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
- fractionalshortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
- epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
- lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
- wallmotion-score -- a measure of how the segments of the left ventricle are moving
- wallmotion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
- mult -- a derivate var which can be ignored
- name -- the name of the patient (I have replaced them with "name")
- group -- meaningless, ignore it
- aliveat1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.
To get a complete picture of the data set, we can use echo.info()
.
# Info of data set
echo.info()
Using echo.info()
we can see that all columns contain at least one missing value.
Conversion of Column Data Types ¶
Based on the description, we know that there are three columns with categorical data: alive, pericardialeffusion, aliveat1.
We are going to change the data types of these columns into dtype="category"
. There is a few reasons why we want to do this:
- It will save memory
- It will make it easier to sort the data
- It will signal other libraries in Python to treat these columns as categorical data
We will use .astype()
for this purpose. In this function, we need to specify the data type that we intend to change into, in this case it will be category
. .astype()
can also take ordered=True
argument to create an ordered categorical data.
However, in these three columns, order is not necessary and meaningless. For example, for column pericardial effusion, having fluid (=1) is not better than not having fluid (=0) and vice versa.
# Convert column data type
echo["alive"] = echo["alive"].astype('category')
echo["pericardialeffusion"] = echo["pericardialeffusion"].astype('category')
echo["aliveat1"] = echo["aliveat1"].astype('category')
# View info
echo.info()
We can now see that the data types of the three columns have been changed to category. Note that our memory usage is now 11.2+ KB instead of 13.6+ KB.
Also note that, currently it is not possible to make the conversion of more than one column in one go. So it has to be done column by column.
Creating A Data Frame ¶
I am now going to slightly diverge from this topic. At this point, we have learnt how to read in a csv data set. I'm gonna show how you can create your own data frame from lists.
For this purpose, let's extract all the values in the survival and age columns. Dataframe has another attribute called .values
. We transform the resulting values into a list using list()
.
# Extract values from survival column and transform it into a list
surv_list = list(echo['survival'].values)
# Extract values from age column and transform it into a list
age_list = list(echo['age'].values)
# Save the lists in one variable
values_list = [surv_list, age_list]
# Create list of column names
colnames_list = ['survival', 'age']
We have two lists, values_list
containing 2 lists of values extracted from the original data set, and colnames_list
containing the column names for our new data set. We will use these lists to create our data frame using pd.DataFrame()
.
We will zip these lists using a built-in function in Python called zip()
. This function will return an iterator of tuples (zip object) from any number of iterables.
When you zip lists using zip()
, the result has to be unpacked using list()
before you can print it. Otherwise, it will not return the value.
# Zip lists
lists_zipped = zip(colnames_list, values_list)
# Print lists_zipped
lists_zipped
Note that it does not return the value as expected. So, let's unpacked this using list()
.
# Unpack values using list
unpacked_zipped = list(lists_zipped)
Once we have unpacked the values, we will use the dict()
function to create a dictionary called mini_echo. Then, we will use pd.DataFrame()
to transform the dictionary into a data frame.
# Create dictionary
mini_echo = dict(unpacked_zipped)
# Create data frame
df = pd.DataFrame(mini_echo)
# View head
df.head(10)
Naming and Re-naming Columns ¶
Let's use the data frame that we have created above. Instead of using age and survival, I want to re-name it to age and surv. We can use .rename()
function.
# Renaming a column
df.rename(columns={'survival':'surv'}, inplace=True)
# View head
df.head(10)
The column name for survival is now updated to surv. Note that by specifying inplace=True
, no copy of the data frame will be created and that changes are implemented in the original data frame.
I'm going to change the column names into a and b. This is to demonstrate a situation where you have a data frame but it has no meaningful column names.
# Rename column into a and b
df.rename(columns={'age':'a', 'surv':'b'}, inplace=True)
# View head
df.head(10)
We create a list of column names and then assign it to df.columns
.
# Name columns
df.columns = ['age', 'survival']
# View head
df.head()
Delete or Remove Columns ¶
We have gathered from the description of this dataset that some of the columns are not useful and can be removed. We are going to remove these columns: wall-motion-score, mult, name and group.
For this purpose, we will use .drop()
and instead of column names, we will specify the column numbers.
# Remove columns
echo.drop(echo.columns[[7, 9, 10, 11]], axis=1, inplace=True)
# View head
echo.head()
Re-arrange Columns ¶
Th survival, alive and aliveat 1 are highly relevant to each other. It will be easier to have these columns located next to each other. Currently aliveat1 is the last column in the dataset.
We can get the list of the columns first and then re-arrange it to make it easy for data analysis and your understanding.
# Get list of columns
list(echo.columns.values)
# Rearrange columns
echo = echo[['survival',
'alive',
'aliveat1',
'age',
'pericardialeffusion',
'fractionalshortening',
'epss',
'lvdd',
'wallmotion-index']]
# View head
echo.head(10)
Slicing using loc and iloc ¶
Let's learn how to select some elements using loc and iloc.
- loc is label-based selection
- iloc is position-based selection (using integer)
To select elements in a data frame, we need to specify the row and column of the data frame. We will look at the difference between these two methods. We would like to select the 1st to 11th rows, only with survival, alive and epss columns.
# Selecting using loc
echo.loc[0:10, ['survival', 'alive', 'epss']]
# Selecting using iloc
echo.iloc[0:11, [0, 1, 6]]
The index of this data frame is made of integers, starting from 0.
Notice the difference between the two methods:
- In loc, we specify 0:10, and in iloc, we specify 0:11. This is because loc is label-based, so the label for the 11th row in this data frame is 10. In iloc, it is position-based, remember that [start:end(not inclusive)], so we specify 11 (the 12th row which is not included in the slicing method).
- For loc, the column names are specified, but for iloc, the index of the columns are specified.
This section will conclude this part.
At this stage, we have done some cleaning in the Echocardiogram data set. However, we have not dealt with the missing values in the data set. Missing values are really common. The next part in this Data Science with Python, we will learn a few strategies on how to deal with missing values.