Data Science with Python, Part 3: pandas

June 18, 2018

Contents

This is the third post in Data Science with Python series. Read previous posts here:

  1. Data Science with Python, Part 1: Introduction to Basics
  2. Data Science with Python, Part 2: NumPy

Introduction to pandas

pandas is a package in Python that can help you prepare your data for further analysis. It is a great package for data munging. Let's say you want to implement a machine learning technique and requires the data to be in a certain format, with no missing values et cetera, pandas is your friend.

For this tutorial, we will use Echocardiogram Data Set from UCI Machine Learning Repository. The file is in a csv format. Without rambling any further, let's dig in.

Import Packages

First and foremost, we need to import the package. We will also import numpy and matplotlib packages. It is common for pandas to be abbreviated as pd. Less typing, shorter codes and neat.

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading in Data Set

To read in data set, we will use pd.read_csv.

In [2]:
# Read in data set
echo = pd.read_csv("/Users/azmirfakkri/Downloads/echocardiogram.csv")

Understanding the Data Set

Once you read in your data, you need to get a feel of the data, see what columns it has, are the are any missing values etc.

In [3]:
# View head
echo.head(10)
Out[3]:
survival alive age pericardialeffusion fractionalshortening epss lvdd wallmotion-score wallmotion-index mult name group aliveat1
0 11.0 0.0 71.0 0.0 0.260 9.000 4.600 14.0 1.000 1.000 name 1 0.0
1 19.0 0.0 72.0 0.0 0.380 6.000 4.100 14.0 1.700 0.588 name 1 0.0
2 16.0 0.0 55.0 0.0 0.260 4.000 3.420 14.0 1.000 1.000 name 1 0.0
3 57.0 0.0 60.0 0.0 0.253 12.062 4.603 16.0 1.450 0.788 name 1 0.0
4 19.0 1.0 57.0 0.0 0.160 22.000 5.750 18.0 2.250 0.571 name 1 0.0
5 26.0 0.0 68.0 0.0 0.260 5.000 4.310 12.0 1.000 0.857 name 1 0.0
6 13.0 0.0 62.0 0.0 0.230 31.000 5.430 22.5 1.875 0.857 name 1 0.0
7 50.0 0.0 60.0 0.0 0.330 8.000 5.250 14.0 1.000 1.000 name 1 0.0
8 19.0 0.0 46.0 0.0 0.340 0.000 5.090 16.0 1.140 1.003 name 1 0.0
9 25.0 0.0 54.0 0.0 0.140 13.000 4.490 15.5 1.190 0.930 name 1 0.0
In [4]:
# View tail
echo.tail(10)
Out[4]:
survival alive age pericardialeffusion fractionalshortening epss lvdd wallmotion-score wallmotion-index mult name group aliveat1
123 31.0 0.0 61.0 0.0 0.18 0.0 4.48 11.0 1.375 0.570 name NaN NaN
124 12.0 0.0 61.0 1.0 0.19 13.2 5.04 19.0 1.730 0.786 name NaN NaN
125 36.0 0.0 48.0 0.0 0.15 12.0 3.66 10.0 1.000 0.714 name NaN NaN
126 17.0 0.0 NaN 0.0 0.09 6.8 4.96 13.0 1.080 0.857 name NaN NaN
127 21.0 0.0 61.0 0.0 0.14 25.5 5.16 14.0 1.270 0.786 name NaN NaN
128 7.5 1.0 64.0 0.0 0.24 12.9 4.72 12.0 1.000 0.857 name NaN NaN
129 41.0 0.0 64.0 0.0 0.28 5.4 5.47 11.0 1.100 0.714 name NaN NaN
130 36.0 0.0 69.0 0.0 0.20 7.0 5.05 14.5 1.210 0.857 name NaN NaN
131 22.0 0.0 57.0 0.0 0.14 16.1 4.36 15.0 1.360 0.786 name NaN NaN
132 20.0 0.0 62.0 0.0 0.15 0.0 4.51 15.5 1.409 0.786 name NaN NaN

To quickly view the shape of the data set, the data set has a shape attribute. As mentioned in a previous post, an attribute does not require a bracket ().

In [5]:
# View shape
echo.shape
Out[5]:
(133, 13)

At this point we have gained a few information about our data set:

  1. We now know that there are 133 rows or observations and 13 columns in this data set
  2. There are missing values
  3. The name of the columns

I always stress to myself the importance of knowing the meaning of every columns in the data set as this will help you to really understand what you are dealing with. For this data set, a description of the columns are given.

Attribute Information:

  1. survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
  2. alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
  3. age -- age in years when heart attack occurred
  4. pericardialeffusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
  5. fractionalshortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
  6. epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
  7. lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
  8. wallmotion-score -- a measure of how the segments of the left ventricle are moving
  9. wallmotion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
  10. mult -- a derivate var which can be ignored
  11. name -- the name of the patient (I have replaced them with "name")
  12. group -- meaningless, ignore it
  13. aliveat1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

To get a complete picture of the data set, we can use echo.info().

In [6]:
# Info of data set
echo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 13 columns):
survival                130 non-null float64
alive                   131 non-null float64
age                     126 non-null float64
pericardialeffusion     132 non-null float64
fractionalshortening    124 non-null float64
epss                    117 non-null float64
lvdd                    121 non-null float64
wallmotion-score        128 non-null float64
wallmotion-index        130 non-null float64
mult                    129 non-null float64
name                    131 non-null object
group                   110 non-null object
aliveat1                75 non-null float64
dtypes: float64(11), object(2)
memory usage: 13.6+ KB

Using echo.info() we can see that all columns contain at least one missing value.

Conversion of Column Data Types

Based on the description, we know that there are three columns with categorical data: alive, pericardialeffusion, aliveat1. We are going to change the data types of these columns into dtype="category". There is a few reasons why we want to do this:

  1. It will save memory
  2. It will make it easier to sort the data
  3. It will signal other libraries in Python to treat these columns as categorical data

We will use .astype() for this purpose. In this function, we need to specify the data type that we intend to change into, in this case it will be category. .astype() can also take ordered=True argument to create an ordered categorical data. However, in these three columns, order is not necessary and meaningless. For example, for column pericardial effusion, having fluid (=1) is not better than not having fluid (=0) and vice versa.

In [7]:
# Convert column data type
echo["alive"] = echo["alive"].astype('category')
echo["pericardialeffusion"] = echo["pericardialeffusion"].astype('category')
echo["aliveat1"] = echo["aliveat1"].astype('category')

# View info
echo.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 13 columns):
survival                130 non-null float64
alive                   131 non-null category
age                     126 non-null float64
pericardialeffusion     132 non-null category
fractionalshortening    124 non-null float64
epss                    117 non-null float64
lvdd                    121 non-null float64
wallmotion-score        128 non-null float64
wallmotion-index        130 non-null float64
mult                    129 non-null float64
name                    131 non-null object
group                   110 non-null object
aliveat1                75 non-null category
dtypes: category(3), float64(8), object(2)
memory usage: 11.2+ KB

We can now see that the data types of the three columns have been changed to category. Note that our memory usage is now 11.2+ KB instead of 13.6+ KB.

Also note that, currently it is not possible to make the conversion of more than one column in one go. So it has to be done column by column.

Creating A Data Frame

I am now going to slightly diverge from this topic. At this point, we have learnt how to read in a csv data set. I'm gonna show how you can create your own data frame from lists.

For this purpose, let's extract all the values in the survival and age columns. Dataframe has another attribute called .values. We transform the resulting values into a list using list().

In [8]:
# Extract values from survival column and transform it into a list
surv_list = list(echo['survival'].values)

# Extract values from age column and transform it into a list
age_list = list(echo['age'].values)

# Save the lists in one variable
values_list = [surv_list, age_list]

# Create list of column names
colnames_list = ['survival', 'age']

We have two lists, values_list containing 2 lists of values extracted from the original data set, and colnames_list containing the column names for our new data set. We will use these lists to create our data frame using pd.DataFrame().

We will zip these lists using a built-in function in Python called zip(). This function will return an iterator of tuples (zip object) from any number of iterables.

When you zip lists using zip(), the result has to be unpacked using list() before you can print it. Otherwise, it will not return the value.

In [9]:
# Zip lists
lists_zipped = zip(colnames_list, values_list)

# Print lists_zipped
lists_zipped
Out[9]:
<zip at 0x1157c5708>

Note that it does not return the value as expected. So, let's unpacked this using list().

In [10]:
# Unpack values using list
unpacked_zipped = list(lists_zipped)

Once we have unpacked the values, we will use the dict() function to create a dictionary called mini_echo. Then, we will use pd.DataFrame() to transform the dictionary into a data frame.

In [11]:
# Create dictionary
mini_echo = dict(unpacked_zipped)

# Create data frame
df = pd.DataFrame(mini_echo)

# View head
df.head(10)
Out[11]:
age survival
0 71.0 11.0
1 72.0 19.0
2 55.0 16.0
3 60.0 57.0
4 57.0 19.0
5 68.0 26.0
6 62.0 13.0
7 60.0 50.0
8 46.0 19.0
9 54.0 25.0

Naming and Re-naming Columns

Let's use the data frame that we have created above. Instead of using age and survival, I want to re-name it to age and surv. We can use .rename() function.

In [12]:
# Renaming a column
df.rename(columns={'survival':'surv'}, inplace=True)

# View head
df.head(10)
Out[12]:
age surv
0 71.0 11.0
1 72.0 19.0
2 55.0 16.0
3 60.0 57.0
4 57.0 19.0
5 68.0 26.0
6 62.0 13.0
7 60.0 50.0
8 46.0 19.0
9 54.0 25.0

The column name for survival is now updated to surv. Note that by specifying inplace=True, no copy of the data frame will be created and that changes are implemented in the original data frame.

I'm going to change the column names into a and b. This is to demonstrate a situation where you have a data frame but it has no meaningful column names.

In [13]:
# Rename column into a and b
df.rename(columns={'age':'a', 'surv':'b'}, inplace=True)

# View head 
df.head(10)
Out[13]:
a b
0 71.0 11.0
1 72.0 19.0
2 55.0 16.0
3 60.0 57.0
4 57.0 19.0
5 68.0 26.0
6 62.0 13.0
7 60.0 50.0
8 46.0 19.0
9 54.0 25.0

We create a list of column names and then assign it to df.columns.

In [14]:
# Name columns
df.columns = ['age', 'survival']

# View head 
df.head()
Out[14]:
age survival
0 71.0 11.0
1 72.0 19.0
2 55.0 16.0
3 60.0 57.0
4 57.0 19.0

Delete or Remove Columns

We have gathered from the description of this dataset that some of the columns are not useful and can be removed. We are going to remove these columns: wall-motion-score, mult, name and group.

For this purpose, we will use .drop() and instead of column names, we will specify the column numbers.

In [15]:
# Remove columns 
echo.drop(echo.columns[[7, 9, 10, 11]], axis=1, inplace=True)
In [16]:
# View head
echo.head()
Out[16]:
survival alive age pericardialeffusion fractionalshortening epss lvdd wallmotion-index aliveat1
0 11.0 0.0 71.0 0.0 0.260 9.000 4.600 1.00 0.0
1 19.0 0.0 72.0 0.0 0.380 6.000 4.100 1.70 0.0
2 16.0 0.0 55.0 0.0 0.260 4.000 3.420 1.00 0.0
3 57.0 0.0 60.0 0.0 0.253 12.062 4.603 1.45 0.0
4 19.0 1.0 57.0 0.0 0.160 22.000 5.750 2.25 0.0

Re-arrange Columns

Th survival, alive and aliveat 1 are highly relevant to each other. It will be easier to have these columns located next to each other. Currently aliveat1 is the last column in the dataset.

We can get the list of the columns first and then re-arrange it to make it easy for data analysis and your understanding.

In [17]:
# Get list of columns
list(echo.columns.values)
Out[17]:
['survival',
 'alive',
 'age',
 'pericardialeffusion',
 'fractionalshortening',
 'epss',
 'lvdd',
 'wallmotion-index',
 'aliveat1']
In [18]:
# Rearrange columns 
echo = echo[['survival',
 'alive',
 'aliveat1',
 'age',
 'pericardialeffusion',
 'fractionalshortening',
 'epss',
 'lvdd',
 'wallmotion-index']]

# View head
echo.head(10)
Out[18]:
survival alive aliveat1 age pericardialeffusion fractionalshortening epss lvdd wallmotion-index
0 11.0 0.0 0.0 71.0 0.0 0.260 9.000 4.600 1.000
1 19.0 0.0 0.0 72.0 0.0 0.380 6.000 4.100 1.700
2 16.0 0.0 0.0 55.0 0.0 0.260 4.000 3.420 1.000
3 57.0 0.0 0.0 60.0 0.0 0.253 12.062 4.603 1.450
4 19.0 1.0 0.0 57.0 0.0 0.160 22.000 5.750 2.250
5 26.0 0.0 0.0 68.0 0.0 0.260 5.000 4.310 1.000
6 13.0 0.0 0.0 62.0 0.0 0.230 31.000 5.430 1.875
7 50.0 0.0 0.0 60.0 0.0 0.330 8.000 5.250 1.000
8 19.0 0.0 0.0 46.0 0.0 0.340 0.000 5.090 1.140
9 25.0 0.0 0.0 54.0 0.0 0.140 13.000 4.490 1.190

Slicing using loc and iloc

Let's learn how to select some elements using loc and iloc.

  • loc is label-based selection
  • iloc is position-based selection (using integer)

To select elements in a data frame, we need to specify the row and column of the data frame. We will look at the difference between these two methods. We would like to select the 1st to 11th rows, only with survival, alive and epss columns.

In [19]:
# Selecting using loc
echo.loc[0:10, ['survival', 'alive', 'epss']]
Out[19]:
survival alive epss
0 11.0 0.0 9.000
1 19.0 0.0 6.000
2 16.0 0.0 4.000
3 57.0 0.0 12.062
4 19.0 1.0 22.000
5 26.0 0.0 5.000
6 13.0 0.0 31.000
7 50.0 0.0 8.000
8 19.0 0.0 0.000
9 25.0 0.0 13.000
10 10.0 1.0 16.000
In [20]:
# Selecting using iloc
echo.iloc[0:11, [0, 1, 6]]
Out[20]:
survival alive epss
0 11.0 0.0 9.000
1 19.0 0.0 6.000
2 16.0 0.0 4.000
3 57.0 0.0 12.062
4 19.0 1.0 22.000
5 26.0 0.0 5.000
6 13.0 0.0 31.000
7 50.0 0.0 8.000
8 19.0 0.0 0.000
9 25.0 0.0 13.000
10 10.0 1.0 16.000

The index of this data frame is made of integers, starting from 0.

Notice the difference between the two methods:

  1. In loc, we specify 0:10, and in iloc, we specify 0:11. This is because loc is label-based, so the label for the 11th row in this data frame is 10. In iloc, it is position-based, remember that [start:end(not inclusive)], so we specify 11 (the 12th row which is not included in the slicing method).
  2. For loc, the column names are specified, but for iloc, the index of the columns are specified.

This section will conclude this part.

At this stage, we have done some cleaning in the Echocardiogram data set. However, we have not dealt with the missing values in the data set. Missing values are really common. The next part in this Data Science with Python, we will learn a few strategies on how to deal with missing values.