Contents¶

Introduction to pandas
Import Packages
Reading in Data Set
Understanding the Data Set
Conversion of Column Data Types
Creating A Data Frame
Naming and Re-naming Columns
Delete or Remove Columns
Re-arrange Columns
Slicing using loc and iloc

This is the third post in Data Science with Python series. Read previous posts here:

Introduction to pandas ¶

pandas is a package in Python that can help you prepare your data for further analysis. It is a great package for data munging. Let's say you want to implement a machine learning technique and requires the data to be in a certain format, with no missing values et cetera, pandas is your friend.

For this tutorial, we will use Echocardiogram Data Set from UCI Machine Learning Repository. The file is in a csv format. Without rambling any further, let's dig in.

Import Packages ¶

First and foremost, we need to import the package. We will also import numpy and matplotlib packages. It is common for pandas to be abbreviated as pd. Less typing, shorter codes and neat.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Reading in Data Set ¶

To read in data set, we will use pd.read_csv.

# Read in data set
echo = pd.read_csv("/Users/azmirfakkri/Downloads/echocardiogram.csv")

Understanding the Data Set ¶

Once you read in your data, you need to get a feel of the data, see what columns it has, are the are any missing values etc.

# View head
echo.head(10)

# View tail
echo.tail(10)

To quickly view the shape of the data set, the data set has a shape attribute. As mentioned in a previous post, an attribute does not require a bracket ().

# View shape
echo.shape

(133, 13)

At this point we have gained a few information about our data set:

We now know that there are 133 rows or observations and 13 columns in this data set
There are missing values
The name of the columns

I always stress to myself the importance of knowing the meaning of every columns in the data set as this will help you to really understand what you are dealing with. For this data set, a description of the columns are given.

Attribute Information:

survival -- the number of months patient survived (has survived, if patient is still alive). Because all the patients had their heart attacks at different times, it is possible that some patients have survived less than one year but they are still alive. Check the second variable to confirm this. Such patients cannot be used for the prediction task mentioned above.
alive -- a binary variable. 0=dead at end of survival period, 1 means still alive
age -- age in years when heart attack occurred
pericardialeffusion -- binary. Pericardial effusion is fluid around the heart. 0=no fluid, 1=fluid
fractionalshortening -- a measure of contracility around the heart lower numbers are increasingly abnormal
epss -- E-point septal separation, another measure of contractility. Larger numbers are increasingly abnormal.
lvdd -- left ventricular end-diastolic dimension. This is a measure of the size of the heart at end-diastole. Large hearts tend to be sick hearts.
wallmotion-score -- a measure of how the segments of the left ventricle are moving
wallmotion-index -- equals wall-motion-score divided by number of segments seen. Usually 12-13 segments are seen in an echocardiogram. Use this variable INSTEAD of the wall motion score.
mult -- a derivate var which can be ignored
name -- the name of the patient (I have replaced them with "name")
group -- meaningless, ignore it
aliveat1 -- Boolean-valued. Derived from the first two attributes. 0 means patient was either dead after 1 year or had been followed for less than 1 year. 1 means patient was alive at 1 year.

To get a complete picture of the data set, we can use echo.info().

# Info of data set
echo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 13 columns):
survival                130 non-null float64
alive                   131 non-null float64
age                     126 non-null float64
pericardialeffusion     132 non-null float64
fractionalshortening    124 non-null float64
epss                    117 non-null float64
lvdd                    121 non-null float64
wallmotion-score        128 non-null float64
wallmotion-index        130 non-null float64
mult                    129 non-null float64
name                    131 non-null object
group                   110 non-null object
aliveat1                75 non-null float64
dtypes: float64(11), object(2)
memory usage: 13.6+ KB

Using echo.info() we can see that all columns contain at least one missing value.

Conversion of Column Data Types ¶

Based on the description, we know that there are three columns with categorical data: alive, pericardialeffusion, aliveat1. We are going to change the data types of these columns into dtype="category". There is a few reasons why we want to do this:

It will save memory
It will make it easier to sort the data
It will signal other libraries in Python to treat these columns as categorical data

We will use .astype() for this purpose. In this function, we need to specify the data type that we intend to change into, in this case it will be category. .astype() can also take ordered=True argument to create an ordered categorical data. However, in these three columns, order is not necessary and meaningless. For example, for column pericardial effusion, having fluid (=1) is not better than not having fluid (=0) and vice versa.

# Convert column data type
echo["alive"] = echo["alive"].astype('category')
echo["pericardialeffusion"] = echo["pericardialeffusion"].astype('category')
echo["aliveat1"] = echo["aliveat1"].astype('category')

# View info
echo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 13 columns):
survival                130 non-null float64
alive                   131 non-null category
age                     126 non-null float64
pericardialeffusion     132 non-null category
fractionalshortening    124 non-null float64
epss                    117 non-null float64
lvdd                    121 non-null float64
wallmotion-score        128 non-null float64
wallmotion-index        130 non-null float64
mult                    129 non-null float64
name                    131 non-null object
group                   110 non-null object
aliveat1                75 non-null category
dtypes: category(3), float64(8), object(2)
memory usage: 11.2+ KB

We can now see that the data types of the three columns have been changed to category. Note that our memory usage is now 11.2+ KB instead of 13.6+ KB.

Also note that, currently it is not possible to make the conversion of more than one column in one go. So it has to be done column by column.

Creating A Data Frame ¶

I am now going to slightly diverge from this topic. At this point, we have learnt how to read in a csv data set. I'm gonna show how you can create your own data frame from lists.

For this purpose, let's extract all the values in the survival and age columns. Dataframe has another attribute called .values. We transform the resulting values into a list using list().

# Extract values from survival column and transform it into a list
surv_list = list(echo['survival'].values)

# Extract values from age column and transform it into a list
age_list = list(echo['age'].values)

# Save the lists in one variable
values_list = [surv_list, age_list]

# Create list of column names
colnames_list = ['survival', 'age']

We have two lists, values_list containing 2 lists of values extracted from the original data set, and colnames_list containing the column names for our new data set. We will use these lists to create our data frame using pd.DataFrame().

We will zip these lists using a built-in function in Python called zip(). This function will return an iterator of tuples (zip object) from any number of iterables.

When you zip lists using zip(), the result has to be unpacked using list() before you can print it. Otherwise, it will not return the value.

# Zip lists
lists_zipped = zip(colnames_list, values_list)

# Print lists_zipped
lists_zipped

<zip at 0x1157c5708>

Note that it does not return the value as expected. So, let's unpacked this using list().

# Unpack values using list
unpacked_zipped = list(lists_zipped)

Once we have unpacked the values, we will use the dict() function to create a dictionary called mini_echo. Then, we will use pd.DataFrame() to transform the dictionary into a data frame.

# Create dictionary
mini_echo = dict(unpacked_zipped)

# Create data frame
df = pd.DataFrame(mini_echo)

# View head
df.head(10)

Naming and Re-naming Columns ¶

Let's use the data frame that we have created above. Instead of using age and survival, I want to re-name it to age and surv. We can use .rename() function.

# Renaming a column
df.rename(columns={'survival':'surv'}, inplace=True)

# View head
df.head(10)

The column name for survival is now updated to surv. Note that by specifying inplace=True, no copy of the data frame will be created and that changes are implemented in the original data frame.

I'm going to change the column names into a and b. This is to demonstrate a situation where you have a data frame but it has no meaningful column names.

# Rename column into a and b
df.rename(columns={'age':'a', 'surv':'b'}, inplace=True)

# View head 
df.head(10)

We create a list of column names and then assign it to df.columns.

# Name columns
df.columns = ['age', 'survival']

# View head 
df.head()

Delete or Remove Columns ¶

We have gathered from the description of this dataset that some of the columns are not useful and can be removed. We are going to remove these columns: wall-motion-score, mult, name and group.

For this purpose, we will use .drop() and instead of column names, we will specify the column numbers.

# Remove columns 
echo.drop(echo.columns[[7, 9, 10, 11]], axis=1, inplace=True)

# View head
echo.head()

Re-arrange Columns ¶

Th survival, alive and aliveat 1 are highly relevant to each other. It will be easier to have these columns located next to each other. Currently aliveat1 is the last column in the dataset.

We can get the list of the columns first and then re-arrange it to make it easy for data analysis and your understanding.

# Get list of columns
list(echo.columns.values)

['survival',
 'alive',
 'age',
 'pericardialeffusion',
 'fractionalshortening',
 'epss',
 'lvdd',
 'wallmotion-index',
 'aliveat1']

# Rearrange columns 
echo = echo[['survival',
 'alive',
 'aliveat1',
 'age',
 'pericardialeffusion',
 'fractionalshortening',
 'epss',
 'lvdd',
 'wallmotion-index']]

# View head
echo.head(10)

Slicing using loc and iloc ¶

Let's learn how to select some elements using loc and iloc.

loc is label-based selection
iloc is position-based selection (using integer)

To select elements in a data frame, we need to specify the row and column of the data frame. We will look at the difference between these two methods. We would like to select the 1st to 11th rows, only with survival, alive and epss columns.

# Selecting using loc
echo.loc[0:10, ['survival', 'alive', 'epss']]

# Selecting using iloc
echo.iloc[0:11, [0, 1, 6]]

The index of this data frame is made of integers, starting from 0.

Notice the difference between the two methods:

In loc, we specify 0:10, and in iloc, we specify 0:11. This is because loc is label-based, so the label for the 11th row in this data frame is 10. In iloc, it is position-based, remember that [start:end(not inclusive)], so we specify 11 (the 12th row which is not included in the slicing method).
For loc, the column names are specified, but for iloc, the index of the columns are specified.

This section will conclude this part.

At this stage, we have done some cleaning in the Echocardiogram data set. However, we have not dealt with the missing values in the data set. Missing values are really common. The next part in this Data Science with Python, we will learn a few strategies on how to deal with missing values.

	survival	alive	age	fractionalshortening	epss	lvdd	wallmotion-score	wallmotion-index	mult	name	group
0	11.0	0.0	71.0	0.260	9.000	4.600	14.0	1.000	1.000	name	1
1	19.0	0.0	72.0	0.380	6.000	4.100	14.0	1.700	0.588	name	1
2	16.0	0.0	55.0	0.260	4.000	3.420	14.0	1.000	1.000	name	1
3	57.0	0.0	60.0	0.253	12.062	4.603	16.0	1.450	0.788	name	1
4	19.0	1.0	57.0	0.160	22.000	5.750	18.0	2.250	0.571	name	1
5	26.0	0.0	68.0	0.260	5.000	4.310	12.0	1.000	0.857	name	1
6	13.0	0.0	62.0	0.230	31.000	5.430	22.5	1.875	0.857	name	1
7	50.0	0.0	60.0	0.330	8.000	5.250	14.0	1.000	1.000	name	1
8	19.0	0.0	46.0	0.340	0.000	5.090	16.0	1.140	1.003	name	1
9	25.0	0.0	54.0	0.140	13.000	4.490	15.5	1.190	0.930	name	1

	survival	alive	age	pericardialeffusion	fractionalshortening	epss	lvdd	wallmotion-score	wallmotion-index	mult	name	group	aliveat1
123	31.0	0.0	61.0	0.0	0.18	0.0	4.48	11.0	1.375	0.570	name	NaN	NaN
124	12.0	0.0	61.0	1.0	0.19	13.2	5.04	19.0	1.730	0.786	name	NaN	NaN
125	36.0	0.0	48.0	0.0	0.15	12.0	3.66	10.0	1.000	0.714	name	NaN	NaN
126	17.0	0.0	NaN	0.0	0.09	6.8	4.96	13.0	1.080	0.857	name	NaN	NaN
127	21.0	0.0	61.0	0.0	0.14	25.5	5.16	14.0	1.270	0.786	name	NaN	NaN
128	7.5	1.0	64.0	0.0	0.24	12.9	4.72	12.0	1.000	0.857	name	NaN	NaN
129	41.0	0.0	64.0	0.0	0.28	5.4	5.47	11.0	1.100	0.714	name	NaN	NaN
130	36.0	0.0	69.0	0.0	0.20	7.0	5.05	14.5	1.210	0.857	name	NaN	NaN
131	22.0	0.0	57.0	0.0	0.14	16.1	4.36	15.0	1.360	0.786	name	NaN	NaN
132	20.0	0.0	62.0	0.0	0.15	0.0	4.51	15.5	1.409	0.786	name	NaN	NaN

	age	survival
0	71.0	11.0
1	72.0	19.0
2	55.0	16.0
3	60.0	57.0
4	57.0	19.0
5	68.0	26.0
6	62.0	13.0
7	60.0	50.0
8	46.0	19.0
9	54.0	25.0

	age	surv
0	71.0	11.0
1	72.0	19.0
2	55.0	16.0
3	60.0	57.0
4	57.0	19.0
5	68.0	26.0
6	62.0	13.0
7	60.0	50.0
8	46.0	19.0
9	54.0	25.0

	a	b
0	71.0	11.0
1	72.0	19.0
2	55.0	16.0
3	60.0	57.0
4	57.0	19.0
5	68.0	26.0
6	62.0	13.0
7	60.0	50.0
8	46.0	19.0
9	54.0	25.0

Azmir Fakkri

Data Science with Python, Part 3: pandas