Data Science with Python, Part 2: NumPy

May 25, 2018

Contents

This is the second post in the Data Science with Python series. Read the first post here:

  1. Data Science with Python, Part 1: Introduction to Basics

Introduction to NumPy

NumPy stands for Numerical Python. This is one of the most common used packages in Python. So why do we need NumPy?
Let's see an example. Suppose you have two list that you want to divide and you want results for each elements.

In [1]:
# Create a list
list1 = [1, 2, 3, 4, 5]

# Divide list by 2
list1 / 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-7a23a9964c02> in <module>()
      3
      4 # Divide list by 2
----> 5 list1 / 2

TypeError: unsupported operand type(s) for /: 'list' and 'int'

Python threw an error in this example. And note what the error said, Python does not know how to handle operation involving list and integer. Let's fix this using NumPy.

In [2]:
# Import package 
import numpy as np

# Create numpy array 
np_list1 = np.array(list1)

# Divide np_list1
np_list1 / 2
Out[2]:
array([ 0.5,  1. ,  1.5,  2. ,  2.5])

This time, Python did not throw us an error and it gives us the results of the division element-wise!

Creating a NumPy Array

Remember that we have learnt a few Python data types in the first post of this series, so NumPy is another data type in Python. Let's dissect what it means by this sentence, "NumPy’s main object is the homogeneous multidimensional array".

  • Homogenous - NumPy can only contain one type of data, for example strings or integers. If there are multiple data types, it will be converted to a single data type.
  • Multidimensional array - it can have more than 1 dimension. Dimensions are called axes in NumPy.
In [3]:
# Create a NumPy array
x = np.array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14]], dtype=np.int16)

# View x
x
Out[3]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int16)

One note in creating a NumPy array, you need to provide a single list. What we essentially did in creating x is that we have 3 sublists in a list. So, we need that extra bracket to make sure that it is one list and not 3 lists.

We can also specify dtype as the second argument.

Numpy Attributes

NumPy has attributes associated with them. We will look at these in further details using the example we created earlier.

First attribute ndarray.ndim By calling this on the array, it will tell us the number of axes or dimensions of the array

In [4]:
x.ndim
Out[4]:
2

There are two axes in x.

Important: There is no parentheses () after attribute, unlike methods, example.method()

Second attribute ndarray.shape It tells us the length of the axes (or dimensions) of the array, the result of shape will be in (n, m), where n is the number of rows and m is the number of columns

In [5]:
x.shape
Out[5]:
(3, 5)

As we found out from the first attribute, this array has two axes. The first axis has 3 rows and the second axis has 5 columns.

Third attribute ndarray.size It tells us the total number of elements in the array

In [6]:
x.size
Out[6]:
15

There are 15 elements in the array.

Fourth attribute ndarray.dtype It describes the type of the elements in the array

In [7]:
x.dtype
Out[7]:
dtype('int16')

Our elements are of integer type.

Fifth attribute ndarray.itemsize It tells us the bytes size of the elements in the array

In [8]:
x.itemsize
Out[8]:
2

How did we get 2? When you look at our result for x.dtype, it has 16 at the end of it. Divide it with 8, you will get 2. If your element type is int64, the itemsize is 32/8 = 8.

NumPy Basic Operations

As mentioned in the introduction of this post, using NumPy will allow you to do element-wise operations on the arrays. Let's see the difference from using NumPy and not using NumPy in Python when we do addition.

In [9]:
# Create lists
a = [1, 2, 3]
b = [1, 2, 3]

# Addition
a + b
Out[9]:
[1, 2, 3, 1, 2, 3]

From the example above, we are not using NumPy array. What happened is, the addition symbol just paste the two lists together.

Let's see what happened when we use NumPy array.

In [10]:
# Create NumPy array
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])

# Addition
a + b
Out[10]:
array([2, 4, 6])

Now, the elements are added element-wise. This is also true for subtraction.

The next operation that I want to describe is multiplication *. Let's create two NumPy arrays to demonstrate this.

In [11]:
# Create NumPy arrays
c = np.array([[1, 2],
              [3, 4]])
d = np.array([[5, 6],
              [7, 8]])

# Element-wise multiplication 
c*d
Out[11]:
array([[ 5, 12],
       [21, 32]])

As mentioned earlier, this is how we expect NumPy to behave, but what if instead of element-wise, we want a matrix product? We can do that using dot. dot is a function and also a method. For that reason, we can use dot in two ways.

In [12]:
# Using dot as a method to get matrix product 
c.dot(d)
Out[12]:
array([[19, 22],
       [43, 50]])
In [13]:
# Using dot as a function to get matrix product 
np.dot(c, d)
Out[13]:
array([[19, 22],
       [43, 50]])

We can see that we have the same results, eventhough we tell Python in a slightly different manner.

Unary operations are operations that produce a single output. Many of these operations are implemented as methods, which means you need to put a parentheses like this example.method(). Let's use our x.

In [14]:
# Get sum of x
x.sum()
Out[14]:
105
In [15]:
# Find the maximum number in x
x.max()
Out[15]:
14
In [16]:
# Find the minimum number in x
x.min()
Out[16]:
0

We can also do operations across rows and columns in NumPy. Let's have a look at x again.

In [17]:
# View x
x
Out[17]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int16)

To do this we need to specify the axis parameter.

In [18]:
# Sum of each columns, axis=0
x.sum(axis=0)
Out[18]:
array([15, 18, 21, 24, 27])
In [19]:
# Sum of each rows, axis=1
x.sum(axis=1)
Out[19]:
array([10, 35, 60])
In [20]:
# Cumulative sum across columns
x.cumsum(axis=0)
Out[20]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  7,  9, 11, 13],
       [15, 18, 21, 24, 27]])
In [21]:
# Cumulative sum across rows
x.cumsum(axis=1)
Out[21]:
array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

Indexing and Slicing in NumPy

We can use bracket to index and slice NumPy arrays. Let's look at some examples for 1-dimensional array and multidimensional array.

In [22]:
# Indexing one-dimensional array 
e = np.array([2, 4, 6, 8, 10])

# Get the third element from e
e[2]
Out[22]:
6
In [23]:
# Get the 1st to 3rd elements
e[0:3]
Out[23]:
array([2, 4, 6])
In [24]:
# Let's use our x
x
Out[24]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int16)
In [25]:
# Get element in the 2nd row, 4th column
x[1, 3]
Out[25]:
8
In [26]:
# Get elements in the 3rd row, 2nd to 4th columns
x[2, 1:4]
Out[26]:
array([11, 12, 13], dtype=int16)
In [27]:
# Get all elements in the 1st row 
x[0, :]
Out[27]:
array([0, 1, 2, 3, 4], dtype=int16)
In [28]:
# Get all elements in the 2nd column 
x[ : , 1]
Out[28]:
array([ 1,  6, 11], dtype=int16)

Manipulating NumPy Shape

We will go through a few commands in manipulating shape in NumPy.

Changing Array Shape

The following three commands will return a modified array but they do not change the original array. I will demonstrate this in the examples. We will use our x.

ndarray.ravel() - it will return a flattened array

In [29]:
# View x
x
Out[29]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int16)
In [30]:
# Flattened x
x.ravel()
Out[30]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14], dtype=int16)

ndarray.reshape() - it will return an array with a different shape as specified

In [31]:
# Change shape of x from (3, 5) to (5, 3)
x.reshape(5, 3)
Out[31]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]], dtype=int16)

ndarray.T - it will return a transposed array, meaning each row and column are interchanged correspondingly

Note that there is no parentheses for ndarray.T

In [32]:
# Transpose
x.T
Out[32]:
array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]], dtype=int16)

Let's checked our x again and see if we have actually changed the original array.

In [34]:
# View x
x
Out[34]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]], dtype=int16)

Nope, we haven't! We still have the original array intact. The next method however will change the original array.

ndarray.resize() - it will return a modified array with the original array changed

In [38]:
# Resize array
x.resize((5, 3))

# View updated x
x
Out[38]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]], dtype=int16)

Stacking Arrays

We can stack arrays together in NumPy along different axes.

Let's create new array to demonstrate this.

In [41]:
# Create arrays
f = np.array([[10, 20], [30, 40]])
g = np.array([[50, 60], [70, 80]])

# Stack vertically
np.vstack((f, g))
Out[41]:
array([[10, 20],
       [30, 40],
       [50, 60],
       [70, 80]])
In [42]:
# Stack horizontally
np.hstack((f,g))
Out[42]:
array([[10, 20, 50, 60],
       [30, 40, 70, 80]])

In the next post of this series, we will have a look at pandas. It is a package that helps you deal with data frames.