The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Home Explore Python for Data Analysis

View in Fullscreen

Like this book? You can publish your book online for free in a few minutes!

Related Publications

Discover the best professional documents and content resources in AnyFlip Document Base.

Published by anquoc.29, 2016-02-21 04:02:21

Python for Data Analysis

Pages:

In [41]: calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

In [42]: int_array.astype(calibers.dtype)
Out[42]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype:

In [43]: empty_uint32 = np.empty(8, dtype='u4')

In [44]: empty_uint32

Out[44]:

array([ 0, 0, 65904672, 0, 64856792, 0,
0], dtype=uint32)
39438163,

Calling astype always creates a new array (a copy of the data), even if
the new dtype is the same as the old dtype.

It’s worth keeping in mind that floating point numbers, such as those
in float64 and float32 arrays, are only capable of approximating frac-
tional quantities. In complex computations, you may accrue some
floating point error, making comparisons only valid up to a certain num-
ber of decimal places.

Operations between Arrays and Scalars

Arrays are important because they enable you to express batch operations on data
without writing any for loops. This is usually called vectorization. Any arithmetic op-
erations between equal-size arrays applies the operation elementwise:

In [45]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [46]: arr
Out[46]:
array([[ 1., 2., 3.],

[ 4., 5., 6.]])

In [47]: arr * arr 9.], In [48]: arr - arr
Out[47]: 36.]]) Out[48]:
array([[ 1., 4., array([[ 0., 0., 0.],

[ 16., 25., [ 0., 0., 0.]])

Arithmetic operations with scalars are as you would expect, propagating the value to
each element:

In [49]: 1 / arr 0.5 , 0.3333], In [50]: arr ** 0.5 1.7321],
Out[49]: 0.2 , 0.1667]]) Out[50]: 2.4495]])
array([[ 1. , array([[ 1. , 1.4142,

[ 0.25 , [ 2. , 2.2361,

The NumPy ndarray: A Multidimensional Array Object | 85
www.it-ebooks.info

Operations between differently sized arrays is called broadcasting and will be discussed
in more detail in Chapter 12. Having a deep understanding of broadcasting is not nec-
essary for most of this book.

Basic Indexing and Slicing

NumPy array indexing is a rich topic, as there are many ways you may want to select
a subset of your data or individual elements. One-dimensional arrays are simple; on
the surface they act similarly to Python lists:

In [51]: arr = np.arange(10)
In [52]: arr
Out[52]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [53]: arr[5]
Out[53]: 5
In [54]: arr[5:8]
Out[54]: array([5, 6, 7])
In [55]: arr[5:8] = 12
In [56]: arr
Out[56]: array([ 0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is
propagated (or broadcasted henceforth) to the entire selection. An important first dis-
tinction from lists is that array slices are views on the original array. This means that
the data is not copied, and any modifications to the view will be reflected in the source
array:

In [57]: arr_slice = arr[5:8]
In [58]: arr_slice[1] = 12345
In [59]: arr
Out[59]: array([ 0, 1, 2, 3, 4, 12, 12345, 12, 8, 9])
In [60]: arr_slice[:] = 64
In [61]: arr
Out[61]: array([ 0, 1, 2, 3, 4, 64, 64, 64, 8, 9])

If you are new to NumPy, you might be surprised by this, especially if they have used
other array programming languages which copy data more zealously. As NumPy has
been designed with large data use cases in mind, you could imagine performance and
memory problems if NumPy insisted on copying data left and right.

86 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

If you want a copy of a slice of an ndarray instead of a view, you will
need to explicitly copy the array; for example arr[5:8].copy().

With higher dimensional arrays, you have many more options. In a two-dimensional
array, the elements at each index are no longer scalars but rather one-dimensional
arrays:

In [62]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
In [63]: arr2d[2]
Out[63]: array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work,
so you can pass a comma-separated list of indices to select individual elements. So these
are equivalent:

In [64]: arr2d[0][2]
Out[64]: 3
In [65]: arr2d[0, 2]
Out[65]: 3

See Figure 4-1 for an illustration of indexing on a 2D array.

Figure 4-1. Indexing elements in a NumPy array

In multidimensional arrays, if you omit later indices, the returned object will be a lower-
dimensional ndarray consisting of all the data along the higher dimensions. So in the
2 × 2 × 3 array arr3d

In [66]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
In [67]: arr3d
Out[67]:
array([[[ 1, 2, 3],

The NumPy ndarray: A Multidimensional Array Object | 87

www.it-ebooks.info

[ 4, 5, 6]],
[[ 7, 8, 9],

[10, 11, 12]]])

arr3d[0] is a 2 × 3 array:

In [68]: arr3d[0]
Out[68]:
array([[1, 2, 3],

[4, 5, 6]])

Both scalar values and arrays can be assigned to arr3d[0]:

In [69]: old_values = arr3d[0].copy()

In [70]: arr3d[0] = 42

In [71]: arr3d
Out[71]:
array([[[42, 42, 42],

[42, 42, 42]],
[[ 7, 8, 9],

[10, 11, 12]]])

In [72]: arr3d[0] = old_values

In [73]: arr3d
Out[73]:
array([[[ 1, 2, 3],

[ 4, 5, 6]],
[[ 7, 8, 9],

[10, 11, 12]]])

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0), form-
ing a 1-dimensional array:

In [74]: arr3d[1, 0]
Out[74]: array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the
returned arrays are views.

Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced using the
familiar syntax:

In [75]: arr[1:6]
Out[75]: array([ 1, 2, 3, 4, 64])

Higher dimensional objects give you more options as you can slice one or more axes

and also mix integers. Consider the 2D array above, arr2d. Slicing this array is a bit
different:

In [76]: arr2d In [77]: arr2d[:2]
Out[76]: Out[77]:

88 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

array([[1, 2, 3], array([[1, 2, 3],
[4, 5, 6], [4, 5, 6]])
[7, 8, 9]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range
of elements along an axis. You can pass multiple slices just like you can pass multiple
indexes:

In [78]: arr2d[:2, 1:]
Out[78]:
array([[2, 3],

[5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions.
By mixing integer indexes and slices, you get lower dimensional slices:

In [79]: arr2d[1, :2] In [80]: arr2d[2, :1]
Out[79]: array([4, 5]) Out[80]: array([7])

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire
axis, so you can slice only higher dimensional axes by doing:

In [81]: arr2d[:, :1]
Out[81]:
array([[1],

[4],
[7]])

Of course, assigning to a slice expression assigns to the whole selection:

In [82]: arr2d[:2, 1:] = 0

Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names
with duplicates. I’m going to use here the randn function in numpy.random to generate
some random normally distributed data:

In [83]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [84]: data = randn(7, 4)

In [85]: names
Out[85]:
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'],

dtype='|S4')

In [86]: data
Out[86]:
array([[-0.048 , 0.5433, -0.2349, 1.2792],

[-0.268 , 0.5465, 0.0939, -2.0445],
[-0.047 , -2.026 , 0.7719, 0.3103],
[ 2.1452, 0.8799, -0.0523, 0.0672],
[-1.0023, -0.1698, 1.1503, 1.7289],

The NumPy ndarray: A Multidimensional Array Object | 89
www.it-ebooks.info

[ 0.1913, 0.4544, 0.4519, 0.5535],
[ 0.5994, 0.8174, -0.9297, -1.2564]])

Figure 4-2. Two-dimensional array slicing

Suppose each name corresponds to a row in the data array. If we wanted to select all
the rows with corresponding name 'Bob'. Like arithmetic operations, comparisons
(such as ==) with arrays are also vectorized. Thus, comparing names with the string
'Bob' yields a boolean array:

In [87]: names == 'Bob'
Out[87]: array([ True, False, False, True, False, False, False], dtype=bool)

This boolean array can be passed when indexing the array:

In [88]: data[names == 'Bob'] 1.2792],
Out[88]: 0.0672]])
array([[-0.048 , 0.5433, -0.2349,

[ 2.1452, 0.8799, -0.0523,

The boolean array must be of the same length as the axis it’s indexing. You can even
mix and match boolean arrays with slices or integers (or sequences of integers, more
on this later):

In [89]: data[names == 'Bob', 2:]
Out[89]:
array([[-0.2349, 1.2792],

90 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

[-0.0523, 0.0672]])

In [90]: data[names == 'Bob', 3]
Out[90]: array([ 1.2792, 0.0672])

To select everything but 'Bob', you can either use != or negate the condition using -:

In [91]: names != 'Bob'
Out[91]: array([False, True, True, False, True, True, True], dtype=bool)

In [92]: data[-(names == 'Bob')]
Out[92]:
array([[-0.268 , 0.5465, 0.0939, -2.0445],

[-0.047 , -2.026 , 0.7719, 0.3103],
[-1.0023, -0.1698, 1.1503, 1.7289],
[ 0.1913, 0.4544, 0.4519, 0.5535],
[ 0.5994, 0.8174, -0.9297, -1.2564]])

Selecting two of the three names to combine multiple boolean conditions, use boolean
arithmetic operators like & (and) and | (or):

In [93]: mask = (names == 'Bob') | (names == 'Will')

In [94]: mask
Out[94]: array([True, False, True, True, True, False, False], dtype=bool)

In [95]: data[mask] 1.2792],
Out[95]: 0.3103],
array([[-0.048 , 0.5433, -0.2349, 0.0672],
1.7289]])
[-0.047 , -2.026 , 0.7719,
[ 2.1452, 0.8799, -0.0523,
[-1.0023, -0.1698, 1.1503,

Selecting data from an array by boolean indexing always creates a copy of the data,
even if the returned array is unchanged.

The Python keywords and and or do not work with boolean arrays.

Setting values with boolean arrays works in a common-sense way. To set all of the
negative values in data to 0 we need only do:

In [96]: data[data < 0] = 0

In [97]: data 0.5433, 0. , 1.2792],
Out[97]: 0.5465, 0.0939, 0. ],
array([[ 0. , 0. , 0.7719, 0.3103],
0.8799, 0. , 0.0672],
[ 0. , 0. , 1.1503, 1.7289],
[ 0. , 0.4544, 0.4519, 0.5535],
[ 2.1452, 0.8174, 0. , 0. ]])
[ 0. ,
[ 0.1913,
[ 0.5994,

The NumPy ndarray: A Multidimensional Array Object | 91

www.it-ebooks.info

Setting whole rows or columns using a 1D boolean array is also easy:

In [98]: data[names != 'Joe'] = 7

In [99]: data 7. , 7. , 7. ],
Out[99]: 0.5465, 0.0939, 0. ],
array([[ 7. , 7. , 7. , 7. ],
7. , 7. , 7. ],
[ 0. , 7. , 7. , 7. ],
[ 7. , 0.4544, 0.4519, 0.5535],
[ 7. , 0.8174, 0. , 0. ]])
[ 7. ,
[ 0.1913,
[ 0.5994,

Fancy Indexing

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays.
Suppose we had a 8 × 4 array:

In [100]: arr = np.empty((8, 4))

In [101]: for i in range(8):

.....: arr[i] = i

In [102]: arr
Out[102]:
array([[ 0., 0., 0., 0.],

[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[ 4., 4., 4., 4.],
[ 5., 5., 5., 5.],
[ 6., 6., 6., 6.],
[ 7., 7., 7., 7.]])

To select out a subset of the rows in a particular order, you can simply pass a list or
ndarray of integers specifying the desired order:

In [103]: arr[[4, 3, 0, 6]]
Out[103]:
array([[ 4., 4., 4., 4.],

[ 3., 3., 3., 3.],
[ 0., 0., 0., 0.],
[ 6., 6., 6., 6.]])

Hopefully this code did what you expected! Using negative indices select rows from
the end:

In [104]: arr[[-3, -5, -7]]
Out[104]:
array([[ 5., 5., 5., 5.],

[ 3., 3., 3., 3.],
[ 1., 1., 1., 1.]])

92 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

Passing multiple index arrays does something slightly different; it selects a 1D array of
elements corresponding to each tuple of indices:

# more on reshape in Chapter 12
In [105]: arr = np.arange(32).reshape((8, 4))

In [106]: arr
Out[106]:
array([[ 0, 1, 2, 3],

[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31]])

In [107]: arr[[1, 5, 7, 2], [0, 3, 1, 2]]
Out[107]: array([ 4, 23, 29, 10])

Take a moment to understand what just happened: the elements (1, 0), (5, 3), (7,
1), and (2, 2) were selected. The behavior of fancy indexing in this case is a bit different
from what some users might have expected (myself included), which is the rectangular
region formed by selecting a subset of the matrix’s rows and columns. Here is one way
to get that:

In [108]: arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
Out[108]:
array([[ 4, 7, 5, 6],

[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])

Another way is to use the np.ix_ function, which converts two 1D integer arrays to an
indexer that selects the square region:

In [109]: arr[np.ix_([1, 5, 7, 2], [0, 3, 1, 2])]
Out[109]:
array([[ 4, 7, 5, 6],

[20, 23, 21, 22],
[28, 31, 29, 30],
[ 8, 11, 9, 10]])

Keep in mind that fancy indexing, unlike slicing, always copies the data into a new array.

Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping which similarly returns a view on the un-
derlying data without copying anything. Arrays have the transpose method and also
the special T attribute:

In [110]: arr = np.arange(15).reshape((3, 5))

In [111]: arr In [112]: arr.T

The NumPy ndarray: A Multidimensional Array Object | 93

www.it-ebooks.info

Out[111]: Out[112]: 5, 10],
array([[ 0, 1, 2, 3, 4], array([[ 0, 6, 11],
7, 12],
[ 5, 6, 7, 8, 9], [ 1, 8, 13],
[10, 11, 12, 13, 14]]) [ 2, 9, 14]])
[ 3,
[ 4,

When doing matrix computations, you will do this very often, like for example com-
puting the inner matrix product XTX using np.dot:

In [113]: arr = np.random.randn(6, 3)

In [114]: np.dot(arr.T, arr)
Out[114]:
array([[ 2.584 , 1.8753, 0.8888],

[ 1.8753, 6.6636, 0.3884],
[ 0.8888, 0.3884, 3.9781]])

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute
the axes (for extra mind bending):

In [115]: arr = np.arange(16).reshape((2, 2, 4))

In [116]: arr
Out[116]:
array([[[ 0, 1, 2, 3],

[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],

[12, 13, 14, 15]]])

In [117]: arr.transpose((1, 0, 2))
Out[117]:
array([[[ 0, 1, 2, 3],

[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],

[12, 13, 14, 15]]])

Simple transposing with .T is just a special case of swapping axes. ndarray has the
method swapaxes which takes a pair of axis numbers:

In [118]: arr In [119]: arr.swapaxes(1, 2)
Out[118]: Out[119]:
array([[[ 0, 1, 2, 3], array([[[ 0, 4],

[ 4, 5, 6, 7]], [ 1, 5],
[[ 8, 9, 10, 11], [ 2, 6],
[ 3, 7]],
[12, 13, 14, 15]]]) [[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])

swapaxes similarly returns a view on the data without making a copy.

94 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

Universal Functions: Fast Element-wise Array Functions

A universal function, or ufunc, is a function that performs elementwise operations on
data in ndarrays. You can think of them as fast vectorized wrappers for simple functions
that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple elementwise transformations, like sqrt or exp:

In [120]: arr = np.arange(10)

In [121]: np.sqrt(arr) 1.4142, 1.7321, 2. , 2.2361, 2.4495,
Out[121]: 3. ])

array([ 0. , 1. ,
2.6458, 2.8284,

In [122]: np.exp(arr)
Out[122]:
array([ 1. , 2.7183, 7.3891, 20.0855, 54.5982,
148.4132, 403.4288, 1096.6332, 2980.958 , 8103.0839])

These are referred to as unary ufuncs. Others, such as add or maximum, take 2 arrays
(thus, binary ufuncs) and return a single array as the result:

In [123]: x = randn(8)

In [124]: y = randn(8)

In [125]: x 0.2002, -0.2551, 0.4655, 0.9222, 0.446 ,
Out[125]:
array([ 0.0749, 0.0974,

-0.9337])

In [126]: y 0.6117, -1.2323, 0.4788, 0.4315,
Out[126]:
array([ 0.267 , -1.1131, -0.3361,

-0.7147])

In [127]: np.maximum(x, y) # element-wise maximum 0.9222, 0.446 ,
Out[127]:
array([ 0.267 , 0.0974, 0.2002, 0.6117, 0.4655,

-0.7147])

While not common, a ufunc can return multiple arrays. modf is one example, a vector-
ized version of the built-in Python divmod: it returns the fractional and integral parts of
a floating point array:

In [128]: arr = randn(7) * 5

In [129]: np.modf(arr) 0.9363, -0.883 ]),
Out[129]:
(array([-0.6808, 0.0636, -0.386 , 0.1393, -0.8806,

array([-2., 4., -3., 5., -3., 3., -6.]))

Universal Functions: Fast Element-wise Array Functions | 95
www.it-ebooks.info

See Table 4-3 and Table 4-4 for a listing of available ufuncs.

Table 4-3. Unary ufuncs Description
Function Compute the absolute value element-wise for integer, floating point, or complex values.
abs, fabs Use fabs as a faster alternative for non-complex-valued data
Compute the square root of each element. Equivalent to arr ** 0.5
sqrt Compute the square of each element. Equivalent to arr ** 2
square Compute the exponent ex of each element
exp Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
log, log10, log2, log1p Compute the sign of each element: 1 (positive), 0 (zero), or -1 (negative)
sign Compute the ceiling of each element, i.e. the smallest integer greater than or equal to
ceil each element
Compute the floor of each element, i.e. the largest integer less than or equal to each
floor element
Round elements to the nearest integer, preserving the dtype
rint Return fractional and integral parts of array as separate array
modf Return boolean array indicating whether each value is NaN (Not a Number)
isnan Return boolean array indicating whether each element is finite (non-inf, non-NaN) or
isfinite, isinf infinite, respectively
Regular and hyperbolic trigonometric functions
cos, cosh, sin, sinh,
tan, tanh Inverse trigonometric functions
arccos, arccosh, arcsin,
arcsinh, arctan, arctanh Compute truth value of not x element-wise. Equivalent to -arr.
logical_not

Table 4-4. Binary universal functions

Function Description
add Add corresponding elements in arrays
subtract Subtract elements in second array from first array
multiply Multiply array elements
divide, floor_divide Divide or floor divide (truncating the remainder)
power Raise elements in first array to powers indicated in second array
maximum, fmax Element-wise maximum. fmax ignores NaN
minimum, fmin Element-wise minimum. fmin ignores NaN
mod Element-wise modulus (remainder of division)
copysign Copy sign of values in second argument to values in first argument

96 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

Function Description
Perform element-wise comparison, yielding boolean array. Equivalent to infix operators
greater, greater_equal, >, >=, <, <=, ==, !=
less, less_equal, equal,
not_equal Compute element-wise truth value of logical operation. Equivalent to infix operators &
|, ^
logical_and,
logical_or, logical_xor

Data Processing Using Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as
concise array expressions that might otherwise require writing loops. This practice of
replacing explicit loops with array expressions is commonly referred to as vectoriza-
tion. In general, vectorized array operations will often be one or two (or more) orders
of magnitude faster than their pure Python equivalents, with the biggest impact in any
kind of numerical computations. Later, in Chapter 12, I will explain broadcasting, a
powerful method for vectorizing computations.

As a simple example, suppose we wished to evaluate the function sqrt(x^2 + y^2)
across a regular grid of values. The np.meshgrid function takes two 1D arrays and pro-
duces two 2D matrices corresponding to all pairs of (x, y) in the two arrays:

In [130]: points = np.arange(-5, 5, 0.01) # 1000 equally spaced points

In [131]: xs, ys = np.meshgrid(points, points)

In [132]: ys
Out[132]:
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],

[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])

Now, evaluating the function is a simple matter of writing the same expression you
would write with two points:

In [134]: import matplotlib.pyplot as plt

In [135]: z = np.sqrt(xs ** 2 + ys ** 2)

In [136]: z 7.064 , 7.0569, ..., 7.0499, 7.0569, 7.064 ],
Out[136]: 7.0569, 7.0499, ..., 7.0428, 7.0499, 7.0569],
array([[ 7.0711, 7.0499, 7.0428, ..., 7.0357, 7.0428, 7.0499],

[ 7.064 , 7.0428, 7.0357, ..., 7.0286, 7.0357, 7.0428],
[ 7.0569, 7.0499, 7.0428, ..., 7.0357, 7.0428, 7.0499],
..., 7.0569, 7.0499, ..., 7.0428, 7.0499, 7.0569]])
[ 7.0499,
[ 7.0569,
[ 7.064 ,

Data Processing Using Arrays | 97

www.it-ebooks.info

In [137]: plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
Out[137]: <matplotlib.colorbar.Colorbar instance at 0x4e46d40>

In [138]: plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")
Out[138]: <matplotlib.text.Text at 0x4565790>

See Figure 4-3. Here I used the matplotlib function imshow to create an image plot from
a 2D array of function values.

Figure 4-3. Plot of function evaluated on grid

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condi
tion else y. Suppose we had a boolean array and two arrays of values:

In [140]: xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])

In [141]: yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])

In [142]: cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in
cond is True otherwise take the value from yarr. A list comprehension doing this might
look like:

In [143]: result = [(x if c else y)
.....: for x, y, c in zip(xarr, yarr, cond)]

In [144]: result
Out[144]: [1.1000000000000001, 2.2000000000000002, 1.3, 1.3999999999999999, 2.5]

98 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

This has multiple problems. First, it will not be very fast for large arrays (because all
the work is being done in pure Python). Secondly, it will not work with multidimen-
sional arrays. With np.where you can write this very concisely:

In [145]: result = np.where(cond, xarr, yarr)

In [146]: result
Out[146]: array([ 1.1, 2.2, 1.3, 1.4, 2.5])

The second and third arguments to np.where don’t need to be arrays; one or both of
them can be scalars. A typical use of where in data analysis is to produce a new array of
values based on another array. Suppose you had a matrix of randomly generated data
and you wanted to replace all positive values with 2 and all negative values with -2.
This is very easy to do with np.where:

In [147]: arr = randn(4, 4)

In [148]: arr 1.7904, 0.0752],
Out[148]: 0.4413, 0.3483],
array([[ 0.6372, 2.2043, 0.7827, -0.7585],
1.3583, -1.3865]])
[-1.5926, -1.1536,
[-0.1798, 0.3299,
[ 0.5857, 0.1619,

In [149]: np.where(arr > 0, 2, -2)
Out[149]:
array([[ 2, 2, 2, 2],

[-2, -2, 2, 2],
[-2, 2, 2, -2],
[ 2, 2, 2, -2]])

In [150]: np.where(arr > 0, 2, arr) # set only positive values to 2
Out[150]:
array([[ 2. , 2. , 2. , 2. ],

[-1.5926, -1.1536, 2. , 2. ],
[-0.1798, 2. , 2. , -0.7585],
[ 2. , 2. , 2. , -1.3865]])

The arrays passed to where can be more than just equal sizes array or scalers.

With some cleverness you can use where to express more complicated logic; consider
this example where I have two boolean arrays, cond1 and cond2, and wish to assign a
different value for each of the 4 possible pairs of boolean values:

result = []
for i in range(n):

if cond1[i] and cond2[i]:
result.append(0)

elif cond1[i]:
result.append(1)

elif cond2[i]:
result.append(2)

else:
result.append(3)

Data Processing Using Arrays | 99
www.it-ebooks.info

While perhaps not immediately obvious, this for loop can be converted into a nested
where expression:

np.where(cond1 & cond2, 0,
np.where(cond1, 1,
np.where(cond2, 2, 3)))

In this particular example, we can also take advantage of the fact that boolean values
are treated as 0 or 1 in calculations, so this could alternatively be expressed (though a
bit more cryptically) as an arithmetic operation:

result = 1 * cond1 + 2 * cond2 + 3 * -(cond1 | cond2)

Mathematical and Statistical Methods

A set of mathematical functions which compute statistics about an entire array or about
the data along an axis are accessible as array methods. Aggregations (often called
reductions) like sum, mean, and standard deviation std can either be used by calling the
array instance method or using the top level NumPy function:

In [151]: arr = np.random.randn(5, 4) # normally-distributed data

In [152]: arr.mean()
Out[152]: 0.062814911084854597

In [153]: np.mean(arr)
Out[153]: 0.062814911084854597

In [154]: arr.sum()
Out[154]: 1.2562982216970919

Functions like mean and sum take an optional axis argument which computes the statistic
over the given axis, resulting in an array with one fewer dimension:

In [155]: arr.mean(axis=1)
Out[155]: array([-1.2833, 0.2844, 0.6574, 0.6743, -0.0187])

In [156]: arr.sum(0)
Out[156]: array([-3.1003, -1.6189, 1.4044, 4.5712])

Other methods like cumsum and cumprod do not aggregate, instead producing an array
of the intermediate results:

In [157]: arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])

In [158]: arr.cumsum(0) In [159]: arr.cumprod(1)
Out[158]: Out[159]:
array([[ 0, 1, 2], array([[ 0, 0, 0],

[ 3, 5, 7], [ 3, 12, 60],
[ 9, 12, 15]]) [ 6, 42, 336]])

See Table 4-5 for a full listing. We’ll see many examples of these methods in action in
later chapters.

100 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

Table 4-5. Basic array statistical methods

Method Description
sum Sum of all the elements in the array or along an axis. Zero-length arrays have sum 0.
mean Arithmetic mean. Zero-length arrays have NaN mean.
std, var Standard deviation and variance, respectively, with optional degrees of freedom adjust-
ment (default denominator n).
min, max Minimum and maximum.
argmin, argmax Indices of minimum and maximum elements, respectively.
cumsum Cumulative sum of elements starting from 0
cumprod Cumulative product of elements starting from 1

Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the above methods. Thus, sum
is often used as a means of counting True values in a boolean array:

In [160]: arr = randn(100)

In [161]: (arr > 0).sum() # Number of positive values
Out[161]: 44

There are two additional methods, any and all, useful especially for boolean arrays.
any tests whether one or more values in an array is True, while all checks if every value
is True:

In [162]: bools = np.array([False, False, True, False])

In [163]: bools.any()
Out[163]: True

In [164]: bools.all()
Out[164]: False

These methods also work with non-boolean arrays, where non-zero elements evaluate
to True.

Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place using the sort
method:

In [165]: arr = randn(8)

In [166]: arr 0.0968, -0.1349, 0.9879, 0.0185, -1.3147,
Out[166]:
array([ 0.6903, 0.4678,

-0.5425])

In [167]: arr.sort()

Data Processing Using Arrays | 101

www.it-ebooks.info

In [168]: arr 0.0185, 0.0968, 0.4678, 0.6903,
Out[168]:
array([-1.3147, -0.5425, -0.1349,

0.9879])

Multidimensional arrays can have each 1D section of values sorted in-place along an
axis by passing the axis number to sort:

In [169]: arr = randn(5, 3)

In [170]: arr
Out[170]:

array([[-0.7139, -1.6331, -0.4959],
[ 0.8236, -1.3132, -0.1935],
[-1.6748, 3.0336, -0.863 ],
[-0.3161, 0.5362, -2.468 ],
[ 0.9058, 1.1184, -1.0516]])

In [171]: arr.sort(1)

In [172]: arr
Out[172]:
array([[-1.6331, -0.7139, -0.4959],

[-1.3132, -0.1935, 0.8236],
[-1.6748, -0.863 , 3.0336],
[-2.468 , -0.3161, 0.5362],
[-1.0516, 0.9058, 1.1184]])

The top level method np.sort returns a sorted copy of an array instead of modifying
the array in place. A quick-and-dirty way to compute the quantiles of an array is to sort
it and select the value at a particular rank:

In [173]: large_arr = randn(1000)

In [174]: large_arr.sort()

In [175]: large_arr[int(0.05 * len(large_arr))] # 5% quantile
Out[175]: -1.5791023260896004

For more details on using NumPy’s sorting methods, and more advanced techniques
like indirect sorts, see Chapter 12. Several other kinds of data manipulations related to
sorting (for example, sorting a table of data by one or more columns) are also to be
found in pandas.

Unique and Other Set Logic

NumPy has some basic set operations for one-dimensional ndarrays. Probably the most
commonly used one is np.unique, which returns the sorted unique values in an array:

In [176]: names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])

In [177]: np.unique(names)
Out[177]:

102 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

array(['Bob', 'Joe', 'Will'],
dtype='|S4')

In [178]: ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])

In [179]: np.unique(ints)
Out[179]: array([1, 2, 3, 4])

Contrast np.unique with the pure Python alternative:

In [180]: sorted(set(names))
Out[180]: ['Bob', 'Joe', 'Will']

Another function, np.in1d, tests membership of the values in one array in another,
returning a boolean array:

In [181]: values = np.array([6, 0, 0, 3, 2, 5, 6])

In [182]: np.in1d(values, [2, 3, 6])
Out[182]: array([ True, False, False, True, True, False, True], dtype=bool)

See Table 4-6 for a listing of set functions in NumPy.

Table 4-6. Array set operations

Method Description
unique(x) Compute the sorted, unique elements in x
intersect1d(x, y) Compute the sorted, common elements in x and y
union1d(x, y) Compute the sorted union of elements
in1d(x, y) Compute a boolean array indicating whether each element of x is contained in y
setdiff1d(x, y) Set difference, elements in x that are not in y
setxor1d(x, y) Set symmetric differences; elements that are in either of the arrays, but not both

File Input and Output with Arrays

NumPy is able to save and load data to and from disk either in text or binary format.
In later chapters you will learn about tools in pandas for reading tabular data into
memory.

Storing Arrays on Disk in Binary Format

np.save and np.load are the two workhorse functions for efficiently saving and loading
array data on disk. Arrays are saved by default in an uncompressed raw binary format
with file extension .npy.

In [183]: arr = np.arange(10)

In [184]: np.save('some_array', arr)

File Input and Output with Arrays | 103
www.it-ebooks.info

If the file path does not already end in .npy, the extension will be appended. The array
on disk can then be loaded using np.load:

In [185]: np.load('some_array.npy')
Out[185]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

You save multiple arrays in a zip archive using np.savez and passing the arrays as key-
word arguments:

In [186]: np.savez('array_archive.npz', a=arr, b=arr)

When loading an .npz file, you get back a dict-like object which loads the individual
arrays lazily:

In [187]: arch = np.load('array_archive.npz')

In [188]: arch['b']
Out[188]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Saving and Loading Text Files

Loading text from files is a fairly standard task. The landscape of file reading and writing
functions in Python can be a bit confusing for a newcomer, so I will focus mainly on
the read_csv and read_table functions in pandas. It will at times be useful to load data
into vanilla NumPy arrays using np.loadtxt or the more specialized np.genfromtxt.
These functions have many options allowing you to specify different delimiters, con-
verter functions for certain columns, skipping rows, and other things. Take a simple
case of a comma-separated file (CSV) like this:

In [191]: !cat array_ex.txt
0.580052,0.186730,1.040717,1.134411
0.194163,-0.636917,-0.938659,0.124094
-0.126410,0.268607,-0.695724,0.047428
-1.484413,0.004176,-0.744203,0.005487
2.302869,0.200131,1.670238,-1.881090
-0.193230,1.047233,0.482803,0.960334

This can be loaded into a 2D array like so:

In [192]: arr = np.loadtxt('array_ex.txt', delimiter=',')

In [193]: arr
Out[193]:
array([[ 0.5801, 0.1867, 1.0407, 1.1344],

[ 0.1942, -0.6369, -0.9387, 0.1241],
[-0.1264, 0.2686, -0.6957, 0.0474],
[-1.4844, 0.0042, -0.7442, 0.0055],
[ 2.3029, 0.2001, 1.6702, -1.8811],
[-0.1932, 1.0472, 0.4828, 0.9603]])

np.savetxt performs the inverse operation: writing an array to a delimited text file.
genfromtxt is similar to loadtxt but is geared for structured arrays and missing data
handling; see Chapter 12 for more on structured arrays.

104 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

For more on file reading and writing, especially tabular or spreadsheet-
like data, see the later chapters involving pandas and DataFrame objects.

Linear Algebra

Linear algebra, like matrix multiplication, decompositions, determinants, and other
square matrix math, is an important part of any array library. Unlike some languages
like MATLAB, multiplying two two-dimensional arrays with * is an element-wise
product instead of a matrix dot product. As such, there is a function dot, both an array
method, and a function in the numpy namespace, for matrix multiplication:

In [194]: x = np.array([[1., 2., 3.], [4., 5., 6.]])

In [195]: y = np.array([[6., 23.], [-1, 7], [8, 9]])

In [196]: x In [197]: y 23.],
Out[196]: Out[197]: 7.],
array([[ 1., 2., 3.], array([[ 6., 9.]])

[ 4., 5., 6.]]) [ -1.,
[ 8.,

In [198]: x.dot(y) # equivalently np.dot(x, y)
Out[198]:
array([[ 28., 64.],

[ 67., 181.]])

A matrix product between a 2D array and a suitably sized 1D array results in a 1D array:

In [199]: np.dot(x, np.ones(3))
Out[199]: array([ 6., 15.])

numpy.linalg has a standard set of matrix decompositions and things like inverse and
determinant. These are implemented under the hood using the same industry-standard
Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LA-
PACK, or possibly (depending on your NumPy build) the Intel MKL:

In [201]: from numpy.linalg import inv, qr

In [202]: X = randn(5, 5)

In [203]: mat = X.T.dot(X)

In [204]: inv(mat)
Out[204]:
array([[ 3.0361, -0.1808, -0.6878, -2.8285, -1.1911],

[-0.1808, 0.5035, 0.1215, 0.6702, 0.0956],
[-0.6878, 0.1215, 0.2904, 0.8081, 0.3049],
[-2.8285, 0.6702, 0.8081, 3.4152, 1.1557],
[-1.1911, 0.0956, 0.3049, 1.1557, 0.6051]])

In [205]: mat.dot(inv(mat))

Linear Algebra | 105

www.it-ebooks.info

Out[205]: 0., -0.],
array([[ 1., 0., 0., 0., 0.],
0., 0.],
[ 0., 1., -0., 1., -0.],
[ 0., -0., 1., 0., 1.]])
[ 0., -0., -0.,
[ 0., 0., 0.,

In [206]: q, r = qr(mat)

In [207]: r 7.389 , 6.1227, -7.1163, -4.9215],
Out[207]: -3.9735, -0.8671, 2.9747, -5.7402],
array([[ -6.9271, 1.8909,
0. , -10.2681, 1.6079],
[ 0. , -1.2996,
[ 0. , 0. , 0. , 0. , 3.3577],
0. , 0. , 0.5571]])
[ 0. ,
[ 0. ,

See Table 4-7 for a list of some of the most commonly-used linear algebra functions.

The scientific Python community is hopeful that there may be a matrix
multiplication infix operator implemented someday, providing syntac-
tically nicer alternative to using np.dot. But for now this is the way.

Table 4-7. Commonly-used numpy.linalg functions

Function Description
diag Return the diagonal (or off-diagonal) elements of a square matrix as a 1D array, or convert a 1D array into a square
matrix with zeros on the off-diagonal
dot Matrix multiplication
trace Compute the sum of the diagonal elements
det Compute the matrix determinant
eig Compute the eigenvalues and eigenvectors of a square matrix
inv Compute the inverse of a square matrix
pinv Compute the Moore-Penrose pseudo-inverse inverse of a square matrix
qr Compute the QR decomposition
svd Compute the singular value decomposition (SVD)
solve Solve the linear system Ax = b for x, where A is a square matrix
lstsq Compute the least-squares solution to y = Xb

Random Number Generation

The numpy.random module supplements the built-in Python random with functions for
efficiently generating whole arrays of sample values from many kinds of probability

106 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

distributions. For example, you can get a 4 by 4 array of samples from the standard
normal distribution using normal:

In [208]: samples = np.random.normal(size=(4, 4))

In [209]: samples
Out[209]:
array([[ 0.1241, 0.3026, 0.5238, 0.0009],

[ 1.3438, -0.7135, -0.8312, -2.3702],
[-1.8608, -0.8608, 0.5601, -1.2659],
[ 0.1198, -1.0635, 0.3329, -2.3594]])

Python’s built-in random module, by contrast, only samples one value at a time. As you
can see from this benchmark, numpy.random is well over an order of magnitude faster
for generating very large samples:

In [210]: from random import normalvariate

In [211]: N = 1000000

In [212]: %timeit samples = [normalvariate(0, 1) for _ in xrange(N)]
1 loops, best of 3: 1.33 s per loop

In [213]: %timeit np.random.normal(size=N)
10 loops, best of 3: 57.7 ms per loop

See table Table 4-8 for a partial list of functions available in numpy.random. I’ll give some
examples of leveraging these functions’ ability to generate large arrays of samples all at
once in the next section.

Table 4-8. Partial list of numpy.random functions

Function Description
seed Seed the random number generator
permutation Return a random permutation of a sequence, or return a permuted range
shuffle Randomly permute a sequence in place
rand Draw samples from a uniform distribution
randint Draw random integers from a given low-to-high range
randn Draw samples from a normal distribution with mean 0 and standard deviation 1 (MATLAB-like interface)
binomial Draw samples a binomial distribution
normal Draw samples from a normal (Gaussian) distribution
beta Draw samples from a beta distribution
chisquare Draw samples from a chi-square distribution
gamma Draw samples from a gamma distribution
uniform Draw samples from a uniform [0, 1) distribution

Random Number Generation | 107
www.it-ebooks.info

Example: Random Walks

An illustrative application of utilizing array operations is in the simulation of random
walks. Let’s first consider a simple random walk starting at 0 with steps of 1 and -1
occurring with equal probability. A pure Python way to implement a single random
walk with 1,000 steps using the built-in random module:

import random
position = 0
walk = [position]
steps = 1000
for i in xrange(steps):

step = 1 if random.randint(0, 1) else -1
position += step
walk.append(position)

See Figure 4-4 for an example plot of the first 100 values on one of these random walks.

Figure 4-4. A simple random walk

You might make the observation that walk is simply the cumulative sum of the random
steps and could be evaluated as an array expression. Thus, I use the np.random module
to draw 1,000 coin flips at once, set these to 1 and -1, and compute the cumulative sum:

In [215]: nsteps = 1000
In [216]: draws = np.random.randint(0, 2, size=nsteps)
In [217]: steps = np.where(draws > 0, 1, -1)
In [218]: walk = steps.cumsum()

108 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation

www.it-ebooks.info

From this we can begin to extract statistics like the minimum and maximum value along
the walk’s trajectory:

In [219]: walk.min() In [220]: walk.max()
Out[219]: -3 Out[220]: 31

A more complicated statistic is the first crossing time, the step at which the random
walk reaches a particular value. Here we might want to know how long it took the

random walk to get at least 10 steps away from the origin 0 in either direction.

np.abs(walk) >= 10 gives us a boolean array indicating where the walk has reached or
exceeded 10, but we want the index of the first 10 or -10. Turns out this can be com-
puted using argmax, which returns the first index of the maximum value in the boolean
array (True is the maximum value):

In [221]: (np.abs(walk) >= 10).argmax()
Out[221]: 37

Note that using argmax here is not always efficient because it always makes a full scan
of the array. In this special case once a True is observed we know it to be the maximum
value.

Simulating Many Random Walks at Once

If your goal was to simulate many random walks, say 5,000 of them, you can generate
all of the random walks with minor modifications to the above code. The numpy.ran
dom functions if passed a 2-tuple will generate a 2D array of draws, and we can compute
the cumulative sum across the rows to compute all 5,000 random walks in one shot:

In [222]: nwalks = 5000

In [223]: nsteps = 1000

In [224]: draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1

In [225]: steps = np.where(draws > 0, 1, -1)

In [226]: walks = steps.cumsum(1)

In [227]: walks 1, ..., 8, 7, 8],
Out[227]: -1, ..., 34, 33, 32],
array([[ 1, 0, -1, ..., 4, 5, 4],

[ 1, 0, 1, ..., 24, 25, 26],
[ 1, 0, 3, ..., 14, 13, 14],
..., -3, ..., -24, -23, -22]])
[ 1, 2,
[ 1, 2,
[ -1, -2,

Now, we can compute the maximum and minimum values obtained over all of the
walks:

In [228]: walks.max() In [229]: walks.min()
Out[228]: 138 Out[229]: -133

Example: Random Walks | 109

www.it-ebooks.info

Out of these walks, let’s compute the minimum crossing time to 30 or -30. This is
slightly tricky because not all 5,000 of them reach 30. We can check this using the
any method:

In [230]: hits30 = (np.abs(walks) >= 30).any(1)

In [231]: hits30
Out[231]: array([False, True, False, ..., False, True, False], dtype=bool)

In [232]: hits30.sum() # Number that hit 30 or -30
Out[232]: 3410

We can use this boolean array to select out the rows of walks that actually cross the
absolute 30 level and call argmax across axis 1 to get the crossing times:

In [233]: crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)

In [234]: crossing_times.mean()
Out[234]: 498.88973607038122

Feel free to experiment with other distributions for the steps other than equal sized
coin flips. You need only use a different random number generation function, like
normal to generate normally distributed steps with some mean and standard deviation:

In [235]: steps = np.random.normal(loc=0, scale=0.25,

.....: size=(nwalks, nsteps))

110 | Chapter 4: NumPy Basics: Arrays and Vectorized Computation
www.it-ebooks.info

CHAPTER 5

Getting Started with pandas

pandas will be the primary library of interest throughout much of the rest of the book.
It contains high-level data structures and manipulation tools designed to make data
analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to
use in NumPy-centric applications.
As a bit of background, I started building pandas in early 2008 during my tenure at
AQR, a quantitative investment management firm. At the time, I had a distinct set of
requirements that were not well-addressed by any single tool at my disposal:

• Data structures with labeled axes supporting automatic or explicit data alignment.
This prevents common errors resulting from misaligned data and working with
differently-indexed data coming from different sources.

• Integrated time series functionality.
• The same data structures handle both time series data and non-time series data.
• Arithmetic operations and reductions (like summing across an axis) would pass

on the metadata (axis labels).
• Flexible handling of missing data.
• Merge and other relational operations found in popular database databases (SQL-

based, for example).
I wanted to be able to do all of these things in one place, preferably in a language well-
suited to general purpose software development. Python was a good candidate lan-
guage for this, but at that time there was not an integrated set of data structures and
tools providing this functionality.
Over the last four years, pandas has matured into a quite large library capable of solving
a much broader set of data handling problems than I ever anticipated, but it has ex-
panded in its scope without compromising the simplicity and ease-of-use that I desired
from the very beginning. I hope that after reading this book, you will find it to be just
as much of an indispensable tool as I do.
Throughout the rest of the book, I use the following import conventions for pandas:

111

www.it-ebooks.info

In [1]: from pandas import Series, DataFrame

In [2]: import pandas as pd

Thus, whenever you see pd. in code, it’s referring to pandas. Series and DataFrame are
used so much that I find it easier to import them into the local namespace.

Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every
problem, they provide a solid, easy-to-use basis for most applications.

Series

A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index. The simplest
Series is formed from only an array of data:

In [4]: obj = Series([4, 7, -5, 3])

In [5]: obj
Out[5]:
04
17
2 -5
33

The string representation of a Series displayed interactively shows the index on the left
and the values on the right. Since we did not specify an index for the data, a default
one consisting of the integers 0 through N - 1 (where N is the length of the data) is
created. You can get the array representation and index object of the Series via its values
and index attributes, respectively:

In [6]: obj.values
Out[6]: array([ 4, 7, -5, 3])

In [7]: obj.index
Out[7]: Int64Index([0, 1, 2, 3])

Often it will be desirable to create a Series with an index identifying each data point:

In [8]: obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [9]: obj2
Out[9]:
d4
b7
a -5
c3

112 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

In [10]: obj2.index
Out[10]: Index([d, b, a, c], dtype=object)

Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values:

In [11]: obj2['a']
Out[11]: -5

In [12]: obj2['d'] = 6

In [13]: obj2[['c', 'a', 'd']]
Out[13]:
c3
a -5
d6

NumPy array operations, such as filtering with a boolean array, scalar multiplication,
or applying math functions, will preserve the index-value link:

In [14]: obj2
Out[14]:
d6
b7
a -5
c3

In [15]: obj2[obj2 > 0] In [16]: obj2 * 2 In [17]: np.exp(obj2)
Out[15]: Out[16]: Out[17]:
d6 d 12 d 403.428793
b7 b 14 b 1096.633158
c3 a -10 a 0.006738
c6 c 20.085537

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
dict:

In [18]: 'b' in obj2
Out[18]: True

In [19]: 'e' in obj2
Out[19]: False

Should you have data contained in a Python dict, you can create a Series from it by
passing the dict:

In [20]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [21]: obj3 = Series(sdata)

In [22]: obj3
Out[22]:
Ohio 35000

Oregon 16000

Introduction to pandas Data Structures | 113
www.it-ebooks.info

Texas 71000
Utah 5000

When only passing a dict, the index in the resulting Series will have the dict’s keys in
sorted order.

In [23]: states = ['California', 'Ohio', 'Oregon', 'Texas']

In [24]: obj4 = Series(sdata, index=states)

In [25]: obj4

Out[25]:

California NaN

Ohio 35000
Oregon 16000
Texas 71000

In this case, 3 values found in sdata were placed in the appropriate locations, but since
no value for 'California' was found, it appears as NaN (not a number) which is con-
sidered in pandas to mark missing or NA values. I will use the terms “missing” or “NA”

to refer to missing data. The isnull and notnull functions in pandas should be used to
detect missing data:

In [26]: pd.isnull(obj4) In [27]: pd.notnull(obj4)
Out[26]: Out[27]:
California True California False
Ohio False Ohio True
Oregon False Oregon True
Texas False Texas True

Series also has these as instance methods:

In [28]: obj4.isnull()
Out[28]:
California True
Ohio False
Oregon False
Texas False

I discuss working with missing data in more detail later in this chapter.

A critical Series feature for many applications is that it automatically aligns differently-
indexed data in arithmetic operations:

In [29]: obj3 In [30]: obj4
Out[29]: Out[30]:
Ohio 35000 California NaN
Oregon 16000 Ohio 35000
Texas 71000 Oregon 16000
Utah 5000 Texas 71000

In [31]: obj3 + obj4
Out[31]:
California NaN
Ohio 70000

Oregon 32000

114 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

Texas 142000
Utah NaN

Data alignment features are addressed as a separate topic.

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [32]: obj4.name = 'population'

In [33]: obj4.index.name = 'state'

In [34]: obj4

Out[34]:

state NaN
California
Ohio 35000
Oregon 16000
Texas 71000
Name: population

A Series’s index can be altered in place by assignment:

In [35]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [36]: obj
Out[36]:

Bob 4
Steve 7
Jeff -5
Ryan 3

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an or-
dered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index). Compared with other
such DataFrame-like structures you may have used before (like R’s data.frame), row-
oriented and column-oriented operations in DataFrame are treated roughly symmet-
rically. Under the hood, the data is stored as one or more two-dimensional blocks rather
than a list, dict, or some other collection of one-dimensional arrays. The exact details
of DataFrame’s internals are far outside the scope of this book.

While DataFrame stores the data internally in a two-dimensional for-
mat, you can easily represent much higher-dimensional data in a tabular
format using hierarchical indexing, a subject of a later section and a key
ingredient in many of the more advanced data-handling features in pan-
das.

Introduction to pandas Data Structures | 115
www.it-ebooks.info

There are numerous ways to construct a DataFrame, though one of the most common
is from a dict of equal-length lists or NumPy arrays

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and
the columns are placed in sorted order:

In [38]: frame year
Out[38]: 2000
2001
pop state 2002
0 1.5 Ohio 2001
1 1.7 Ohio 2002
2 3.6 Ohio
3 2.4 Nevada
4 2.9 Nevada

If you specify a sequence of columns, the DataFrame’s columns will be exactly what
you pass:

In [39]: DataFrame(data, columns=['year', 'state', 'pop'])
Out[39]:

year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

As with Series, if you pass a column that isn’t contained in data, it will appear with NA
values in the result:

In [40]: frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
....: index=['one', 'two', 'three', 'four', 'five'])

In [41]: frame2
Out[41]:

year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

In [42]: frame2.columns
Out[42]: Index([year, state, pop, debt], dtype=object)

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute:

In [43]: frame2['state'] In [44]: frame2.year
Out[43]: Out[44]:
one Ohio one 2000

116 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

two Ohio two 2001
three Ohio three 2002
four Nevada four 2001

five Nevada five 2002
Name: state Name: year

Note that the returned Series have the same index as the DataFrame, and their name
attribute has been appropriately set.

Rows can also be retrieved by position or name by a couple of methods, such as the
ix indexing field (much more on this later):

In [45]: frame2.ix['three']

Out[45]:

year 2002
state Ohio
pop 3.6
debt NaN
Name: three

Columns can be modified by assignment. For example, the empty 'debt' column could
be assigned a scalar value or an array of values:

In [46]: frame2['debt'] = 16.5

In [47]: frame2
Out[47]:

year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5

In [48]: frame2['debt'] = np.arange(5.)

In [49]: frame2
Out[49]:
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4

When assigning lists or arrays to a column, the value’s length must match the length
of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
DataFrame’s index, inserting missing values in any holes:

In [50]: val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [51]: frame2['debt'] = val

In [52]: frame2

Out[52]:
year state pop debt

Introduction to pandas Data Structures | 117

www.it-ebooks.info

one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict:

In [53]: frame2['eastern'] = frame2.state == 'Ohio'

In [54]: frame2
Out[54]:

year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False

In [55]: del frame2['eastern']

In [56]: frame2.columns
Out[56]: Index([year, state, pop, debt], dtype=object)

The column returned when indexing a DataFrame is a view on the un-
derlying data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s copy method.

Another common form of data is a nested dict of dicts format:

In [57]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
keys as the row indices:

In [58]: frame3 = DataFrame(pop)

In [59]: frame3

Out[59]:
Nevada Ohio

2000 NaN 1.5
2001 2.4 1.7

2002 2.9 3.6

Of course you can always transpose the result:

In [60]: frame3.T
Out[60]:
2000 2001 2002
Nevada NaN 2.4 2.9
3.6
Ohio 1.5 1.7

118 | Chapter 5: Getting Started with pandas
www.it-ebooks.info

The keys in the inner dicts are unioned and sorted to form the index in the result. This
isn’t true if an explicit index is specified:

In [61]: DataFrame(pop, index=[2001, 2002, 2003])
Out[61]:
Nevada Ohio
2001 2.4 1.7

2002 2.9 3.6

2003 NaN NaN

Dicts of Series are treated much in the same way:

In [62]: pdata = {'Ohio': frame3['Ohio'][:-1],

....: 'Nevada': frame3['Nevada'][:2]}

In [63]: DataFrame(pdata)
Out[63]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7

For a complete list of things you can pass the DataFrame constructor, see Table 5-1.

If a DataFrame’s index and columns have their name attributes set, these will also be
displayed:

In [64]: frame3.index.name = 'year'; frame3.columns.name = 'state'

In [65]: frame3
Out[65]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6

Like Series, the values attribute returns the data contained in the DataFrame as a 2D
ndarray:

In [66]: frame3.values
Out[66]:
array([[ nan, 1.5],

[ 2.4, 1.7],
[ 2.9, 3.6]])

If the DataFrame’s columns are different dtypes, the dtype of the values array will be
chosen to accomodate all of the columns:

In [67]: frame2.values
Out[67]:
array([[2000, Ohio, 1.5, nan],

[2001, Ohio, 1.7, -1.2],
[2002, Ohio, 3.6, nan],
[2001, Nevada, 2.4, -1.5],
[2002, Nevada, 2.9, -1.7]], dtype=object)

Introduction to pandas Data Structures | 119
www.it-ebooks.info

Table 5-1. Possible data inputs to DataFrame constructor

Type Notes
2D ndarray
dict of arrays, lists, or tuples A matrix of data, passing optional row and column labels
NumPy structured/record array
dict of Series Each sequence becomes a column in the DataFrame. All sequences must be the same length.

dict of dicts Treated as the “dict of arrays” case

list of dicts or Series Each value becomes a column. Indexes from each Series are unioned together to form the
result’s row index if no explicit index is passed.
List of lists or tuples
Another DataFrame Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of
NumPy MaskedArray Series” case.

Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the
DataFrame’s column labels

Treated as the “2D ndarray” case

The DataFrame’s indexes are used unless different ones are passed

Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata
(like the axis name or names). Any array or other sequence of labels used when con-
structing a Series or DataFrame is internally converted to an Index:

In [68]: obj = Series(range(3), index=['a', 'b', 'c'])

In [69]: index = obj.index

In [70]: index
Out[70]: Index([a, b, c], dtype=object)

In [71]: index[1:]
Out[71]: Index([b, c], dtype=object)

Index objects are immutable and thus can’t be modified by the user:

In [72]: index[1] = 'd'
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-72-676fdeb26a68> in <module>()
----> 1 index[1] = 'd'
/Users/wesm/code/pandas/pandas/core/index.pyc in __setitem__(self, key, value)
302 def __setitem__(self, key, value):
303 """Disable the setting of values."""
--> 304 raise Exception(str(self.__class__) + ' object is immutable')
305
306 def __getitem__(self, key):
Exception: <class 'pandas.core.index.Index'> object is immutable

120 | Chapter 5: Getting Started with pandas
www.it-ebooks.info

Immutability is important so that Index objects can be safely shared among data
structures:

In [73]: index = pd.Index(np.arange(3))

In [74]: obj2 = Series([1.5, -2.5, 0], index=index)

In [75]: obj2.index is index
Out[75]: True

Table 5-2 has a list of built-in Index classes in the library. With some development
effort, Index can even be subclassed to implement specialized axis indexing function-
ality.

Many users will not need to know much about Index objects, but they’re
nonetheless an important part of pandas’s data model.

Table 5-2. Main Index objects in pandas

Class Description
Index The most general Index object, representing axis labels in a NumPy array of Python objects.
Int64Index
MultiIndex Specialized Index for integer values.

DatetimeIndex “Hierarchical” index object representing multiple levels of indexing on a single axis. Can be thought of
PeriodIndex as similar to an array of tuples.

Stores nanosecond timestamps (represented using NumPy’s datetime64 dtype).
Specialized Index for Period data (timespans).

In addition to being array-like, an Index also functions as a fixed-size set:

In [76]: frame3

Out[76]:
state Nevada Ohio

year
2000 NaN 1.5

2001 2.4 1.7
2002 2.9 3.6

In [77]: 'Ohio' in frame3.columns
Out[77]: True

In [78]: 2003 in frame3.index
Out[78]: False

Each Index has a number of methods and properties for set logic and answering other
common questions about the data it contains. These are summarized in Table 5-3.

Introduction to pandas Data Structures | 121
www.it-ebooks.info

Table 5-3. Index methods and properties

Method Description
append Concatenate with additional Index objects, producing a new Index
diff Compute set difference as an Index
intersection Compute set intersection
union Compute set union
isin Compute boolean array indicating whether each value is contained in the passed collection
delete Compute new Index with element at index i deleted
drop Compute new index by deleting passed values
insert Compute new Index by inserting element at index i
is_monotonic Returns True if each element is greater than or equal to the previous element
is_unique Returns True if the Index has no duplicate values
unique Compute the array of unique values in the Index

Essential Functionality

In this section, I’ll walk you through the fundamental mechanics of interacting with
the data contained in a Series or DataFrame. Upcoming chapters will delve more deeply
into data analysis and manipulation topics using pandas. This book is not intended to
serve as exhaustive documentation for the pandas library; I instead focus on the most
important features, leaving the less common (that is, more esoteric) things for you to
explore on your own.

Reindexing

A critical method on pandas objects is reindex, which means to create a new object
with the data conformed to a new index. Consider a simple example from above:

In [79]: obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [80]: obj
Out[80]:
d 4.5
b 7.2
a -5.3
c 3.6

Calling reindex on this Series rearranges the data according to the new index, intro-
ducing missing values if any index values were not already present:

In [81]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [82]: obj2
Out[82]:
a -5.3

122 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

b 7.2
c 3.6
d 4.5
e NaN

In [83]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
Out[83]:
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0

For ordered data like time series, it may be desirable to do some interpolation or filling
of values when reindexing. The method option allows us to do this, using a method such
as ffill which forward fills the values:

In [84]: obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [85]: obj3.reindex(range(6), method='ffill')
Out[85]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow

Table 5-4 lists available method options. At this time, interpolation more sophisticated
than forward- and backfilling would need to be applied after the fact.

Table 5-4. reindex method (interpolation) options

Argument Description
ffill or pad Fill (or carry) values forward
bfill or backfill Fill (or carry) values backward

With DataFrame, reindex can alter either the (row) index, columns, or both. When
passed just a sequence, the rows are reindexed in the result:

In [86]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
....: columns=['Ohio', 'Texas', 'California'])

In [87]: frame
Out[87]:
Ohio Texas California
a0 1 2
c3 4 5
d6 7 8

In [88]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [89]: frame2
Out[89]:

Essential Functionality | 123

www.it-ebooks.info

Ohio Texas California
a0 1 2
b NaN NaN NaN

c3 4 5
d6 7 8

The columns can be reindexed using the columns keyword:

In [90]: states = ['Texas', 'Utah', 'California']

In [91]: frame.reindex(columns=states)

Out[91]:

Texas Utah California

a 1 NaN 2

c 4 NaN 5
d 7 NaN 8

Both can be reindexed in one shot, though interpolation will only apply row-wise (axis
0):

In [92]: frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
....: columns=states)
Out[92]:
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8

As you’ll see soon, reindexing can be done more succinctly by label-indexing with ix:

In [93]: frame.ix[['a', 'b', 'c', 'd'], states]
Out[93]:
Texas Utah California
a 1 NaN 2
b NaN NaN NaN
c 4 NaN 5
d 7 NaN 8

Table 5-5. reindex function arguments

Argument Description
index New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An
Index will be used exactly as is without any copying
method Interpolation (fill) method, see Table 5-4 for options.
fill_value Substitute value to use when introducing missing data by reindexing
limit When forward- or backfilling, maximum size gap to fill
level Match simple Index on level of MultiIndex, otherwise select subset of
copy Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

124 | Chapter 5: Getting Started with pandas
www.it-ebooks.info

Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list
without those entries. As that can require a bit of munging and set logic, the drop
method will return a new object with the indicated value or values deleted from an axis:

In [94]: obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [95]: new_obj = obj.drop('c')

In [96]: new_obj
Out[96]:
a0
b1
d3
e4

In [97]: obj.drop(['d', 'c'])
Out[97]:
a0
b1
e4

With DataFrame, index values can be deleted from either axis:

In [98]: data = DataFrame(np.arange(16).reshape((4, 4)),
....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
....: columns=['one', 'two', 'three', 'four'])

In [99]: data.drop(['Colorado', 'Ohio'])
Out[99]:
one two three four
Utah 89 10 11
New York 12 13 14 15

In [100]: data.drop('two', axis=1) In [101]: data.drop(['two', 'four'], axis=1)
Out[100]: Out[101]:
one three four one three
Ohio 0 23 Ohio 02
Colorado 4 67 Colorado 4 6
Utah 8 10 11 Utah 8 10
New York 12 14 15 New York 12 14

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers. Here are some examples this:

In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [103]: obj['b'] In [104]: obj[1]
Out[103]: 1.0 Out[104]: 1.0

In [105]: obj[2:4] In [106]: obj[['b', 'a', 'd']]
Out[105]: Out[106]:

Essential Functionality | 125

www.it-ebooks.info

c2 b1
d3 a0
d3

In [107]: obj[[1, 3]] In [108]: obj[obj < 2]
Out[107]: Out[108]:
b1 a0
d3 b1

Slicing with labels behaves differently than normal Python slicing in that the endpoint
is inclusive:

In [109]: obj['b':'c']
Out[109]:

b1
c2

Setting using these methods works just as you would expect:

In [110]: obj['b':'c'] = 5

In [111]: obj
Out[111]:
a0
b5
c5
d3

As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns
either with a single value or sequence:

In [112]: data = DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])

In [113]: data
Out[113]:
one two three four
Ohio 01 2 3
Colorado 4 5 6 7
Utah 89
New York 12 13 10 11
14 15

In [114]: data['two'] In [115]: data[['three', 'one']]
Out[114]: Out[115]:
Ohio 1 three one
Colorado 5 Ohio 20
Utah 9 Colorado 64
New York 13 Utah 10 8
Name: two New York 14 12

Indexing like this has a few special cases. First selecting rows by slicing or a boolean
array:

In [116]: data[:2] In [117]: data[data['three'] > 5]
Out[116]: Out[117]:

one two three four one two three four

126 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

Ohio 01 23 Colorado 4 5 67
Colorado 4 5 67 Utah 89 10 11
New York 12 13 14 15

This might seem inconsistent to some readers, but this syntax arose out of practicality
and nothing more. Another use case is in indexing with a boolean DataFrame, such as
one produced by a scalar comparison:

In [118]: data < 5

Out[118]:

one two three four
True True
Ohio True True False
False False
Colorado True False False
False
Utah False False False

New York False False

In [119]: data[data < 5] = 0

In [120]: data
Out[120]:
one two three four
Ohio 00 0 0
6 7
Colorado 0 5
Utah 89 10 11
14 15
New York 12 13

This is intended to make DataFrame syntactically more like an ndarray in this case.

For DataFrame label-indexing on the rows, I introduce the special indexing field ix. It
enables you to select a subset of the rows and columns from a DataFrame with NumPy-

like notation plus axis labels. As I mentioned earlier, this is also a less verbose way to

do reindexing:

In [121]: data.ix['Colorado', ['two', 'three']]
Out[121]:
two 5
three 6
Name: Colorado

In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
Out[122]:
four one two
Colorado 705
Utah 11 8 9

In [123]: data.ix[2] In [124]: data.ix[:'Utah', 'two']
Out[123]: Out[124]:

one 8 Ohio 0
two 9 Colorado 5
three 10 Utah 9
four 11 Name: two
Name: Utah

In [125]: data.ix[data.three > 5, :3]
Out[125]:

Essential Functionality | 127

www.it-ebooks.info

one two three
Colorado 0 5 6
Utah 89 10

New York 12 13 14

So there are many ways to select and rearrange the data contained in a pandas object.
For DataFrame, there is a short summary of many of them in Table 5-6. You have a
number of additional options when working with hierarchical indexes as you’ll later
see.

When designing pandas, I felt that having to type frame[:, col] to select
a column was too verbose (and error-prone), since column selection is
one of the most common operations. Thus I made the design trade-off
to push all of the rich label-indexing into ix.

Table 5-6. Indexing options with DataFrame

Type Notes
obj[val]
Select single column or sequence of columns from the DataFrame. Special case con-
obj.ix[val] veniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set
obj.ix[:, val] values based on some criterion).
obj.ix[val1, val2]
reindex method Selects single row of subset of rows from the DataFrame.
xs method
icol, irow methods Selects single column of subset of columns.
get_value, set_value methods
Select both rows and columns.

Conform one or more axes to new indexes.

Select single row or column as a Series by label.

Select single column or row, respectively, as a Series by integer location.

Select single value by row and column label.

Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between ob-
jects with different indexes. When adding together objects, if any index pairs are not
the same, the respective index in the result will be the union of the index pairs. Let’s
look at a simple example:

In [126]: s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [127]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [128]: s1 In [129]: s2
Out[128]: Out[129]:
a 7.3 a -2.1
c -2.5 c 3.6
d 3.4 e -1.5

128 | Chapter 5: Getting Started with pandas
www.it-ebooks.info

e 1.5 f 4.0
g 3.1

Adding these together yields:

In [130]: s1 + s2
Out[130]:
a 5.2
c 1.1

d NaN

e 0.0

f NaN

g NaN

The internal data alignment introduces NA values in the indices that don’t overlap.
Missing values propagate in arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [131]: df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
.....: index=['Ohio', 'Texas', 'Colorado'])

In [132]: df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [133]: df1 In [134]: df2
Out[133]: Out[134]:
bc d e
Ohio 01 2 bd 2
Texas 34 5 Utah 0 1 5
Colorado 6 7 8 Ohio 3 4 8
Texas 6 7 11
Oregon 9 10

Adding these together returns a DataFrame whose index and columns are the unions
of the ones in each DataFrame:

In [135]: df1 + df2
Out[135]:
bcde
Colorado NaN NaN NaN NaN
Ohio 3 NaN 6 NaN
Oregon NaN NaN NaN NaN
Texas 9 NaN 12 NaN
Utah NaN NaN NaN NaN

Arithmetic methods with fill values

In arithmetic operations between differently-indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other:

In [136]: df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

In [137]: df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [138]: df1 In [139]: df2
Out[138]: Out[139]:

ab c d abcde

Essential Functionality | 129

www.it-ebooks.info

001 2 3 001234
145 6 7 156789
2 8 9 10 11 2 10 11 12 13 14
3 15 16 17 18 19

Adding these together results in NA values in the locations that don’t overlap:

In [140]: df1 + df2
Out[140]:

abcde
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN

Using the add method on df1, I pass df2 and an argument to fill_value:

In [141]: df1.add(df2, fill_value=0)
Out[141]:

abcde
002464
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19

Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [142]: df1.reindex(columns=df2.columns, fill_value=0)
Out[142]:

ab c de
001 2 30
145 6 70
2 8 9 10 11 0

Table 5-7. Flexible arithmetic methods

Method Description
add Method for addition (+)
sub Method for subtraction (-)
div Method for division (/)
mul Method for multiplication (*)

Operations between DataFrame and Series
As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. First,
as a motivating example, consider the difference between a 2D array and one of its rows:

In [143]: arr = np.arange(12.).reshape((3, 4))

In [144]: arr
Out[144]:
array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],

130 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

[ 8., 9., 10., 11.]])

In [145]: arr[0]
Out[145]: array([ 0., 1., 2., 3.])

In [146]: arr - arr[0] 0.],
Out[146]: 4.],
array([[ 0., 0., 0., 8.]])

[ 4., 4., 4.,
[ 8., 8., 8.,

This is referred to as broadcasting and is explained in more detail in Chapter 12. Op-
erations between a DataFrame and a Series are similar:

In [147]: frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [148]: series = frame.ix[0]

In [149]: frame In [150]: series
Out[149]: Out[150]:
b0
bde d1
Utah 0 1 2 e2
Ohio 3 4 5 Name: Utah
Texas 6 7 8
Oregon 9 10 11

By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame's columns, broadcasting down the rows:

In [151]: frame - series
Out[151]:

bde
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9

If an index value is not found in either the DataFrame’s columns or the Series’s index,
the objects will be reindexed to form the union:

In [152]: series2 = Series(range(3), index=['b', 'e', 'f'])

In [153]: frame + series2
Out[153]:

bdef
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN

If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [154]: series3 = frame['d']

In [155]: frame In [156]: series3

Essential Functionality | 131

www.it-ebooks.info

Out[155]: Out[156]: 1
bde Utah
Ohio 4
Utah 0 1 2
Ohio 3 4 5 Texas 7
Texas 6 7 8 Oregon 10
Oregon 9 10 11 Name: d

In [157]: frame.sub(series3, axis=0)
Out[157]:

bde
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1

The axis number that you pass is the axis to match on. In this case we mean to match
on the DataFrame’s row index and broadcast across.

Function application and mapping

NumPy ufuncs (element-wise array methods) work fine with pandas objects:

In [158]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),

.....: index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [159]: frame de In [160]: np.abs(frame) e
Out[159]: 0.478943 -0.519439 Out[160]: 0.519439
1.965781 1.393406 1.393406
b 0.281746 0.769023 bd 0.769023
Utah -0.204708 1.007189 -1.296221 Utah 0.204708 0.478943 1.296221
Ohio -0.555730 Ohio 0.555730 1.965781
Texas 0.092908 Texas 0.092908 0.281746
Oregon 1.246435 Oregon 1.246435 1.007189

Another frequent operation is applying a function on 1D arrays to each column or row.
DataFrame’s apply method does exactly this:

In [161]: f = lambda x: x.max() - x.min()

In [162]: frame.apply(f) In [163]: frame.apply(f, axis=1)
Out[162]: Out[163]:
b 1.802165 Utah 0.998382
d 1.684034 Ohio 2.521511
e 2.689627 Texas 0.676115
Oregon 2.542656

Many of the most common array statistics (like sum and mean) are DataFrame methods,
so using apply is not necessary.

The function passed to apply need not return a scalar value, it can also return a Series
with multiple values:

In [164]: def f(x):
.....: return Series([x.min(), x.max()], index=['min', 'max'])

In [165]: frame.apply(f)

132 | Chapter 5: Getting Started with pandas

www.it-ebooks.info

Out[165]: de
b 0.281746 -1.296221
1.965781 1.393406
min -0.555730
max 1.246435

Element-wise Python functions can be used, too. Suppose you wanted to compute a
formatted string from each floating point value in frame. You can do this with applymap:

In [166]: format = lambda x: '%.2f' % x

In [167]: frame.applymap(format)

Out[167]:

bd e

Utah -0.20 0.48 -0.52

Ohio -0.56 1.97 1.39
Texas 0.09 0.28 0.77
Oregon 1.25 1.01 -1.30

The reason for the name applymap is that Series has a map method for applying an ele-
ment-wise function:

In [168]: frame['e'].map(format)
Out[168]:
Utah -0.52
Ohio 1.39
Texas 0.77
Oregon -1.30
Name: e

Sorting and ranking

Sorting a data set by some criterion is another important built-in operation. To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object:

In [169]: obj = Series(range(4), index=['d', 'a', 'b', 'c'])

In [170]: obj.sort_index()
Out[170]:
a1
b2
c3
d0

With a DataFrame, you can sort by index on either axis:

In [171]: frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
.....: columns=['d', 'a', 'b', 'c'])

In [172]: frame.sort_index() In [173]: frame.sort_index(axis=1)
Out[172]: Out[173]:

dabc abcd
one 4 5 6 7 three 1 2 3 0
three 0 1 2 3 one 5 6 7 4

Essential Functionality | 133

www.it-ebooks.info

The data is sorted in ascending order by default, but can be sorted in descending order,
too:

In [174]: frame.sort_index(axis=1, ascending=False)
Out[174]:

dcba
three 0 3 2 1
one 4 7 6 5

To sort a Series by its values, use its order method:

In [175]: obj = Series([4, 7, -3, 2])

In [176]: obj.order()
Out[176]:
2 -3
32
04
17

Any missing values are sorted to the end of the Series by default:

In [177]: obj = Series([4, np.nan, 7, np.nan, -3, 2])

In [178]: obj.order()
Out[178]:
4 -3
52
04
27
1 NaN
3 NaN

On DataFrame, you may want to sort by the values in one or more columns. To do so,
pass one or more column names to the by option:

In [179]: frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [180]: frame In [181]: frame.sort_index(by='b')
Out[180]: Out[181]:

ab ab
004 2 0 -3
117 312
2 0 -3 004
312 117

To sort by multiple columns, pass a list of names:

In [182]: frame.sort_index(by=['a', 'b'])
Out[182]:

ab
2 0 -3
004
312
117

134 | Chapter 5: Getting Started with pandas
www.it-ebooks.info

Pages:

Click to View FlipBook Version