
The Pandas data structure
Let's first get acquainted with two of Pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.
Series
A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:
>>> s1 = pd.Series(np.random.rand(4), index=['a', 'b', 'c', 'd']) >>> s1 a 0.6122 b 0.98096 c 0.3350 d 0.7221 dtype: float64
By default, if no index is passed, it will be created to have values ranging from 0
to N-1
, where N
is the length of the Series:
>>> s2 = pd.Series(np.random.rand(4)) >>> s2 0 0.6913 1 0.8487 2 0.8627 3 0.7286 dtype: float64
We can access the value of a Series by using the index:
>>> s1['c'] 0.3350 >>>s1['c'] = 3.14 >>> s1['c', 'a', 'b'] c 3.14 a 0.6122 b 0.98096
This accessing method is similar to a Python dictionary. Therefore, Pandas also allows us to initialize a Series object directly from a Python dictionary:
>>> s3 = pd.Series({'001': 'Nam', '002': 'Mary', '003': 'Peter'}) >>> s3 001 Nam 002 Mary 003 Peter dtype: object
Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the above example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN
values by Pandas:
>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary', '003': 'Peter'}, index=[ '002', '001', '024', '065']) >>> s4 002 Mary 001 Nam 024 NaN 065 NaN dtype: object ect
The library also supports functions that detect missing data:
>>> pd.isnull(s4) 002 False 001 False 024 True 065 True dtype: bool
Similarly, we can also initialize a Series from a scalar value:
>>> s5 = pd.Series(2.71, index=['x', 'y']) >>> s5 x 2.71 y 2.71 dtype: float64
A Series object can be initialized with NumPy objects as well, such as ndarray
. Moreover, Pandas can automatically align data indexed in different ways in arithmetic operations:
>>> s6 = pd.Series(np.array([2.71, 3.14]), index=['z', 'y']) >>> s6 z 2.71 y 3.14 dtype: float64 >>> s5 + s6 x NaN y 5.85 z NaN dtype: float64
The DataFrame
The DataFrame is a tabular data structure comprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:
>>> data = {'Year': [2000, 2005, 2010, 2014], 'Median_Age': [24.2, 26.4, 28.5, 30.3], 'Density': [244, 256, 268, 279]} >>> df1 = pd.DataFrame(data) >>> df1 Density Median_Age Year 0 244 24.2 2000 1 256 26.4 2005 2 268 28.5 2010 3 279 30.3 2014
By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:
>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 'Median_Age']) >>> df2 Year Density Median_Age 0 2000 244 24.2 1 2005 256 26.4 2 2010 268 28.5 3 2014 279 30.3 >>> df2.index Int64Index([0, 1, 2, 3], dtype='int64')
We can provide the index labels of a DataFrame similar to a Series:
>>> df3 = pd.DataFrame(data, columns=['Year', 'Density', 'Median_Age'], index=['a', 'b', 'c', 'd']) >>> df3.index Index([u'a', u'b', u'c', u'd'], dtype='object')
We can construct a DataFrame out of nested lists as well:
>>> df4 = pd.DataFrame([ ['Peter', 16, 'pupil', 'TN', 'M', None], ['Mary', 21, 'student', 'SG', 'F', None], ['Nam', 22, 'student', 'HN', 'M', None], ['Mai', 31, 'nurse', 'SG', 'F', None], ['John', 28, 'laywer', 'SG', 'M', None]], columns=['name', 'age', 'career', 'province', 'sex', 'award'])
Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:
>>> df4.name # or df4['name'] 0 Peter 1 Mary 2 Nam 3 Mai 4 John Name: name, dtype: object
To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:
>>> df4['award'] = None >>> df4 name age career province sex award 0 Peter 16 pupil TN M None 1 Mary 21 student SG F None 2 Nam 22 student HN M None 3 Mai 31 nurse SG F None 4 John 28 lawer SG M None
Using a couple of methods, rows can be retrieved by position or name:
>>> df4.ix[1] name Mary age 21 career student province SG sex F award None Name: 1, dtype: object
A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the examples above.
Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv
function that expects the column separator to be a comma, by default. However, we can change that by using the sep
parameter:
# person.csv file name,age,career,province,sex Peter,16,pupil,TN,M Mary,21,student,SG,F Nam,22,student,HN,M Mai,31,nurse,SG,F John,28,lawer,SG,M # loading person.cvs into a DataFrame >>> df4 = pd.read_csv('person.csv') >>> df4 name age career province sex 0 Peter 16 pupil TN M 1 Mary 21 student SG F 2 Nam 22 student HN M 3 Mai 31 nurse SG F 4 John 28 laywer SG M
While reading a data file, we sometimes want to skip a line or an invalid value. As for Pandas 0.16.2
, read_csv
supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:
sep
: This is a delimiter between columns. The default is comma symbol.dtype
: This is a data type for data or columns.header
: This sets row numbers to use as the column names.skiprows
: This skips line numbers to skip at the start of the file.error_bad_lines
: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter asfalse
, the bad lines will be skipped.
Moreover, Pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame
or write_frame
function within the Pandas module. We will come back to these methods later in this chapter.