上QQ阅读APP看书，第一时间看更新

Working interactively with IPython

In this section, we will introduce Python interactive console, or IPython, a command-line shell that allows us to explore concepts and methods in an interactive way.

To run IPython, you call it from the command line:

Here we see IPython executing, and then the initial quick help. The most interesting part is the last line - it will allow you to import libraries and execute commands and will show the resulting objects. An additional and convenient feature of IPython is that you can redefine variables on the fly to see how the results differ with different inputs.

In the current examples, we are using the standard Python version for the most supported Linux distribution at the time of writing (Ubuntu 16.04). The examples should be equivalent for Python 3.

First of all, let's import pandas and load a sample .csv file (a very common format with one row per line, and registers). It contains a very famous dataset for classification problems with the dimensions of the attributes of 150 instances of iris plants, with a numerical column indicating the class (1, 2, or 3):

In [1]: import pandas as pd #Import the pandas library with pd alias

In this line, we import pandas in the usual way, making its method available for use with the import statement. The as modifier allows us to use a succinct name for all objects and methods in the library:

In [2]: df = pd.read_csv ("data/iris.csv") #import iris data as dataframe

In this line, we use the read_csv method, allowing pandas to guess the possible item separator for the .csv file, and storing it in a dataframe object.

Let's perform some simple exploration of the dataset:

In [3]: df.columns
Out[3]:
Index([u'Sepal.Length', u'Sepal.Width', u'Petal.Length', u'Petal.Width',
u'Species'],
dtype='object')

In [4]: df.head(3)
Out[4]:
5.1 3.5 1.4 0.2 setosa
0 4.9 3.0 1.4 0.2 setosa
1 4.7 3.2 1.3 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa

We are now able to see the column names of the dataset and explore the first n instances of it. Looking at the first registers, you can see the varying measures for the setosa iris class.

Now, let's access a particular subset of columns and display the first three elements:

In [19]: df[u'Sepal.Length'].head(3)
Out[19]:
0 5.1
1 4.9
2 4.7
Name: Sepal.Length, dtype: float64

Pandas includes many related methods for importing tabulated data formats, such as HDF5 (read_hdf), JSON (read_json), and Excel (read_excel). For a complete list of formats, visit http://pandas.pydata.org/pandas-docs/stable/io.html .

In addition to these simple exploration methods, we will now use pandas to get all the descriptive statistics concepts we've seen in order to characterize the distribution of the Sepal.Length column:

#Describe the sepal length column
print "Mean: " + str (df[u'Sepal.Length'].mean())
print "Standard deviation: " + str(df[u'Sepal.Length'].std())
print "Kurtosis: " + str(df[u'Sepal.Length'].kurtosis())
print "Skewness: " + str(df[u'Sepal.Length'].skew())

And here are the main metrics of this distribution:

Mean: 5.84333333333
Standard deviation: 0.828066127978
Kurtosis: -0.552064041316
Skewness: 0.314910956637

Now we will graphically evaluate the accuracy of these metrics by looking at the histogram of this distribution, this time using the built-in plot.hist method:

#Plot the data histogram to illustrate the measures
import matplotlib.pyplot as plt
%matplotlib inline
df[u'Sepal.Length'].plot.hist()

Histogram of the Iris Sepal Length

As the metrics show, the distribution is right skewed, because the skewness is positive, and it is of the plainly distributed type (has a spread much greater than 1), as the kurtosis metrics indicate.