Python绝技:运用Python成为顶级数据工程师
上QQ阅读APP看本书,新人免费读10天
设备和账号都新为新人

2.2.2 Pandas

Pandas是基于Numpy构建的高性能数据统计库,其内容如表2-2所示。它提供了Series、DataFrame、Panel三种数据结构,并在此数据结构的基础上提供创建、存取、统计、缺失值处理以及I/O输入输出等函数操作。Panel衍生于计量经济学的分析中,使用的频率并不是很高。因此本节着重围绕Series和DataFrame来介绍Pandas库。

表2-2 Pandas内容概览

官网手册

http://pandas.pydata.org/pandas-docs/stable/

内容概览

示例版本

版本号:0.19.1

日期:2015.11.03

示例讲解

(1)引入包

      >>>import numpy as np
      >>>import pandas as pd

(2)创建对象

      >>>#================创建Series对象================
      >>> #传入list创建Series,并生成默认的整形索引
      >>>s = pd.Series([1, 2, 3, 4, 5]) >>>prints
      0   1
      1   2
      2   3
      3   4
      4   5
      dtype: int64
      >>>s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"])
  #指定索引创建Series
      >>>prints
      a   1
      b   2
      c   3
      d   4
      e   5
      dtype: int64
      >>>#传入Numpy的ndarray数组创建
      >>>pd.Series(np.arange(1, 5, 1))0   1
      1   2
      2   3
      3   4
      dtype: int64
      >>>#================创建DataFrame对象================
      ...
      >>>dates = pd.date_range('20170101', periods=5)
      >>>df = pd.DataFrame(np.random.randn(5,4), index=dates, columns=
    list('ABCD'))
        >>>print df
        ABCD
        2017-01-01-1.008850  0.389870  0.422930-0.239423
        2017-01-02-1.804773-0.122929-0.570285  1.609145
        2017-01-03-0.214131-1.887011  0.546617-1.250311
        2017-01-04  0.862515  0.417958-0.506498-0.330921
        2017-01-05  1.225338-0.182110  1.117512-1.338410
        df = pd.DataFrame({"a":1, "b":2, "c":3, "d":4, "e":("e1", "e2",
    "e3")}) #传入Python的dict创建DataFrame数组
        >>>print df
        abcde
        0  1  2  3  4  e1
        1  1  2  3  4  e2
        2  1  2  3  4  e3
        >>>#================创建Panel对象================
        >>>p = pd.Panel(np.random.randn(2, 5, 4), items=['a', 'b'],
        ...           major_axis=pd.date_range('1/1/2016', periods=5),
        ...           minor_axis=['c', 'd', 'e', 'f'])
        >>>print p
        <class 'pandas.core.panel.Panel'>
        Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
        Itemsaxis: atob
        Major_axis axis: 2016-01-01 00:00:00 to 2016-01-05 00:00:00
        Minor_axis axis: ctof

(3)数据查看

      >>>s = pd.Series([1, 2, 3, 4, 5])
      >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c",
  "d", "e"], columns=["item1", "item2", "item3"])
      >>>#================head函数================
      ... s.head()
      0   1
      1   2
      2   3
      3   4
      4   5
      dtype: int64
      >>>df.head(3)
      item1    item2    item3
      a -0.954088  0.244918  1.145347
      b -0.807056  1.180588-0.153845
      c  0.100668  0.162599-0.320771
        >>>#================tail函数================
        ... s.tail(3)
        2   3
        3   4
        4   5
        dtype: int64
        >>>df.tail(2)
        item1    item2    item3
        d  0.741110-0.020918-0.482747
        e  0.756976-0.112055  0.335975
        #========通过index、columns、values查看df数据================
        >>>df.index
        Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
        >>>df.columns
        Index([u'item1', u'item2', u'item3'], dtype='object')
        >>>df.values
        array([[-0.95408841,  0.24491757,  1.14534653],
              [-0.80705606,  1.18058753, -0.15384471],
              [ 0.10066784,  0.16259905, -0.32077148],
              [ 0.74111005, -0.02091765, -0.48274747],
              [ 0.75697607, -0.11205458,  0.3359747 ]])
        >>>#================df行、列互换================
        ... df.T
        abcde
        item1-0.954088-0.807056  0.100668  0.741110  0.756976
        item2  0.244918  1.180588  0.162599-0.020918-0.112055
        item3  1.145347-0.153845-0.320771-0.482747  0.335975
        >>>#================df排序查看================
        ... df.sort_index(axis=1, ascending=False)
        item3    item2    item1
        a  1.145347  0.244918-0.954088
        b -0.153845  1.180588-0.807056
        c -0.320771  0.162599  0.100668
        d -0.482747-0.020918  0.741110
        e  0.335975-0.112055  0.756976

(4)数据选择

        >>> #内容:数据选择
        >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c",
    "d", "e"], columns=["item1", "item2", "item3"])
        >>>#获取列
        ... df["item1"]
        a   -0.799240
        b   1.354446
        c   0.250335
        d   -0.203544
        e   0.703325
        Name: item1, dtype: float64
        >>>#获取行
        ... df[0:1]
        item1    item2    item3
        a -0.79924-0.612019-1.716188
        >>>#通过索引获取数据
        ... df.loc["a"]
        item1   -0.799240
        item2   -0.612019
        item3   -1.716188
        Name: a, dtype: float64
        >>>df.loc["a", ["item1", "item3"]]
        item1   -0.799240
        item3   -1.716188
        Name: a, dtype: float64
        >>>#通过位置获取数据
        >>>df.iloc[2]
        item1   0.250335
        item2   0.067347
        item3   0.058785
        Name: c, dtype: float64
        >>>df.iloc[2:3, [0, 1]]
        item1    item2
        c  0.250335  0.067347
        >>>#获取指定的单个数据
        ... df.iat[1, 1]
        0.08695444047161624

(5)数据设置

        >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c",
    "d", "e"], columns=["item1", "item2", "item3"])
        >>>print df
        item1    item2    item3
        a  0.155408  0.128694  1.385727
        b -0.937322  0.789462  0.636947
        c  0.203996-0.573447-0.136809
        d  0.155883-0.537128-0.480155
        e -0.626413  0.854594  0.448322
        >>>#设置单个值
        ... df.iat[1,1] = 0.99999
        >>>print df
        item1    item2    item3
        a  0.155408  0.128694  1.385727
        b -0.937322  0.999990  0.636947
        c  0.203996-0.573447-0.136809
        d  0.155883-0.537128-0.480155
        e -0.626413  0.854594  0.448322
        >>>#设置一个序列
        ... df["item1"] = [0.1, 0.2, 0.3, 0.4, 0.5]
        >>>print df
        item1    item2    item3
        a   0.1  0.128694  1.385727
        b   0.2  0.999990  0.636947
        c   0.3-0.573447-0.136809
        d   0.4-0.537128-0.480155
        e   0.5  0.854594  0.448322

(6)缺失值处理

      >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c",
  "d", "e"], columns=["item1", "item2", "item3"])
      >>>df1  =  df.reindex(index=["a",  "b",  "c",  "d"],  columns=list
  (df.columns) + ['item4'])
      >>>print df1
      item1    item2    item3  item4
      a -1.366190  0.384771-0.242857   NaN
      b  0.498889-0.415037  1.490942   NaN
      c -1.154120-0.323287  0.120690   NaN
      d -0.174297-1.495528-1.517813   NaN
      >>>df1.iat[0, 3] = 0.9999
      >>>print df1
      item1    item2    item3   item4
      a -1.366190  0.384771-0.242857  0.9999
      b  0.498889-0.415037  1.490942    NaN
      c -1.154120-0.323287  0.120690    NaN
      d -0.174297-1.495528-1.517813    NaN
      >>>#去除带有Nan空值的行
      ... df1.dropna(how='any')
      item1    item2    item3   item4
      a -1.36619  0.384771-0.242857  0.9999
      >>>#填补Nan空值
      ... df1.fillna(value = 0.1234)
      item1    item2    item3   item4
        a -1.366190  0.384771-0.242857  0.9999
        b  0.498889-0.415037  1.490942  0.1234
        c -1.154120-0.323287  0.120690  0.1234
        d -0.174297-1.495528-1.517813  0.1234
        >>>#获取控制布尔值掩码
        >>>pd.isnull(df1)
        item1  item2  item3  item4
        a False False False False
        b False False False True
        c False False False True
        d False False False True

(7)汇总与统计

        >>>df = pd.DataFrame({"a":[1, 2, 3], "b":[1, 3, 5], "c":[5, 4, 1],
    "d":[5, 6, 6], "e":[1, 7, 8]}, index=["r1", "r2", "r3"])
        >>>print df
        abcde
        r1  1  1  5  5  1
        r2  2  3  4  6  7
        r3  3  5  1  6  8
        #统计信息描述
        >>>df.describe()
        abcde
        count  3.0  3  3.000000  3.000000  3.000000
        mean   2.0  3  3.333333  5.666667  5.333333
        std   1.0  2  2.081666  0.577350  3.785939
        min   1.0  1  1.000000  5.000000  1.000000
        25%   1.5  2  2.500000  5.500000  4.000000
        50%   2.0  3  4.000000  6.000000  7.000000
        75%   2.5  4  4.500000  6.000000  7.500000
        max   3.0  5  5.000000  6.000000  8.000000
        #获取单项指标,注意传入0或者1的差别
        >>>df.count(0)
        a   3
        b   3
        c   3
        d   3
        e   3
        dtype: int64
        >>>df.count(1)
        r1   5
        r2   5
        r3   5
      dtype: int64
      >>> #内容:类SQL操作
      >>>df = pd.DataFrame({"a":[1, 2, 3], "b":[1, 3, 5], "c":[5, 4, 1],
  "d":[5, 6, 6], "e":[1, 7, 8]}, index=["r1", "r2", "r3"])
      >>>print df
      abcde
      r1  1  1  5  5  1
      r2  2  3  4  6  7
      r3  3  5  1  6  8
      >>>#类似where操作,其本质是先获取布尔掩码,之后获取值
      ... df[df.a> 1]
      abcde
      r2  2  3  4  6  7
      r3  3  5  1  6  8
      >>> #类似insert操作
      >>>s = df.loc["r1"]
      >>>print s
      a   1
      b   1
      c   5
      d   5
      e   1
      Name: r1, dtype: int64
      >>>df.append(s, ignore_index = False)
      abcde
      r1  1  1  5  5  1
      r2  2  3  4  6  7
      r3  3  5  1  6  8
      r1  1  1  5  5  1
      >>>#类似union操作
      ... pieces = [df[:1], df[1:2], df[2:]]
      >>>pd.concat(pieces)
      abcde
      r1  1  1  5  5  1
      r2  2  3  4  6  7
      r3  3  5  1  6  8
      >>>#类似join操作
      ... left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
      >>>print left
      keylval
      0  foo    1
      1  foo    2
      >>>left = pd.DataFrame({'key': ['id1', 'id2'], 'lval': [1, 2]})
      >>>right = pd.DataFrame({'key': ['id2', 'id3'], 'lval': [5, 6]})
      >>>print left
        keylval
        0  id1    1
        1  id2    2
        >>>print right
        keylval
        0  id2    5
        1  id3    6
        >>>pd.merge(left, right, on='key')
        keylval_xlval_y
        0  id2      2      5
        >>> #类似groupby操作
        >>>s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"],
    name="r1")
        >>>df1 = df.append(s, ignore_index = False)
        >>>print df1
        abcde
        r1  1  1  5  5  1
        r2  2  3  4  6  7
        r3  3  5  1  6  8
        r1  1  2  3  4  5
        >>>df1.groupby("a").sum()
        bcde
        a
        1  3  8  9  6
        2  3  4  6  7
        3  5  1  6  8

(8)时间序列

时间序列基于Timestamp和DatetimeIndex,以及Period和PeriodIndex进行操作。其主要用途是作为Series或者DataFrame的索引,并进行一系列的时间序列相关的操作。下面我们首先通过示例,对Timestamp和Period的数据形式做直观的展示,再接着说明如何生成以及应用该时间序列作为索引,示例如下。

      >>>#创建Timestamp
      ...pd.Timestamp('2017-01-01')
      Timestamp('2017-01-01 00:00:00')
      >>>pd.Timestamp(pd.datetime(2017, 1, 1))
      Timestamp('2017-01-01 00:00:00')
      >>>#创建Period
      ... pd.Period('2017-01')
      Period('2017-01', 'M')
        >>>pd.Period('2017-01', freq = 'D')
        Period('2017-01-01', 'D')
        >>> #DatatimeIndex索引生成的三种方式,示例如下:
        >>>#方式一
        ... dates = [pd.Timestamp(pd.datetime(2012, 5, 1)), pd.Timestamp
    (pd.datetime(2012, 5, 2)), pd.Timestamp(pd.datetime(2012, 5, 3))]
        >>>index = pd.DatetimeIndex(dates)
        >>>print index
        DatetimeIndex(['2012-05-01',     '2012-05-02',     '2012-05-03'],
    dtype='datetime64[ns]', freq=None, tz=None)
        >>>#方式二
        ... index = pd.date_range('2017-1-1', periods=3, freq='D')
        >>>print index
        DatetimeIndex(['2017-01-01',     '2017-01-02',     '2017-01-03'],
    dtype='datetime64[ns]', freq='D', tz=None)
        >>>#方式三
        ... #date_range生成的是日历时间,bdate_range生成的是工作日时间
        ... index = pd.bdate_range('2017-1-1', periods=10)
        >>>print index
        DatetimeIndex(['2017-01-02',     '2017-01-03',     '2017-01-04',
    '2017-01-05', '2017-01-06',        '2017-01-09',        '2017-01-10',
    '2017-01-11', '2017-01-12',     '2017-01-13'], dtype='datetime64[ns]',
    freq='B', tz=None)
        >>> #PeriodIndex索引生成的三种方式,示例如下:
        >>>#方式一
        ... prng = pd.period_range('1/1/2017', '3/1/2017', freq='M')
        >>>print prng
        PeriodIndex(['2017-01',  '2017-02',  '2017-03'],  dtype='int64',
    freq='M')
        >>>#方式二
        ... pd.PeriodIndex(['2017-1', '2017-2', '2017-3'], freq='M')
        PeriodIndex(['2017-01',  '2017-02',  '2017-03'],  dtype='int64',
    freq='M')
        >>>#方式三
        ... >>>pd.PeriodIndex(start='2017-01', freq='3M', periods=4)
        >>>PeriodIndex(['2017-01',   '2017-04',   '2017-07',   '2017-10'],
    dtype='period[3M]', freq='3M')
        >>> #上述内容简要介绍了时间索引,接下来的示例将进一步说明时间索引作为Series
    和DataFrame的索引,以及resample(重采样)函数的使用
        >>>rng = pd.date_range('1/1/2017', periods=6, freq='M')
        >>>ts.resample('2M', how="sum")
        2017-01-31   384
        2017-03-31   472
        2017-05-31   974
        2017-07-31   304
        2017-09-30   492
        2017-11-30   594
        2018-01-31   418
        Freq: 2M, dtype: int64
        >>>rng = pd.date_range('1/1/2017', periods=6, freq='M')
        >>>ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
        >>>print ts
        2017-01-31    91
        2017-02-28   248
        2017-03-31   256
        2017-04-30   193
        2017-05-31   482
        2017-06-30   100
        Freq: M, dtype: int64
        >>>ts.resample('3M').sum()
        2017-01-31    91
        2017-04-30   697
        2017-07-31   582
        Freq: 3M, dtype: int64
        >>>ts.resample('3M').mean()
        2017-01-31    91.000000
        2017-04-30   232.333333
        2017-07-31   291.000000
        Freq: 3M, dtype: float64

(9)文件I/O

同上节介绍的Numpy一样,Pandas提供了一系列简单易用的文件I/O函数,可以把Series以及DataFrame的数据以多种形式保存到文件,并从文件读取到内存中。可以保存的形式包括:csv、hdf、excel、json、html、sql、stata、sas、clipboard、pickle。此处简单介绍csv, excel的操作方式,其他格式的使用与此类似,可以参考官网进一步学习。

        >>>df = pd.DataFrame(np.random.randn(4,3), index=["r1", "r2", "r3",
    "r4"], columns=list('abc'))
        >>>#csv文件的存取
        >>>df.to_csv("df.csv")
        >>>pd.read_csv("df.csv")
        Unnamed: 0        abc
        0        r1  0.242875  0.360095-1.639387
        1        r2  2.335254-0.484403  0.776782
        2        r3-1.536690-0.691591-1.762023
        3        r4-0.423772  0.735941  0.206301
        >>>excel文件的存取
        >>>df.to_excel('df.xlsx', sheet_name='Sheet1')
        >>>pd.read_excel("df.xlsx", 'sheet1', index_col = None)
        abc
        0        r1  0.242875  0.360095-1.639387
        1        r2  2.335254-0.484403  0.776782
        2        r3-1.536690-0.691591-1.762023
        3        r4-0.423772  0.735941  0.206301