上QQ阅读APP看本书,新人免费读10天
设备和账号都新为新人
2.2.2 Pandas
Pandas是基于Numpy构建的高性能数据统计库,其内容如表2-2所示。它提供了Series、DataFrame、Panel三种数据结构,并在此数据结构的基础上提供创建、存取、统计、缺失值处理以及I/O输入输出等函数操作。Panel衍生于计量经济学的分析中,使用的频率并不是很高。因此本节着重围绕Series和DataFrame来介绍Pandas库。
表2-2 Pandas内容概览
官网手册
http://pandas.pydata.org/pandas-docs/stable/
内容概览
示例版本
版本号:0.19.1
日期:2015.11.03
示例讲解
(1)引入包
>>>import numpy as np >>>import pandas as pd
(2)创建对象
>>>#================创建Series对象================ >>> #传入list创建Series,并生成默认的整形索引 >>>s = pd.Series([1, 2, 3, 4, 5]) >>>prints 0 1 1 2 2 3 3 4 4 5 dtype: int64 >>>s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"]) #指定索引创建Series >>>prints a 1 b 2 c 3 d 4 e 5 dtype: int64 >>>#传入Numpy的ndarray数组创建 >>>pd.Series(np.arange(1, 5, 1))0 1 1 2 2 3 3 4 dtype: int64 >>>#================创建DataFrame对象================ ... >>>dates = pd.date_range('20170101', periods=5) >>>df = pd.DataFrame(np.random.randn(5,4), index=dates, columns= list('ABCD')) >>>print df ABCD 2017-01-01-1.008850 0.389870 0.422930-0.239423 2017-01-02-1.804773-0.122929-0.570285 1.609145 2017-01-03-0.214131-1.887011 0.546617-1.250311 2017-01-04 0.862515 0.417958-0.506498-0.330921 2017-01-05 1.225338-0.182110 1.117512-1.338410 df = pd.DataFrame({"a":1, "b":2, "c":3, "d":4, "e":("e1", "e2", "e3")}) #传入Python的dict创建DataFrame数组 >>>print df abcde 0 1 2 3 4 e1 1 1 2 3 4 e2 2 1 2 3 4 e3 >>>#================创建Panel对象================ >>>p = pd.Panel(np.random.randn(2, 5, 4), items=['a', 'b'], ... major_axis=pd.date_range('1/1/2016', periods=5), ... minor_axis=['c', 'd', 'e', 'f']) >>>print p <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis) Itemsaxis: atob Major_axis axis: 2016-01-01 00:00:00 to 2016-01-05 00:00:00 Minor_axis axis: ctof
(3)数据查看
>>>s = pd.Series([1, 2, 3, 4, 5]) >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c", "d", "e"], columns=["item1", "item2", "item3"]) >>>#================head函数================ ... s.head() 0 1 1 2 2 3 3 4 4 5 dtype: int64 >>>df.head(3) item1 item2 item3 a -0.954088 0.244918 1.145347 b -0.807056 1.180588-0.153845 c 0.100668 0.162599-0.320771 >>>#================tail函数================ ... s.tail(3) 2 3 3 4 4 5 dtype: int64 >>>df.tail(2) item1 item2 item3 d 0.741110-0.020918-0.482747 e 0.756976-0.112055 0.335975 #========通过index、columns、values查看df数据================ >>>df.index Index([u'a', u'b', u'c', u'd', u'e'], dtype='object') >>>df.columns Index([u'item1', u'item2', u'item3'], dtype='object') >>>df.values array([[-0.95408841, 0.24491757, 1.14534653], [-0.80705606, 1.18058753, -0.15384471], [ 0.10066784, 0.16259905, -0.32077148], [ 0.74111005, -0.02091765, -0.48274747], [ 0.75697607, -0.11205458, 0.3359747 ]]) >>>#================df行、列互换================ ... df.T abcde item1-0.954088-0.807056 0.100668 0.741110 0.756976 item2 0.244918 1.180588 0.162599-0.020918-0.112055 item3 1.145347-0.153845-0.320771-0.482747 0.335975 >>>#================df排序查看================ ... df.sort_index(axis=1, ascending=False) item3 item2 item1 a 1.145347 0.244918-0.954088 b -0.153845 1.180588-0.807056 c -0.320771 0.162599 0.100668 d -0.482747-0.020918 0.741110 e 0.335975-0.112055 0.756976
(4)数据选择
>>> #内容:数据选择 >>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c", "d", "e"], columns=["item1", "item2", "item3"]) >>>#获取列 ... df["item1"] a -0.799240 b 1.354446 c 0.250335 d -0.203544 e 0.703325 Name: item1, dtype: float64 >>>#获取行 ... df[0:1] item1 item2 item3 a -0.79924-0.612019-1.716188 >>>#通过索引获取数据 ... df.loc["a"] item1 -0.799240 item2 -0.612019 item3 -1.716188 Name: a, dtype: float64 >>>df.loc["a", ["item1", "item3"]] item1 -0.799240 item3 -1.716188 Name: a, dtype: float64 >>>#通过位置获取数据 >>>df.iloc[2] item1 0.250335 item2 0.067347 item3 0.058785 Name: c, dtype: float64 >>>df.iloc[2:3, [0, 1]] item1 item2 c 0.250335 0.067347 >>>#获取指定的单个数据 ... df.iat[1, 1] 0.08695444047161624
(5)数据设置
>>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c", "d", "e"], columns=["item1", "item2", "item3"]) >>>print df item1 item2 item3 a 0.155408 0.128694 1.385727 b -0.937322 0.789462 0.636947 c 0.203996-0.573447-0.136809 d 0.155883-0.537128-0.480155 e -0.626413 0.854594 0.448322 >>>#设置单个值 ... df.iat[1,1] = 0.99999 >>>print df item1 item2 item3 a 0.155408 0.128694 1.385727 b -0.937322 0.999990 0.636947 c 0.203996-0.573447-0.136809 d 0.155883-0.537128-0.480155 e -0.626413 0.854594 0.448322 >>>#设置一个序列 ... df["item1"] = [0.1, 0.2, 0.3, 0.4, 0.5] >>>print df item1 item2 item3 a 0.1 0.128694 1.385727 b 0.2 0.999990 0.636947 c 0.3-0.573447-0.136809 d 0.4-0.537128-0.480155 e 0.5 0.854594 0.448322
(6)缺失值处理
>>>df = pd.DataFrame(np.random.randn(5, 3), index=["a", "b", "c", "d", "e"], columns=["item1", "item2", "item3"]) >>>df1 = df.reindex(index=["a", "b", "c", "d"], columns=list (df.columns) + ['item4']) >>>print df1 item1 item2 item3 item4 a -1.366190 0.384771-0.242857 NaN b 0.498889-0.415037 1.490942 NaN c -1.154120-0.323287 0.120690 NaN d -0.174297-1.495528-1.517813 NaN >>>df1.iat[0, 3] = 0.9999 >>>print df1 item1 item2 item3 item4 a -1.366190 0.384771-0.242857 0.9999 b 0.498889-0.415037 1.490942 NaN c -1.154120-0.323287 0.120690 NaN d -0.174297-1.495528-1.517813 NaN >>>#去除带有Nan空值的行 ... df1.dropna(how='any') item1 item2 item3 item4 a -1.36619 0.384771-0.242857 0.9999 >>>#填补Nan空值 ... df1.fillna(value = 0.1234) item1 item2 item3 item4 a -1.366190 0.384771-0.242857 0.9999 b 0.498889-0.415037 1.490942 0.1234 c -1.154120-0.323287 0.120690 0.1234 d -0.174297-1.495528-1.517813 0.1234 >>>#获取控制布尔值掩码 >>>pd.isnull(df1) item1 item2 item3 item4 a False False False False b False False False True c False False False True d False False False True
(7)汇总与统计
>>>df = pd.DataFrame({"a":[1, 2, 3], "b":[1, 3, 5], "c":[5, 4, 1], "d":[5, 6, 6], "e":[1, 7, 8]}, index=["r1", "r2", "r3"]) >>>print df abcde r1 1 1 5 5 1 r2 2 3 4 6 7 r3 3 5 1 6 8 #统计信息描述 >>>df.describe() abcde count 3.0 3 3.000000 3.000000 3.000000 mean 2.0 3 3.333333 5.666667 5.333333 std 1.0 2 2.081666 0.577350 3.785939 min 1.0 1 1.000000 5.000000 1.000000 25% 1.5 2 2.500000 5.500000 4.000000 50% 2.0 3 4.000000 6.000000 7.000000 75% 2.5 4 4.500000 6.000000 7.500000 max 3.0 5 5.000000 6.000000 8.000000 #获取单项指标,注意传入0或者1的差别 >>>df.count(0) a 3 b 3 c 3 d 3 e 3 dtype: int64 >>>df.count(1) r1 5 r2 5 r3 5 dtype: int64 >>> #内容:类SQL操作 >>>df = pd.DataFrame({"a":[1, 2, 3], "b":[1, 3, 5], "c":[5, 4, 1], "d":[5, 6, 6], "e":[1, 7, 8]}, index=["r1", "r2", "r3"]) >>>print df abcde r1 1 1 5 5 1 r2 2 3 4 6 7 r3 3 5 1 6 8 >>>#类似where操作,其本质是先获取布尔掩码,之后获取值 ... df[df.a> 1] abcde r2 2 3 4 6 7 r3 3 5 1 6 8 >>> #类似insert操作 >>>s = df.loc["r1"] >>>print s a 1 b 1 c 5 d 5 e 1 Name: r1, dtype: int64 >>>df.append(s, ignore_index = False) abcde r1 1 1 5 5 1 r2 2 3 4 6 7 r3 3 5 1 6 8 r1 1 1 5 5 1 >>>#类似union操作 ... pieces = [df[:1], df[1:2], df[2:]] >>>pd.concat(pieces) abcde r1 1 1 5 5 1 r2 2 3 4 6 7 r3 3 5 1 6 8 >>>#类似join操作 ... left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]}) >>>print left keylval 0 foo 1 1 foo 2 >>>left = pd.DataFrame({'key': ['id1', 'id2'], 'lval': [1, 2]}) >>>right = pd.DataFrame({'key': ['id2', 'id3'], 'lval': [5, 6]}) >>>print left keylval 0 id1 1 1 id2 2 >>>print right keylval 0 id2 5 1 id3 6 >>>pd.merge(left, right, on='key') keylval_xlval_y 0 id2 2 5 >>> #类似groupby操作 >>>s = pd.Series([1, 2, 3, 4, 5], index=["a", "b", "c", "d", "e"], name="r1") >>>df1 = df.append(s, ignore_index = False) >>>print df1 abcde r1 1 1 5 5 1 r2 2 3 4 6 7 r3 3 5 1 6 8 r1 1 2 3 4 5 >>>df1.groupby("a").sum() bcde a 1 3 8 9 6 2 3 4 6 7 3 5 1 6 8
(8)时间序列
时间序列基于Timestamp和DatetimeIndex,以及Period和PeriodIndex进行操作。其主要用途是作为Series或者DataFrame的索引,并进行一系列的时间序列相关的操作。下面我们首先通过示例,对Timestamp和Period的数据形式做直观的展示,再接着说明如何生成以及应用该时间序列作为索引,示例如下。
>>>#创建Timestamp ...pd.Timestamp('2017-01-01') Timestamp('2017-01-01 00:00:00') >>>pd.Timestamp(pd.datetime(2017, 1, 1)) Timestamp('2017-01-01 00:00:00') >>>#创建Period ... pd.Period('2017-01') Period('2017-01', 'M') >>>pd.Period('2017-01', freq = 'D') Period('2017-01-01', 'D') >>> #DatatimeIndex索引生成的三种方式,示例如下: >>>#方式一 ... dates = [pd.Timestamp(pd.datetime(2012, 5, 1)), pd.Timestamp (pd.datetime(2012, 5, 2)), pd.Timestamp(pd.datetime(2012, 5, 3))] >>>index = pd.DatetimeIndex(dates) >>>print index DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None, tz=None) >>>#方式二 ... index = pd.date_range('2017-1-1', periods=3, freq='D') >>>print index DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq='D', tz=None) >>>#方式三 ... #date_range生成的是日历时间,bdate_range生成的是工作日时间 ... index = pd.bdate_range('2017-1-1', periods=10) >>>print index DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11', '2017-01-12', '2017-01-13'], dtype='datetime64[ns]', freq='B', tz=None) >>> #PeriodIndex索引生成的三种方式,示例如下: >>>#方式一 ... prng = pd.period_range('1/1/2017', '3/1/2017', freq='M') >>>print prng PeriodIndex(['2017-01', '2017-02', '2017-03'], dtype='int64', freq='M') >>>#方式二 ... pd.PeriodIndex(['2017-1', '2017-2', '2017-3'], freq='M') PeriodIndex(['2017-01', '2017-02', '2017-03'], dtype='int64', freq='M') >>>#方式三 ... >>>pd.PeriodIndex(start='2017-01', freq='3M', periods=4) >>>PeriodIndex(['2017-01', '2017-04', '2017-07', '2017-10'], dtype='period[3M]', freq='3M') >>> #上述内容简要介绍了时间索引,接下来的示例将进一步说明时间索引作为Series 和DataFrame的索引,以及resample(重采样)函数的使用 >>>rng = pd.date_range('1/1/2017', periods=6, freq='M') >>>ts.resample('2M', how="sum") 2017-01-31 384 2017-03-31 472 2017-05-31 974 2017-07-31 304 2017-09-30 492 2017-11-30 594 2018-01-31 418 Freq: 2M, dtype: int64 >>>rng = pd.date_range('1/1/2017', periods=6, freq='M') >>>ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng) >>>print ts 2017-01-31 91 2017-02-28 248 2017-03-31 256 2017-04-30 193 2017-05-31 482 2017-06-30 100 Freq: M, dtype: int64 >>>ts.resample('3M').sum() 2017-01-31 91 2017-04-30 697 2017-07-31 582 Freq: 3M, dtype: int64 >>>ts.resample('3M').mean() 2017-01-31 91.000000 2017-04-30 232.333333 2017-07-31 291.000000 Freq: 3M, dtype: float64
(9)文件I/O
同上节介绍的Numpy一样,Pandas提供了一系列简单易用的文件I/O函数,可以把Series以及DataFrame的数据以多种形式保存到文件,并从文件读取到内存中。可以保存的形式包括:csv、hdf、excel、json、html、sql、stata、sas、clipboard、pickle。此处简单介绍csv, excel的操作方式,其他格式的使用与此类似,可以参考官网进一步学习。
>>>df = pd.DataFrame(np.random.randn(4,3), index=["r1", "r2", "r3", "r4"], columns=list('abc')) >>>#csv文件的存取 >>>df.to_csv("df.csv") >>>pd.read_csv("df.csv") Unnamed: 0 abc 0 r1 0.242875 0.360095-1.639387 1 r2 2.335254-0.484403 0.776782 2 r3-1.536690-0.691591-1.762023 3 r4-0.423772 0.735941 0.206301 >>>excel文件的存取 >>>df.to_excel('df.xlsx', sheet_name='Sheet1') >>>pd.read_excel("df.xlsx", 'sheet1', index_col = None) abc 0 r1 0.242875 0.360095-1.639387 1 r2 2.335254-0.484403 0.776782 2 r3-1.536690-0.691591-1.762023 3 r4-0.423772 0.735941 0.206301