Python数据处理:数据分析处理库Pandas


一、Pandas读取数据

import pandas as pd
df = pd.read_csv('./data/titanic.csv')
df.head(6)									#.head()可以读取前几条数据,指定前几条都可以,相对应还有.tail()
  • df.info()返回当前的信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
  • df.index:返回RangeIndex(start=0, stop=891, step=1)

  • df.columns:返回

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
  • df.dtypes:返回
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
  • df.values:返回
array([[1, 0, 3, ..., 7.25, nan, 'S'],
       [2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [3, 1, 3, ..., 7.925, nan, 'S'],
       ..., 
       [889, 0, 3, ..., 23.45, nan, 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C'],
       [891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)

二、Pandas基本操作

2.1 基本结构及操作

创建一个DataFrame结构:

data = {'country':['aaa','bbb','ccc'],
       'population':[10,12,14]}
df_data = pd.DataFrame(data)

df_data则为:

country population
0 aaa 10
1 bbb 12
2 ccc 14

seriesDataFrame中的一行/列,底层是ndarray结构。取指定的数据:

age = df['Age']
age[:5]

打印结果为:

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

pandas中索引可以由我们自己指定:

df = df.set_index('Name')
df.head()
Name PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

然后打印df['Age'][:5],则结果为:

Name
Braund, Mr. Owen Harris                                22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    38.0
Heikkinen, Miss. Laina                                 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           35.0
Allen, Mr. William Henry                               35.0
Name: Age, dtype: float64

打印age['Allen, Mr. William Henry'],则结果为35.0

pandas可以对结果进行运算:

age = age + 10
age[:5]

打印结果为:

Name
Braund, Mr. Owen Harris                                32.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    48.0
Heikkinen, Miss. Laina                                 36.0
Futrelle, Mrs. Jacques Heath (Lily May Peel)           45.0
Allen, Mr. William Henry                               45.0
Name: Age, dtype: float64

age = age *10同理,与numpy的语法一致。

age.mean()		#396.99117647058824,均值
age.max()		#900.0
age.min()		#104.2

describe()函数可以得到数据的基本统计特性,即数量、均值、标准差、最小值、25/50/75%值、最大值:

PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

2.2 pandas索引

上一节介绍的索引定位:

df[['Age','Fare']][:5]

打印结果为:

Age Fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500

2.2.1 iloc

iloc:用position来去定位。

df.iloc[0]的打印结果为:

PassengerId                          1
Survived                             0
Pclass                               3
Name           Braund, Mr. Owen Harris
Sex                               male
Age                                 22
SibSp                                1
Parch                                0
Ticket                       A/5 21171
Fare                              7.25
Cabin                              NaN
Embarked                             S
Name: 0, dtype: object

同理,df.iloc[0:5]的打印结果为:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

df.iloc[0:5,1:3]的打印结果为:

Survived Pclass
0 0 3
1 1 1
2 1 3
3 1 1
4 0 3

2.2.2 loc

loc:用label来去定位。

df = df.set_index('Name')
df.loc['Heikkinen, Miss. Laina']

打印结果为:

PassengerId                   3
Survived                      1
Pclass                        3
Sex                      female
Age                          26
SibSp                         0
Parch                         0
Ticket         STON/O2. 3101282
Fare                      7.925
Cabin                       NaN
Embarked                      S
Name: Heikkinen, Miss. Laina, dtype: object

打印某个指标:

df.loc['Heikkinen, Miss. Laina','Fare']			#7.9249999999999998

可以使用切片进行索引:

df.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]

打印结果为:

Name PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.925 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.100 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.050 NaN S

可以进行赋值操作:

df.loc['Heikkinen, Miss. Laina','Fare'] = 1000
df.head()

2.2.3 bool类型的索引

df['Fare'] > 40

返回结果为:

Name
Braund, Mr. Owen Harris                                      False
Cumings, Mrs. John Bradley (Florence Briggs Thayer)           True
Heikkinen, Miss. Laina                                        True
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  True
Allen, Mr. William Henry                                     False
Moran, Mr. James                                             False
McCarthy, Mr. Timothy J                                       True
......
df[df['Fare'] > 40][:5]

返回结果为:

Name PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 1000.0000 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
McCarthy, Mr. Timothy J 7 0 1 male 54.0 0 0 17463 51.8625 E46 S
Fortune, Mr. Charles Alexander 28 0 1 male 19.0 3 2 19950 263.0000 C23 C25 C27 S

df[df['Sex'] == 'male'][:5],返回的结果类似。

可以对这些数据进行运算:

  • 比如对Age作平均值df.loc[df['Sex'] == 'male','Age'].mean()

  • 计数(df['Age'] > 70).sum()

2.2.4 索引进阶isin/where/query

s = pd.Series(np.arange(5),index = np.arange(5)[::-1],dtype='int64')

判断一个数是否在当前序列中:

s.isin([1,3,4])
4    False
3     True
2    False
1     True
0     True
dtype: bool
s2 = pd.Series(np.arange(6),index = pd.MultiIndex.from_product([[0,1],['a','b','c']]))
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32
s2.iloc[s2.index.isin([(1,'a'),(0,'b')])]
0  b    1
1  a    3
dates = pd.date_range('20220819',periods=8)
df = pd.DataFrame(np.random.randn(8,4),index=dates,columns=['A','B','C','D'])
A B C D
2022-08-19 -1.400125 1.250936 -0.220523 0.771322
2022-08-20 -0.837570 -0.321949 1.868991 -1.722395
2022-08-21 -0.578139 0.431476 -0.019026 -0.040017
2022-08-22 0.385588 2.463628 -0.066183 -0.948124
2022-08-23 0.833327 0.451540 -1.124371 -0.328375
2022-08-24 -0.387599 1.547158 -0.806947 0.233735
2022-08-25 -0.940428 0.286457 0.412619 0.531194
2022-08-26 -0.379863 -0.211559 -1.288098 1.592291
df.where(df < 0)
A B C D
2022-08-19 -1.400125 NaN -0.220523 NaN
2022-08-20 -0.837570 -0.321949 NaN -1.722395
2022-08-21 -0.578139 NaN -0.019026 -0.040017
2022-08-22 NaN NaN -0.066183 -0.948124
2022-08-23 NaN NaN -1.124371 -0.328375
2022-08-24 -0.387599 NaN -0.806947 NaN
2022-08-25 -0.940428 NaN NaN NaN
2022-08-26 -0.379863 -0.211559 -1.288098 NaN
df.where(df < 0,-df)					#不满足条件的可以指定取反
A B C D
2017-11-24 -1.690231 -0.338101 -1.071022 -1.084637
2017-11-25 -1.292291 -0.449885 -0.468264 -0.637102
2017-11-26 -0.602494 -0.591658 -0.301893 -1.050524
2017-11-27 -1.132170 -1.310110 -0.552812 -0.370947
2017-11-28 -0.113234 -0.859983 -0.381977 -0.371730
2017-11-29 -0.616029 -0.209225 -1.879964 -0.179152
2017-11-30 -0.554969 -0.656240 -2.449274 -0.302113
2017-12-01 -0.700342 -1.068990 -0.572698 -0.577581
df = pd.DataFrame(np.random.rand(10,3),columns = list('abc'))
a b c
0 0.760971 0.910097 0.480540
1 0.242792 0.778342 0.852441
2 0.147953 0.449719 0.539780
3 0.519164 0.936192 0.402399
4 0.365343 0.148621 0.176917
5 0.837852 0.283028 0.527734
6 0.729312 0.066871 0.747968
7 0.502851 0.462246 0.116735
8 0.472404 0.517753 0.945877
9 0.962282 0.300276 0.258252

查询操作

df.query('(a<b)')
df.query('(a<b) & (b<c)')

2.3 groupby操作

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                  'data':[0,5,10,5,10,15,10,15,20]})
data key
0 0 A
1 5 B
2 10 C
3 5 A
4 10 B
5 15 C
6 10 A
7 15 B
8 20 C
for key in ['A','B','C']:
    print (key,df[df['key'] == key].sum())

打印结果为:

A data     15
key     AAA
dtype: object
B data     30
key     BBB
dtype: object
C data     45
key     CCC
dtype: object

若使用groupby操作,则为:

df.groupby('key').sum()
key data
A 15
B 30
C 45

也可以通过numpy格式得出上述结果,则为:df.groupby('key').aggregate(np.sum)

平均值也类似:

df.groupby('key').aggregate(np.mean)
key data
A 5
B 10
C 15
df = pd.read_csv('./data/titanic.csv')
df.groupby('Sex')['Age'].mean()

返回结果为:

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
A B C D
0 foo one -0.785250 1.010458
1 bar one 2.549941 1.704677
2 foo two -0.255153 -0.603249
3 bar three -0.954625 0.117662
4 foo two -0.548512 0.648127
5 bar two -0.642762 -1.111568
6 foo one 0.870697 0.556371
7 foo three 0.839937 0.798669
grouped = df.groupby('A')
grouped.count()
A B C D
bar 3 3 3
foo 5 5 5
grouped = df.groupby(['A','B'])
grouped.count()

image-20220817225035447

def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'a'
    else:
        return 'b'
grouped = df.groupby(get_letter_type,axis = 1)
grouped.count().iloc[0]
a    1
b    3
Name: 0, dtype: int64
df2.groupby(['X']).get_group('A')
X Y
0 A 1
2 A 3
grouped = df.groupby(['A','B'])
grouped.aggregate(np.sum)

相同的索引会进行合并显示:

相同的索引合并显示

grouped = df.groupby(['A','B'],as_index = False)
grouped.aggregate(np.sum)
A B C D
0 bar one 2.549941 1.704677
1 bar three -0.954625 0.117662
2 bar two -0.642762 -1.111568
3 foo one 0.085447 1.566829
4 foo three 0.839937 0.798669
5 foo two -0.803665 0.044878

重新构建索引:df.groupby(['A','B']).sum().reset_index(),也能显示以上结果。

得到分组组合个数:

grouped = df.groupby(['A','B'])
grouped.size()
A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64

获取统计特性:

grouped.describe().head()

获取统计特性

agg函数用于调用groupby函数之后,对数据做一些聚合操作,包括sum,min,max以及其他一些聚合函数:

grouped = df.groupby('A')
grouped['C'].agg([np.sum,np.mean,np.std])
A sum mean std
bar 0.952553 0.317518 1.939613
foo 0.121719 0.024344 0.781542

还可以通过字典结构指定分组名字:

grouped['C'].agg({'res_sum':np.sum,'res_mean':np.mean,'res_std':np.std})

2.4 数值运算

2.4.1 运算

构造DataFrame结构:

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]],index = ['a','b'],columns = ['A','B','C'])
A B C
a 1 2 3
b 4 5 6
df.sum()

默认按列求和,返回结果为:

A    5
B    7
C    9
dtype: int64

可以指定维度:

df.sum(axis = 1)					#也可以使用df.sum(axis = 'columns')

返回结果为:

a     6
b    15
dtype: int64

还有df.mean()df.min()df.max()df.median()(中位数)等。

2.4.2 二元统计

  • 统计协方差df.cov()
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 66231.000000 -0.626966 -7.561798 138.696504 -16.325843 -0.342697 161.883369
Survived -0.626966 0.236772 -0.137703 -0.551296 -0.018954 0.032017 6.221787
Pclass -7.561798 -0.137703 0.699015 -4.496004 0.076599 0.012429 -22.830196
Age 138.696504 -0.551296 -4.496004 211.019125 -4.163334 -2.344191 73.849030
SibSp -16.325843 -0.018954 0.076599 -4.163334 1.216043 0.368739 8.748734
Parch -0.342697 0.032017 0.012429 -2.344191 0.368739 0.649728 8.661052
Fare 161.883369 6.221787 -22.830196 73.849030 8.748734 8.661052 2469.436846
  • 统计相关系数df.corr()
PassengerId Survived Pclass Age SibSp Parch Fare
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000
  • **统计不同属性的个数value_counts**:df['Age'].value_counts()
24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
28.00    25
21.00    24
25.00    23
36.00    22
29.00    20
32.00    18
......

可以指定为升序:df['Age'].value_counts(ascending = True)

可以通过bins作分组:df['Age'].value_counts(ascending = True,bins = 5)

(64.084, 80.0]       11
(48.168, 64.084]     69
(0.339, 16.336]     100
(32.252, 48.168]    188
(16.336, 32.252]    346
Name: Age, dtype: int64
  • count()可以用于计数样本:df['Pclass'].count()

2.5 对象操作

2.5.1 Series结构的增删改查

data = [10,11,12]
index = ['a','b','c']
s = pd.Series(data = data,index = index)

返回结果为:

a    10
b    11
c    12
dtype: int64

查操作

s[0]				#10
s[0:2]				#a    10
					#b    11
					#dtype: int64
mask = [True,False,True]
s[mask]
s.loc['b']			#11
s.iloc[1]			#11

改操作

s1 = s.copy()
s1['a'] = 100
s1
a    100
b     11
c     12
dtype: int64

还可以使用replace函数进行修改(inplace = True将在原始series结构进行改变):

s1.replace(to_replace = 100,value = 101,inplace = True)

a    101
b     11
c     12
dtype: int64

可以修改索引:

s1.index										#Index(['a', 'b', 'c'], dtype='object')
s1.index = ['a','b','d']						#a    101 b     11 d     12 dtype: int64
s1.rename(index = {'a':'A'},inplace = True)		#单独修改某一索引,index值指定成字典结构

增操作

data = [100,110]
index = ['h','k']
s2 = pd.Series(data = data,index = index)
s3 = s1.append(s2)								#第一种
s3['j'] = 500									#第二种
s1.append(s2,ignore_index = True)				#ignore_index = True,改变原有索引方式

ignore_index = False

A    101
b     11
d     12
j    500
h    100
k    110
dtype: int64

ignore_index = True

0    101
1     11
2     12
3    500
4    100
5    110
dtype: int64

删操作

  • del s1['A']
  • s1.drop(['b','d'],inplace = True)

2.5.2 DataFrame结构的增删改查

data = [[1,2,3],[4,5,6]]
index = ['a','b']
columns = ['A','B','C']

df = pd.DataFrame(data=data,index=index,columns = columns)

查操作是类似的

  • df['A']
  • df.iloc[0]
  • df.loc['a']

改操作

  • df.loc['a']['A']
  • df.loc['a']['A'] = 150
  • df.index = ['f','g']

增操作

  • 增加行:
df.loc['c'] = [1,2,3]					#增加一行数据
data = [[1,2,3],[4,5,6]]
index = ['j','k']
columns = ['A','B','C']
df2 = pd.DataFrame(data=data,index=index,columns = columns)
df3 = pd.concat([df,df2],axis = 0)		#连接操作
  • 增加列:
df2['Tang'] = [10,11]					#增加新的列
df4 = pd.DataFrame([[10,11],[12,13]],index=['j','k'],columns=['D','E'])
df5 = pd.concat([df2,df4],axis = 1)		#增加多列
A B C Tang D E
j 1 2 3 10 10 11
k 4 5 6 11 12 13

删操作

df5.drop(['j'],axis=0,inplace = True)				#删除一行数据
A B C Tang D E
k 4 5 6 11 12 13
del df5['Tang']										#删除一列数据
df5.drop(['A','B','C'],axis = 1,inplace = True)		#删除多列数据

2.5.3 merge操作

left = pd.DataFrame({'key':['K0','K1','K2','K3'],'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']})
right = pd.DataFrame({'key':['K0','K1','K2','K3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']})
pd.merge(left,right,on = 'key')						#不加on默认以相同元素为键
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

若有两个键key1key2,执行pd.merge(left,right,on = 'key1'),则结果为:

key1 key2_x A B key2_y C D
0 K0 K0 A0 B0 K0 C0 D0
1 K1 K1 A1 B1 K1 C1 D1
2 K2 K2 A2 B2 K2 C2 D2
3 K3 K3 A3 B3 K3 C3 D3

若执行res = pd.merge(left, right, on = ['key1','key2']),则结果为:

key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K1 A1 B1 C1 D1
2 K2 K2 A2 B2 C2 D2
3 K3 K3 A3 B3 C3 D3

若两个键有部分不一致,即:

left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
                     'key2': ['K0', 'K1', 'K2', 'K3'],
                    'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
                      'key2': ['K0', 'K1', 'K2', 'K4'],
                    'C': ['C0', 'C1', 'C2', 'C3'], 
                    'D': ['D0', 'D1', 'D2', 'D3']})

print (left)
print (right)

仍执行res = pd.merge(left, right, on = ['key1','key2']),则结果为:

key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K1 A1 B1 C1 D1
2 K2 K2 A2 B2 C2 D2

若指定how = outer,即取并集:res = pd.merge(left, right, on = ['key1', 'key2'], how = 'outer'),则结果为:

A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A2 B2 K2 K2 C2 D2
3 A3 B3 K3 K3 NaN NaN
4 NaN NaN K3 K4 C3 D3

加入指示器indicatorres = pd.merge(left, right, on = ['key1', 'key2'], how = 'outer', indicator = True),则结果为:

A B key1 key2 C D _merge
0 A0 B0 K0 K0 C0 D0 both
1 A1 B1 K1 K1 C1 D1 both
2 A2 B2 K2 K2 C2 D2 both
3 A3 B3 K3 K3 NaN NaN left_only
4 NaN NaN K3 K4 C3 D3 right_only

指定以左表/右表为基准:如res = pd.merge(left, right, how = 'left'),则结果为:

A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A2 B2 K2 K2 C2 D2
3 A3 B3 K3 K3 NaN NaN

2.6 显示设置

网址:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html?highlight=set_option#pandas.set_option

  • 显示当前显示数据最大行数:pd.get_option('display.max_rows')
  • 设置当前显示数据最大行数:pd.set_option('display.max_rows',6)

同理,列操作分别为:pd.get_option('display.max_columns')pd.set_option('display.max_columns',30)

  • 设置字符串长度:pd.get_option('display.max_colwidth')pd.set_option('display.max_colwidth',100)
  • 设置字符串小数点精度:pd.get_option('display.precision')pd.set_option('display.precision',5)

2.7 数据透视表

import pandas as pd
example = pd.DataFrame({'Month': ["January", "January", "January", "January", 
                                  "February", "February", "February", "February", 
                                  "March", "March", "March", "March"],
                   'Category': ["Transportation", "Grocery", "Household", "Entertainment",
                                "Transportation", "Grocery", "Household", "Entertainment",
                                "Transportation", "Grocery", "Household", "Entertainment"],
                   'Amount': [74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.]})
Month Category Amount
0 January Transportation 74.0
1 January Grocery 235.0
2 January Household 175.0
3 January Entertainment 100.0
4 February Transportation 115.0
5 February Grocery 240.0
6 February Household 225.0
7 February Entertainment 125.0
8 March Transportation 90.0
9 March Grocery 260.0
10 March Household 200.0
11 March Entertainment 120.0
example_pivot = example.pivot(index = 'Category',columns= 'Month',values = 'Amount')	#values代表统计属性
Month February January March
Category
Entertainment 125.0 100.0 120.0
Grocery 240.0 235.0 260.0
Household 225.0 175.0 200.0
Transportation 115.0 74.0 90.0
example_pivot.sum(axis = 1)
Category
Entertainment     345.0
Grocery           735.0
Household         600.0
Transportation    279.0
dtype: float64
example_pivot.sum(axis = 0)
Month
February    705.0
January     584.0
March       670.0
dtype: float64

案例:

df = pd.read_csv('./data/titanic.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
#默认值就是求平均
df.pivot_table(index = 'Sex',columns='Pclass',values='Fare')
Pclass 1 2 3
Sex
female 106.125798 21.970121 16.118810
male 67.226127 19.741782 12.661633
#指定为求最大值
df.pivot_table(index = 'Sex',columns='Pclass',values='Fare',aggfunc='max')
#指定求平均
df.pivot_table(index = 'Pclass',columns='Sex',values='Survived',aggfunc='mean')
Pclass 1 2 3
Sex
female 512.3292 65.0 69.55
male 512.3292 73.5 69.55

类似地,还有计数:df.pivot_table(index = 'Sex',columns='Pclass',values='Fare',aggfunc='count')

pandas中也有类似计数:pd.crosstab(index = df['Sex'],columns = df['Pclass'])

三、时间操作

3.1 时间基本操作

import pandas as pd
ts = pd.Timestamp('2022-08-17')				#Timestamp('2022-08-17 00:00:00')
ts.month									#8
ts.day										#17
ts + pd.Timedelta('5 days')					#Timestamp('2022-08-22 00:00:00')

此外,还有pd.to_datetime('2022-08-17')/pd.to_datetime('17/8/2022')来构造时间。

一般我们可以用于构造series结构:

s = pd.Series(['2022-08-17 00:00:00','2022-08-22 00:00:00','2022-08-23 00:00:00'])
0    2022-08-17 00:00:00
1    2022-08-22 00:00:00
2    2022-08-23 00:00:00
dtype: object

把它转换为datetime类型:

ts = pd.to_datetime(s)
0   2022-08-17
1   2022-08-22
2   2022-08-23
dtype: datetime64[ns]

现在我们可以调用它的属性:

ts.dt.hour
0    0
1    0
2    0
dtype: int64
ts.dt.weekday									#星期索引,周一的索引为0,依此类推
0    2
1    0
2    1
dtype: int64

构造series结构,可以直接指定datetime类型:

pd.Series(pd.date_range(start='2022-08-24',periods = 10,freq = '12H'))		#freq为时间间隔
0   2022-08-24 00:00:00
1   2022-08-24 12:00:00
2   2022-08-25 00:00:00
3   2022-08-25 12:00:00
4   2022-08-26 00:00:00
5   2022-08-26 12:00:00
6   2022-08-27 00:00:00
7   2022-08-27 12:00:00
8   2022-08-28 00:00:00
9   2022-08-28 12:00:00
dtype: datetime64[ns]

指定index列并解析date数据:

data = pd.read_csv('./data/flowdata.csv',index_col = 0,parse_dates = True)
data.head()

3.2 时间序列操作

进行取数据操作:【指定索引】

data[pd.Timestamp('2012-01-01 09:00'):pd.Timestamp('2012-01-01 19:00')]
data[('2012-01-01 09:00'):('2012-01-01 19:00')]
Time L06_347 LS06_347 LS06_348
2012-01-01 09:00:00 0.330750 0.293583 0.029750
2012-01-01 12:00:00 0.295000 0.285167 0.031750
2012-01-01 15:00:00 0.301417 0.287750 0.031417
2012-01-01 18:00:00 0.322083 0.304167 0.038083
data['2013']											#根据年份取
data['2012-01':'2012-03']								#根据年份+月份取
data[data.index.month == 1]								#根据布尔类型索引取出月份
data[(data.index.hour > 8) & (data.index.hour <12)]		#取出时间片段法1
data.between_time('08:00','12:00')						#取出时间片段法2,包含端点

对时间序列进行重采样data.resample('D').mean().head()/data.resample('D').max().head()

Time L06_347 LS06_347 LS06_348 month
2009-01-01 0.125010 0.092281 0.016635 1
2009-01-02 0.124146 0.095781 0.016406 1
2009-01-03 0.113562 0.085542 0.016094 1
2009-01-04 0.140198 0.102708 0.017323 1
2009-01-05 0.128812 0.104490 0.018167 1

取时间节点为多天(如3天):

data.resample('3D').mean().head()
data.resample('M').mean().head()						#以月份为单位

画图展示:

%matplotlib notebook									#进行画图展示
data.resample('M').mean().plot()
image-20220817180544761

四、Pandas常用操作

data = pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],'data':[4,3,2,1,12,3,4,5,7]})
group data
0 a 4
1 a 3
2 a 2
3 b 1
4 b 12
5 b 3
6 c 4
7 c 5
8 c 7

4.1 排序

data.sort_values(by=['group','data'],ascending = [False,True],inplace = True)
group data
6 c 4
7 c 5
8 c 7
3 b 1
5 b 3
4 b 12
2 a 2
1 a 3
0 a 4
data = pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,4,4]})
data.sort_values(by='k2')
k1 k2
2 one 1
1 one 2
0 one 3
3 two 3
4 two 3
5 two 4
6 two 4
data.drop_duplicates()							#去除重复数据
k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
5 two 4
data.drop_duplicates(subset = 'k1')				#指定列去重
k1 k2
0 one 3
3 two 3

4.2 同类项合并(作映射)

data = pd.DataFrame({'food':['A1','A2','B1','B2','B3','C1','C2'],'data':[1,2,3,4,5,6,7]})
def food_map(series):
    if series['food'] == 'A1':
        return 'A'
    elif series['food'] == 'A2':
        return 'A'
    elif series['food'] == 'B1':
        return 'B'
    elif series['food'] == 'B2':
        return 'B'
    elif series['food'] == 'B3':
        return 'B'
    elif series['food'] == 'C1':
        return 'C'
    elif series['food'] == 'C2':
        return 'C'
data['food_map'] = data.apply(food_map,axis = 'columns')		#对数据执行相同的操作(函数)
data food food_map
0 1 A1 A
1 2 A2 A
2 3 B1 B
3 4 B2 B
4 5 B3 B
5 6 C1 C
6 7 C2 C

或使用map操作:

food2Upper = {
    'A1':'A',
    'A2':'A',
    'B1':'B',
    'B2':'B',
    'B3':'B',
    'C1':'C',
    'C2':'C'
}
data['upper'] = data['food'].map(food2Upper)
data food food_map upper
0 1 A1 A A
1 2 A2 A A
2 3 B1 B B
3 4 B2 B B
4 5 B3 B B
5 6 C1 C C
6 7 C2 C C

4.3 assign、replace方法

import numpy as np
df = pd.DataFrame({'data1':np.random.randn(5),
                  'data2':np.random.randn(5)})
df2 = df.assign(ration = df['data1']/df['data2'])
data1 data2 ration
0 -1.069925 -0.186540 5.735617
1 0.636127 0.020425 31.143814
2 0.366197 -0.102836 -3.560992
3 -0.975327 0.451201 -2.161624
4 -1.562407 -2.436845 0.641160

使用drop方法可以删去ration列:df2.drop('ration',axis='columns',inplace=True)

data = pd.Series([1,2,3,4,5,6,7,8,9])
data.replace(9,np.nan,inplace=True)
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
dtype: float64

4.4 连续值离散化:cut方法

ages = [15,18,20,21,22,34,41,52,63,79]
bins = [10,40,80]
bins_res = pd.cut(ages,bins)
[(10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (10, 40], (40, 80], (40, 80], (40, 80], (40, 80]]
Categories (2, interval[int64]): [(10, 40] < (40, 80]]
bins_res.labels						#array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1], dtype=int8)
pd.value_counts(bins_res)			#(10, 40]    6  (40, 80]    4  dtype: int64
pd.cut(ages,[10,30,50,80])
[(10, 30], (10, 30], (10, 30], (10, 30], (10, 30], (30, 50], (30, 50], (50, 80], (50, 80], (50, 80]]
Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 80]]
group_names = ['Yonth','Mille','Old']
#pd.cut(ages,[10,20,50,80],labels=group_names)
pd.value_counts(pd.cut(ages,[10,20,50,80],labels=group_names))		#labels即bins换完之后的名字
Mille    4
Old      3
Yonth    3
dtype: int64

4.5 缺失值的情况

df = pd.DataFrame([range(3),[0, np.nan,0],[0,0,np.nan],range(3)])
0 1 2
0 0 1.0 2.0
1 0 NaN 0.0
2 0 0.0 NaN
3 0 1.0 2.0
df.isnull()							#查看是否有缺失值
0 1 2
0 False False False
1 False True False
2 False False True
3 False False False
df.isnull().any()					#哪一列是否有缺失值
0    False
1     True
2     True
dtype: bool
df.isnull().any(axis = 1)			#哪一行是否有缺失值
0    False
1     True
2     True
3    False
dtype: bool
df.fillna(5)						#填充缺失值
0	1	2
0	0	1.0	2.0
1	0	5.0	0.0
2	0	0.0	5.0
3	0	1.0	2.0
df[df.isnull().any(axis = 1)]		#通过布尔类型的索引找到带缺失值的行
0 1 2
1 0 NaN 0.0
2 0 0.0 NaN

4.6 字符串操作

import numpy as np
import pandas as pd
s = pd.Series(['A','b','B','gaer','GAER',np.nan])
0       A
1       b
2       B
3    gaer
4    GAER
5     NaN
dtype: object
s.str.lower()						#将字符串中的字母全部转为小写
0       a
1       b
2       b
3    gaer
4    gaer
5     NaN
dtype: object

同理,将字符串中的字母全部转为大写可使用:s.str.upper()

统计字符串长度:s.str.len()

去除字符串中的空格:strip方法

index = pd.Index(['  tang','yu   ','abc'])
index.str.strip()

注:

还可以只去掉左/右空格:index.str.lstrip(),index.str.rstrip()

df = pd.DataFrame(np.random.randn(3,2),columns = ['A a','B b'],index = range(3))
df.columns = df.columns.str.replace(' ','_')
A_a B_b
0 -1.392628 1.020082
1 0.866707 0.654731
2 -0.320871 1.360513
s = pd.Series(['a_b_C','c_d_e','f_g_h'])
s.str.split('_')
0    [a, b, C]
1    [c, d, e]
2    [f, g, h]
dtype: object

得到新的字段:

s.str.split('_',expand = True)
0 1 2
0 a b C
1 c d e
2 f g h
s.str.split('_',expand = True,n=1)
0 1
0 a b_C
1 c d_e
2 f g_h

进行包含的判断

s = pd.Series(['A','Aas','Afgew','Ager','Agre','Ager'])
0        A
1      Aas
2    Afgew
3     Ager
4     Agre
5     Ager
dtype: object
s.str.contains('Ag')
0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool

指定分隔符:

s = pd.Series(['a','a|b','a|c'])
s.str.get_dummies(sep = '|')
a b c
0 1 0 0
1 1 1 0
2 1 0 1

4.7 Pandas绘图

4.7.1 画折线图

%matplotlib inline
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10),index = np.arange(0,100,10))	 #生成标准正态分布的伪随机数(均值为0,方差为1) 
s.plot()

image-20220819121800261

df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), 
               index = np.arange(0, 100, 10), 
               columns = ['A', 'B', 'C', 'D'])
df.head()

Out[3]:

A B C D
0 -0.887275 0.191912 0.276001 -0.804419
10 -0.404738 -1.942269 0.183497 -1.845532
20 -0.746758 -1.534280 0.178036 -0.790430
30 -0.702495 -2.510148 1.879569 -0.421204
40 -0.023128 -1.389147 2.205663 -0.253191
df.plot()

image-20220819121911613

4.7.2 画柱状图

import matplotlib.pyplot as plt

fig,axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(16),index=list('abcdefghijklmnop'))
data.plot(ax = axes[0],kind='bar')			#bar表示竖着画
data.plot(ax = axes[1],kind='barh')			#barh表示横着画
image-20220819122925825
df = pd.DataFrame(np.random.rand(6, 4), 
               index = ['one', 'two', 'three', 'four', 'five', 'six'], 
               columns = pd.Index(['A', 'B', 'C', 'D'], name = 'Genus'))
df.head()
Genus A B C D
one 0.130214 0.536757 0.243533 0.371248
two 0.424017 0.052330 0.932248 0.482683
three 0.084314 0.589451 0.876603 0.604232
four 0.561504 0.121044 0.303261 0.065200
five 0.680850 0.177105 0.314080 0.153842
df.plot(kind='bar')
image-20220819123407782

4.7.3 直方图

tips = pd.read_csv('tips.csv')
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
tips.total_bill.plot(kind='hist',bins=50)
image-20220819123610876

4.7.4 散点图

macro = pd.read_csv('macrodata.csv')
macro.head()
data = macro[['quarter','realgdp','realcons']]
data.plot.scatter('quarter','realgdp')
image-20220819123833970
pd.scatter_matrix(data,color='g',alpha=0.3)
image-20220819124538160

五、Pandas大数据处理技巧

5.1 获取平均内存使用大小

g1 = pd.read_csv('game_logs.csv')
g1.shape								#(171907, 161),数据+列
g1.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 859.4 MB
for dtype in ['float64','int64','object']:
    selected_dtype = g1.select_dtypes(include = [dtype])
    mean_usage_b = selected_dtype.memory_usage(deep = True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print('平均内存占用',dtype,mean_usage_mb,'MB')
平均内存占用 float64 1.294733194204477 MB
平均内存占用 int64 1.1242000034877233 MB
平均内存占用 object 9.500870656363572 MB

查看各种类型能表达的最大数:

import numpy as np
int_types = ['uint8','int8','int16','int32','int64']		#uint8为无符号整数
for it in int_types:
    print(np.iinfo(it))
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

5.2 更改数据类型

向下转换数据类型:

def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep = True).sum()
    else:
        usage_b = pandas_obj.memory_usage(deep = True)
    usage_mb = usage_b/1024 ** 2
    return '{:03.2f} MB'.format(usage_mb)
g1_int = g1.select_dtypes(include = ['int64'])
coverted_int = g1_int.apply(pd.to_numeric,downcast = 'unsigned')
print(mem_usage(g1_int))					#7.87 MB
print(mem_usage(coverted_int))				#1.48 MB

还可以将float64类型的数据转换为float类型。

gl_obj = gl.select_dtypes(include = ['object']).copy()
gl_obj.describe()

image-20220819163403748

可观察day_of_week列的unique值非常少,故将object类型转为category类型,即可将重复的数据映射到同一个内存空间:

dow = g1_obj.day_of_week
dow.head()
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: object
dow_cat = dow.astype('category')
dow_cat.head()
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: category
Categories (7, object): ['Fri', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed']
dow_cat.head(10).cat.codes					#查看编码
0    4
1    0
2    2
3    1
4    5
5    4
6    2
7    2
8    1
9    5
dtype: int8
print(mem_usage(dow))						#9.84 MB
print(mem_usage(dow_cat))					#0.16 MB

可根据以下程序进行判断重复数据是否较多:

converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]
print(mem_usage(gl_obj))						#751.64 MB
print(mem_usage(converted_obj))					#51.67 MB

5.3 对时间类型的优化

date = optimized_gl.date
date[:5]
0    18710504
1    18710505
2    18710506
3    18710508
4    18710509
Name: date, dtype: uint32
print (mem_usage(date))							#0.66 MB

指定为标准时间格式,占用内存将变多:

optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')
print (mem_usage(optimized_gl['date']))			#1.31 MB
optimized_gl['date'][:5]
0   1871-05-04
1   1871-05-05
2   1871-05-06
3   1871-05-08
4   1871-05-09
Name: date, dtype: datetime64[ns]

文章作者: ShiQuLiZhi
版权声明: 本博客所有文章除特别声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 ShiQuLiZhi !
评论
  目录