Pandas ---- Big Data Manipulate Tool

教程

Pandas Cookbook:
- https://github.com/PacktPublishing/Pandas-Cookbook
Data School
- https://www.dataschool.io/best-python-pandas-resources/

修改 index

df.set_index 按给定的列名排序，给定的列变成了 index

df.reindex 重新排序 index，添加新 index（同级，不是 MultiIndex）

原形：pd.DataFrame.reindex(['the', 'new', 'labels'], axis='columns')
注意：
- 实际是在原 df 的基础上，新建一个 dataframe
- new_labels 如果比原来长，新行 value＝NaN
- new_labels 必须是 array like，即：类似 list
- 如果使用新 label（与之前的 df 的不一样）
  - 产生 NaN 数据
  - 要么使用 pd.rename()
两种使用方法：
- df.reindex(['A', 'B'], axis='columns')
- df.reindex(columns=['A', 'B'])

df.rename 修改 index 名字，真正的修改

把原来的 label—修改成–> 新 labels
原形：
- df.rename(mapper=…, axis='columns or index')
- df.rename(columns=)
```
df.rename(index=index_mapper, columns=columns_mapper, ...)
df.rename(mapper, axis={'index', 'columns'}, ...)
```
- mapper, columns, index:
  - dict-like or functions transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.
  - 即：dict 映射，或同样功能的函数
注意：mapper
- 正确
  - eg：{'a':'good'}
- 错误
  - eg: {'a':['good']}
```
TypeError: unhashable type: 'list'
```

df.append

注意：
- 追加 dataframe，columns 名称要一致
- 不同，导致 NaN

设置

1
2
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

赋值 Series 或单纯一列数据给 DataFrame

使用 assign
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html

1
2
3
4
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

注意
- 直接使用 column 名称，不要加引号 ("")

DataFrame 去重

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import pandas as pd
df = pd.DataFrame({'A':[1, 1, 2, 2, 2], 'B':[1, 2, 3, 4, 4]})
print('--'*10, 'original')
print(df)

# mask
ret = df.duplicated('A')
print('--'*10, 'mask of A')
print(ret)

# drop duplicated
ret = df[~df.duplicated('A')]
print('--'*10, 'drop duplicated A')
print(ret)
ret = df[~df.duplicated(['A', 'B'])]
print('--'*10, 'drop duplicated [A, B]')
print(ret)

-------------------- original
   A  B
0  1  1
1  1  2
2  2  3
3  2  4
4  2  4
-------------------- mask of A
0    False
1     True
2    False
3     True
4     True
dtype: bool
-------------------- drop duplicated A
   A  B
0  1  1
2  2  3
-------------------- drop duplicated [A, B]
   A  B
0  1  1
1  1  2
2  2  3
3  2  4

DataFrame.apply

传入的是一个 pd.Series
- axis=0 or 'index'
  - 传入一个 column Series, 即一整列
- axis=1 or 'columns'
  - 传入一个 row Series, 即一整行

保持 apply 结果是 DataFrame

方法一：单独处理一个 column, 出入嵌套 lambda 函数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd

def split_synonyms(cell):
    if isinstance(cell, str):
        return cell.split(';')
    return []

In [95]: pd.DataFrame({'english_name': ['hello', 'Lucy'], 'synonyms': ['Hello', 'lucy; Lily; Lue']})
Out[95]:
  english_name         synonyms
0        hello            Hello
1         Lucy  lucy; Lily; Lue


In [107]: df.loc[:, [ 'synonyms']].apply(lambda col: col.apply(split_synonyms), axis=0)
Out[107]:
              synonyms
0              [Hello]
1  [lucy,  Lily,  Lue]

# df = pd.DataFrame({'english_name': ['hello', 'Lucy'], 'synonyms': ['Hello', 'lucy; Lily; Lue']})
# result = df.loc[:, [ 'synonyms']].apply(lambda col: col.apply(split_synonyms), axis=0)
# type(result) --> pd.DataFrame

方法二：lambda 函数返回 pd.Series

1
2
3
4
5
In [114]: df2.apply(lambda row: pd.Series([row['english_name'], split_synonyms(row['synonyms'])], index=['english_name', 'synonyms']), axis=1)
Out[114]:
  english_name             synonyms
0        hello              [Hello]
1         Lucy  [lucy,  Lily,  Lue]

解说
- 重建一个 pd.Series
- 需要维护 index (即表的 head)

通过 Series 重建法

原理
- 从 pd.DataFrame 获取的 Series 能够保持 Index 不变
- 因此，我们可以对提取到的 Series 处理
- 完成转换后，再复制回原 pd.DataFrame

例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import pandas as pd

df2 = pd.DataFrame({'english_name': ['hello', 'Lucy'], 'synonyms': ['Hello', 'lucy; Lily; Lue']})


In [142]: df2
Out[142]:
  english_name         synonyms
0        hello            Hello
1         Lucy  lucy; Lily; Lue

# 原 index 
In [144]: a = df2.index

In [145]: a
Out[145]: RangeIndex(start=0, stop=2, step=1)

# 获取 Series
In [147]: b = df2['synonyms'].index

In [148]: b
Out[148]: RangeIndex(start=0, stop=2, step=1)

# 确认 新index 一致
In [149]: a == b
Out[149]: array([ True,  True])

In [149]: a == b
Out[149]: array([ True,  True])

# 创建 Series
In [151]: df2['synonyms']
Out[151]:
0              Hello
1    lucy; Lily; Lue
Name: synonyms, dtype: object

# apply on Series

In [153]: df2['synonyms'].apply(lambda cel: cel.split(';') if isinstance(cel, str) else [])
Out[153]:
0                [Hello]
1    [lucy,  Lily,  Lue]
Name: synonyms, dtype: object

# 复制回 pd.DataFrame
In [154]: df2.loc[:, ['synonyms']] = pd.DataFrame(df2['synonyms'].apply(lambda cel: cel.split(';')))

In [155]: df2
Out[155]:
  english_name             synonyms
0        hello              [Hello]
1         Lucy  [lucy,  Lily,  Lue]

通过 applymap 实现

applymap 作用于一个单列 pd.DataFrame
- 处理结果再赋值回去

例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
In [184]: df2 = pd.DataFrame({'english_name': ['hello', 'Lucy'], 'synonyms': ['Hello', 'lucy; Lily; Lue']})

In [185]: df2.loc[:, ['synonyms']].applymap(lambda x: x.split(';'))
Out[185]:
              synonyms
0              [Hello]
1  [lucy,  Lily,  Lue]

In [186]: type(df2.loc[:, ['synonyms']].applymap(lambda x: x.split(';')))
Out[186]: pandas.core.frame.DataFrame

DataFrame.applymap

与 DataFrame.apply 对比
- apply 是作用于行或列
- applymap 直接左右于所有单个元素

关于 NaN

判断方法

1
2
3
import math

assert math.isnan(float('NaN'))

注意
- math.isnan 只接受数字参数

特性
- == : 不能判断
- is : 不能判断

Series

Series.map

类似 Series.apply
都是作用与单个元素，但是功能简单的多

元素访问

注意：只能通过 index(0, 1, 2, …) 访问 Series 的元素，不能通过列名访问

DataFrame 添加列

pd.DataFrame.insert
参考：pandas.DataFrame.insert — pandas 1.3.1 documentation

Modin 库 —- 提升 pandas 速度

参考：GitHub - modin-project/modin: Modin: Speed up your Pandas workflows by changi…

限制 CPU 占用数量

原理
- 通过环境变量 MODIN_CPUS 设置

方法

shell 直接修改环境变量
1
export MODIN_CPUS=4

python 导入 modin 前设置

1
2
3
import os
os.environ["MODIN_CPUS"] = "4"
import modin.pandas as pd

进度条功能

原理
- 利用 tqdm 包实现

方法

1
2
3
4
import modin.pandas as pd
from tqdm import tqdm
from modin.config import ProgressBar
ProgressBar.enable()

分布式工具 Ray

参考：What is Ray? — Ray v1.4.1

修改 /dev/shm 限制

参考：linux - How to resize /dev/shm? - Stack Overflow

Edit file /etc/fstab (with sudo if needed).
In this file, try to locate a line like this one : none /dev/shm tmpfs defaults,size=4G 0 0.
Case 1 - This line exists in your /etc/fstab file:

Modify the text after size=. For example if you want an 8G size, replace size=4G by size=8G.
Exit your text editor, then run (with sudo if needed) $ mount -o remount /dev/shm.
Case 2 - This line does NOT exists in your /etc/fstab file:

Append at the end of the file the line none /dev/shm tmpfs defaults,size=4G 0 0, and modify the text after size=. For example if you want an 8G size, replace size=4G by size=8G.
Exit your text editor, then run (with sudo if needed) $ mount /dev/shm.

修改 /etc/fstab

修改或添加以下内容

1
none /dev/shm tmpfs defaults,size=32G 0 0

详解 /dev/shm 设备

参考：linux下的/dev/shm/ 以及与swap目录的区别【转】 - Tinywan - 博客园

注意

modin apply 和 pandas apply 区别

pandas apply(…, result_type=None)

返回值 pd.Series

eg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pandas as pd

df = pd.DataFrame({'a': [1,2,3], 'b': list('xyz')})

result = df.apply(lambda row: row['a']**2, axis=1)

# output
0    1
1    4
2    9
dtype: int64

type(result)
# output
pandas.core.series.Series

modin apply(…, result_type=None)

返回值 pd.DataFrame
- columns: __reduced__

eg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import modin.pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': list('xyz')})

result = df.apply(lambda row: row['a']**2, axis=1)

result
# output
    __reduced__
0	1
1	4
2	9

type(result)
# output
modin.pandas.dataframe.DataFrame

并行工具

modin
- 一旦启用，默认全部并行
- 缺点
  - 小任务，或者跨行操作，overhead 太大，反而拖累速度，更慢
  - eg: df.groupby(), df.loc[df.where()], df.loc[:10000, :]
- 优点
  - 操作简单，代码几乎无需修改
pandarallel
- 使用 multiprocessing 并行
- 优点
  - 随时可以选择是否并行
  - df.apply() Vs. df.parallel_apply()
  - 可选使用 /dev/shm 作为跨核对象共享
- 缺点
  - overhead 高
joblib
- 优点
  - 代码改动小
- 缺点
  - 实现跨进程数据共享，要自行设置共享，太麻烦
Dask
- 优点
  - 有对 numpy 和 DataFrame 的优化
- 缺点
  - 修改代码太多
  - dask 的 DataFrame 和 pd.DataFrame 不太一样

修改 index 和 column names

原理
- 直接赋值列表

例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
In [11]: df = pd.DataFrame({'a':[1, 2,3], 'b': [4, 5, 6]})

In [12]: df
Out[12]:
   a  b
0  1  4
1  2  5
2  3  6


# 修改 index
df.index = [1, 2, 3]
# 赋值 pd.Series
In [14]: df.index = df['a']

In [15]: df
Out[15]:
   a  b
a
1  1  4
2  2  5
3  3  6

# 修改 columns 名称
df.columns = ['A', 'B', 'C']
In [16]: df.columns = df.loc[1, :]

In [17]: df
Out[17]:
1  1  4
a
1  1  4
2  2  5
3  3  6

pd.get_dummies – 0~1 转化

作用
- 把一列数据转换成多列使用（0 或 1）表示
参考：pandas.get_dummies — pandas 1.3.1 documentation

代码

1
2
3
4
5
6
7
8
>>> s = pd.Series(list('abca'))

>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

DataFrame 拼接 append 和 concat

pd.concat(df_list)
- 多个 DataFrame 直接拼接到一起
df.append(other_df)

列序修改 —- 调整列的顺序

解说
- 把列之间（或行之间）的顺序调整

方法

重整 index 即可

例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
In [3]: df = pd.DataFrame({'A':[1, 1, 2, 2, 2], 'B':[1, 2, 3, 4, 4]})

In [4]: df
Out[4]:
   A  B
0  1  1
1  1  2
2  2  3
3  2  4
4  2  4

In [5]: df = df[['B', 'A']]

In [6]: df
Out[6]:
   B  A
0  1  1
1  2  1
2  3  2
3  4  2
4  4  2

merge / join

pd.DataFrame.join
- 用于 index-index 或 columns-index 处理
- 底层使用 pd.merge() 实现
全面功能
- pd.merge()

文章目录

教程