Polars 与 Pandas 性能比较

Polars 是基于 Rust 开发的开源 DataFrame 库,专为 Python 中的高性能数据操作和分析而设计。通过并行处理和内存优化技术,达到了显著优于 Pandas 的性能。

Pandas 与 Polars 性能对比

通过 csv 写入(write_csv)、csv 读取(read_csv)、列选择(select_columns)、行筛选(filter_rows)、行排序(sort_rows)、空值填补(fill_na)、统计量计算(calc_stats)这 7 类 Benchmark 在 Pandas 和 Polars 下的耗时,评估这 2 个库的性能。评估结果如下图表所示,所有运行结果均基于 Lenovo Legion R9000P 2023(AMD R9-7945HX,96GB RAM,安静模式)得到。

Benchmark 项目 Pandas 耗时(s) Polars 耗时(s)
csv 写入(write_csv) 68.47109769999952 2.4320810999997775
csv 读取(read_csv) 18.033977799997956 1.293282899998303
列选择(select_columns) 2.6878174000012223 1.3185502999986056
行筛选(filter_rows) 20.69887430000381 1.9661064000029
行排序(sort_rows) 5.450892100001511 1.174701100004313
空值填补(fill_na) 4.529551200001151 0.7351737999997567
统计量计算(calc_stats) 6.443044399995415 0.5910871000014595
性能对比

示例代码

Google Colab 运行代码或下载 Jupyter Notebook

1 数据生成与存储速度比较

使用 Numpy 生成随机矩阵,构建 DataFrame,并存储为 csv 文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
## comparison of dataframes writing
print('Comparison of dataframes writing')

# create directory to store dataframes
if not os.path.exists('./random_dataframe/'):
os.mkdir('./random_dataframe/')

# generate dataframs using pandas
start = time.perf_counter()
columns = [f'{i}' for i in range(1000)]
for i in range(10):
array = np.random.rand(10000,1000) + i * 100
df = pd.DataFrame(array, columns=columns)
df.to_csv(f'./random_dataframe/{i+1}.csv')
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# generate dataframs using polars
start = time.perf_counter()
columns = [f'{i}' for i in range(1000)]
for i in range(10):
array = np.random.rand(10000,1000) + i * 100
df = pl.DataFrame(array, schema=columns)
df.write_csv(f'./random_dataframe/{i+11}.csv')
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of dataframes writing
Pandas Time: 68.47109769999952 seconds
Polars Time: 2.4320810999997775 seconds

2 数据读取速度比较

从 csv 文件读取 DataFrame。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## comparison of dataframes reading
print('Comparison of dataframes reading')

# read dataframes using pandas
start = time.perf_counter()
for i in range(20):
df = pd.read_csv(f'./random_dataframe/{i+1}.csv')
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# read dataframes using polars
start = time.perf_counter()
for i in range(20):
df = pl.read_csv(f'./random_dataframe/{i+1}.csv')
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of dataframes reading
Pandas Time: 18.033977799997956 seconds
Polars Time: 1.293282899998303 seconds

3 数据按列选择速度比较

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## comparison of select columns
print('Comparison of select columns')

# read dataframes using polars
df_dict_pl = {}
df_dict_pd = {}
for i in range(20):
df_pl = pl.read_csv(f'./random_dataframe/{i+1}.csv')
df_dict_pl[i] = df_pl
df_pd = df_pl.to_pandas()
df_dict_pd[i] = df_pd

# select columns using pandas
start = time.perf_counter()
for i in range(20000):
df = df_dict_pd[i%20]
df = df[['0','1','2','3','4']]
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# select columns using polars
start = time.perf_counter()
for i in range(20000):
df = df_dict_pl[i%20]
df = df.select(['0','1','2','3','4'])
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of select columns
Pandas Time: 2.6878174000012223 seconds
Polars Time: 1.3185502999986056 seconds

4 数据筛选速度比较

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## comparison of filter rows
print('Comparison of filter rows')

# read dataframes using polars
df_dict_pl = {}
df_dict_pd = {}
for i in range(20):
df_pl = pl.read_csv(f'./random_dataframe/{i+1}.csv')
df_dict_pl[i] = df_pl
df_pd = df_pl.to_pandas()
df_dict_pd[i] = df_pd

# filter rows using pandas
start = time.perf_counter()
for i in range(2000):
df = df_dict_pd[i%20]
df = df[df['0'] > 0.5]
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# filter rows using polars
start = time.perf_counter()
for i in range(2000):
df = df_dict_pl[i%20]
df = df.filter(df['0'] > 0.5)
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of filter rows
Pandas Time: 20.69887430000381 seconds
Polars Time: 1.9661064000029 seconds

5 数据排序速度比较

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## comparison of sort rows
print('Comparison of sort rows')

# read dataframes using polars
df_dict_pl = {}
df_dict_pd = {}
for i in range(20):
df_pl = pl.read_csv(f'./random_dataframe/{i+1}.csv')
df_dict_pl[i] = df_pl
df_pd = df_pl.to_pandas()
df_dict_pd[i] = df_pd

# sort rows using pandas
start = time.perf_counter()
for i in range(200):
df = df_dict_pd[i%20]
df = df.sort_values(by='0')
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# sort rows using polars
start = time.perf_counter()
for i in range(200):
df = df_dict_pl[i%20]
df = df.sort('0')
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of sort rows
Pandas Time: 5.450892100001511 seconds
Polars Time: 1.174701100004313 seconds

6 数据填补速度比较

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## comparison of fill na
print('Comparison of fill na')

# read dataframes using polars
df_dict_pl = {}
df_dict_pd = {}
for i in range(20):
df_pl = pl.read_csv(f'./random_dataframe/{i+1}.csv')
df_dict_pl[i] = df_pl
df_pd = df_pl.to_pandas()
df_dict_pd[i] = df_pd

# create random nulls
for i in range(20):
df_pd = df_dict_pd[i]
df_pl = df_dict_pl[i]
for j in range(1000):
indexes = np.random.randint(0,10000,100)
df_pd.loc[indexes,f'{j}'] = np.nan
df_pl[indexes, f'{j}'] = np.nan
df_dict_pd[i] = df_pd
df_dict_pl[i] = df_pl

# fill nulls using pandas
start = time.perf_counter()
for i in range(200):
df = df_dict_pd[i%20]
df = df.fillna(0)
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# fill nulls using polars
start = time.perf_counter()
for i in range(200):
df = df_dict_pl[i%20]
df = df.fill_null(0)
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of fill na
Pandas Time: 4.529551200001151 seconds
Polars Time: 0.7351737999997567 seconds

7 数据统计量计算速度比较

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
## comparison of calculate dataframes statistics
print('Comparison of calculate dataframes statistics')

# read dataframes using polars
df_dict_pl = {}
df_dict_pd = {}
for i in range(20):
df_pl = pl.read_csv(f'./random_dataframe/{i+1}.csv')
df_dict_pl[i] = df_pl
df_pd = df_pl.to_pandas()
df_dict_pd[i] = df_pd

# calculate dataframes statistics using pandas
start = time.perf_counter()
for i in range(20):
df = df_dict_pd[i%20]
df.mean()
df.var()
df.std()
df.sum()
df.max()
df.min()
df.median()
df.quantile(0.25)
df.quantile(0.75)
end = time.perf_counter()
print(f'Pandas Time: {end - start} seconds')

# calculate dataframes statistics using polars
start = time.perf_counter()
for i in range(20):
df = df_dict_pl[i%20]
df.mean()
df.var()
df.std()
df.sum()
df.max()
df.min()
df.median()
df.quantile(0.25)
df.quantile(0.75)
end = time.perf_counter()
print(f'Polars Time: {end - start} seconds')

输出:

1
2
3
Comparison of calculate dataframes statistics
Pandas Time: 6.443044399995415 seconds
Polars Time: 0.5910871000014595 seconds