用 Python 实现 Expand from List

处理数据时偶尔会遇到某列中有包含多个值,用逗号或其他分隔符分割,如下表:

1
2
3
4
5
6
7
+-------------------------------------+------------+
| Company | Field |
+-------------------------------------+------------+
| [UBC, GoldmanSachs, Morgan Stanley] | Finance |
| [Amazon, Microsoft, Huawei] | Technology |
| [Pfizer] | Pharma |
+-------------------------------------+------------+

我们需要纵向展开表格中的list,得到如下结果:

1
2
3
4
5
6
7
8
9
10
11
+----------------+------------+
| Company | Field |
+----------------+------------+
| UBC | Finance |
| GoldmanSachs | Finance |
| Morgan Stanley | Finance |
| Amazon | Technology |
| Microsoft | Technology |
| Huawei | Technology |
| Pfizer | Pharma |
+----------------+------------+

用 Pandas 如何实现? 答案是 pandas.DataFrame.explode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
>>>import pandas as pd
>>>data = {
'Company': [['UBC', 'GoldmanSachs', 'Morgan Stanley'],['Amazon', 'Microsoft', 'Huawei'], ['Pfizer']],
'Field': ['Finance','Technology','Pharma']}
>>>df = pd.DataFrame(data)
>>>df
'''
输出
+-------------------------------------+------------+
| Company | Field |
+-------------------------------------+------------+
| [UBC, GoldmanSachs, Morgan Stanley] | Finance |
| [Amazon, Microsoft, Huawei] | Technology |
| [Pfizer] | Pharma |
+-------------------------------------+------------+
'''
>>>df.explode('Company',ignore_index=True)
'''
输出
+----------------+------------+
| Company | Field |
+----------------+------------+
| UBC | Finance |
| GoldmanSachs | Finance |
| Morgan Stanley | Finance |
| Amazon | Technology |
| Microsoft | Technology |
| Huawei | Technology |
| Pfizer | Pharma |
+----------------+------------+
'''

对于非 list-like 的 element,则需要先预处理下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
>>>data = {
'Company': ['UBC|GoldmanSachs|Morgan Stanley','Amazon|Microsoft|Huawei', 'Pfizer'],
'Field': ['Finance', 'Technology', 'Pharma']
}
>>>df = pd.DataFrame(data)
>>>df
'''
+---------------------------------+---------+
| Company | Field |
+---------------------------------+---------+
| UBS|GoldmanSachs|Morgan Stanley | Finance |
| Amazon|Microsoft|Huawei | Finance |
| Pfizer | Finance |
+---------------------------------+---------+
'''
>>>df = df.assign(Company=df.Company.str.split('|'))
>>>df
'''
输出
+-------------------------------------+------------+
| Company | Field |
+-------------------------------------+------------+
| [UBC, GoldmanSachs, Morgan Stanley] | Finance |
| [Amazon, Microsoft, Huawei] | Technology |
| [Pfizer] | Pharma |
+-------------------------------------+------------+
'''

当然也可以用 iterrows 方法逐行处理:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>>data = {
'Company': ['UBC|GoldmanSachs|Morgan Stanley','Amazon|Microsoft|Huawei', 'Pfizer'],
'Field': ['Finance', 'Technology', 'Pharma']
}
>>>df = pd.DataFrame(data)
>>>for index, row in df.iterrows():
... row_list = row['Company'].split('|')
... df.loc[index, 'Company'] = row_list
...
>>>df
'''
输出
+-------------------------------------+------------+
| Company | Field |
+-------------------------------------+------------+
| [UBC, GoldmanSachs, Morgan Stanley] | Finance |
| [Amazon, Microsoft, Huawei] | Technology |
| [Pfizer] | Pharma |
+-------------------------------------+------------+
'''