处理数据时偶尔会遇到某列中有包含多个值,用逗号或其他分隔符分割,如下表:
1 2 3 4 5 6 7
| +-------------------------------------+------------+ | Company | Field | +-------------------------------------+------------+ | [UBC, GoldmanSachs, Morgan Stanley] | Finance | | [Amazon, Microsoft, Huawei] | Technology | | [Pfizer] | Pharma | +-------------------------------------+------------+
|
我们需要纵向展开表格中的list,得到如下结果:
1 2 3 4 5 6 7 8 9 10 11
| +----------------+------------+ | Company | Field | +----------------+------------+ | UBC | Finance | | GoldmanSachs | Finance | | Morgan Stanley | Finance | | Amazon | Technology | | Microsoft | Technology | | Huawei | Technology | | Pfizer | Pharma | +----------------+------------+
|
用 Pandas 如何实现? 答案是 pandas.DataFrame.explode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| >>>import pandas as pd >>>data = { 'Company': [['UBC', 'GoldmanSachs', 'Morgan Stanley'],['Amazon', 'Microsoft', 'Huawei'], ['Pfizer']], 'Field': ['Finance','Technology','Pharma']} >>>df = pd.DataFrame(data) >>>df ''' 输出 +-------------------------------------+------------+ | Company | Field | +-------------------------------------+------------+ | [UBC, GoldmanSachs, Morgan Stanley] | Finance | | [Amazon, Microsoft, Huawei] | Technology | | [Pfizer] | Pharma | +-------------------------------------+------------+ ''' >>>df.explode('Company',ignore_index=True) ''' 输出 +----------------+------------+ | Company | Field | +----------------+------------+ | UBC | Finance | | GoldmanSachs | Finance | | Morgan Stanley | Finance | | Amazon | Technology | | Microsoft | Technology | | Huawei | Technology | | Pfizer | Pharma | +----------------+------------+ '''
|
对于非 list-like 的 element,则需要先预处理下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| >>>data = { 'Company': ['UBC|GoldmanSachs|Morgan Stanley','Amazon|Microsoft|Huawei', 'Pfizer'], 'Field': ['Finance', 'Technology', 'Pharma'] } >>>df = pd.DataFrame(data) >>>df ''' +---------------------------------+---------+ | Company | Field | +---------------------------------+---------+ | UBS|GoldmanSachs|Morgan Stanley | Finance | | Amazon|Microsoft|Huawei | Finance | | Pfizer | Finance | +---------------------------------+---------+ ''' >>>df = df.assign(Company=df.Company.str.split('|')) >>>df ''' 输出 +-------------------------------------+------------+ | Company | Field | +-------------------------------------+------------+ | [UBC, GoldmanSachs, Morgan Stanley] | Finance | | [Amazon, Microsoft, Huawei] | Technology | | [Pfizer] | Pharma | +-------------------------------------+------------+ '''
|
当然也可以用 iterrows 方法逐行处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| >>>data = { 'Company': ['UBC|GoldmanSachs|Morgan Stanley','Amazon|Microsoft|Huawei', 'Pfizer'], 'Field': ['Finance', 'Technology', 'Pharma'] } >>>df = pd.DataFrame(data) >>>for index, row in df.iterrows(): ... row_list = row['Company'].split('|') ... df.loc[index, 'Company'] = row_list ... >>>df ''' 输出 +-------------------------------------+------------+ | Company | Field | +-------------------------------------+------------+ | [UBC, GoldmanSachs, Morgan Stanley] | Finance | | [Amazon, Microsoft, Huawei] | Technology | | [Pfizer] | Pharma | +-------------------------------------+------------+ '''
|