一个简单的推荐算法模型

日常生活中经常能看到利用被收集到的数据来提供建议或者推荐产品的例子。比如在京东购买了关于 Hadoop 的书,JD App 就会看到 Spark 相关的书排在很靠前的位置。本文将尝试用 Python 来实现一个简单的推荐算法模型。

假设我们有如下用户数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
users_interests = [ 
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
]

列表 users_interests 中的每一个子列表都是一个用户的兴趣点。比如倒数第三位用户同时对 Python, R 和 pandas 感兴趣。

在某些情形下用过往的经验来人工推荐,比如图书馆的图书管理员会很熟练的根据你的兴趣或者你喜欢的书籍来推荐书籍。但更一般的情况下,我们往往需要在没有先验数据的情况下对海量数据进行处理,这超出了人的经验和想象力。我们来让 Python 做这件事。

直接推荐流行事物

一个比较简单的方法是直接推荐我们数据里面比较流行的东西:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from itertools import chain
from collections import Counter

users_interests_unstack = list(chain(*users_interests))
popular_interests = Counter(users_interests_unstack).most_common()
popular_interests
#### 得到的结果会是下面这样
# [('Python', 4),
# ('R', 4),
# ('Big Data', 3),
# ('HBase', 3),
# ('Java', 3),
# ......
####

然后可以向用户推荐那些流行的但他尚未感兴趣的东西:

1
2
3
4
5
6
7
def most_popular_new_interests(user_interests, max_results=5):
suggestions = []
for interest, frequency in popular_interests:
if interest not in user_interests:
suggestions.append(interest, frequency)

return suggestions[:max_results]

用上面的函数对用户 1 进行预测。用户 1 的兴趣是:[“NoSQL”, “MongoDB”, “Cassandra”, “HBase”, “Postgres”], 运行函数的结果为:

1
2
3
most_popular_new_interests(users_interests[1])
# 运行结果:
# [('Python', 4), ('R', 4), ('Big Data', 3), ('Java', 3), ('statistics', 3)]

当然,很多人喜欢 Python,所以我们的目标用户也会喜欢 Python 的假设并不正确。但对于没有任何先验数据的新注册用户,这不失为一个好的方法。

基于用户的协同过滤方法

我们可以根据某用户的兴趣来找到和该用户“相似”的用户,然后再根据这些相似用户的兴趣来推荐感兴趣的东西。

如何定义相似用户?在这里用余弦相似度作为两个用户之间相似程度的指标,并对用户爱好进行 one-hot 编码处理。 “爱好相似的用户”就意味着“兴趣向量的方向几乎相同的用户”。

首先整理出一个不重复的兴趣列表:

1
2
3
4
5
6
7
8
9
10
unique_interests = sorted(list({interest for user_interests in users_interests for interest in user_interests }))

unique_interests
## output:
# ['Big Data',
# 'C++',
# 'Cassandra',
# 'HBase',
# 'Hadoop',
# 'Haskell',

然后根据这个兴趣列表对每个用户进行 one-hot 编码处理:

1
2
3
4
5
6
7
8
9
10
11
12
def make_user_interest_vector(user_interests):
user_interest_one_hot = []
for interest in unique_interests:
if interest in user_interests:
user_interest_one_hot.append(1)
else:
user_interest_one_hot.append(0)
return user_interest_one_hot

user_interest_matrix = []
for user_interests in users_interests:
user_interest_matrix.append(make_user_interest_vector(user_interests))

这样我们有了用户兴趣矩阵 user_interest_matrix。接下来用余弦相似度来计算用户之间的相似度:

1
2
3
4
5
6
7
8
9
10
from sklearn.metrics.pairwise import cosine_similarity

user_similarities = []
user_similarities_temp = []

for i in user_interest_matrix:
for j in user_interest_matrix:
user_similarities_j.append(cosine_similarity([i,j])[0,1])
user_similarities.append(user_similarities_temp)
user_similarities_temp = []

user_similarities[i][j] 的数值即为用户 i 和用户 j 之间的相似度。user_similarities[i] 里存放的是用户 i 相对于所有用户的相似度数据。接下来以 user_similarities[i] 为输入参数来给出和用户 i 相似度最高的用户们:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def most_similar_users_to(user_id):
pairs = []
for other_user_id, similarity in enumerate(user_similarities[user_id]):
if user_id != other_user_id and similarity > 0:
pairs.append((other_user_id, similarity))

pairs.sort(key = lambda x:x[1], reverse=True)

return pairs

most_similar_users_to(0)
# output:
# [(9, 0.5669467095138407),
# (1, 0.3380617018914066),
# (8, 0.1889822365046136),
# (13, 0.1690308509457033),
# (5, 0.1543033499620919)]

得到了最相似的用户群后,我们可以将用户群里用户的兴趣相似度向加起来,排序后得到最终结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def user_based_suggestions(user_id, include_current_interests=False):
suggestions = defaultdict(float)
for other_user_id, similarity in most_similar_users_to(user_id):
for interest in users_interests[other_user_id]:
suggestions[interest] += similarity

suggestions = sorted(suggestions.items(),
key=lambda x: x[1],
reverse=True)

if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]

user_based_suggestions(0)
# output:
# [('MapReduce', 0.5669467095138407),
# ('MongoDB', 0.50709255283711),
# ('Postgres', 0.50709255283711),
# ('NoSQL', 0.3380617018914066),
# ('neural networks', 0.1889822365046136),
# ('deep learning', 0.1889822365046136),
# ('artificial intelligence', 0.1889822365046136),
# ('databases', 0.1690308509457033),
# ('MySQL', 0.1690308509457033),
# ('Python', 0.1543033499620919),
# ('R', 0.1543033499620919),
# ('C++', 0.1543033499620919),
# ('Haskell', 0.1543033499620919),
# ('programming languages', 0.1543033499620919)]

从 users_interests 里可以看出用户 0 对大数据很感兴趣,这与上面的代码预测结果一致。

在实际情况中,兴趣的数量会非常大。这意味着用户兴趣矩阵 user_interest_matrix 会是一个非常稀疏的矩阵,这会非常影响后面预测的效果。

基于物品的协同过滤算法

换一种思路,直接计算两种兴趣的相似度,然后推荐与用户当前兴趣相似度高的东西。

首先对用户兴趣矩阵转置,使行对应于兴趣,列对应用户:

1
interest_user_matrix = np.array(user_interest_matrix).T.tolist()

unique_interests[0] 为 Big Data:
[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0], 这一行的意思是用户 0, 8, 9 对 Big Data 感兴趣。

接下来再此使用余弦相似度得到兴趣矩阵:

1
2
3
4
5
6
7
8
interest_similarities = []
interest_similarities_temp = []

for i in interest_user_matrix:
for j in interest_user_matrix:
interest_similarities_temp.append(cosine_similarity([i,j])[0,1])
interest_similarities.append(interest_similarities_temp)
interest_similarities_temp = []

用下面的函数定义与 Big Data 最相似的项:

1
2
3
4
5
6
7
8
def most_similar_interests_to(interest_id): 
similarities = interest_similarities[interest_id]
pairs = [(unique_interests[other_interest_id], similarity)
for other_interest_id, similarity in enumerate(similarities)
if interest_id != other_interest_id and similarity > 0]
return sorted(pairs,
key=lambda x: x[1],
reverse=True)

运行 most_similar_interests_to(0) 的到的输出为:

1
2
3
4
5
6
7
8
9
10
[('Hadoop', 0.816496580927726),
('Java', 0.6666666666666669),
('MapReduce', 0.5773502691896258),
('Spark', 0.5773502691896258),
('Storm', 0.5773502691896258),
('Cassandra', 0.408248290463863),
('artificial intelligence', 0.408248290463863),
('deep learning', 0.408248290463863),
('neural networks', 0.408248290463863),
('HBase', 0.3333333333333334)]

最后根据用户的兴趣总结与其兴趣相似的东西进行推荐:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def item_based_suggestions(user_id, include_current_interests=False): 

suggestions = defaultdict(float)
user_interest_vector = user_interest_matrix[user_id]
for interest_id, is_interested in enumerate(user_interest_vector):
if is_interested == 1:
similar_interests = most_similar_interests_to(interest_id)
for interest, similarity in similar_interests:
suggestions[interest] += similarity


suggestions = sorted(suggestions.items(),
key=lambda x: x[1],
reverse=True)

if include_current_interests:
return suggestions
else:
return [(suggestion, weight)
for suggestion, weight in suggestions
if suggestion not in users_interests[user_id]]

运行 item_based_suggestions(0) 得到如下结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[('MapReduce', 1.861807319565799),
('MongoDB', 1.3164965809277258),
('Postgres', 1.3164965809277258),
('NoSQL', 1.2844570503761732),
('MySQL', 0.5773502691896258),
('databases', 0.5773502691896258),
('Haskell', 0.5773502691896258),
('programming languages', 0.5773502691896258),
('artificial intelligence', 0.408248290463863),
('deep learning', 0.408248290463863),
('neural networks', 0.408248290463863),
('C++', 0.408248290463863),
('Python', 0.2886751345948129),
('R', 0.2886751345948129)]