recommendation-engine

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Recommendation Engine

推荐引擎

Build recommendation systems for personalized content and product suggestions.
构建用于个性化内容和商品推荐的推荐系统。

Recommendation Approaches

推荐方法

ApproachHow It WorksProsCons
CollaborativeUser-item interactionsDiscovers hidden patternsCold start
Content-basedItem featuresWorks for new itemsLimited discovery
HybridCombines bothBest of bothComplex
方法工作原理优点缺点
协同过滤用户-物品交互数据可发现隐藏模式存在冷启动问题
基于内容物品特征适用于新物品发现能力有限
混合模式结合以上两种方法兼具两者优势复杂度高

Collaborative Filtering

协同过滤

python
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

class CollaborativeFilter:
    def __init__(self):
        self.user_similarity = None
        self.item_similarity = None

    def fit(self, user_item_matrix):
        # User-based similarity
        self.user_similarity = cosine_similarity(user_item_matrix)
        # Item-based similarity
        self.item_similarity = cosine_similarity(user_item_matrix.T)

    def recommend_for_user(self, user_id, n=10):
        scores = self.user_similarity[user_id].dot(self.user_item_matrix)
        # Exclude already interacted items
        already_interacted = self.user_item_matrix[user_id].nonzero()[0]
        scores[already_interacted] = -np.inf
        return np.argsort(scores)[-n:][::-1]
python
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

class CollaborativeFilter:
    def __init__(self):
        self.user_similarity = None
        self.item_similarity = None

    def fit(self, user_item_matrix):
        # User-based similarity
        self.user_similarity = cosine_similarity(user_item_matrix)
        # Item-based similarity
        self.item_similarity = cosine_similarity(user_item_matrix.T)

    def recommend_for_user(self, user_id, n=10):
        scores = self.user_similarity[user_id].dot(self.user_item_matrix)
        # Exclude already interacted items
        already_interacted = self.user_item_matrix[user_id].nonzero()[0]
        scores[already_interacted] = -np.inf
        return np.argsort(scores)[-n:][::-1]

Matrix Factorization (SVD)

矩阵分解 (SVD)

python
from sklearn.decomposition import TruncatedSVD

class MatrixFactorization:
    def __init__(self, n_factors=50):
        self.svd = TruncatedSVD(n_components=n_factors)

    def fit(self, user_item_matrix):
        self.user_factors = self.svd.fit_transform(user_item_matrix)
        self.item_factors = self.svd.components_.T

    def predict(self, user_id, item_id):
        return np.dot(self.user_factors[user_id], self.item_factors[item_id])
python
from sklearn.decomposition import TruncatedSVD

class MatrixFactorization:
    def __init__(self, n_factors=50):
        self.svd = TruncatedSVD(n_components=n_factors)

    def fit(self, user_item_matrix):
        self.user_factors = self.svd.fit_transform(user_item_matrix)
        self.item_factors = self.svd.components_.T

    def predict(self, user_id, item_id):
        return np.dot(self.user_factors[user_id], self.item_factors[item_id])

Hybrid Recommender

混合推荐器

python
class HybridRecommender:
    def __init__(self, collab_weight=0.7, content_weight=0.3):
        self.collab = CollaborativeFilter()
        self.content = ContentBasedFilter()
        self.weights = (collab_weight, content_weight)

    def recommend(self, user_id, n=10):
        collab_scores = self.collab.score(user_id)
        content_scores = self.content.score(user_id)
        combined = self.weights[0] * collab_scores + self.weights[1] * content_scores
        return np.argsort(combined)[-n:][::-1]
python
class HybridRecommender:
    def __init__(self, collab_weight=0.7, content_weight=0.3):
        self.collab = CollaborativeFilter()
        self.content = ContentBasedFilter()
        self.weights = (collab_weight, content_weight)

    def recommend(self, user_id, n=10):
        collab_scores = self.collab.score(user_id)
        content_scores = self.content.score(user_id)
        combined = self.weights[0] * collab_scores + self.weights[1] * content_scores
        return np.argsort(combined)[-n:][::-1]

Evaluation Metrics

评估指标

  • Precision@K, Recall@K
  • NDCG (ranking quality)
  • Coverage (catalog diversity)
  • A/B test conversion rate
  • Precision@K, Recall@K
  • NDCG (排序质量)
  • Coverage (品类多样性)
  • A/B测试转化率

Cold Start Solutions

冷启动解决方案

  • New users: Popular items, onboarding preferences, demographic-based
  • New items: Content-based bootstrapping, active learning
  • Exploration strategies: ε-greedy, Thompson sampling bandits
  • 新用户: 热门物品、注册偏好设置、基于人口统计学属性
  • 新物品: 基于内容的冷启动引导、主动学习
  • 探索策略: ε-greedy, Thompson sampling bandits

Quick Start: Build a Recommender in 5 Steps

快速入门:5步构建推荐系统

python
from scipy.sparse import csr_matrix
import numpy as np
python
from scipy.sparse import csr_matrix
import numpy as np

1. Prepare user-item interaction matrix

1. Prepare user-item interaction matrix

rows = users, cols = items, values = ratings/interactions

rows = users, cols = items, values = ratings/interactions

ratings_data = [(0, 5, 5), (0, 10, 4), (1, 5, 3), ...] # (user, item, rating) n_users, n_items = 1000, 5000
row_idx = [r[0] for r in ratings_data] col_idx = [r[1] for r in ratings_data] ratings = [r[2] for r in ratings_data] user_item_matrix = csr_matrix((ratings, (row_idx, col_idx)), shape=(n_users, n_items))
ratings_data = [(0, 5, 5), (0, 10, 4), (1, 5, 3), ...] # (user, item, rating) n_users, n_items = 1000, 5000
row_idx = [r[0] for r in ratings_data] col_idx = [r[1] for r in ratings_data] ratings = [r[2] for r in ratings_data] user_item_matrix = csr_matrix((ratings, (row_idx, col_idx)), shape=(n_users, n_items))

2. Choose and train model

2. Choose and train model

from recommendation_engine import ItemBasedCollaborativeFilter # See references
model = ItemBasedCollaborativeFilter(similarity_metric='cosine', k_neighbors=20) model.fit(user_item_matrix)
from recommendation_engine import ItemBasedCollaborativeFilter # See references
model = ItemBasedCollaborativeFilter(similarity_metric='cosine', k_neighbors=20) model.fit(user_item_matrix)

3. Generate recommendations

3. Generate recommendations

recommendations = model.recommend(user_id=42, n=10) print(recommendations) # [(item_id, score), ...]
recommendations = model.recommend(user_id=42, n=10) print(recommendations) # [(item_id, score), ...]

4. Evaluate on test set

4. Evaluate on test set

from evaluation_metrics import precision_at_k, recall_at_k
test_items = {42: {10, 25, 30}} # True relevant items for user 42 rec_items = [item for item, score in recommendations]
precision = precision_at_k(rec_items, test_items[42], k=10) recall = recall_at_k(rec_items, test_items[42], k=10) print(f"Precision@10: {precision:.3f}, Recall@10: {recall:.3f}")
from evaluation_metrics import precision_at_k, recall_at_k
test_items = {42: {10, 25, 30}} # True relevant items for user 42 rec_items = [item for item, score in recommendations]
precision = precision_at_k(rec_items, test_items[42], k=10) recall = recall_at_k(rec_items, test_items[42], k=10) print(f"Precision@10: {precision:.3f}, Recall@10: {recall:.3f}")

5. Handle cold start

5. Handle cold start

from cold_start import PopularityRecommender
popularity_model = PopularityRecommender() popularity_model.fit(interactions_with_timestamps) new_user_recs = popularity_model.recommend(n=10)
undefined
from cold_start import PopularityRecommender
popularity_model = PopularityRecommender() popularity_model.fit(interactions_with_timestamps) new_user_recs = popularity_model.recommend(n=10)
undefined

Known Issues Prevention

已知问题预防

1. Popularity Bias

1. 流行度偏差

Problem: Recommending only popular items, ignoring long tail. Reduces diversity and serendipity.
Solution: Balance popularity with personalization, apply re-ranking for diversity:
python
def diversify_recommendations(
    recommendations: List[Tuple[int, float]],
    item_features: np.ndarray,
    diversity_weight: float = 0.3
) -> List[Tuple[int, float]]:
    """Re-rank to increase diversity while maintaining relevance."""
    from sklearn.metrics.pairwise import cosine_distances

    selected = []
    candidates = recommendations.copy()

    while len(selected) < len(recommendations) and candidates:
        if not selected:
            # First item: highest score
            selected.append(candidates.pop(0))
            continue

        # Compute diversity scores
        selected_features = item_features[[item for item, _ in selected]]
        diversity_scores = []

        for item, relevance in candidates:
            item_feature = item_features[item].reshape(1, -1)
            # Average distance to already selected items
            avg_distance = cosine_distances(item_feature, selected_features).mean()
            # Combined score: relevance + diversity
            combined = (1 - diversity_weight) * relevance + diversity_weight * avg_distance
            diversity_scores.append((item, relevance, combined))

        # Select item with best combined score
        best = max(diversity_scores, key=lambda x: x[2])
        selected.append((best[0], best[1]))
        candidates = [(i, s) for i, s, _ in diversity_scores if i != best[0]]

    return selected
问题: 仅推荐热门物品,忽略长尾商品,降低推荐多样性和惊喜感。
解决方案: 平衡流行度与个性化,通过重排序提升多样性:
python
def diversify_recommendations(
    recommendations: List[Tuple[int, float]],
    item_features: np.ndarray,
    diversity_weight: float = 0.3
) -> List[Tuple[int, float]]:
    """Re-rank to increase diversity while maintaining relevance."""
    from sklearn.metrics.pairwise import cosine_distances

    selected = []
    candidates = recommendations.copy()

    while len(selected) < len(recommendations) and candidates:
        if not selected:
            # First item: highest score
            selected.append(candidates.pop(0))
            continue

        # Compute diversity scores
        selected_features = item_features[[item for item, _ in selected]]
        diversity_scores = []

        for item, relevance in candidates:
            item_feature = item_features[item].reshape(1, -1)
            # Average distance to already selected items
            avg_distance = cosine_distances(item_feature, selected_features).mean()
            # Combined score: relevance + diversity
            combined = (1 - diversity_weight) * relevance + diversity_weight * avg_distance
            diversity_scores.append((item, relevance, combined))

        # Select item with best combined score
        best = max(diversity_scores, key=lambda x: x[2])
        selected.append((best[0], best[1]))
        candidates = [(i, s) for i, s, _ in diversity_scores if i != best[0]]

    return selected

2. Data Sparsity (Matrix >99% Empty)

2. 数据稀疏性 (矩阵空值占比>99%)

Problem: Collaborative filtering fails when most users have rated <1% of items.
Solution: Use matrix factorization (SVD, ALS) instead of memory-based CF:
python
undefined
问题: 当大多数用户评分的物品占比不足1%时,协同过滤会失效。
解决方案: 使用矩阵分解(SVD, ALS)替代基于内存的CF:
python
undefined

❌ Bad: User-based CF on sparse data (fails to find similar users)

❌ Bad: User-based CF on sparse data (fails to find similar users)

user_cf = UserBasedCollaborativeFilter() user_cf.fit(sparse_matrix) # Most users have <10 ratings
user_cf = UserBasedCollaborativeFilter() user_cf.fit(sparse_matrix) # Most users have <10 ratings

✅ Good: Matrix factorization handles sparsity

✅ Good: Matrix factorization handles sparsity

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50) user_factors = svd.fit_transform(sparse_matrix) item_factors = svd.components_.T
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50) user_factors = svd.fit_transform(sparse_matrix) item_factors = svd.components_.T

Predict rating: user_factors[u] @ item_factors[i]

Predict rating: user_factors[u] @ item_factors[i]

undefined
undefined

3. Cold Start Without Fallback

3. 无兜底方案的冷启动

Problem: Recommender crashes or returns empty results for new users/items.
Solution: Always implement fallback chain:
python
def recommend_with_fallback(user_id, n=10):
    """Graceful degradation through fallback chain."""
    try:
        # Try personalized recommendations
        if has_sufficient_history(user_id, min_interactions=5):
            return collaborative_filter.recommend(user_id, n)
    except Exception as e:
        logger.warning(f"CF failed for user {user_id}: {e}")

    # Fallback 1: Demographic-based
    if user_demographics_available(user_id):
        return demographic_recommender.recommend(user_id, n)

    # Fallback 2: Popularity
    return popularity_recommender.recommend(n)
问题: 面对新用户/新物品时,推荐器崩溃或返回空结果。
解决方案: 始终实现兜底降级链路:
python
def recommend_with_fallback(user_id, n=10):
    """Graceful degradation through fallback chain."""
    try:
        # Try personalized recommendations
        if has_sufficient_history(user_id, min_interactions=5):
            return collaborative_filter.recommend(user_id, n)
    except Exception as e:
        logger.warning(f"CF failed for user {user_id}: {e}")

    # Fallback 1: Demographic-based
    if user_demographics_available(user_id):
        return demographic_recommender.recommend(user_id, n)

    # Fallback 2: Popularity
    return popularity_recommender.recommend(n)

4. Not Excluding Already-Interacted Items

4. 未过滤已交互物品

Problem: Recommending items user already purchased/viewed wastes recommendation slots.
Solution: Always filter interacted items:
python
undefined
问题: 推荐用户已经购买/浏览过的物品,浪费推荐位资源。
解决方案: 始终过滤已交互物品:
python
undefined

✅ Correct: Exclude interacted items

✅ Correct: Exclude interacted items

user_items = user_item_matrix[user_id].nonzero()[1] scores[user_items] = -np.inf # Ensure they don't appear in top-K recommendations = np.argsort(scores)[-n:][::-1]
user_items = user_item_matrix[user_id].nonzero()[1] scores[user_items] = -np.inf # Ensure they don't appear in top-K recommendations = np.argsort(scores)[-n:][::-1]

❌ Wrong: Forgetting to filter

❌ Wrong: Forgetting to filter

recommendations = np.argsort(scores)[-n:][::-1] # May include already purchased!
undefined
recommendations = np.argsort(scores)[-n:][::-1] # May include already purchased!
undefined

5. Ignoring Implicit Feedback Confidence

5. 忽略隐式反馈置信度

Problem: Treating all clicks/views equally. 1 view ≠ 100 views.
Solution: Weight by interaction strength (view count, watch time, etc.):
python
undefined
问题: 同等对待所有点击/浏览行为,1次浏览≠100次浏览。
解决方案: 根据交互强度加权(浏览次数、观看时长等):
python
undefined

For implicit feedback, use confidence weighting

For implicit feedback, use confidence weighting

confidence_matrix = 1 + alpha * np.log(1 + interaction_counts)
confidence_matrix = 1 + alpha * np.log(1 + interaction_counts)

In ALS: C_ui * (P_ui - X_ui)²

In ALS: C_ui * (P_ui - X_ui)²

Higher confidence for items with more interactions

Higher confidence for items with more interactions

undefined
undefined

6. Not Evaluating Ranking Quality (Using Only Accuracy)

6. 未评估排序质量(仅使用准确率)

Problem: High prediction accuracy (RMSE) doesn't mean good top-K recommendations.
Solution: Use ranking metrics (NDCG, MAP@K):
python
undefined
问题: 高预测准确率(RMSE)不代表Top-K推荐效果好。
解决方案: 使用排序指标(NDCG, MAP@K):
python
undefined

❌ Bad: Only RMSE

❌ Bad: Only RMSE

from sklearn.metrics import mean_squared_error rmse = np.sqrt(mean_squared_error(y_true, y_pred))
from sklearn.metrics import mean_squared_error rmse = np.sqrt(mean_squared_error(y_true, y_pred))

✅ Good: Ranking metrics for top-K evaluation

✅ Good: Ranking metrics for top-K evaluation

from evaluation_metrics import ndcg_at_k, mean_average_precision_at_k
from evaluation_metrics import ndcg_at_k, mean_average_precision_at_k

NDCG rewards putting highly relevant items first

NDCG rewards putting highly relevant items first

ndcg = ndcg_at_k(recommendations, relevance_scores, k=10)
ndcg = ndcg_at_k(recommendations, relevance_scores, k=10)

MAP@K considers precision at each relevant item position

MAP@K considers precision at each relevant item position

map_score = mean_average_precision_at_k(all_recommendations, ground_truth, k=10)
undefined
map_score = mean_average_precision_at_k(all_recommendations, ground_truth, k=10)
undefined

7. Filter Bubble (Lack of Exploration)

7. 过滤气泡(缺乏探索性)

Problem: Always recommending similar items limits discovery, reduces user engagement over time.
Solution: Implement explore-exploit strategy:
python
class ExploreExploitRecommender:
    def __init__(self, base_model, epsilon=0.1):
        self.base_model = base_model
        self.epsilon = epsilon  # 10% exploration

    def recommend(self, user_id, n=10):
        # Exploit: Use trained model for most recommendations
        n_exploit = int(n * (1 - self.epsilon))
        exploitative_recs = self.base_model.recommend(user_id, n=n_exploit)

        # Explore: Add random diverse items
        n_explore = n - n_exploit
        explored_items = sample_diverse_items(n_explore)

        return exploitative_recs + explored_items
问题: 始终推荐相似物品会限制内容发现,长期会降低用户参与度。
解决方案: 实现探索-利用策略:
python
class ExploreExploitRecommender:
    def __init__(self, base_model, epsilon=0.1):
        self.base_model = base_model
        self.epsilon = epsilon  # 10% exploration

    def recommend(self, user_id, n=10):
        # Exploit: Use trained model for most recommendations
        n_exploit = int(n * (1 - self.epsilon))
        exploitative_recs = self.base_model.recommend(user_id, n=n_exploit)

        # Explore: Add random diverse items
        n_explore = n - n_exploit
        explored_items = sample_diverse_items(n_explore)

        return exploitative_recs + explored_items

When to Load References

何时加载参考文档

Load reference files when you need detailed implementations:
  • Collaborative Filtering: Load
    references/collaborative-filtering-deep-dive.md
    for complete user-based and item-based CF implementations with similarity metrics (cosine, Pearson, Jaccard), scalability optimizations (sparse matrices, approximate nearest neighbors), and handling edge cases (cold start, sparsity)
  • Matrix Factorization: Load
    references/matrix-factorization-methods.md
    for SVD, ALS, and NMF implementations with hyperparameter tuning, implicit feedback handling, and advanced techniques (BPR, WARP)
  • Evaluation Metrics: Load
    references/evaluation-metrics-implementation.md
    for Precision@K, Recall@K, NDCG, coverage, diversity metrics, cross-validation strategies, and statistical significance testing (paired t-test, bootstrap confidence intervals)
  • Cold Start Solutions: Load
    references/cold-start-strategies.md
    for new user/item strategies (popularity-based, onboarding, demographic, content-based bootstrapping, active learning), explore-exploit approaches (ε-greedy, Thompson sampling), and hybrid fallback chains
当需要详细实现时加载对应参考文件:
  • 协同过滤: 加载
    references/collaborative-filtering-deep-dive.md
    获取完整的基于用户和基于物品的CF实现,包含相似度指标(余弦、皮尔逊、杰卡德)、可扩展性优化(稀疏矩阵、近似最近邻)以及边缘场景处理(冷启动、稀疏性)
  • 矩阵分解: 加载
    references/matrix-factorization-methods.md
    获取SVD、ALS、NMF实现,包含超参数调优、隐式反馈处理以及高级技术(BPR、WARP)
  • 评估指标: 加载
    references/evaluation-metrics-implementation.md
    获取Precision@K、Recall@K、NDCG、覆盖率、多样性指标、交叉验证策略以及统计显著性检验(配对t检验、bootstrap置信区间)
  • 冷启动解决方案: 加载
    references/cold-start-strategies.md
    获取新用户/新物品策略(基于流行度、注册引导、人口统计学、基于内容的冷启动、主动学习)、探索-利用方法(ε-greedy、汤普森采样)以及混合兜底链路