algo-rec-content

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Content-Based Recommendation

基于内容的推荐

Overview

概述

Content-based filtering recommends items whose features match the user's preference profile, built from their interaction history. Computes in O(I × F) per user where I=items, F=features. Solves new-item cold start since items only need features, not interaction history.
基于内容的过滤会推荐特征与用户偏好画像匹配的物品,该画像由用户的交互历史构建。每个用户的计算复杂度为O(I × F),其中I=物品数量,F=特征数量。该方法可解决新物品冷启动问题,因为物品仅需具备特征,无需交互历史。

When to Use

使用场景

Trigger conditions:
  • Recommending based on item attributes (genre, category, keywords, price range)
  • New item cold start: items have features but no interaction data yet
  • When user privacy requires no cross-user data sharing
When NOT to use:
  • When serendipity matters (content-based creates filter bubbles)
  • When item features are unavailable or uninformative (use CF instead)
触发条件:
  • 基于物品属性(类型、类别、关键词、价格区间)进行推荐
  • 新物品冷启动:物品具备特征但尚无交互数据
  • 用户隐私要求不共享跨用户数据的场景
不适用场景:
  • 需要推荐惊喜内容的场景(基于内容的方法会造成过滤气泡)
  • 物品特征不可用或无参考价值的场景(改用协同过滤(CF))

Algorithm

算法

IRON LAW: Content-Based Can Only Recommend SIMILAR Items
It cannot discover unexpected interests (filter bubble problem).
Users who only interact with action movies will only get action
movie recommendations — even if they'd love a documentary.
IRON LAW: Content-Based Can Only Recommend SIMILAR Items
It cannot discover unexpected interests (filter bubble problem).
Users who only interact with action movies will only get action
movie recommendations — even if they'd love a documentary.

Phase 1: Input Validation

阶段1:输入验证

Extract item feature vectors (TF-IDF for text, one-hot for categories, numerical for attributes). Build user profile from weighted item features of interacted items. Gate: Item features extracted, user profile vector built.
提取物品特征向量(文本内容采用TF-IDF,类别采用独热编码,属性采用数值型表示)。基于用户交互过的物品的加权特征构建用户画像。 校验门限: 已提取物品特征,已构建用户画像向量。

Phase 2: Core Algorithm

阶段2:核心算法

  1. Represent each item as a feature vector
  2. Build user profile: weighted centroid of interacted item vectors (weight by recency, rating, or engagement)
  3. Compute similarity between user profile and all candidate items (cosine similarity)
  4. Rank by similarity score, exclude already-interacted items
  1. 将每个物品表示为特征向量
  2. 构建用户画像:交互物品向量的加权质心(按近期程度、评分或参与度加权)
  3. 计算用户画像与所有候选物品的相似度(余弦相似度(cosine similarity))
  4. 按相似度得分排序,排除用户已交互过的物品

Phase 3: Verification

阶段3:验证

Evaluate: does the recommendation list reflect the user's demonstrated preferences? Check diversity metrics. Gate: Recommendations are topically aligned with user history.
评估:推荐列表是否反映了用户已表现出的偏好?检查多样性指标。 校验门限: 推荐内容与用户历史在主题上保持一致。

Phase 4: Output

阶段4:输出

Return ranked recommendations with feature-level explanations.
返回带特征级解释的排序推荐结果。

Output Format

输出格式

json
{
  "recommendations": [{"item_id": "456", "score": 0.87, "matching_features": ["genre:thriller", "director:Nolan"]}],
  "metadata": {"method": "content-based", "features_used": 15, "profile_items": 30}
}
json
{
  "recommendations": [{"item_id": "456", "score": 0.87, "matching_features": ["genre:thriller", "director:Nolan"]}],
  "metadata": {"method": "content-based", "features_used": 15, "profile_items": 30}
}

Examples

示例

Sample I/O

输入输出示例

Input: User watched 5 sci-fi movies, 2 documentaries. Candidate: new sci-fi movie. Expected: High score (~0.8+) due to genre match with dominant preference.
输入: 用户观看了5部科幻电影、2部纪录片。候选物品:一部新的科幻电影。 预期结果: 得分较高(约0.8+),因为类型与用户主要偏好匹配。

Edge Cases

边缘情况

InputExpectedWhy
New user, no historyCannot build profileNew-user cold start — use popularity
All items same featuresEqual scoresNo differentiation possible
User with diverse historyModerate scores for allProfile averages dilute signal
输入预期结果原因
新用户,无历史数据无法构建用户画像新用户冷启动——改用热门推荐
所有物品特征相同得分相同无法区分物品
用户历史偏好多样所有物品得分中等用户画像平均值削弱了信号

Gotchas

注意事项

  • Feature quality is everything: Garbage features → garbage recommendations. Invest in feature engineering.
  • Filter bubble: Users get increasingly narrow recommendations. Inject diversity by mixing in exploration items.
  • Profile drift: User preferences change over time. Apply temporal decay to older interactions.
  • Feature sparsity: Items with few features produce unreliable similarity. Set a minimum feature count threshold.
  • Over-specialization: A user who rated one jazz album highly shouldn't get ALL jazz. Weight by interaction count, not just rating.
  • 特征质量是关键:劣质特征会导致劣质推荐。需投入精力进行特征工程。
  • 过滤气泡:用户会收到越来越狭窄的推荐内容。可混入探索性物品来增加多样性。
  • 画像漂移:用户偏好会随时间变化。应对旧交互应用时间衰减机制。
  • 特征稀疏:特征过少的物品会产生不可靠的相似度。需设置最低特征数量阈值。
  • 过度专业化:仅给某张爵士专辑打高分的用户不应只收到爵士内容。需按交互次数而非仅评分加权。

References

参考资料

  • For hybrid approaches combining content and CF, see
    references/hybrid-strategies.md
  • For text-based feature extraction techniques, see
    references/feature-extraction.md
  • 如需了解结合内容过滤与协同过滤(CF)的混合方法,请查看
    references/hybrid-strategies.md
  • 如需了解基于文本的特征提取技术,请查看
    references/feature-extraction.md