algo-rec-content
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseContent-Based Recommendation
基于内容的推荐
Overview
概述
Content-based filtering recommends items whose features match the user's preference profile, built from their interaction history. Computes in O(I × F) per user where I=items, F=features. Solves new-item cold start since items only need features, not interaction history.
基于内容的过滤会推荐特征与用户偏好画像匹配的物品,该画像由用户的交互历史构建。每个用户的计算复杂度为O(I × F),其中I=物品数量,F=特征数量。该方法可解决新物品冷启动问题,因为物品仅需具备特征,无需交互历史。
When to Use
使用场景
Trigger conditions:
- Recommending based on item attributes (genre, category, keywords, price range)
- New item cold start: items have features but no interaction data yet
- When user privacy requires no cross-user data sharing
When NOT to use:
- When serendipity matters (content-based creates filter bubbles)
- When item features are unavailable or uninformative (use CF instead)
触发条件:
- 基于物品属性(类型、类别、关键词、价格区间)进行推荐
- 新物品冷启动:物品具备特征但尚无交互数据
- 用户隐私要求不共享跨用户数据的场景
不适用场景:
- 需要推荐惊喜内容的场景(基于内容的方法会造成过滤气泡)
- 物品特征不可用或无参考价值的场景(改用协同过滤(CF))
Algorithm
算法
IRON LAW: Content-Based Can Only Recommend SIMILAR Items
It cannot discover unexpected interests (filter bubble problem).
Users who only interact with action movies will only get action
movie recommendations — even if they'd love a documentary.IRON LAW: Content-Based Can Only Recommend SIMILAR Items
It cannot discover unexpected interests (filter bubble problem).
Users who only interact with action movies will only get action
movie recommendations — even if they'd love a documentary.Phase 1: Input Validation
阶段1:输入验证
Extract item feature vectors (TF-IDF for text, one-hot for categories, numerical for attributes). Build user profile from weighted item features of interacted items.
Gate: Item features extracted, user profile vector built.
提取物品特征向量(文本内容采用TF-IDF,类别采用独热编码,属性采用数值型表示)。基于用户交互过的物品的加权特征构建用户画像。
校验门限: 已提取物品特征,已构建用户画像向量。
Phase 2: Core Algorithm
阶段2:核心算法
- Represent each item as a feature vector
- Build user profile: weighted centroid of interacted item vectors (weight by recency, rating, or engagement)
- Compute similarity between user profile and all candidate items (cosine similarity)
- Rank by similarity score, exclude already-interacted items
- 将每个物品表示为特征向量
- 构建用户画像:交互物品向量的加权质心(按近期程度、评分或参与度加权)
- 计算用户画像与所有候选物品的相似度(余弦相似度(cosine similarity))
- 按相似度得分排序,排除用户已交互过的物品
Phase 3: Verification
阶段3:验证
Evaluate: does the recommendation list reflect the user's demonstrated preferences? Check diversity metrics.
Gate: Recommendations are topically aligned with user history.
评估:推荐列表是否反映了用户已表现出的偏好?检查多样性指标。
校验门限: 推荐内容与用户历史在主题上保持一致。
Phase 4: Output
阶段4:输出
Return ranked recommendations with feature-level explanations.
返回带特征级解释的排序推荐结果。
Output Format
输出格式
json
{
"recommendations": [{"item_id": "456", "score": 0.87, "matching_features": ["genre:thriller", "director:Nolan"]}],
"metadata": {"method": "content-based", "features_used": 15, "profile_items": 30}
}json
{
"recommendations": [{"item_id": "456", "score": 0.87, "matching_features": ["genre:thriller", "director:Nolan"]}],
"metadata": {"method": "content-based", "features_used": 15, "profile_items": 30}
}Examples
示例
Sample I/O
输入输出示例
Input: User watched 5 sci-fi movies, 2 documentaries. Candidate: new sci-fi movie.
Expected: High score (~0.8+) due to genre match with dominant preference.
输入: 用户观看了5部科幻电影、2部纪录片。候选物品:一部新的科幻电影。
预期结果: 得分较高(约0.8+),因为类型与用户主要偏好匹配。
Edge Cases
边缘情况
| Input | Expected | Why |
|---|---|---|
| New user, no history | Cannot build profile | New-user cold start — use popularity |
| All items same features | Equal scores | No differentiation possible |
| User with diverse history | Moderate scores for all | Profile averages dilute signal |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 新用户,无历史数据 | 无法构建用户画像 | 新用户冷启动——改用热门推荐 |
| 所有物品特征相同 | 得分相同 | 无法区分物品 |
| 用户历史偏好多样 | 所有物品得分中等 | 用户画像平均值削弱了信号 |
Gotchas
注意事项
- Feature quality is everything: Garbage features → garbage recommendations. Invest in feature engineering.
- Filter bubble: Users get increasingly narrow recommendations. Inject diversity by mixing in exploration items.
- Profile drift: User preferences change over time. Apply temporal decay to older interactions.
- Feature sparsity: Items with few features produce unreliable similarity. Set a minimum feature count threshold.
- Over-specialization: A user who rated one jazz album highly shouldn't get ALL jazz. Weight by interaction count, not just rating.
- 特征质量是关键:劣质特征会导致劣质推荐。需投入精力进行特征工程。
- 过滤气泡:用户会收到越来越狭窄的推荐内容。可混入探索性物品来增加多样性。
- 画像漂移:用户偏好会随时间变化。应对旧交互应用时间衰减机制。
- 特征稀疏:特征过少的物品会产生不可靠的相似度。需设置最低特征数量阈值。
- 过度专业化:仅给某张爵士专辑打高分的用户不应只收到爵士内容。需按交互次数而非仅评分加权。
References
参考资料
- For hybrid approaches combining content and CF, see
references/hybrid-strategies.md - For text-based feature extraction techniques, see
references/feature-extraction.md
- 如需了解结合内容过滤与协同过滤(CF)的混合方法,请查看
references/hybrid-strategies.md - 如需了解基于文本的特征提取技术,请查看
references/feature-extraction.md