xiaohongshu-search-summarizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Xiaohongshu Search and Summarize

小红书搜索与汇总

This skill automates the process of extracting high-quality multi-modal content (text + images) from Xiaohongshu (小红书) and actively assists you in generating a deeply integrated, analytical final report for the user. Due to Xiaohongshu's aggressive anti-scraping mechanisms, direct HTTP requests or naive scraping often result in 404s or blocks. This skill natively bypasses these by simulating a real user through the
playwright-cli
in a headed browser window.
It operates in two distinct phases:
该Skill可自动从小红书提取高质量多模态内容(文本+图片),主动协助你为用户生成深度整合的分析型最终报告。由于小红书严格的反爬机制,直接发送HTTP请求或简单爬取通常会返回404或被拦截。该Skill通过
playwright-cli
在有头浏览器窗口中模拟真实用户操作,从原生层面绕过了这些限制。
它分为两个不同的运行阶段:

Phase 1: Subagent Data Collection

阶段1:子代理数据收集

  1. Simulate a search for the keyword on Xiaohongshu in a headed browser.
  2. Advance through image sliders to fully load all lazy pictures from the top N posts.
  3. Extract titles, descriptions, top comments, and all high-resolution images.
  4. Download those images to a local directory and generate a raw data document (
    [keyword]_raw_data.md
    ).
  1. 在有头浏览器中模拟在小红书搜索指定关键词的操作。
  2. 滑动图片轮播区,完全加载前N条帖子的所有懒加载图片。
  3. 提取标题、描述、热门评论以及所有高清图片。
  4. 将这些图片下载到本地目录,并生成原始数据文档(
    [keyword]_raw_data.md
    )。

Phase 2: AI Multi-Modal Synthesis (Your Job)

阶段2:AI多模态合成(你的任务)

  1. You MUST use your file reading capabilities to read the
    [keyword]_raw_data.md
    file.
  2. Inside the raw data markdown, you will find paths to image files. You MUST use your file reading / vision capabilities on these image file paths to actually ingest and "see" their visual content. If you skip this step, you are only reading file names, not the images themselves!
  3. You analyze the texts, summarize the genuinely useful comments (discarding noise like "pm me"), and interpret the semantic content of the images you just viewed (e.g. diagrams, guidelines, step-by-step UI flows).
  4. You compile everything into a beautifully synthesized, single comprehensive report rather than just a linear list of posts.
  1. 你必须使用文件读取能力读取
    [keyword]_raw_data.md
    文件。
  2. 在原始数据Markdown文件中,你会找到图片文件的路径。你必须对这些图片文件路径使用文件读取/视觉能力,以真正摄入并“查看”它们的视觉内容。如果你跳过这一步,你读取的只是文件名,而非图片本身!
  3. 需要分析文本,总结真正有价值的评论(丢弃诸如“私我”之类的无效信息),并解读你刚刚查看的图片的语义内容(例如示意图、指南、分步UI流程)。
  4. 需要将所有内容整合成一份排版精美、全面统一的报告,而非仅仅是帖子的线性罗列。

Dependencies

依赖项

  • playwright-cli
    (Must be available on the path)
  • python3
    (Required to download images and stitch the raw data markdown)

  • playwright-cli
    (必须在环境路径中可用)
  • python3
    (下载图片和拼接原始数据Markdown文件所需)

Usage Instructions

使用说明

Step 1: Run the Extraction Script

步骤1:运行提取脚本

Execute the wrapper script in
scripts/run.sh
. It accepts the following arguments:
bash
/bin/bash <skill_dir>/scripts/run.sh "YOUR KEYWORD" <MAX_POSTS> <OUTPUT_DIRECTORY>
  • YOUR KEYWORD
    : The search term to look up on Xiaohongshu.
  • <MAX_POSTS>
    : (Optional, default = 10) The number of top posts to scan.
  • <OUTPUT_DIRECTORY>
    : (Optional, default =
    ./
    ) Directory where the raw data and images will be saved.
Example execution:
bash
/bin/bash ~/.claude/skills/xiaohongshu-search-summarizer/scripts/run.sh "openclaw使用场景" 10 "./xhs_report_openclaw_scenarios"
执行
scripts/run.sh
中的封装脚本。它接受以下参数:
bash
/bin/bash <skill_dir>/scripts/run.sh "YOUR KEYWORD" <MAX_POSTS> <OUTPUT_DIRECTORY>
  • YOUR KEYWORD
    :要在小红书搜索的关键词。
  • <MAX_POSTS>
    :(可选,默认值=10)要扫描的热门帖子数量。
  • <OUTPUT_DIRECTORY>
    :(可选,默认值=
    ./
    )原始数据和图片的保存目录。
执行示例
bash
/bin/bash ~/.claude/skills/xiaohongshu-search-summarizer/scripts/run.sh "openclaw使用场景" 10 "./xhs_report_openclaw_scenarios"

Step 2: Read Raw Data & Images

步骤2:读取原始数据与图片

Once the bash script finishes successfully, navigate to the
OUTPUT_DIRECTORY
and use your file reading capabilities to ingest the generated
[keyword]_raw_data.md
file.
Inside this file, you will find descriptions, comments, and file paths pointing to
post_X_img_Y.webp
or
post_X_img_Y.jpg
.
Bash脚本成功运行完成后,进入
OUTPUT_DIRECTORY
,使用你的文件读取能力摄入生成的
[keyword]_raw_data.md
文件。
在该文件中,你会找到描述、评论,以及指向
post_X_img_Y.webp
post_X_img_Y.jpg
的文件路径。

Step 3: Synthesis & Summarization

步骤3:合成与汇总

This is the most critical step. Do not just return the raw markdown file to the user. Instead, write a polished comprehensive markdown report that reorganizes the information logically, while retaining a high level of detail.
Follow these strict compilation rules:
  • Do not list posts individually (e.g. avoid "Post 1: ... Post 2: ...").
  • Read the Images: You MUST use your file reading and vision capabilities on the
    .webp
    or
    .jpg
    image files found in the raw data directory to interpret their contents.
  • Detailed & Comprehensive Synthesis: Provide a highly detailed summary that includes diverse viewpoints, nuances, and specific examples found across different posts. Avoid over-summarizing or losing important context; preserve the richness and diversity of the information.
  • Extract and merge themes: Group ideas by concepts, steps, recurring themes, or pros/cons.
  • Evaluate comments: Merge insights from valuable comments directly into the core narrative. Skip useless or repetitive comments, but preserve diverse opinions or helpful counter-arguments from the comments section.
  • Integrate images contextually: Embed the most relevant and high-quality images directly into the flow of your final report to support the analytical points being made. Describe their visual meaning based on what you saw with your vision capabilities.
  • Save to OUTPUT_DIRECTORY: Save your beautifully compiled final Markdown report using your file writing capabilities directly into the same
    <OUTPUT_DIRECTORY>
    as the raw data (e.g.,
    <OUTPUT_DIRECTORY>/[keyword]_synthesis.md
    ), and give the user the path to it.
这是最关键的步骤。 不要直接将原始Markdown文件返回给用户。你需要撰写一份打磨完善的综合Markdown报告,对信息进行逻辑重组,同时保留高细节度。
请遵循以下严格的内容整合规则:
  • 不要单独罗列帖子(例如避免出现“帖子1:... 帖子2:...”的格式)。
  • 读取图片内容:你必须对原始数据目录中的
    .webp
    .jpg
    图片文件使用文件读取和视觉能力,解读其内容。
  • 详细全面的整合:提供高度详细的总结,涵盖不同帖子中的多元观点、细节差异和具体示例。避免过度总结或丢失重要上下文,保留信息的丰富性和多样性。
  • 提取并合并主题:按照概念、步骤、重复出现的主题或优缺点对内容进行分组。
  • 评论评估:将有价值的评论中的洞见直接整合到核心叙述中。跳过无用或重复的评论,但保留评论区中的多元观点或有帮助的反对意见。
  • 上下文整合图片:将最相关、高质量的图片直接嵌入最终报告的行文逻辑中,支撑你提出的分析观点。根据你通过视觉能力获取的内容,描述图片的视觉含义。
  • 保存到输出目录:使用你的文件写入能力,将排版精美的最终Markdown报告直接保存到原始数据所在的
    <OUTPUT_DIRECTORY>
    中(例如
    <OUTPUT_DIRECTORY>/[keyword]_synthesis.md
    ),并告知用户文件路径。

Error Handling

错误处理

If you encounter
404 Not Found
or "element not visible" errors during the browser invocation:
  • Keep in mind that Xiaohongshu may demand a login challenge. If the site pauses waiting for a login, instruct the user to verify the
    playwright-cli
    browser window and perform necessary authentication manually, then try the script again.
如果在调用浏览器过程中遇到
404 Not Found
或“元素不可见”错误:
  • 请注意小红书可能要求登录验证。如果网站暂停运行等待登录,请告知用户检查
    playwright-cli
    浏览器窗口,手动完成必要的身份验证后,再重新运行脚本。