robots-txt

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SEO Technical: robots.txt

SEO技术:robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.
When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.
指导robots.txt的配置与审核,以实现对搜索引擎和AI爬虫的管控。
调用规则首次使用时,如果对用户有帮助,可以先用1-2句话介绍本技能涵盖的内容以及其重要性,再提供核心输出。后续使用或用户要求跳过介绍时,直接输出核心内容。

Scope (Technical SEO)

适用范围(技术SEO)

  • Robots.txt: Review Disallow/Allow; avoid blocking important pages
  • Crawler access: Ensure crawlers (including AI crawlers) can access key pages
  • Indexing: Misconfigured robots.txt can block indexing; verify no accidental blocks
  • Robots.txt:检查Disallow/Allow规则,避免拦截重要页面
  • 爬虫访问:确保爬虫(包括AI爬虫)可以访问核心页面
  • 索引收录:配置错误的robots.txt会拦截索引收录,需确认不存在意外拦截的情况

Initial Assessment

初步评估

Check for product marketing context first: If
.claude/product-marketing-context.md
or
.cursor/product-marketing-context.md
exists, read it for site URL and indexing goals.
Identify:
  1. Site URL: Base domain (e.g.,
    https://example.com
    )
  2. Indexing scope: Full site, partial, or specific paths to exclude
  3. AI crawler strategy: Allow search/indexing vs. block training data crawlers
优先检查产品营销上下文:如果存在
.claude/product-marketing-context.md
.cursor/product-marketing-context.md
文件,读取其中的站点URL和索引收录目标信息。
确认以下信息:
  1. 站点URL:根域名(例如
    https://example.com
  2. 索引范围:全站、部分页面,或是需要排除的特定路径
  3. AI爬虫策略:允许搜索/收录,或是拦截训练数据类爬虫

Best Practices

最佳实践

Purpose and Limitations

作用与局限性

PointNote
PurposeControls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
No-indexUse noindex meta or auth for sensitive content; robots.txt is publicly readable
Indexed vs non-indexedNot all content should be indexed. robots.txt and noindex complement each other: robots for path-level crawl control, noindex for page-level indexing. See indexing
AdvisoryRules are advisory; malicious crawlers may ignore
要点说明
作用控制爬虫访问,但不能阻止索引收录:被禁止的URL仍可能出现在搜索结果中,只是不会展示摘要
No-index敏感内容请使用noindex meta标签或权限控制;robots.txt是公开可读取的
收录与非收录并非所有内容都需要被收录。robots.txt和noindex是互补关系:robots.txt用于路径层级的爬取控制,noindex用于页面层级的收录控制,详见索引相关说明
规则属性规则仅为建议性,恶意爬虫可能会忽略

Location and Format

存放位置与格式

ItemRequirement
PathSite root:
https://example.com/robots.txt
EncodingUTF-8 plain text
StandardRFC 9309 (Robots Exclusion Protocol)
项目要求
路径站点根目录:
https://example.com/robots.txt
编码UTF-8纯文本
标准RFC 9309(爬虫排除协议)

Core Directives

核心指令

DirectivePurposeExample
User-agent:
Target crawler
User-agent: Googlebot
,
User-agent: *
Disallow:
Block path prefix
Disallow: /admin/
Allow:
Allow path (can override Disallow)
Allow: /public/
Sitemap:
Declare sitemap absolute URL
Sitemap: https://example.com/sitemap.xml
Clean-param:
Strip query params (Yandex)See below
指令作用示例
User-agent:
指定目标爬虫
User-agent: Googlebot
,
User-agent: *
Disallow:
拦截指定前缀的路径
Disallow: /admin/
Allow:
允许访问指定路径(可覆盖Disallow规则)
Allow: /public/
Sitemap:
声明站点地图的绝对URL
Sitemap: https://example.com/sitemap.xml
Clean-param:
剔除查询参数(仅Yandex支持)见下方示例

Critical: Do Not Block Rendering Resources

重要提示:请勿拦截渲染资源

  • Do not block CSS, JS, images; Google needs them to render pages
  • Only block paths that don't need crawling: admin, API, temp files
  • 切勿拦截CSS、JS、图片资源,Google需要这些资源来正确渲染页面
  • 拦截不需要被爬取的路径:后台、API、临时文件等

AI Crawler Strategy

AI爬虫管控策略

User-agentPurposeTypical
OAI-SearchBotChatGPT searchAllow
GPTBotOpenAI trainingDisallow
Claude-SearchBotClaude searchAllow
ClaudeBotAnthropic trainingDisallow
PerplexityBotPerplexity searchAllow
Google-ExtendedGemini trainingDisallow
CCBotCommon CrawlDisallow
User-agent用途典型配置
OAI-SearchBotChatGPT搜索Allow
GPTBotOpenAI模型训练Disallow
Claude-SearchBotClaude搜索Allow
ClaudeBotAnthropic模型训练Disallow
PerplexityBotPerplexity搜索Allow
Google-ExtendedGemini模型训练Disallow
CCBotCommon CrawlDisallow

Clean-param (Yandex)

Clean-param(Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid
Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

输出格式

  • Current state (if auditing)
  • Recommended robots.txt (full file)
  • Compliance checklist
  • References: Google robots.txt
  • 当前状态(如果是审核场景)
  • 推荐的robots.txt(完整文件内容)
  • 合规检查清单
  • 参考资料Google robots.txt官方文档

Related Skills

相关技能

  • xml-sitemap: Sitemap URL to reference in robots.txt
  • site-crawlability: Broader crawl and structure guidance
  • xml-sitemap:可在robots.txt中引用的站点地图URL
  • site-crawlability:更全面的爬取与站点结构指导