robots-txt

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

SEO Technical: robots.txt

SEO技术：robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

指导robots.txt的配置与审核，以实现对搜索引擎和AI爬虫的管控。

调用规则：首次使用时，如果对用户有帮助，可以先用1-2句话介绍本技能涵盖的内容以及其重要性，再提供核心输出。后续使用或用户要求跳过介绍时，直接输出核心内容。

Scope (Technical SEO)

适用范围（技术SEO）

Robots.txt: Review Disallow/Allow; avoid blocking important pages
Crawler access: Ensure crawlers (including AI crawlers) can access key pages
Indexing: Misconfigured robots.txt can block indexing; verify no accidental blocks

Robots.txt：检查Disallow/Allow规则，避免拦截重要页面
爬虫访问：确保爬虫（包括AI爬虫）可以访问核心页面
索引收录：配置错误的robots.txt会拦截索引收录，需确认不存在意外拦截的情况

Initial Assessment

初步评估

Check for product marketing context first: If

.claude/product-marketing-context.md

.cursor/product-marketing-context.md

exists, read it for site URL and indexing goals.

Identify:

Site URL: Base domain (e.g.,
```
https://example.com
```
)
Indexing scope: Full site, partial, or specific paths to exclude
AI crawler strategy: Allow search/indexing vs. block training data crawlers

优先检查产品营销上下文：如果存在

.claude/product-marketing-context.md

或

.cursor/product-marketing-context.md

文件，读取其中的站点URL和索引收录目标信息。

确认以下信息：

站点URL：根域名（例如
```
https://example.com
```
）
索引范围：全站、部分页面，或是需要排除的特定路径
AI爬虫策略：允许搜索/收录，或是拦截训练数据类爬虫

Best Practices

最佳实践

Purpose and Limitations

作用与局限性

Point	Note
Purpose	Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
No-index	Use noindex meta or auth for sensitive content; robots.txt is publicly readable
Indexed vs non-indexed	Not all content should be indexed. robots.txt and noindex complement each other: robots for path-level crawl control, noindex for page-level indexing. See indexing
Advisory	Rules are advisory; malicious crawlers may ignore

要点	说明
作用	控制爬虫访问，但不能阻止索引收录：被禁止的URL仍可能出现在搜索结果中，只是不会展示摘要
No-index	敏感内容请使用noindex meta标签或权限控制；robots.txt是公开可读取的
收录与非收录	并非所有内容都需要被收录。robots.txt和noindex是互补关系：robots.txt用于路径层级的爬取控制，noindex用于页面层级的收录控制，详见索引相关说明
规则属性	规则仅为建议性，恶意爬虫可能会忽略

Location and Format

存放位置与格式

Item	Requirement
Path	Site root: `https://example.com/robots.txt`
Encoding	UTF-8 plain text
Standard	RFC 9309 (Robots Exclusion Protocol)

项目	要求
路径	站点根目录： `https://example.com/robots.txt`
编码	UTF-8纯文本
标准	RFC 9309（爬虫排除协议）

Core Directives

核心指令

Directive	Purpose	Example
`User-agent:`	Target crawler	`User-agent: Googlebot` , `User-agent: *`
`Disallow:`	Block path prefix	`Disallow: /admin/`
`Allow:`	Allow path (can override Disallow)	`Allow: /public/`
`Sitemap:`	Declare sitemap absolute URL	`Sitemap: https://example.com/sitemap.xml`
`Clean-param:`	Strip query params (Yandex)	See below

指令	作用	示例
`User-agent:`	指定目标爬虫	`User-agent: Googlebot` , `User-agent: *`
`Disallow:`	拦截指定前缀的路径	`Disallow: /admin/`
`Allow:`	允许访问指定路径（可覆盖Disallow规则）	`Allow: /public/`
`Sitemap:`	声明站点地图的绝对URL	`Sitemap: https://example.com/sitemap.xml`
`Clean-param:`	剔除查询参数（仅Yandex支持）	见下方示例

Critical: Do Not Block Rendering Resources

重要提示：请勿拦截渲染资源

Do not block CSS, JS, images; Google needs them to render pages
Only block paths that don't need crawling: admin, API, temp files

切勿拦截CSS、JS、图片资源，Google需要这些资源来正确渲染页面
仅拦截不需要被爬取的路径：后台、API、临时文件等

AI Crawler Strategy

AI爬虫管控策略

User-agent	Purpose	Typical
OAI-SearchBot	ChatGPT search	Allow
GPTBot	OpenAI training	Disallow
Claude-SearchBot	Claude search	Allow
ClaudeBot	Anthropic training	Disallow
PerplexityBot	Perplexity search	Allow
Google-Extended	Gemini training	Disallow
CCBot	Common Crawl	Disallow

User-agent	用途	典型配置
OAI-SearchBot	ChatGPT搜索	Allow
GPTBot	OpenAI模型训练	Disallow
Claude-SearchBot	Claude搜索	Allow
ClaudeBot	Anthropic模型训练	Disallow
PerplexityBot	Perplexity搜索	Allow
Google-Extended	Gemini模型训练	Disallow
CCBot	Common Crawl	Disallow

Clean-param (Yandex)

Clean-param（Yandex）

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid