geo-crawlers
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Crawler Access Analysis Skill
AI爬虫访问分析技能
Purpose
用途
This skill analyzes a website's accessibility to AI crawlers -- the bots that AI companies use to discover, index, and train on web content. If AI crawlers are blocked, the site's content cannot appear in AI-generated responses regardless of its quality. Crawler access is the foundational technical requirement for GEO.
本技能分析网站对AI爬虫的可访问性——AI公司用于发现、抓取和训练网页内容的机器人。如果AI爬虫被拦截,无论内容质量如何,网站内容都无法出现在AI生成的回复中。爬虫访问权限是GEO的基础技术要求。
Key Insight
核心洞察
As of early 2026, many websites inadvertently block AI crawlers through overly aggressive robots.txt rules, inherited from legacy SEO configurations. An Originality.ai 2025 study found that over 35% of the top 1,000 websites block at least one major AI crawler, and 5-10% block all AI crawlers. Blocking AI crawlers is the single fastest way to become invisible in AI-generated search results.
截至2026年初,许多网站因过度严格的robots.txt规则(继承自传统SEO配置)而无意中拦截了AI爬虫。Originality.ai 2025年的一项研究发现,在排名前1000的网站中,超过35%的网站至少拦截了一款主流AI爬虫,5-10%的网站拦截了所有AI爬虫。拦截AI爬虫是让你的内容在AI生成的搜索结果中彻底消失的最快方式。
Complete AI Crawler Reference
完整AI爬虫参考
Tier 1: Critical for AI Search Visibility (RECOMMEND: ALLOW)
第一梯队:对AI搜索可见性至关重要(建议:允许)
These crawlers power the AI search products where users actively look for answers. Blocking them directly reduces your visibility in AI-generated responses.
这些爬虫为用户主动寻找答案的AI搜索产品提供支持。拦截它们会直接降低你的内容在AI生成回复中的可见性。
GPTBot
GPTBot
- Operator: OpenAI
- User-Agent:
GPTBot - Full User-Agent String:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) - Purpose: Fetches content for ChatGPT's web browsing, plugins, and search features. Content accessed by GPTBot may be used to improve OpenAI models.
- Impact of Blocking: Content will NOT appear in ChatGPT Search results or be accessible when users ask ChatGPT to browse the web. This is the highest-impact AI crawler to allow.
- Recommendation: ALLOW -- ChatGPT has 300M+ weekly active users as of 2025. Blocking GPTBot removes your content from one of the largest AI search surfaces.
- 运营方: OpenAI
- User-Agent:
GPTBot - 完整User-Agent字符串:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) - 用途: 为ChatGPT的网页浏览、插件和搜索功能抓取内容。GPTBot获取的内容可能会被用于优化OpenAI模型。
- 拦截影响: 内容将不会出现在ChatGPT搜索结果中,当用户要求ChatGPT浏览网页时也无法访问该内容。这是允许访问后影响力最大的AI爬虫。
- 建议: 允许 —— 截至2025年,ChatGPT的周活跃用户已超过3亿。拦截GPTBot会让你的内容从最大的AI搜索场景之一中消失。
OAI-SearchBot
OAI-SearchBot
- Operator: OpenAI
- User-Agent:
OAI-SearchBot - Full User-Agent String:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://docs.openai.com/bots/overview) - Purpose: Specifically powers ChatGPT's search feature. Unlike GPTBot, content accessed by OAI-SearchBot is NOT used for model training -- only for live search results.
- Impact of Blocking: Content will not appear in ChatGPT's search results even if GPTBot is allowed.
- Recommendation: ALLOW -- This is a search-only crawler with no training implications. There is no strategic reason to block it.
- 运营方: OpenAI
- User-Agent:
OAI-SearchBot - 完整User-Agent字符串:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://docs.openai.com/bots/overview) - 用途: 专门为ChatGPT的搜索功能提供支持。与GPTBot不同,OAI-SearchBot获取的内容不会用于模型训练——仅用于实时搜索结果。
- 拦截影响: 即使允许GPTBot访问,内容也不会出现在ChatGPT的搜索结果中。
- 建议: 允许 —— 这是一款仅用于搜索的爬虫,不会涉及模型训练。没有任何战略理由拦截它。
ChatGPT-User
ChatGPT-User
- Operator: OpenAI
- User-Agent:
ChatGPT-User - Full User-Agent String:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot) - Purpose: Used when a ChatGPT user explicitly asks the model to visit a specific URL. Acts like a browser agent on behalf of the user.
- Impact of Blocking: ChatGPT cannot visit your pages when users ask it to read or summarize them. This prevents direct user-initiated traffic.
- Recommendation: ALLOW -- Blocking this bot prevents users who are actively trying to engage with your content from accessing it through ChatGPT.
- 运营方: OpenAI
- User-Agent:
ChatGPT-User - 完整User-Agent字符串:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot) - 用途: 当ChatGPT用户明确要求模型访问特定URL时使用。代表用户充当浏览器代理。
- 拦截影响: 当用户要求ChatGPT读取或总结你的页面时,ChatGPT无法访问这些页面。这会阻止用户主动发起的流量。
- 建议: 允许 —— 拦截该机器人会阻止那些试图通过ChatGPT与你的内容互动的用户访问它。
ClaudeBot
ClaudeBot
- Operator: Anthropic
- User-Agent:
ClaudeBot - Full User-Agent String:
ClaudeBot/1.0; +https://www.anthropic.com/claude-bot - Purpose: Fetches web content for Claude's features including web search, citations, and analysis tools.
- Impact of Blocking: Content will not be accessible to Claude for web search or when users ask Claude to analyze specific URLs.
- Recommendation: ALLOW -- Claude is a major AI assistant with growing market share. Blocking ClaudeBot reduces your AI search footprint.
- 运营方: Anthropic
- User-Agent:
ClaudeBot - 完整User-Agent字符串:
ClaudeBot/1.0; +https://www.anthropic.com/claude-bot - 用途: 为Claude的网页搜索、引用和分析工具等功能抓取网页内容。
- 拦截影响: 内容将无法被Claude用于网页搜索,或当用户要求Claude分析特定URL时也无法访问。
- 建议: 允许 —— Claude是一款市场份额不断增长的主流AI助手。拦截ClaudeBot会缩小你的AI搜索覆盖范围。
PerplexityBot
PerplexityBot
- Operator: Perplexity AI
- User-Agent:
PerplexityBot - Full User-Agent String:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) - Purpose: Powers Perplexity's AI search engine, which provides sourced answers with direct citations and links back to source pages.
- Impact of Blocking: Content will not appear in Perplexity search results. Perplexity is one of the best referral traffic sources among AI search products because it always displays source links.
- Recommendation: ALLOW -- Perplexity drives actual referral traffic and always attributes sources. High-value AI crawler for publishers and businesses.
- 运营方: Perplexity AI
- User-Agent:
PerplexityBot - 完整User-Agent字符串:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) - 用途: 为Perplexity的AI搜索引擎提供支持,该引擎提供带有直接引用和返回源页面链接的可信答案。
- 拦截影响: 内容将不会出现在Perplexity搜索结果中。在AI搜索产品中,Perplexity是最佳的推荐流量来源之一,因为它始终显示源链接。
- 建议: 允许 —— Perplexity能带来实际的推荐流量,并且始终注明来源。对于发布商和企业来说,这是一款高价值的AI爬虫。
Tier 2: Important for Broader AI Ecosystem (RECOMMEND: ALLOW)
第二梯队:对更广泛的AI生态系统重要(建议:允许)
These crawlers serve large AI platforms or search ecosystems. Allowing them increases your content's reach.
这些爬虫为大型AI平台或搜索生态系统提供支持。允许它们访问能扩大你的内容覆盖范围。
Google-Extended
Google-Extended
- Operator: Google
- User-Agent:
Google-Extended - Purpose: Controls whether Google uses your content for Gemini model training and AI Overviews improvement. CRITICAL NOTE: Blocking Google-Extended does NOT affect your Google Search rankings or your appearance in Google Search results. That is controlled by the standard Googlebot.
- Impact of Blocking: Content may not be used for Gemini training or to improve AI Overviews. However, your content can still appear in AI Overviews based on standard search indexing.
- Recommendation: ALLOW -- Blocking provides minimal content protection upside while reducing your presence in Google's AI features. Since it does not affect standard search ranking, the only reason to block is philosophical objection to training data usage.
- 运营方: Google
- User-Agent:
Google-Extended - 用途: 控制Google是否将你的内容用于Gemini模型训练和AI概览优化。重要提示: 拦截Google-Extended不会影响你的Google搜索排名或在Google搜索结果中的展示。这由标准的Googlebot控制。
- 拦截影响: 内容可能不会被用于Gemini模型训练或AI概览优化。不过,基于标准搜索索引,你的内容仍可能出现在AI概览中。
- 建议: 允许 —— 拦截它对内容保护的益处极小,同时会降低你在Google AI功能中的曝光度。由于它不影响标准搜索排名,唯一的拦截理由是对训练数据使用的哲学反对。
GoogleOther
GoogleOther
- Operator: Google
- User-Agent:
GoogleOther - Purpose: Used by Google for various non-search-ranking purposes including research, one-off crawls, and AI-related data collection.
- Impact of Blocking: Minimal impact on search rankings. May reduce presence in Google's AI research and experimental features.
- Recommendation: ALLOW -- Low risk, moderate potential benefit for AI feature inclusion.
- 运营方: Google
- User-Agent:
GoogleOther - 用途: Google用于各种非搜索排名目的,包括研究、一次性抓取和AI相关数据收集。
- 拦截影响: 对搜索排名的影响极小。可能会降低你在Google AI研究和实验功能中的曝光度。
- 建议: 允许 —— 风险低,纳入AI功能的潜在收益适中。
Applebot-Extended
Applebot-Extended
- Operator: Apple
- User-Agent:
Applebot-Extended - Purpose: Used by Apple to train and improve Apple Intelligence features, Siri, and Apple's AI products. Separate from standard Applebot (which powers Siri search and Spotlight Suggestions).
- Impact of Blocking: Content may not be used in Apple Intelligence features. Standard Siri and Spotlight functionality is unaffected (controlled by Applebot).
- Recommendation: ALLOW -- Apple Intelligence is integrated into all Apple devices (2B+ active devices). Presence in Apple's AI features has growing strategic value.
- 运营方: Apple
- User-Agent:
Applebot-Extended - 用途: Apple用于训练和优化Apple Intelligence功能、Siri和Apple的AI产品。与标准Applebot(为Siri搜索和Spotlight建议提供支持)是分开的。
- 拦截影响: 内容可能不会被用于Apple Intelligence功能。标准Siri和Spotlight功能不受影响(由Applebot控制)。
- 建议: 允许 —— Apple Intelligence已集成到所有Apple设备中(活跃设备超过20亿台)。在Apple AI功能中的曝光度具有越来越重要的战略价值。
Amazonbot
Amazonbot
- Operator: Amazon
- User-Agent:
Amazonbot - Full User-Agent String:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) - Purpose: Indexes content for Alexa answers and Amazon's AI features.
- Impact of Blocking: Content will not appear in Alexa voice responses or Amazon's AI-powered search features.
- Recommendation: ALLOW -- Relevant for voice search optimization. Lower priority than Tier 1 crawlers but no downside to allowing.
- 运营方: Amazon
- User-Agent:
Amazonbot - 完整User-Agent字符串:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) - 用途: 为Alexa回复和Amazon的AI功能抓取内容并建立索引。
- 拦截影响: 内容将不会出现在Alexa语音回复或Amazon的AI驱动搜索功能中。
- 建议: 允许 —— 与语音搜索优化相关。优先级低于第一梯队爬虫,但允许访问没有任何弊端。
FacebookBot
FacebookBot
- Operator: Meta
- User-Agent:
FacebookBot - Purpose: Used by Meta for AI features across Facebook, Instagram, WhatsApp, and Meta AI assistant.
- Impact of Blocking: Content may not be accessible to Meta AI. Link previews on Facebook/Instagram are handled by a different crawler and are unaffected.
- Recommendation: ALLOW -- Meta AI is embedded in apps with 3B+ combined users. Growing importance for AI visibility.
- 运营方: Meta
- User-Agent:
FacebookBot - 用途: Meta用于Facebook、Instagram、WhatsApp和Meta AI助手等平台的AI功能。
- 拦截影响: 内容可能无法被Meta AI访问。Facebook/Instagram上的链接预览由另一款爬虫处理,不受影响。
- 建议: 允许 —— Meta AI已嵌入到总用户量超过30亿的应用中。对AI可见性的重要性日益提升。
Tier 3: Training-Only Crawlers (ALLOW or BLOCK Based on Strategy)
第三梯队:仅用于训练的爬虫(根据策略选择允许或拦截)
These crawlers are primarily used for AI model training rather than live search features. Blocking them does not affect AI search visibility.
这些爬虫主要用于AI模型训练,而非实时搜索功能。拦截它们不会影响AI搜索可见性。
CCBot
CCBot
- Operator: Common Crawl (nonprofit)
- User-Agent:
CCBot - Full User-Agent String:
CCBot/2.0 (https://commoncrawl.org/faq/) - Purpose: Builds the Common Crawl dataset, which is used as training data by many AI companies (Google, Meta, Stability AI, and others).
- Impact of Blocking: Content will not appear in future Common Crawl datasets. Does NOT affect any live AI search product.
- Recommendation: CONTEXT-DEPENDENT -- Allow if you want maximum long-term AI training presence. Block if you want to control training data usage. No impact on search visibility.
- 运营方: Common Crawl(非营利组织)
- User-Agent:
CCBot - 完整User-Agent字符串:
CCBot/2.0 (https://commoncrawl.org/faq/) - 用途: 构建Common Crawl数据集,该数据集被许多AI公司(Google、Meta、Stability AI等)用作训练数据。
- 拦截影响: 内容将不会出现在未来的Common Crawl数据集中。不会影响任何实时AI搜索产品。
- 建议: 视情况而定 —— 如果你希望在长期AI训练中获得最大曝光度,则允许访问。如果你想控制训练数据的使用,则拦截。对搜索可见性没有影响。
anthropic-ai
anthropic-ai
- Operator: Anthropic
- User-Agent:
anthropic-ai - Purpose: Used by Anthropic for AI safety research and Claude model training. Separate from ClaudeBot (which powers live features).
- Impact of Blocking: Content will not be used for Claude training. Does NOT affect Claude's live search or web browsing features (controlled by ClaudeBot).
- Recommendation: CONTEXT-DEPENDENT -- Similar to CCBot. Allow for training presence, block for training data control. No impact on live AI search.
- 运营方: Anthropic
- User-Agent:
anthropic-ai - 用途: Anthropic用于AI安全研究和Claude模型训练。与ClaudeBot(为实时功能提供支持)是分开的。
- 拦截影响: 内容将不会被用于Claude模型训练。不会影响Claude的实时搜索或网页浏览功能(由ClaudeBot控制)。
- 建议: 视情况而定 —— 与CCBot类似。允许访问以获得训练曝光度,拦截以控制训练数据使用。对实时AI搜索没有影响。
Bytespider
Bytespider
- Operator: ByteDance
- User-Agent:
Bytespider - Purpose: Used by ByteDance for various AI products including TikTok's AI features and Doubao (their ChatGPT competitor in China).
- Impact of Blocking: Content will not be used for ByteDance AI products. Minimal impact for Western-market businesses.
- Recommendation: BLOCK for most Western businesses (aggressive crawling behavior reported, minimal search visibility benefit). ALLOW if targeting Chinese/Asian markets.
- 运营方: ByteDance
- User-Agent:
Bytespider - 用途: ByteDance用于各种AI产品,包括TikTok的AI功能和Doubao(中国的ChatGPT竞品)。
- 拦截影响: 内容将不会被用于ByteDance的AI产品。对西方市场企业的影响极小。
- 建议: 大多数西方企业拦截(据报道其抓取行为激进,搜索可见性收益极小)。如果针对中国/亚洲市场,则允许。
cohere-ai
cohere-ai
- Operator: Cohere
- User-Agent:
cohere-ai - Purpose: Used by Cohere for model training. Cohere powers enterprise AI solutions and the Coral chat product.
- Impact of Blocking: Content will not be used for Cohere model training. Minimal direct consumer-facing impact.
- Recommendation: CONTEXT-DEPENDENT -- Low priority. Allow or block based on general training data stance.
- 运营方: Cohere
- User-Agent:
cohere-ai - 用途: Cohere用于模型训练。Cohere为企业AI解决方案和Coral聊天产品提供支持。
- 拦截影响: 内容将不会被用于Cohere模型训练。对面向消费者的直接影响极小。
- 建议: 视情况而定 —— 优先级低。根据对训练数据的总体立场选择允许或拦截。
Recommendation Matrix Summary
建议矩阵总结
| Crawler | Tier | Recommendation | Reason |
|---|---|---|---|
| GPTBot | 1 | ALLOW | Powers ChatGPT Search (300M+ users) |
| OAI-SearchBot | 1 | ALLOW | Search-only, no training use |
| ChatGPT-User | 1 | ALLOW | User-initiated browsing |
| ClaudeBot | 1 | ALLOW | Claude web search and analysis |
| PerplexityBot | 1 | ALLOW | Best referral traffic AI search |
| Google-Extended | 2 | ALLOW | Gemini features; no search rank impact |
| GoogleOther | 2 | ALLOW | Google AI research |
| Applebot-Extended | 2 | ALLOW | Apple Intelligence (2B+ devices) |
| Amazonbot | 2 | ALLOW | Alexa and Amazon AI |
| FacebookBot | 2 | ALLOW | Meta AI (3B+ app users) |
| CCBot | 3 | Context | Training data only |
| anthropic-ai | 3 | Context | Training data only |
| Bytespider | 3 | BLOCK | Aggressive crawler, low benefit |
| cohere-ai | 3 | Context | Training data only |
| 爬虫 | 梯队 | 建议 | 理由 |
|---|---|---|---|
| GPTBot | 1 | 允许 | 为ChatGPT搜索提供支持(用户超3亿) |
| OAI-SearchBot | 1 | 允许 | 仅用于搜索,不涉及训练 |
| ChatGPT-User | 1 | 允许 | 用户主动发起的浏览 |
| ClaudeBot | 1 | 允许 | Claude网页搜索和分析 |
| PerplexityBot | 1 | 允许 | AI搜索中最佳的推荐流量来源 |
| Google-Extended | 2 | 允许 | Gemini功能;不影响搜索排名 |
| GoogleOther | 2 | 允许 | Google AI研究 |
| Applebot-Extended | 2 | 允许 | Apple Intelligence(设备超20亿台) |
| Amazonbot | 2 | 允许 | Alexa和Amazon AI |
| FacebookBot | 2 | 允许 | Meta AI(应用用户超30亿) |
| CCBot | 3 | 视情况而定 | 仅用于训练数据 |
| anthropic-ai | 3 | 视情况而定 | 仅用于训练数据 |
| Bytespider | 3 | 拦截 | 抓取行为激进,收益低 |
| cohere-ai | 3 | 视情况而定 | 仅用于训练数据 |
Maximum AI Visibility Configuration (robots.txt)
最大化AI可见性的配置(robots.txt)
For sites wanting maximum AI search visibility:
undefined对于希望获得最大AI搜索可见性的网站:
undefinedAI Crawlers - ALLOWED for AI search visibility
AI Crawlers - ALLOWED for AI search visibility
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: FacebookBot
Allow: /
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: FacebookBot
Allow: /
AI Crawlers - BLOCKED (aggressive/low value)
AI Crawlers - BLOCKED (aggressive/low value)
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
---User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
---Analysis Procedure
分析流程
Step 1: Fetch and Parse robots.txt
步骤1:获取并解析robots.txt
- Use WebFetch to retrieve .
[domain]/robots.txt - Parse all User-agent directives and their associated Allow/Disallow rules.
- For each AI crawler in the reference list above:
- Check if there is a specific User-agent block for that crawler
- Check if there is a wildcard () block that would apply
User-agent: * - Determine effective access: Allowed, Blocked, or Not Mentioned (inherits wildcard rules)
- Note any directives that may slow AI crawler access.
Crawl-delay - Check for directives (AI crawlers use these for discovery).
Sitemap
- 使用WebFetch获取。
[domain]/robots.txt - 解析所有User-agent指令及其关联的Allow/Disallow规则。
- 对于上述参考列表中的每个AI爬虫:
- 检查是否有针对该爬虫的特定User-agent拦截规则
- 检查是否有通配符()拦截规则适用于该爬虫
User-agent: * - 确定有效访问权限:允许、拦截或未提及(继承通配符规则)
- 记录任何可能减慢AI爬虫访问速度的指令。
Crawl-delay - 检查指令(AI爬虫使用这些指令发现内容)。
Sitemap
Step 2: Check Meta Robots Tags
步骤2:检查元机器人标签
- For a sample of 5-10 key pages, fetch the HTML and check for:
- -- blocks all bots
<meta name="robots" content="noindex"> - -- prevents link following
<meta name="robots" content="nofollow"> - -- emerging tag to block AI use
<meta name="robots" content="noai"> - -- blocks AI image training
<meta name="robots" content="noimageai"> - Bot-specific meta tags:
<meta name="GPTBot" content="noindex">
- Record any page-level overrides of the robots.txt directives.
- 选取5-10个关键页面的样本,获取HTML并检查:
- —— 拦截所有机器人
<meta name="robots" content="noindex"> - —— 阻止跟随链接
<meta name="robots" content="nofollow"> - —— 新兴标签,用于拦截AI使用
<meta name="robots" content="noai"> - —— 拦截AI图像训练
<meta name="robots" content="noimageai"> - 针对特定机器人的元标签:
<meta name="GPTBot" content="noindex">
- 记录任何页面级别的robots.txt规则覆盖。
Step 3: Check HTTP Headers
步骤3:检查HTTP标头
- For the same sample pages, check response headers for:
- -- HTTP header equivalent of meta noindex
X-Robots-Tag: noindex - -- HTTP header to block AI use
X-Robots-Tag: noai - -- blocks AI image training
X-Robots-Tag: noimageai - Bot-specific headers:
X-Robots-Tag: GPTBot: noindex
- Note that HTTP headers override meta tags and apply to non-HTML resources too.
- 对于相同的页面样本,检查响应标头:
- —— 与元noindex等效的HTTP标头
X-Robots-Tag: noindex - —— 用于拦截AI使用的HTTP标头
X-Robots-Tag: noai - —— 拦截AI图像训练
X-Robots-Tag: noimageai - 针对特定机器人的标头:
X-Robots-Tag: GPTBot: noindex
- 注意HTTP标头会覆盖元标签,并且也适用于非HTML资源。
Step 4: Check for AI-Specific Files
步骤4:检查AI特定文件
- Check for (emerging standard for AI crawler guidance).
/llms.txt - Check for (OpenAI plugin manifest).
/.well-known/ai-plugin.json - Check for (proposed standard, similar to ads.txt for AI).
/ai.txt - Record presence/absence and quality of each file.
- 检查是否存在(用于指导AI爬虫的新兴标准)。
/llms.txt - 检查是否存在(OpenAI插件清单)。
/.well-known/ai-plugin.json - 检查是否存在(拟议标准,类似于AI领域的ads.txt)。
/ai.txt - 记录每个文件的存在/缺失情况和质量。
Step 5: Assess JavaScript Rendering Requirements
步骤5:评估JavaScript渲染要求
- Check if the site is a Single Page Application (SPA) or heavily JavaScript-rendered.
- AI crawlers vary in their JavaScript rendering capabilities:
- GPTBot: Limited JS rendering
- ClaudeBot: Limited JS rendering
- PerplexityBot: Limited JS rendering
- Googlebot: Full JS rendering (but Google-Extended inherits this)
- If critical content requires JS rendering, flag this as a potential issue.
- Check for Server-Side Rendering (SSR) or Static Site Generation (SSG) as mitigations.
- 检查网站是否为单页应用(SPA)或严重依赖JavaScript渲染。
- AI爬虫的JavaScript渲染能力各不相同:
- GPTBot:有限的JS渲染能力
- ClaudeBot:有限的JS渲染能力
- PerplexityBot:有限的JS渲染能力
- Googlebot:完整的JS渲染能力(Google-Extended继承此能力)
- 如果关键内容需要JS渲染,则将其标记为潜在问题。
- 检查是否有服务器端渲染(SSR)或静态站点生成(SSG)作为缓解措施。
Output Format
输出格式
Generate a file called :
GEO-CRAWLER-ACCESS.mdmarkdown
undefined生成名为的文件:
GEO-CRAWLER-ACCESS.mdmarkdown
undefinedAI Crawler Access Report: [Domain]
AI爬虫访问报告: [Domain]
Analysis Date: [Date]
Domain: [Domain]
robots.txt Status: [Found/Not Found/Error]
分析日期: [Date]
域名: [Domain]
robots.txt状态: [Found/Not Found/Error]
Crawler Access Summary
爬虫访问摘要
| Crawler | Operator | Tier | Status | Impact |
|---|---|---|---|---|
| GPTBot | OpenAI | 1 | [Allowed/Blocked/Not Mentioned] | [Impact description] |
| OAI-SearchBot | OpenAI | 1 | [Status] | [Impact] |
| ChatGPT-User | OpenAI | 1 | [Status] | [Impact] |
| ClaudeBot | Anthropic | 1 | [Status] | [Impact] |
| PerplexityBot | Perplexity | 1 | [Status] | [Impact] |
| Google-Extended | 2 | [Status] | [Impact] | |
| GoogleOther | 2 | [Status] | [Impact] | |
| Applebot-Extended | Apple | 2 | [Status] | [Impact] |
| Amazonbot | Amazon | 2 | [Status] | [Impact] |
| FacebookBot | Meta | 2 | [Status] | [Impact] |
| CCBot | Common Crawl | 3 | [Status] | [Impact] |
| anthropic-ai | Anthropic | 3 | [Status] | [Impact] |
| Bytespider | ByteDance | 3 | [Status] | [Impact] |
| cohere-ai | Cohere | 3 | [Status] | [Impact] |
| 爬虫 | 运营方 | 梯队 | 状态 | 影响 |
|---|---|---|---|---|
| GPTBot | OpenAI | 1 | [Allowed/Blocked/Not Mentioned] | [Impact description] |
| OAI-SearchBot | OpenAI | 1 | [Status] | [Impact] |
| ChatGPT-User | OpenAI | 1 | [Status] | [Impact] |
| ClaudeBot | Anthropic | 1 | [Status] | [Impact] |
| PerplexityBot | Perplexity | 1 | [Status] | [Impact] |
| Google-Extended | 2 | [Status] | [Impact] | |
| GoogleOther | 2 | [Status] | [Impact] | |
| Applebot-Extended | Apple | 2 | [Status] | [Impact] |
| Amazonbot | Amazon | 2 | [Status] | [Impact] |
| FacebookBot | Meta | 2 | [Status] | [Impact] |
| CCBot | Common Crawl | 3 | [Status] | [Impact] |
| anthropic-ai | Anthropic | 3 | [Status] | [Impact] |
| Bytespider | ByteDance | 3 | [Status] | [Impact] |
| cohere-ai | Cohere | 3 | [Status] | [Impact] |
AI Visibility Score: [X]/100
AI可见性得分: [X]/100
Tier 1 Access: [X/5 crawlers allowed]
Tier 2 Access: [X/5 crawlers allowed]
Tier 3 Access: [X/4 crawlers allowed]
第一梯队访问权限: [X/5个爬虫被允许]
第二梯队访问权限: [X/5个爬虫被允许]
第三梯队访问权限: [X/4个爬虫被允许]
Critical Issues
关键问题
[List any Tier 1 crawlers that are blocked]
[列出所有被拦截的第一梯队爬虫]
Recommendations
建议
Immediate Actions
立即行动
[Specific robots.txt changes needed]
[所需的具体robots.txt更改]
robots.txt Recommendation
robots.txt建议
[Complete recommended robots.txt content for AI crawlers][针对AI爬虫的完整推荐robots.txt内容]Additional Technical Findings
其他技术发现
- Meta Robots Tags: [Findings]
- X-Robots-Tag Headers: [Findings]
- JavaScript Rendering: [Assessment]
- llms.txt: [Present/Absent]
- Sitemap Accessibility: [Assessment]
---- 元机器人标签: [发现结果]
- X-Robots-Tag标头: [发现结果]
- JavaScript渲染: [评估结果]
- llms.txt: [存在/缺失]
- 站点地图可访问性: [评估结果]
---Scoring for Crawler Access
爬虫访问评分规则
The AI Crawler Access Score is calculated as:
| Component | Weight | Scoring |
|---|---|---|
| Tier 1 Crawlers Allowed | 50% | 20 points per Tier 1 crawler allowed (5 crawlers = 100 points max, scaled to 50) |
| Tier 2 Crawlers Allowed | 25% | 20 points per Tier 2 crawler allowed (5 crawlers = 100 points max, scaled to 25) |
| No Blanket AI Blocks | 15% | Full points if no |
| AI-Specific Files Present | 10% | 5 points for llms.txt, 5 points for sitemap accessible to AI crawlers |
Final score = sum of all weighted components, capped at 100.
AI爬虫访问得分的计算方式如下:
| 组件 | 权重 | 评分标准 |
|---|---|---|
| 第一梯队爬虫被允许数量 | 50% | 每个被允许的第一梯队爬虫得20分(5个爬虫最多得100分,按比例折算为50分) |
| 第二梯队爬虫被允许数量 | 25% | 每个被允许的第二梯队爬虫得20分(5个爬虫最多得100分,按比例折算为25分) |
| 无全面AI拦截 | 15% | 如果没有 |
| 存在AI特定文件 | 10% | llms.txt得5分,AI爬虫可访问的站点地图得5分 |
最终得分 = 所有加权组件的总和,最高为100分。