xcrawl-crawl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

XCrawl Crawl

XCrawl 爬取

Overview

概述

This skill orchestrates full-site or scoped crawling with XCrawl Crawl APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.
本技能通过XCrawl Crawl API协调全站或限定范围的爬取任务。默认行为是直接透传:原样返回上游API的响应体。

Required Local Config

必要的本地配置

Before using this skill, the user must create a local config file and write
XCRAWL_API_KEY
into it.
Path:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
Read API key from local config file only. Do not require global environment variables.
使用本技能前,用户必须创建一个本地配置文件,并将
XCRAWL_API_KEY
写入其中。
路径:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
仅从本地配置文件读取API密钥,不依赖全局环境变量。

Credits and Account Setup

积分与账户设置

Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at
https://dash.xcrawl.com/
. After registration, they can activate the free
1000
credits plan before running requests.
使用XCrawl API会消耗积分。如果用户没有账户或可用积分,引导他们前往
https://dash.xcrawl.com/
注册。注册后,他们可以激活免费的1000积分套餐,之后再发起请求。

Tool Permission Policy

工具权限策略

Request runtime permissions for
curl
and
node
only. Do not request Python, shell helper scripts, or other runtime permissions.
仅请求
curl
node
的运行时权限,不请求Python、shell辅助脚本或其他运行时权限。

API Surface

API接口

  • Start crawl:
    POST /v1/crawl
  • Read result:
    GET /v1/crawl/{crawl_id}
  • Base URL:
    https://run.xcrawl.com
  • Required header:
    Authorization: Bearer <XCRAWL_API_KEY>
  • 启动爬取:
    POST /v1/crawl
  • 读取结果:
    GET /v1/crawl/{crawl_id}
  • 基础URL:
    https://run.xcrawl.com
  • 必填请求头:
    Authorization: Bearer <XCRAWL_API_KEY>

Usage Examples

使用示例

cURL (create + result)

cURL(创建任务 + 获取结果)

bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"

echo "$CREATE_RESP"

CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
  -H "Authorization: Bearer ${API_KEY}"
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"

echo "$CREATE_RESP"

CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

Node

Node.js

bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'
bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

  • Endpoint:
    POST https://run.xcrawl.com/v1/crawl
  • Headers:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>
  • 端点:
    POST https://run.xcrawl.com/v1/crawl
  • 请求头:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>

Request body: top-level fields

请求体:顶级字段

FieldTypeRequiredDefaultDescription
url
stringYes-Site entry URL
crawler
objectNo-Crawler config
proxy
objectNo-Proxy config
request
objectNo-Request config
js_render
objectNo-JS rendering config
output
objectNo-Output config
webhook
objectNo-Async callback config
字段类型必填默认值描述
url
string-网站入口URL
crawler
object-爬虫配置
proxy
object-代理配置
request
object-请求配置
js_render
object-JS渲染配置
output
object-输出配置
webhook
object-异步回调配置

crawler

crawler

FieldTypeRequiredDefaultDescription
limit
integerNo
100
Max pages
include
string[]No-Include only matching URLs (regex supported)
exclude
string[]No-Exclude matching URLs (regex supported)
max_depth
integerNo
3
Max depth from entry URL
include_entire_domain
booleanNo
false
Crawl full site instead of only subpaths
include_subdomains
booleanNo
false
Include subdomains
include_external_links
booleanNo
false
Include external links
sitemaps
booleanNo
true
Use site sitemap
字段类型必填默认值描述
limit
integer
100
最大爬取页面数
include
string[]-仅包含匹配的URL(支持正则)
exclude
string[]-排除匹配的URL(支持正则)
max_depth
integer
3
相对于入口URL的最大爬取深度
include_entire_domain
boolean
false
爬取整个站点而非仅子路径
include_subdomains
boolean
false
包含子域名
include_external_links
boolean
false
包含外部链接
sitemaps
boolean
true
使用站点地图

proxy

proxy

FieldTypeRequiredDefaultDescription
location
stringNo
US
ISO-3166-1 alpha-2 country code, e.g.
US
/
JP
/
SG
sticky_session
stringNoAuto-generatedSticky session ID; same ID attempts to reuse exit
字段类型必填默认值描述
location
string
US
ISO-3166-1 alpha-2国家代码,例如
US
/
JP
/
SG
sticky_session
string自动生成粘性会话ID;相同ID会尝试复用出口节点

request

request

FieldTypeRequiredDefaultDescription
locale
stringNo
en-US,en;q=0.9
Affects
Accept-Language
device
stringNo
desktop
desktop
/
mobile
; affects UA and viewport
cookies
object mapNo-Cookie key/value pairs
headers
object mapNo-Header key/value pairs
only_main_content
booleanNo
true
Return main content only
block_ads
booleanNo
true
Attempt to block ad resources
skip_tls_verification
booleanNo
true
Skip TLS verification
字段类型必填默认值描述
locale
string
en-US,en;q=0.9
影响
Accept-Language
请求头
device
string
desktop
desktop
/
mobile
;影响用户代理和视口
cookies
对象映射-Cookie键值对
headers
对象映射-请求头键值对
only_main_content
boolean
true
仅返回页面主内容
block_ads
boolean
true
尝试阻止广告资源加载
skip_tls_verification
boolean
true
跳过TLS验证

js_render

js_render

FieldTypeRequiredDefaultDescription
enabled
booleanNo
true
Enable browser rendering
wait_until
stringNo
load
load
/
domcontentloaded
/
networkidle
viewport.width
integerNo-Viewport width (desktop
1920
, mobile
402
)
viewport.height
integerNo-Viewport height (desktop
1080
, mobile
874
)
字段类型必填默认值描述
enabled
boolean
true
启用浏览器渲染
wait_until
string
load
load
/
domcontentloaded
/
networkidle
viewport.width
integer-视口宽度(桌面端为
1920
,移动端为
402
viewport.height
integer-视口高度(桌面端为
1080
,移动端为
874

output

output

FieldTypeRequiredDefaultDescription
formats
string[]No
["markdown"]
Output formats
screenshot
stringNo
viewport
full_page
/
viewport
(only if
formats
includes
screenshot
)
json.prompt
stringNo-Extraction prompt
json.json_schema
objectNo-JSON Schema
output.formats
enum:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json
字段类型必填默认值描述
formats
string[]
["markdown"]
输出格式
screenshot
string
viewport
full_page
/
viewport
(仅当
formats
包含
screenshot
时生效)
json.prompt
string-提取提示词
json.json_schema
object-JSON Schema
output.formats
可选值:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json

webhook

webhook

FieldTypeRequiredDefaultDescription
url
stringNo-Callback URL
headers
object mapNo-Custom callback headers
events
string[]No
["started","completed","failed"]
Events:
started
/
completed
/
failed
字段类型必填默认值描述
url
string-回调URL
headers
对象映射-自定义回调请求头
events
string[]
["started","completed","failed"]
触发事件:
started
/
completed
/
failed

Response Parameters

响应参数

Create response (
POST /v1/crawl
)

创建任务响应(
POST /v1/crawl

FieldTypeDescription
crawl_id
stringTask ID
endpoint
stringAlways
crawl
version
stringVersion
status
stringAlways
pending
字段类型描述
crawl_id
string任务ID
endpoint
string固定为
crawl
version
string版本号
status
string固定为
pending

Result response (
GET /v1/crawl/{crawl_id}
)

结果响应(
GET /v1/crawl/{crawl_id}

FieldTypeDescription
crawl_id
stringTask ID
endpoint
stringAlways
crawl
version
stringVersion
status
string
pending
/
crawling
/
completed
/
failed
url
stringEntry URL
data
object[]Per-page result array
started_at
stringStart time (ISO 8601)
ended_at
stringEnd time (ISO 8601)
total_credits_used
integerTotal credits used
data[]
fields follow
output.formats
:
  • html
    ,
    raw_html
    ,
    markdown
    ,
    links
    ,
    summary
    ,
    screenshot
    ,
    json
  • metadata
    (page metadata)
  • traffic_bytes
  • credits_used
  • credits_detail
字段类型描述
crawl_id
string任务ID
endpoint
string固定为
crawl
version
string版本号
status
string
pending
/
crawling
/
completed
/
failed
url
string入口URL
data
object[]单页面结果数组
started_at
string任务开始时间(ISO 8601格式)
ended_at
string任务结束时间(ISO 8601格式)
total_credits_used
integer总消耗积分
data[]
字段对应
output.formats
包含的格式:
  • html
    raw_html
    markdown
    links
    summary
    screenshot
    json
  • metadata
    (页面元数据)
  • traffic_bytes
    (流量字节数)
  • credits_used
    (单页面消耗积分)
  • credits_detail
    (积分消耗明细)

Workflow

工作流

  1. Confirm business objective and crawl boundary.
  • What content is required, what content must be excluded, and what is the completion signal.
  1. Draft a bounded crawl request.
  • Prefer explicit limits and path constraints.
  1. Start crawl and capture task metadata.
  • Record
    crawl_id
    , initial status, and request payload.
  1. Poll
    GET /v1/crawl/{crawl_id}
    until terminal state.
  • Track
    pending
    ,
    crawling
    ,
    completed
    , or
    failed
    .
  1. Return raw create/result responses.
  • Do not synthesize derived summaries unless explicitly requested.
  1. 确认业务目标与爬取范围。
  • 明确需要爬取的内容、必须排除的内容,以及任务完成的标识。
  1. 编写限定范围的爬取请求。
  • 优先使用明确的限制条件和路径约束。
  1. 启动爬取并记录任务元数据。
  • 记录
    crawl_id
    、初始状态和请求参数。
  1. 轮询
    GET /v1/crawl/{crawl_id}
    直到任务进入终态。
  • 跟踪
    pending
    crawling
    completed
    failed
    状态。
  1. 返回原始的创建/结果响应。
  • 除非明确要求,否则不要生成衍生摘要。

Output Contract

输出约定

Return:
  • Endpoint flow (
    POST /v1/crawl
    +
    GET /v1/crawl/{crawl_id}
    )
  • request_payload
    used for the create request
  • Raw response body from create call
  • Raw response body from result call
  • Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
返回内容包括:
  • 端点调用流程(
    POST /v1/crawl
    +
    GET /v1/crawl/{crawl_id}
  • 创建任务时使用的
    request_payload
  • 创建任务调用的原始响应体
  • 获取结果调用的原始响应体
  • 请求失败时的错误详情
除非用户明确要求,否则不要生成摘要。

Guardrails

约束规则

  • Never run an unbounded crawl without explicit constraints.
  • Do not present speculative page counts as final coverage.
  • Do not hardcode provider-specific tool schemas in core logic.
  • Highlight policy, legal, or website-usage risks when relevant.
  • 绝不在没有明确约束的情况下运行无限制爬取。
  • 不要将预估的页面数作为最终爬取范围呈现。
  • 不要在核心逻辑中硬编码特定服务商的工具 schema。
  • 当涉及相关风险时,需提示政策、法律或网站使用条款相关的风险。