xcrawl-crawl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseXCrawl Crawl
XCrawl 爬取
Overview
概述
This skill orchestrates full-site or scoped crawling with XCrawl Crawl APIs.
Default behavior is raw passthrough: return upstream API response bodies as-is.
本技能通过XCrawl Crawl API协调全站或限定范围的爬取任务。默认行为是直接透传:原样返回上游API的响应体。
Required Local Config
必要的本地配置
Before using this skill, the user must create a local config file and write into it.
XCRAWL_API_KEYPath:
~/.xcrawl/config.jsonjson
{
"XCRAWL_API_KEY": "<your_api_key>"
}Read API key from local config file only. Do not require global environment variables.
使用本技能前,用户必须创建一个本地配置文件,并将写入其中。
XCRAWL_API_KEY路径:
~/.xcrawl/config.jsonjson
{
"XCRAWL_API_KEY": "<your_api_key>"
}仅从本地配置文件读取API密钥,不依赖全局环境变量。
Credits and Account Setup
积分与账户设置
Using XCrawl APIs consumes credits.
If the user does not have an account or available credits, guide them to register at .
After registration, they can activate the free credits plan before running requests.
https://dash.xcrawl.com/1000使用XCrawl API会消耗积分。如果用户没有账户或可用积分,引导他们前往注册。注册后,他们可以激活免费的1000积分套餐,之后再发起请求。
https://dash.xcrawl.com/Tool Permission Policy
工具权限策略
Request runtime permissions for and only.
Do not request Python, shell helper scripts, or other runtime permissions.
curlnode仅请求和的运行时权限,不请求Python、shell辅助脚本或其他运行时权限。
curlnodeAPI Surface
API接口
- Start crawl:
POST /v1/crawl - Read result:
GET /v1/crawl/{crawl_id} - Base URL:
https://run.xcrawl.com - Required header:
Authorization: Bearer <XCRAWL_API_KEY>
- 启动爬取:
POST /v1/crawl - 读取结果:
GET /v1/crawl/{crawl_id} - 基础URL:
https://run.xcrawl.com - 必填请求头:
Authorization: Bearer <XCRAWL_API_KEY>
Usage Examples
使用示例
cURL (create + result)
cURL(创建任务 + 获取结果)
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"
CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"
echo "$CREATE_RESP"
CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"
curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
-H "Authorization: Bearer ${API_KEY}"bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"
CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${API_KEY}" \
-d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"
echo "$CREATE_RESP"
CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"
curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
-H "Authorization: Bearer ${API_KEY}"Node
Node.js
bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
method:"POST",
headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
method:"POST",
headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'Request Parameters
请求参数
Request endpoint and headers
请求端点与请求头
- Endpoint:
POST https://run.xcrawl.com/v1/crawl - Headers:
Content-Type: application/jsonAuthorization: Bearer <api_key>
- 端点:
POST https://run.xcrawl.com/v1/crawl - 请求头:
Content-Type: application/jsonAuthorization: Bearer <api_key>
Request body: top-level fields
请求体:顶级字段
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | Yes | - | Site entry URL |
| object | No | - | Crawler config |
| object | No | - | Proxy config |
| object | No | - | Request config |
| object | No | - | JS rendering config |
| object | No | - | Output config |
| object | No | - | Async callback config |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| string | 是 | - | 网站入口URL |
| object | 否 | - | 爬虫配置 |
| object | 否 | - | 代理配置 |
| object | 否 | - | 请求配置 |
| object | 否 | - | JS渲染配置 |
| object | 否 | - | 输出配置 |
| object | 否 | - | 异步回调配置 |
crawler
crawlercrawler
crawler| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| integer | No | | Max pages |
| string[] | No | - | Include only matching URLs (regex supported) |
| string[] | No | - | Exclude matching URLs (regex supported) |
| integer | No | | Max depth from entry URL |
| boolean | No | | Crawl full site instead of only subpaths |
| boolean | No | | Include subdomains |
| boolean | No | | Include external links |
| boolean | No | | Use site sitemap |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| integer | 否 | | 最大爬取页面数 |
| string[] | 否 | - | 仅包含匹配的URL(支持正则) |
| string[] | 否 | - | 排除匹配的URL(支持正则) |
| integer | 否 | | 相对于入口URL的最大爬取深度 |
| boolean | 否 | | 爬取整个站点而非仅子路径 |
| boolean | 否 | | 包含子域名 |
| boolean | 否 | | 包含外部链接 |
| boolean | 否 | | 使用站点地图 |
proxy
proxyproxy
proxy| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | | ISO-3166-1 alpha-2 country code, e.g. |
| string | No | Auto-generated | Sticky session ID; same ID attempts to reuse exit |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| string | 否 | | ISO-3166-1 alpha-2国家代码,例如 |
| string | 否 | 自动生成 | 粘性会话ID;相同ID会尝试复用出口节点 |
request
requestrequest
request| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | | Affects |
| string | No | | |
| object map | No | - | Cookie key/value pairs |
| object map | No | - | Header key/value pairs |
| boolean | No | | Return main content only |
| boolean | No | | Attempt to block ad resources |
| boolean | No | | Skip TLS verification |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| string | 否 | | 影响 |
| string | 否 | | |
| 对象映射 | 否 | - | Cookie键值对 |
| 对象映射 | 否 | - | 请求头键值对 |
| boolean | 否 | | 仅返回页面主内容 |
| boolean | 否 | | 尝试阻止广告资源加载 |
| boolean | 否 | | 跳过TLS验证 |
js_render
js_renderjs_render
js_render| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| boolean | No | | Enable browser rendering |
| string | No | | |
| integer | No | - | Viewport width (desktop |
| integer | No | - | Viewport height (desktop |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| boolean | 否 | | 启用浏览器渲染 |
| string | 否 | | |
| integer | 否 | - | 视口宽度(桌面端为 |
| integer | 否 | - | 视口高度(桌面端为 |
output
outputoutput
output| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string[] | No | | Output formats |
| string | No | | |
| string | No | - | Extraction prompt |
| object | No | - | JSON Schema |
output.formatshtmlraw_htmlmarkdownlinkssummaryscreenshotjson
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| string[] | 否 | | 输出格式 |
| string | 否 | | |
| string | 否 | - | 提取提示词 |
| object | 否 | - | JSON Schema |
output.formatshtmlraw_htmlmarkdownlinkssummaryscreenshotjson
webhook
webhookwebhook
webhook| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | - | Callback URL |
| object map | No | - | Custom callback headers |
| string[] | No | | Events: |
| 字段 | 类型 | 必填 | 默认值 | 描述 |
|---|---|---|---|---|
| string | 否 | - | 回调URL |
| 对象映射 | 否 | - | 自定义回调请求头 |
| string[] | 否 | | 触发事件: |
Response Parameters
响应参数
Create response (POST /v1/crawl
)
POST /v1/crawl创建任务响应(POST /v1/crawl
)
POST /v1/crawl| Field | Type | Description |
|---|---|---|
| string | Task ID |
| string | Always |
| string | Version |
| string | Always |
| 字段 | 类型 | 描述 |
|---|---|---|
| string | 任务ID |
| string | 固定为 |
| string | 版本号 |
| string | 固定为 |
Result response (GET /v1/crawl/{crawl_id}
)
GET /v1/crawl/{crawl_id}结果响应(GET /v1/crawl/{crawl_id}
)
GET /v1/crawl/{crawl_id}| Field | Type | Description |
|---|---|---|
| string | Task ID |
| string | Always |
| string | Version |
| string | |
| string | Entry URL |
| object[] | Per-page result array |
| string | Start time (ISO 8601) |
| string | End time (ISO 8601) |
| integer | Total credits used |
data[]output.formats- ,
html,raw_html,markdown,links,summary,screenshotjson - (page metadata)
metadata traffic_bytescredits_usedcredits_detail
| 字段 | 类型 | 描述 |
|---|---|---|
| string | 任务ID |
| string | 固定为 |
| string | 版本号 |
| string | |
| string | 入口URL |
| object[] | 单页面结果数组 |
| string | 任务开始时间(ISO 8601格式) |
| string | 任务结束时间(ISO 8601格式) |
| integer | 总消耗积分 |
data[]output.formats- 、
html、raw_html、markdown、links、summary、screenshotjson - (页面元数据)
metadata - (流量字节数)
traffic_bytes - (单页面消耗积分)
credits_used - (积分消耗明细)
credits_detail
Workflow
工作流
- Confirm business objective and crawl boundary.
- What content is required, what content must be excluded, and what is the completion signal.
- Draft a bounded crawl request.
- Prefer explicit limits and path constraints.
- Start crawl and capture task metadata.
- Record , initial status, and request payload.
crawl_id
- Poll until terminal state.
GET /v1/crawl/{crawl_id}
- Track ,
pending,crawling, orcompleted.failed
- Return raw create/result responses.
- Do not synthesize derived summaries unless explicitly requested.
- 确认业务目标与爬取范围。
- 明确需要爬取的内容、必须排除的内容,以及任务完成的标识。
- 编写限定范围的爬取请求。
- 优先使用明确的限制条件和路径约束。
- 启动爬取并记录任务元数据。
- 记录、初始状态和请求参数。
crawl_id
- 轮询直到任务进入终态。
GET /v1/crawl/{crawl_id}
- 跟踪、
pending、crawling或completed状态。failed
- 返回原始的创建/结果响应。
- 除非明确要求,否则不要生成衍生摘要。
Output Contract
输出约定
Return:
- Endpoint flow (+
POST /v1/crawl)GET /v1/crawl/{crawl_id} - used for the create request
request_payload - Raw response body from create call
- Raw response body from result call
- Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
返回内容包括:
- 端点调用流程(+
POST /v1/crawl)GET /v1/crawl/{crawl_id} - 创建任务时使用的
request_payload - 创建任务调用的原始响应体
- 获取结果调用的原始响应体
- 请求失败时的错误详情
除非用户明确要求,否则不要生成摘要。
Guardrails
约束规则
- Never run an unbounded crawl without explicit constraints.
- Do not present speculative page counts as final coverage.
- Do not hardcode provider-specific tool schemas in core logic.
- Highlight policy, legal, or website-usage risks when relevant.
- 绝不在没有明确约束的情况下运行无限制爬取。
- 不要将预估的页面数作为最终爬取范围呈现。
- 不要在核心逻辑中硬编码特定服务商的工具 schema。
- 当涉及相关风险时,需提示政策、法律或网站使用条款相关的风险。