xcrawl-crawl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

XCrawl Crawl

XCrawl 爬取

Overview

概述

This skill orchestrates full-site or scoped crawling with XCrawl Crawl APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.

本技能通过XCrawl Crawl API协调全站或限定范围的爬取任务。默认行为是直接透传：原样返回上游API的响应体。

Required Local Config

必要的本地配置

Before using this skill, the user must create a local config file and write

XCRAWL_API_KEY

into it.

Path:

~/.xcrawl/config.json

json

{
  "XCRAWL_API_KEY": "<your_api_key>"
}

Read API key from local config file only. Do not require global environment variables.

使用本技能前，用户必须创建一个本地配置文件，并将

XCRAWL_API_KEY

写入其中。

路径：

~/.xcrawl/config.json

json

{
  "XCRAWL_API_KEY": "<your_api_key>"
}

仅从本地配置文件读取API密钥，不依赖全局环境变量。

Credits and Account Setup

积分与账户设置

Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at

https://dash.xcrawl.com/

. After registration, they can activate the free

credits plan before running requests.

使用XCrawl API会消耗积分。如果用户没有账户或可用积分，引导他们前往

https://dash.xcrawl.com/

Tool Permission Policy

工具权限策略

Request runtime permissions for

curl

and

node

only. Do not request Python, shell helper scripts, or other runtime permissions.

仅请求

curl

和

node

的运行时权限，不请求Python、shell辅助脚本或其他运行时权限。

API Surface

API接口

Start crawl:
```
POST /v1/crawl
```
Read result:
```
GET /v1/crawl/{crawl_id}
```
Base URL:
```
https://run.xcrawl.com
```
Required header:
```
Authorization: Bearer <XCRAWL_API_KEY>
```

启动爬取：
```
POST /v1/crawl
```
读取结果：
```
GET /v1/crawl/{crawl_id}
```
基础URL：
```
https://run.xcrawl.com
```
必填请求头：
```
Authorization: Bearer <XCRAWL_API_KEY>
```

Usage Examples

使用示例

cURL (create + result)

cURL（创建任务 + 获取结果）

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"

echo "$CREATE_RESP"

CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')"

echo "$CREATE_RESP"

CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

Node

Node.js

bash

node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

bash

node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}};
fetch("https://run.xcrawl.com/v1/crawl",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

Endpoint:
```
POST https://run.xcrawl.com/v1/crawl
```
Headers:
```
Content-Type: application/json
```
```
Authorization: Bearer <api_key>
```

端点：
```
POST https://run.xcrawl.com/v1/crawl
```
请求头：
```
Content-Type: application/json
```
```
Authorization: Bearer <api_key>
```

Request body: top-level fields

请求体：顶级字段

Field	Type	Required	Default	Description
`url`	string	Yes	-	Site entry URL
`crawler`	object	No	-	Crawler config
`proxy`	object	No	-	Proxy config
`request`	object	No	-	Request config
`js_render`	object	No	-	JS rendering config
`output`	object	No	-	Output config
`webhook`	object	No	-	Async callback config

字段	类型	必填	默认值	描述
`url`	string	是	-	网站入口URL
`crawler`	object	否	-	爬虫配置
`proxy`	object	否	-	代理配置
`request`	object	否	-	请求配置
`js_render`	object	否	-	JS渲染配置
`output`	object	否	-	输出配置
`webhook`	object	否	-	异步回调配置

crawler

crawler

Field	Type	Required	Default	Description
`limit`	integer	No	`100`	Max pages
`include`	string[]	No	-	Include only matching URLs (regex supported)
`exclude`	string[]	No	-	Exclude matching URLs (regex supported)
`max_depth`	integer	No	`3`	Max depth from entry URL
`include_entire_domain`	boolean	No	`false`	Crawl full site instead of only subpaths
`include_subdomains`	boolean	No	`false`	Include subdomains
`include_external_links`	boolean	No	`false`	Include external links
`sitemaps`	boolean	No	`true`	Use site sitemap

字段	类型	必填	默认值	描述
`limit`	integer	否	`100`	最大爬取页面数
`include`	string[]	否	-	仅包含匹配的URL（支持正则）
`exclude`	string[]	否	-	排除匹配的URL（支持正则）
`max_depth`	integer	否	`3`	相对于入口URL的最大爬取深度
`include_entire_domain`	boolean	否	`false`	爬取整个站点而非仅子路径
`include_subdomains`	boolean	否	`false`	包含子域名
`include_external_links`	boolean	否	`false`	包含外部链接
`sitemaps`	boolean	否	`true`	使用站点地图

proxy

proxy

Field	Type	Required	Default	Description
`location`	string	No	`US`	ISO-3166-1 alpha-2 country code, e.g. `US` / `JP` / `SG`
`sticky_session`	string	No	Auto-generated	Sticky session ID; same ID attempts to reuse exit

字段	类型	必填	默认值	描述
`location`	string	否	`US`	ISO-3166-1 alpha-2国家代码，例如 `US` / `JP` / `SG`
`sticky_session`	string	否	自动生成	粘性会话ID；相同ID会尝试复用出口节点

request

request

Field	Type	Required	Default	Description
`locale`	string	No	`en-US,en;q=0.9`	Affects `Accept-Language`
`device`	string	No	`desktop`	`desktop` / `mobile` ; affects UA and viewport
`cookies`	object map	No	-	Cookie key/value pairs
`headers`	object map	No	-	Header key/value pairs
`only_main_content`	boolean	No	`true`	Return main content only
`block_ads`	boolean	No	`true`	Attempt to block ad resources
`skip_tls_verification`	boolean	No	`true`	Skip TLS verification

字段	类型	必填	默认值	描述
`locale`	string	否	`en-US,en;q=0.9`	影响 `Accept-Language` 请求头
`device`	string	否	`desktop`	`desktop` / `mobile` ；影响用户代理和视口
`cookies`	对象映射	否	-	Cookie键值对
`headers`	对象映射	否	-	请求头键值对
`only_main_content`	boolean	否	`true`	仅返回页面主内容
`block_ads`	boolean	否	`true`	尝试阻止广告资源加载
`skip_tls_verification`	boolean	否	`true`	跳过TLS验证

js_render

js_render

Field	Type	Required	Default	Description
`enabled`	boolean	No	`true`	Enable browser rendering
`wait_until`	string	No	`load`	`load` / `domcontentloaded` / `networkidle`
`viewport.width`	integer	No	-	Viewport width (desktop `1920` , mobile `402` )
`viewport.height`	integer	No	-	Viewport height (desktop `1080` , mobile `874` )

字段	类型	必填	默认值	描述
`enabled`	boolean	否	`true`	启用浏览器渲染
`wait_until`	string	否	`load`	`load` / `domcontentloaded` / `networkidle`
`viewport.width`	integer	否	-	视口宽度（桌面端为 `1920` ，移动端为 `402` ）
`viewport.height`	integer	否	-	视口高度（桌面端为 `1080` ，移动端为 `874` ）

output

output

Field	Type	Required	Default	Description
`formats`	string[]	No	`["markdown"]`	Output formats
`screenshot`	string	No	`viewport`	`full_page` / `viewport` (only if `formats` includes `screenshot` )
`json.prompt`	string	No	-	Extraction prompt
`json.json_schema`	object	No	-	JSON Schema

output.formats

enum:

```
html
```
```
raw_html
```
```
markdown
```
```
links
```
```
summary
```
```
screenshot
```
```
json
```

字段	类型	必填	默认值	描述
`formats`	string[]	否	`["markdown"]`	输出格式
`screenshot`	string	否	`viewport`	`full_page` / `viewport` （仅当 `formats` 包含 `screenshot` 时生效）
`json.prompt`	string	否	-	提取提示词
`json.json_schema`	object	否	-	JSON Schema

output.formats

可选值：

```
html
```
```
raw_html
```
```
markdown
```
```
links
```
```
summary
```
```
screenshot
```
```
json
```

webhook

webhook

Field	Type	Required	Default	Description
`url`	string	No	-	Callback URL
`headers`	object map	No	-	Custom callback headers
`events`	string[]	No	`["started","completed","failed"]`	Events: `started` / `completed` / `failed`

字段	类型	必填	默认值	描述
`url`	string	否	-	回调URL
`headers`	对象映射	否	-	自定义回调请求头
`events`	string[]	否	`["started","completed","failed"]`	触发事件： `started` / `completed` / `failed`

Response Parameters

响应参数

Create response (

POST /v1/crawl

)

创建任务响应（

POST /v1/crawl

）

Field	Type	Description
`crawl_id`	string	Task ID
`endpoint`	string	Always `crawl`
`version`	string	Version
`status`	string	Always `pending`

字段	类型	描述
`crawl_id`	string	任务ID
`endpoint`	string	固定为 `crawl`
`version`	string	版本号
`status`	string	固定为 `pending`

Result response (

GET /v1/crawl/{crawl_id}

)

结果响应（

GET /v1/crawl/{crawl_id}

）

Field	Type	Description
`crawl_id`	string	Task ID
`endpoint`	string	Always `crawl`
`version`	string	Version
`status`	string	`pending` / `crawling` / `completed` / `failed`
`url`	string	Entry URL
`data`	object[]	Per-page result array
`started_at`	string	Start time (ISO 8601)
`ended_at`	string	End time (ISO 8601)
`total_credits_used`	integer	Total credits used

data[]

fields follow

output.formats

html

raw_html

markdown

links

summary

screenshot

json

```
metadata
```
(page metadata)
```
traffic_bytes
```
```
credits_used
```
```
credits_detail
```

字段	类型	描述
`crawl_id`	string	任务ID
`endpoint`	string	固定为 `crawl`
`version`	string	版本号
`status`	string	`pending` / `crawling` / `completed` / `failed`
`url`	string	入口URL
`data`	object[]	单页面结果数组
`started_at`	string	任务开始时间（ISO 8601格式）
`ended_at`	string	任务结束时间（ISO 8601格式）
`total_credits_used`	integer	总消耗积分

data[]

字段对应

output.formats

包含的格式：

html

、

raw_html

、

markdown

、

links

、

summary

、

screenshot

、

json

```
metadata
```
（页面元数据）
```
traffic_bytes
```
（流量字节数）
```
credits_used
```
（单页面消耗积分）
```
credits_detail
```
（积分消耗明细）

Workflow

工作流

Confirm business objective and crawl boundary.

What content is required, what content must be excluded, and what is the completion signal.

Draft a bounded crawl request.

Prefer explicit limits and path constraints.

Start crawl and capture task metadata.

Record
```
crawl_id
```
, initial status, and request payload.

Poll
```
GET /v1/crawl/{crawl_id}
```
until terminal state.

Track
```
pending
```
,
```
crawling
```
,
```
completed
```
, or
```
failed
```
.

Return raw create/result responses.

Do not synthesize derived summaries unless explicitly requested.

确认业务目标与爬取范围。

明确需要爬取的内容、必须排除的内容，以及任务完成的标识。

编写限定范围的爬取请求。

优先使用明确的限制条件和路径约束。

启动爬取并记录任务元数据。

记录
```
crawl_id
```
、初始状态和请求参数。

轮询
```
GET /v1/crawl/{crawl_id}
```
直到任务进入终态。

跟踪
```
pending
```
、
```
crawling
```
、
```
completed
```
或
```
failed
```
状态。

返回原始的创建/结果响应。

除非明确要求，否则不要生成衍生摘要。

Output Contract

输出约定

Return:

Endpoint flow (
```
POST /v1/crawl
```
+
```
GET /v1/crawl/{crawl_id}
```
)
```
request_payload
```
used for the create request
Raw response body from create call
Raw response body from result call
Error details when request fails

Do not generate summaries unless the user explicitly requests a summary.

返回内容包括：

端点调用流程（
```
POST /v1/crawl
```
+
```
GET /v1/crawl/{crawl_id}
```
）
创建任务时使用的
```
request_payload
```
创建任务调用的原始响应体
获取结果调用的原始响应体
请求失败时的错误详情

除非用户明确要求，否则不要生成摘要。

Guardrails

约束规则

Never run an unbounded crawl without explicit constraints.
Do not present speculative page counts as final coverage.
Do not hardcode provider-specific tool schemas in core logic.
Highlight policy, legal, or website-usage risks when relevant.

绝不在没有明确约束的情况下运行无限制爬取。
不要将预估的页面数作为最终爬取范围呈现。
不要在核心逻辑中硬编码特定服务商的工具 schema。
当涉及相关风险时，需提示政策、法律或网站使用条款相关的风险。

xcrawl-crawl

Original

Translation

XCrawl Crawl

XCrawl 爬取

Overview

概述

Required Local Config

必要的本地配置

Credits and Account Setup

积分与账户设置

Tool Permission Policy

工具权限策略

API Surface

API接口

Usage Examples

使用示例

cURL (create + result)

cURL（创建任务 + 获取结果）

Node

Node.js

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

Request body: top-level fields

请求体：顶级字段

crawler

crawler

proxy

proxy

request

request

js_render

js_render

output

output

webhook

webhook

Response Parameters

响应参数

Create response (POST /v1/crawl)

创建任务响应（POST /v1/crawl）

Result response (GET /v1/crawl/{crawl_id})

结果响应（GET /v1/crawl/{crawl_id}）

Workflow

工作流

Output Contract

输出约定

Guardrails

约束规则

`crawler`

`crawler`

`proxy`

`proxy`

`request`

`request`

`js_render`

`js_render`

`output`

`output`

`webhook`

`webhook`

Create response (
`POST /v1/crawl`
)

创建任务响应（
`POST /v1/crawl`
）

Result response (
`GET /v1/crawl/{crawl_id}`
)

结果响应（
`GET /v1/crawl/{crawl_id}`
）