xcrawl-scrape

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

XCrawl Scrape

XCrawl 爬取

Overview

概述

This skill handles single-page extraction with XCrawl Scrape APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.

该Skill通过XCrawl Scrape API处理单页面提取任务。默认行为是直接透传：原样返回上游API的响应体。

Required Local Config

本地配置要求

Before using this skill, the user must create a local config file and write

XCRAWL_API_KEY

into it.

Path:

~/.xcrawl/config.json

json

{
  "XCRAWL_API_KEY": "<your_api_key>"
}

Read API key from local config file only. Do not require global environment variables.

使用该Skill前，用户必须创建本地配置文件并将

XCRAWL_API_KEY

写入其中。

路径：

~/.xcrawl/config.json

json

{
  "XCRAWL_API_KEY": "<your_api_key>"
}

仅从本地配置文件读取API密钥，不依赖全局环境变量。

Credits and Account Setup

积分与账户设置

Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at

https://dash.xcrawl.com/

. After registration, they can activate the free

credits plan before running requests.

使用XCrawl API会消耗积分。如果用户没有账户或可用积分，引导他们前往

https://dash.xcrawl.com/

Tool Permission Policy

工具权限策略

Request runtime permissions for

curl

and

node

only. Do not request Python, shell helper scripts, or other runtime permissions.

仅请求

curl

和

node

的运行时权限，不请求Python、shell辅助脚本或其他运行时权限。

API Surface

API接口

Start scrape:
```
POST /v1/scrape
```
Read async result:
```
GET /v1/scrape/{scrape_id}
```
Base URL:
```
https://run.xcrawl.com
```
Required header:
```
Authorization: Bearer <XCRAWL_API_KEY>
```

启动爬取：
```
POST /v1/scrape
```
读取异步结果：
```
GET /v1/scrape/{scrape_id}
```
基础URL：
```
https://run.xcrawl.com
```
必填请求头：
```
Authorization: Bearer <XCRAWL_API_KEY>
```

Usage Examples

使用示例

cURL (sync)

cURL（同步）

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'

cURL (async create + result)

cURL（异步创建+结果获取）

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

bash

API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

Node

bash

node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

bash

node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

Endpoint:
```
POST https://run.xcrawl.com/v1/scrape
```
Headers:
```
Content-Type: application/json
```
```
Authorization: Bearer <api_key>
```

端点：
```
POST https://run.xcrawl.com/v1/scrape
```
请求头：
```
Content-Type: application/json
```
```
Authorization: Bearer <api_key>
```

Request body: top-level fields

请求体：顶层字段

Field	Type	Required	Default	Description
`url`	string	Yes	-	Target URL
`mode`	string	No	`sync`	`sync` or `async`
`proxy`	object	No	-	Proxy config
`request`	object	No	-	Request config
`js_render`	object	No	-	JS rendering config
`output`	object	No	-	Output config
`webhook`	object	No	-	Async webhook config ( `mode=async` )

字段	类型	必填	默认值	描述
`url`	string	是	-	目标URL
`mode`	string	否	`sync`	`sync` 或 `async`
`proxy`	object	否	-	代理配置
`request`	object	否	-	请求配置
`js_render`	object	否	-	JS渲染配置
`output`	object	否	-	输出配置
`webhook`	object	否	-	异步回调配置（ `mode=async` 时可用）

proxy

proxy

Field	Type	Required	Default	Description
`location`	string	No	`US`	ISO-3166-1 alpha-2 country code, e.g. `US` / `JP` / `SG`
`sticky_session`	string	No	Auto-generated	Sticky session ID; same ID attempts to reuse exit

字段	类型	必填	默认值	描述
`location`	string	否	`US`	ISO-3166-1 alpha-2国家代码，例如 `US` / `JP` / `SG`
`sticky_session`	string	否	自动生成	粘性会话ID；相同ID会尝试复用出口节点

request

request

Field	Type	Required	Default	Description
`locale`	string	No	`en-US,en;q=0.9`	Affects `Accept-Language`
`device`	string	No	`desktop`	`desktop` / `mobile` ; affects UA and viewport
`cookies`	object map	No	-	Cookie key/value pairs
`headers`	object map	No	-	Header key/value pairs
`only_main_content`	boolean	No	`true`	Return main content only
`block_ads`	boolean	No	`true`	Attempt to block ad resources
`skip_tls_verification`	boolean	No	`true`	Skip TLS verification

字段	类型	必填	默认值	描述
`locale`	string	否	`en-US,en;q=0.9`	影响 `Accept-Language` 请求头
`device`	string	否	`desktop`	`desktop` / `mobile` ；影响用户代理和视口
`cookies`	对象映射	否	-	Cookie键值对
`headers`	对象映射	否	-	请求头键值对
`only_main_content`	boolean	否	`true`	仅返回主内容
`block_ads`	boolean	否	`true`	尝试屏蔽广告资源
`skip_tls_verification`	boolean	否	`true`	跳过TLS验证

js_render

js_render

Field	Type	Required	Default	Description
`enabled`	boolean	No	`true`	Enable browser rendering
`wait_until`	string	No	`load`	`load` / `domcontentloaded` / `networkidle`
`viewport.width`	integer	No	-	Viewport width (desktop `1920` , mobile `402` )
`viewport.height`	integer	No	-	Viewport height (desktop `1080` , mobile `874` )

字段	类型	必填	默认值	描述
`enabled`	boolean	否	`true`	启用浏览器渲染
`wait_until`	string	否	`load`	`load` / `domcontentloaded` / `networkidle`
`viewport.width`	integer	否	-	视口宽度（桌面端 `1920` ，移动端 `402` ）
`viewport.height`	integer	否	-	视口高度（桌面端 `1080` ，移动端 `874` ）

output

output

Field	Type	Required	Default	Description
`formats`	string[]	No	`["markdown"]`	Output formats
`screenshot`	string	No	`viewport`	`full_page` / `viewport` (only if `formats` includes `screenshot` )
`json.prompt`	string	No	-	Extraction prompt
`json.json_schema`	object	No	-	JSON Schema

output.formats

enum:

```
html
```
```
raw_html
```
```
markdown
```
```
links
```
```
summary
```
```
screenshot
```
```
json
```

字段	类型	必填	默认值	描述
`formats`	string[]	否	`["markdown"]`	输出格式
`screenshot`	string	否	`viewport`	`full_page` / `viewport` （仅当 `formats` 包含 `screenshot` 时生效）
`json.prompt`	string	否	-	提取提示词
`json.json_schema`	object	否	-	JSON Schema

output.formats

可选值：

```
html
```
```
raw_html
```
```
markdown
```
```
links
```
```
summary
```
```
screenshot
```
```
json
```

webhook

webhook

Field	Type	Required	Default	Description
`url`	string	No	-	Callback URL
`headers`	object map	No	-	Custom callback headers
`events`	string[]	No	`["started","completed","failed"]`	Events: `started` / `completed` / `failed`

字段	类型	必填	默认值	描述
`url`	string	否	-	回调URL
`headers`	对象映射	否	-	自定义回调请求头
`events`	string[]	否	`["started","completed","failed"]`	触发事件： `started` / `completed` / `failed`

Response Parameters

响应参数

Sync create response (

mode=sync

)

同步创建响应（

mode=sync

）

Field	Type	Description
`scrape_id`	string	Task ID
`endpoint`	string	Always `scrape`
`version`	string	Version
`status`	string	`completed` / `failed`
`url`	string	Target URL
`data`	object	Result data
`started_at`	string	Start time (ISO 8601)
`ended_at`	string	End time (ISO 8601)
`total_credits_used`	integer	Total credits used

data

fields (based on

output.formats

html

raw_html

markdown

links

summary

screenshot

json

```
metadata
```
(page metadata)
```
traffic_bytes
```
```
credits_used
```
```
credits_detail
```

credits_detail

fields:

Field	Type	Description
`base_cost`	integer	Base scrape cost
`traffic_cost`	integer	Traffic cost
`json_extract_cost`	integer	JSON extraction cost

字段	类型	描述
`scrape_id`	string	任务ID
`endpoint`	string	固定为 `scrape`
`version`	string	版本号
`status`	string	`completed` / `failed`
`url`	string	目标URL
`data`	object	结果数据
`started_at`	string	开始时间（ISO 8601格式）
`ended_at`	string	结束时间（ISO 8601格式）
`total_credits_used`	integer	总消耗积分

data

字段（根据

output.formats

生成）：

html

raw_html

markdown

links

summary

screenshot

json

```
metadata
```
（页面元数据）
```
traffic_bytes
```
```
credits_used
```
```
credits_detail
```

credits_detail

字段：

字段	类型	描述
`base_cost`	integer	基础爬取成本
`traffic_cost`	integer	流量成本
`json_extract_cost`	integer	JSON提取成本

Async create response (

mode=async

)

异步创建响应（

mode=async

）

Field	Type	Description
`scrape_id`	string	Task ID
`endpoint`	string	Always `scrape`
`version`	string	Version
`status`	string	Always `pending`

字段	类型	描述
`scrape_id`	string	任务ID
`endpoint`	string	固定为 `scrape`
`version`	string	版本号
`status`	string	固定为 `pending`

Async result response (

GET /v1/scrape/{scrape_id}

)

异步结果响应（

GET /v1/scrape/{scrape_id}

）

Field	Type	Description
`scrape_id`	string	Task ID
`endpoint`	string	Always `scrape`
`version`	string	Version
`status`	string	`pending` / `crawling` / `completed` / `failed`
`url`	string	Target URL
`data`	object	Same shape as sync `data`
`started_at`	string	Start time (ISO 8601)
`ended_at`	string	End time (ISO 8601)

字段	类型	描述
`scrape_id`	string	任务ID
`endpoint`	string	固定为 `scrape`
`version`	string	版本号
`status`	string	`pending` / `crawling` / `completed` / `failed`
`url`	string	目标URL
`data`	object	结构与同步响应的 `data` 一致
`started_at`	string	开始时间（ISO 8601格式）
`ended_at`	string	结束时间（ISO 8601格式）

Workflow

工作流程

Restate the user goal as an extraction contract.

URL scope, required fields, accepted nulls, and precision expectations.

Build the scrape request body.

Keep only necessary options.
Prefer explicit
```
output.formats
```
.

Execute scrape and capture task metadata.

Track
```
scrape_id
```
,
```
status
```
, and timestamps.
If async, poll until
```
completed
```
or
```
failed
```
.

Return raw API responses directly.

Do not synthesize or compress fields by default.

将用户目标重述为提取约定。

包括URL范围、必填字段、允许空值的情况以及精度要求。

构建爬取请求体。

仅保留必要选项。
优先显式指定
```
output.formats
```
。

执行爬取并捕获任务元数据。

跟踪
```
scrape_id
```
、
```
status
```
和时间戳。
如果是异步任务，轮询直到状态变为
```
completed
```
或
```
failed
```
。

直接返回原始API响应。

默认不进行结果合成或字段压缩。

Output Contract

输出约定

Return:

Endpoint(s) used and mode (
```
sync
```
or
```
async
```
)
```
request_payload
```
used for the request
Raw response body from each API call
Error details when request fails

Do not generate summaries unless the user explicitly requests a summary.

返回内容包括：

使用的端点和模式（
```
sync
```
或
```
async
```
）
请求使用的
```
request_payload
```
每个API调用的原始响应体
请求失败时的错误详情

除非用户明确要求，否则不生成摘要。

Guardrails

约束规则

Do not invent unsupported output fields.
Do not hardcode provider-specific tool schemas in core logic.
Call out uncertainty when page structure is unstable.

不得虚构不支持的输出字段。
核心逻辑中不得硬编码特定工具的提供者专属模式。
当页面结构不稳定时，需说明不确定性。

xcrawl-scrape

Original

Translation

XCrawl Scrape

XCrawl 爬取

Overview

概述

Required Local Config

本地配置要求

Credits and Account Setup

积分与账户设置

Tool Permission Policy

工具权限策略

API Surface

API接口

Usage Examples

使用示例

cURL (sync)

cURL（同步）

cURL (async create + result)

cURL（异步创建+结果获取）

Node

Node

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

Request body: top-level fields

请求体：顶层字段

proxy

proxy

request

request

js_render

js_render

output

output

webhook

webhook

Response Parameters

响应参数

Sync create response (mode=sync)

同步创建响应（mode=sync）

Async create response (mode=async)

异步创建响应（mode=async）

Async result response (GET /v1/scrape/{scrape_id})

异步结果响应（GET /v1/scrape/{scrape_id}）

Workflow

工作流程

Output Contract

输出约定

Guardrails

约束规则

`proxy`

`proxy`

`request`

`request`

`js_render`

`js_render`

`output`

`output`

`webhook`

`webhook`

Sync create response (
`mode=sync`
)

同步创建响应（
`mode=sync`
）

Async create response (
`mode=async`
)

异步创建响应（
`mode=async`
）

Async result response (
`GET /v1/scrape/{scrape_id}`
)

异步结果响应（
`GET /v1/scrape/{scrape_id}`
）