xcrawl-scrape

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

XCrawl Scrape

XCrawl 爬取

Overview

概述

This skill handles single-page extraction with XCrawl Scrape APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.
该Skill通过XCrawl Scrape API处理单页面提取任务。默认行为是直接透传:原样返回上游API的响应体。

Required Local Config

本地配置要求

Before using this skill, the user must create a local config file and write
XCRAWL_API_KEY
into it.
Path:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
Read API key from local config file only. Do not require global environment variables.
使用该Skill前,用户必须创建本地配置文件并将
XCRAWL_API_KEY
写入其中。
路径:
~/.xcrawl/config.json
json
{
  "XCRAWL_API_KEY": "<your_api_key>"
}
仅从本地配置文件读取API密钥,不依赖全局环境变量。

Credits and Account Setup

积分与账户设置

Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at
https://dash.xcrawl.com/
. After registration, they can activate the free
1000
credits plan before running requests.
使用XCrawl API会消耗积分。如果用户没有账户或可用积分,引导他们前往
https://dash.xcrawl.com/
注册。注册后,他们可以激活免费的1000积分套餐,之后再发起请求。

Tool Permission Policy

工具权限策略

Request runtime permissions for
curl
and
node
only. Do not request Python, shell helper scripts, or other runtime permissions.
仅请求
curl
node
的运行时权限,不请求Python、shell辅助脚本或其他运行时权限。

API Surface

API接口

  • Start scrape:
    POST /v1/scrape
  • Read async result:
    GET /v1/scrape/{scrape_id}
  • Base URL:
    https://run.xcrawl.com
  • Required header:
    Authorization: Bearer <XCRAWL_API_KEY>
  • 启动爬取:
    POST /v1/scrape
  • 读取异步结果:
    GET /v1/scrape/{scrape_id}
  • 基础URL:
    https://run.xcrawl.com
  • 必填请求头:
    Authorization: Bearer <XCRAWL_API_KEY>

Usage Examples

使用示例

cURL (sync)

cURL(同步)

bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com","mode":"sync","output":{"formats":["markdown","links"]}}'

cURL (async create + result)

cURL(异步创建+结果获取)

bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"
bash
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")"

CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/scrape" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${API_KEY}" \
  -d '{"url":"https://example.com/product/1","mode":"async","output":{"formats":["json"]},"json":{"prompt":"Extract title and price."}}')"

echo "$CREATE_RESP"

SCRAPE_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrape_id||"")' "$CREATE_RESP")"

curl -sS -X GET "https://run.xcrawl.com/v1/scrape/${SCRAPE_ID}" \
  -H "Authorization: Bearer ${API_KEY}"

Node

Node

bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'
bash
node -e '
const fs=require("fs");
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY;
const body={url:"https://example.com",mode:"sync",output:{formats:["markdown","json"]},json:{prompt:"Extract title and publish date."}};
fetch("https://run.xcrawl.com/v1/scrape",{
  method:"POST",
  headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`},
  body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
'

Request Parameters

请求参数

Request endpoint and headers

请求端点与请求头

  • Endpoint:
    POST https://run.xcrawl.com/v1/scrape
  • Headers:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>
  • 端点:
    POST https://run.xcrawl.com/v1/scrape
  • 请求头:
  • Content-Type: application/json
  • Authorization: Bearer <api_key>

Request body: top-level fields

请求体:顶层字段

FieldTypeRequiredDefaultDescription
url
stringYes-Target URL
mode
stringNo
sync
sync
or
async
proxy
objectNo-Proxy config
request
objectNo-Request config
js_render
objectNo-JS rendering config
output
objectNo-Output config
webhook
objectNo-Async webhook config (
mode=async
)
字段类型必填默认值描述
url
string-目标URL
mode
string
sync
sync
async
proxy
object-代理配置
request
object-请求配置
js_render
object-JS渲染配置
output
object-输出配置
webhook
object-异步回调配置(
mode=async
时可用)

proxy

proxy

FieldTypeRequiredDefaultDescription
location
stringNo
US
ISO-3166-1 alpha-2 country code, e.g.
US
/
JP
/
SG
sticky_session
stringNoAuto-generatedSticky session ID; same ID attempts to reuse exit
字段类型必填默认值描述
location
string
US
ISO-3166-1 alpha-2国家代码,例如
US
/
JP
/
SG
sticky_session
string自动生成粘性会话ID;相同ID会尝试复用出口节点

request

request

FieldTypeRequiredDefaultDescription
locale
stringNo
en-US,en;q=0.9
Affects
Accept-Language
device
stringNo
desktop
desktop
/
mobile
; affects UA and viewport
cookies
object mapNo-Cookie key/value pairs
headers
object mapNo-Header key/value pairs
only_main_content
booleanNo
true
Return main content only
block_ads
booleanNo
true
Attempt to block ad resources
skip_tls_verification
booleanNo
true
Skip TLS verification
字段类型必填默认值描述
locale
string
en-US,en;q=0.9
影响
Accept-Language
请求头
device
string
desktop
desktop
/
mobile
;影响用户代理和视口
cookies
对象映射-Cookie键值对
headers
对象映射-请求头键值对
only_main_content
boolean
true
仅返回主内容
block_ads
boolean
true
尝试屏蔽广告资源
skip_tls_verification
boolean
true
跳过TLS验证

js_render

js_render

FieldTypeRequiredDefaultDescription
enabled
booleanNo
true
Enable browser rendering
wait_until
stringNo
load
load
/
domcontentloaded
/
networkidle
viewport.width
integerNo-Viewport width (desktop
1920
, mobile
402
)
viewport.height
integerNo-Viewport height (desktop
1080
, mobile
874
)
字段类型必填默认值描述
enabled
boolean
true
启用浏览器渲染
wait_until
string
load
load
/
domcontentloaded
/
networkidle
viewport.width
integer-视口宽度(桌面端
1920
,移动端
402
viewport.height
integer-视口高度(桌面端
1080
,移动端
874

output

output

FieldTypeRequiredDefaultDescription
formats
string[]No
["markdown"]
Output formats
screenshot
stringNo
viewport
full_page
/
viewport
(only if
formats
includes
screenshot
)
json.prompt
stringNo-Extraction prompt
json.json_schema
objectNo-JSON Schema
output.formats
enum:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json
字段类型必填默认值描述
formats
string[]
["markdown"]
输出格式
screenshot
string
viewport
full_page
/
viewport
(仅当
formats
包含
screenshot
时生效)
json.prompt
string-提取提示词
json.json_schema
object-JSON Schema
output.formats
可选值:
  • html
  • raw_html
  • markdown
  • links
  • summary
  • screenshot
  • json

webhook

webhook

FieldTypeRequiredDefaultDescription
url
stringNo-Callback URL
headers
object mapNo-Custom callback headers
events
string[]No
["started","completed","failed"]
Events:
started
/
completed
/
failed
字段类型必填默认值描述
url
string-回调URL
headers
对象映射-自定义回调请求头
events
string[]
["started","completed","failed"]
触发事件:
started
/
completed
/
failed

Response Parameters

响应参数

Sync create response (
mode=sync
)

同步创建响应(
mode=sync

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
string
completed
/
failed
url
stringTarget URL
data
objectResult data
started_at
stringStart time (ISO 8601)
ended_at
stringEnd time (ISO 8601)
total_credits_used
integerTotal credits used
data
fields (based on
output.formats
):
  • html
    ,
    raw_html
    ,
    markdown
    ,
    links
    ,
    summary
    ,
    screenshot
    ,
    json
  • metadata
    (page metadata)
  • traffic_bytes
  • credits_used
  • credits_detail
credits_detail
fields:
FieldTypeDescription
base_cost
integerBase scrape cost
traffic_cost
integerTraffic cost
json_extract_cost
integerJSON extraction cost
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string
completed
/
failed
url
string目标URL
data
object结果数据
started_at
string开始时间(ISO 8601格式)
ended_at
string结束时间(ISO 8601格式)
total_credits_used
integer总消耗积分
data
字段(根据
output.formats
生成):
  • html
    ,
    raw_html
    ,
    markdown
    ,
    links
    ,
    summary
    ,
    screenshot
    ,
    json
  • metadata
    (页面元数据)
  • traffic_bytes
  • credits_used
  • credits_detail
credits_detail
字段:
字段类型描述
base_cost
integer基础爬取成本
traffic_cost
integer流量成本
json_extract_cost
integerJSON提取成本

Async create response (
mode=async
)

异步创建响应(
mode=async

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
stringAlways
pending
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string固定为
pending

Async result response (
GET /v1/scrape/{scrape_id}
)

异步结果响应(
GET /v1/scrape/{scrape_id}

FieldTypeDescription
scrape_id
stringTask ID
endpoint
stringAlways
scrape
version
stringVersion
status
string
pending
/
crawling
/
completed
/
failed
url
stringTarget URL
data
objectSame shape as sync
data
started_at
stringStart time (ISO 8601)
ended_at
stringEnd time (ISO 8601)
字段类型描述
scrape_id
string任务ID
endpoint
string固定为
scrape
version
string版本号
status
string
pending
/
crawling
/
completed
/
failed
url
string目标URL
data
object结构与同步响应的
data
一致
started_at
string开始时间(ISO 8601格式)
ended_at
string结束时间(ISO 8601格式)

Workflow

工作流程

  1. Restate the user goal as an extraction contract.
  • URL scope, required fields, accepted nulls, and precision expectations.
  1. Build the scrape request body.
  • Keep only necessary options.
  • Prefer explicit
    output.formats
    .
  1. Execute scrape and capture task metadata.
  • Track
    scrape_id
    ,
    status
    , and timestamps.
  • If async, poll until
    completed
    or
    failed
    .
  1. Return raw API responses directly.
  • Do not synthesize or compress fields by default.
  1. 将用户目标重述为提取约定。
  • 包括URL范围、必填字段、允许空值的情况以及精度要求。
  1. 构建爬取请求体。
  • 仅保留必要选项。
  • 优先显式指定
    output.formats
  1. 执行爬取并捕获任务元数据。
  • 跟踪
    scrape_id
    status
    和时间戳。
  • 如果是异步任务,轮询直到状态变为
    completed
    failed
  1. 直接返回原始API响应。
  • 默认不进行结果合成或字段压缩。

Output Contract

输出约定

Return:
  • Endpoint(s) used and mode (
    sync
    or
    async
    )
  • request_payload
    used for the request
  • Raw response body from each API call
  • Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
返回内容包括:
  • 使用的端点和模式(
    sync
    async
  • 请求使用的
    request_payload
  • 每个API调用的原始响应体
  • 请求失败时的错误详情
除非用户明确要求,否则不生成摘要。

Guardrails

约束规则

  • Do not invent unsupported output fields.
  • Do not hardcode provider-specific tool schemas in core logic.
  • Call out uncertainty when page structure is unstable.
  • 不得虚构不支持的输出字段。
  • 核心逻辑中不得硬编码特定工具的提供者专属模式。
  • 当页面结构不稳定时,需说明不确定性。