apify-actor-development

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Apify Actor Development

Apify Actor 开发

Important: Before you begin, fill in the

generatedBy

property in the meta section of

.actor/actor.json

. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.

重要提示： 开始前，请填写

.actor/actor.json

元数据部分的

generatedBy

属性。将其替换为你当前使用的工具和模型，例如"Claude Code with Claude Sonnet 4.5"。这有助于Apify针对特定AI工具和模型监控并改进AGENTS.md。

What are Apify Actors?

什么是Apify Actors？

Actors are serverless programs inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems. They're packaged as Docker images and run in isolated containers in the cloud.

Core Concepts:

Accept well-defined JSON input
Perform isolated tasks (web scraping, automation, data processing)
Produce structured JSON output to datasets and/or store data in key-value stores
Can run from seconds to hours or even indefinitely
Persist state and can be restarted

Actors是受UNIX哲学启发的无服务器程序——每个程序专注于做好一件事，并且可以轻松组合以构建复杂系统。它们被打包为Docker镜像，在云端的隔离容器中运行。

核心概念：

接受定义明确的JSON输入
执行独立任务（网页抓取、自动化、数据处理）
生成结构化JSON输出到数据集，或在键值存储中存储数据
运行时间可从几秒到数小时，甚至无限期运行
持久化状态并可重启

Prerequisites & Setup (MANDATORY)

前提条件与设置（必须完成）

Before creating or modifying actors, verify that

apify

CLI is installed

apify --help

If it is not installed, you can run:

bash

curl -fsSL https://apify.com/install-cli.sh | bash

在创建或修改Actor之前，请验证是否已安装apify CLI，运行

apify --help

检查。

如果未安装，可以运行以下命令：

bash

curl -fsSL https://apify.com/install-cli.sh | bash

Or (Mac): brew install apify-cli

或者（Mac）：brew install apify-cli

Or (Windows): irm https://apify.com/install-cli.ps1 | iex

或者（Windows）：irm https://apify.com/install-cli.ps1 | iex

Or: npm install -g apify-cli

或者：npm install -g apify-cli


When the apify CLI is installed, check that it is logged in with:

```bash
apify info  # Should return your username

If it is not logged in, check if the APIFY_TOKEN environment variable is defined (if not, ask the user to generate one on https://console.apify.com/settings/integrations and then define APIFY_TOKEN with it).

Then run:

bash

apify login -t $APIFY_TOKEN


安装apify CLI后，运行以下命令检查是否已登录：

```bash
apify info  # 应返回你的用户名

如果未登录，请检查是否已定义APIFY_TOKEN环境变量（如果没有，请让用户在https://console.apify.com/settings/integrations生成一个，然后用它定义APIFY_TOKEN）。

然后运行：

bash

apify login -t $APIFY_TOKEN

Template Selection

模板选择

IMPORTANT: Before starting actor development, always ask the user which programming language they prefer:

JavaScript - Use

apify create <actor-name> -t project_empty

TypeScript - Use
```
apify create <actor-name> -t ts_empty
```

Python - Use

apify create <actor-name> -t python-empty

Use the appropriate CLI command based on the user's language choice. Additional packages (Crawlee, Playwright, etc.) can be installed later as needed.

重要提示： 开始Actor开发前，务必询问用户偏好的编程语言：

JavaScript - 使用

apify create <actor-name> -t project_empty

TypeScript - 使用
```
apify create <actor-name> -t ts_empty
```

Python - 使用

apify create <actor-name> -t python-empty

根据用户选择的语言使用相应的CLI命令。后续可根据需要安装额外的包（Crawlee、Playwright等）。

Quick Start Workflow

快速开始流程

Create actor project - Run the appropriate
```
apify create
```
command based on user's language preference (see Template Selection above)
Install dependencies
- JavaScript/TypeScript:
```
npm install
```
- Python:
```
pip install -r requirements.txt
```
Implement logic - Write the actor code in
```
src/main.py
```
,
```
src/main.js
```
, or
```
src/main.ts
```

Configure schemas - Update input/output schemas in

.actor/input_schema.json

.actor/output_schema.json

.actor/dataset_schema.json

Configure platform settings - Update
```
.actor/actor.json
```
with actor metadata (see references/actor-json.md)
Write documentation - Create comprehensive README.md for the marketplace
Test locally - Run
```
apify run
```
to verify functionality (see Local Testing section below)
Deploy - Run
```
apify push
```
to deploy the actor on the Apify platform (actor name is defined in
```
.actor/actor.json
```
)

创建Actor项目 - 根据用户偏好的语言运行相应的
```
apify create
```
命令（见上方模板选择）
安装依赖
- JavaScript/TypeScript：
```
npm install
```
- Python：
```
pip install -r requirements.txt
```
实现逻辑 - 在
```
src/main.py
```
、
```
src/main.js
```
或
```
src/main.ts
```
中编写Actor代码

配置Schema - 更新

.actor/input_schema.json

、

.actor/output_schema.json

、

.actor/dataset_schema.json

中的输入/输出Schema

配置平台设置 - 更新
```
.actor/actor.json
```
中的Actor元数据（参考references/actor-json.md）
编写文档 - 为市场创建全面的README.md
本地测试 - 运行
```
apify run
```
验证功能（见下方本地测试部分）
部署 - 运行
```
apify push
```
将Actor部署到Apify平台（Actor名称定义在
```
.actor/actor.json
```
中）

Best Practices

最佳实践

✓ Do:

Use
```
apify run
```
to test actors locally (configures Apify environment and storage)
Use Apify SDK (
```
apify
```
) for code running ON Apify platform
Validate input early with proper error handling and fail gracefully
Use CheerioCrawler for static HTML (10x faster than browsers)
Use PlaywrightCrawler only for JavaScript-heavy sites
Use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
Implement retry strategies with exponential backoff
Use proper concurrency: HTTP (10-50), Browser (1-5)
Set sensible defaults in
```
.actor/input_schema.json
```
Define output schema in
```
.actor/output_schema.json
```
Clean and validate data before pushing to dataset
Use semantic CSS selectors with fallback strategies
Respect robots.txt, ToS, and implement rate limiting
Always use
apify/log
package - censors sensitive data (API keys, tokens, credentials)
Implement readiness probe handler (required if your Actor uses standby mode)

✗ Don't:

Use
```
npm start
```
,
```
npm run start
```
,
```
npx apify run
```
, or similar commands to run actors (use
```
apify run
```
instead)
Rely on
```
Dataset.getInfo()
```
for final counts on Cloud
Use browser crawlers when HTTP/Cheerio works
Hard code values that should be in input schema or environment variables
Skip input validation or error handling
Overload servers - use appropriate concurrency and delays
Scrape prohibited content or ignore Terms of Service
Store personal/sensitive data unless explicitly permitted
Use deprecated options like
```
requestHandlerTimeoutMillis
```
on CheerioCrawler (v3.x)
Use
```
additionalHttpHeaders
```
- use
```
preNavigationHooks
```
instead
Disable standby mode without explicit permission

✓ 建议：

使用
```
apify run
```
在本地测试Actor（配置Apify环境和存储）
在Apify平台上运行的代码使用Apify SDK (
```
apify
```
)
尽早验证输入，进行适当的错误处理并优雅失败
对静态HTML使用CheerioCrawler（比浏览器快10倍）
仅对JavaScript密集型网站使用PlaywrightCrawler
对复杂爬取使用路由模式（createCheerioRouter/createPlaywrightRouter）
实现带指数退避的重试策略
使用适当的并发数：HTTP（10-50）、浏览器（1-5）
在
```
.actor/input_schema.json
```
中设置合理的默认值
在
```
.actor/output_schema.json
```
中定义输出Schema
将数据推送到数据集之前进行清理和验证
使用语义化CSS选择器并提供回退策略
遵守robots.txt、服务条款，并实现速率限制
始终使用
apify/log
包 - 审查敏感数据（API密钥、令牌、凭据）
实现就绪探针处理程序（如果你的Actor使用待机模式则为必须）

✗ 禁止：

使用
```
npm start
```
、
```
npm run start
```
、
```
npx apify run
```
或类似命令运行Actor（请改用
```
apify run
```
）
在云端依赖
```
Dataset.getInfo()
```
获取最终计数
当HTTP/Cheerio可以工作时使用浏览器爬虫
将应在输入Schema或环境变量中的值硬编码
跳过输入验证或错误处理
过载服务器 - 使用适当的并发数和延迟
抓取禁止内容或忽略服务条款
存储个人/敏感数据，除非明确允许
在CheerioCrawler（v3.x）上使用已弃用的选项，如
```
requestHandlerTimeoutMillis
```
使用
```
additionalHttpHeaders
```
- 改用
```
preNavigationHooks
```
未经明确许可禁用待机模式

Logging

日志记录

See references/logging.md for complete logging documentation including available log levels and best practices for JavaScript/TypeScript and Python.

Check

usesStandbyMode

.actor/actor.json

- only implement if set to

true

有关完整的日志记录文档，包括可用的日志级别以及JavaScript/TypeScript和Python的最佳实践，请参考references/logging.md。

检查

.actor/actor.json

中的

usesStandbyMode

- 仅当设置为

true

时才需要实现。

Commands

命令

bash

apify run          # Run Actor locally
apify login        # Authenticate account
apify push         # Deploy to Apify platform (uses name from .actor/actor.json)
apify help         # List all commands

IMPORTANT: Always use

apify run

to test actors locally. Do not use

npm run start

npm start

yarn start

, or other package manager commands - these will not properly configure the Apify environment and storage.

bash

apify run          # 在本地运行Actor
apify login        # 验证账户
apify push         # 部署到Apify平台（使用.actor/actor.json中的名称）
apify help         # 列出所有命令

重要提示： 始终使用

apify run

在本地测试Actor。不要使用

npm run start

、

npm start

、

yarn start

或其他包管理器命令 - 这些命令不会正确配置Apify环境和存储。

Local Testing

本地测试

When testing an actor locally with

apify run

, provide input data by creating a JSON file at:

storage/key_value_stores/default/INPUT.json

This file should contain the input parameters defined in your

.actor/input_schema.json

. The actor will read this input when running locally, mirroring how it receives input on the Apify platform.

使用

apify run

在本地测试Actor时，可通过在以下路径创建JSON文件来提供输入数据：

storage/key_value_stores/default/INPUT.json

该文件应包含

.actor/input_schema.json

中定义的输入参数。Actor在本地运行时会读取此输入，与在Apify平台上接收输入的方式一致。

Standby Mode

待机模式

See references/standby-mode.md for complete standby mode documentation including readiness probe implementation for JavaScript/TypeScript and Python.

有关完整的待机模式文档，包括JavaScript/TypeScript和Python的就绪探针实现，请参考references/standby-mode.md。

Project Structure

项目结构

.actor/
├── actor.json           # Actor config: name, version, env vars, runtime
├── input_schema.json    # Input validation & Console form definition
└── output_schema.json   # Output storage and display templates
src/
└── main.js/ts/py       # Actor entry point
storage/                # Local storage (mirrors Cloud)
├── datasets/           # Output items (JSON objects)
├── key_value_stores/   # Files, config, INPUT
└── request_queues/     # Pending crawl requests
Dockerfile              # Container image definition

.actor/
├── actor.json           # Actor配置：名称、版本、环境变量、运行时
├── input_schema.json    # 输入验证与控制台表单定义
└── output_schema.json   # 输出存储与显示模板
src/
└── main.js/ts/py       # Actor入口文件
storage/                # 本地存储（与云端镜像）
├── datasets/           # 输出项（JSON对象）
├── key_value_stores/   # 文件、配置、INPUT
└── request_queues/     # 待处理的爬取请求
Dockerfile              # 容器镜像定义

Actor Configuration

Actor配置

See references/actor-json.md for complete actor.json structure and configuration options.

有关完整的actor.json结构和配置选项，请参考references/actor-json.md。

Input Schema

输入Schema

See references/input-schema.md for input schema structure and examples.

有关输入Schema的结构和示例，请参考references/input-schema.md。

Output Schema

输出Schema

See references/output-schema.md for output schema structure, examples, and template variables.

有关输出Schema的结构、示例和模板变量，请参考references/output-schema.md。

Dataset Schema

数据集Schema

See references/dataset-schema.md for dataset schema structure, configuration, and display properties.

有关数据集Schema的结构、配置和显示属性，请参考references/dataset-schema.md。

Key-Value Store Schema

键值存储Schema

See references/key-value-store-schema.md for key-value store schema structure, collections, and configuration.

有关键值存储Schema的结构、集合和配置，请参考references/key-value-store-schema.md。

Apify MCP Tools

Apify MCP工具

If MCP server is configured, use these tools for documentation:

```
search-apify-docs
```
- Search documentation
```
fetch-apify-docs
```
- Get full doc pages

Otherwise, the MCP Server url:

https://mcp.apify.com/?tools=docs

如果已配置MCP服务器，可使用以下工具获取文档：

```
search-apify-docs
```
- 搜索文档
```
fetch-apify-docs
```
- 获取完整文档页面

否则，MCP服务器地址：

https://mcp.apify.com/?tools=docs

。

Resources

资源

docs.apify.com/llms.txt - Apify quick reference documentation
docs.apify.com/llms-full.txt - Apify complete documentation
https://crawlee.dev/llms.txt - Crawlee quick reference documentation
https://crawlee.dev/llms-full.txt - Crawlee complete documentation
whitepaper.actor - Complete Actor specification

docs.apify.com/llms.txt - Apify快速参考文档
docs.apify.com/llms-full.txt - Apify完整文档
https://crawlee.dev/llms.txt - Crawlee快速参考文档
https://crawlee.dev/llms-full.txt - Crawlee完整文档
whitepaper.actor - 完整的Actor规范