alibabacloud-pai-rec-diagnosis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PAI-Rec Engine Diagnosis and Configuration Validation

PAI-Rec引擎诊断与配置验证

This skill provides comprehensive diagnostic and validation capabilities for Alibaba Cloud PAI-Rec (Programmable Recommendation System) engines, including interface troubleshooting and configuration analysis.
本Skill为阿里云PAI-Rec(可编程推荐系统)引擎提供全面的诊断与验证能力,包括接口故障排查和配置分析。

Scenario Description

场景说明

PAI-Rec is Alibaba Cloud's programmable recommendation system that provides intelligent recommendation capabilities. This skill helps users:
  1. Diagnose PAI-Rec Engine Interface Issues: When engine API returns errors or unexpected results, trace the request through EAS service logs and engine configurations to identify root causes.
  2. Validate Engine Configurations: Analyze engine configuration files for potential issues, inconsistencies, or misconfigurations before deployment.
Architecture: PAI-EAS Service + PAI-Rec Engine + Engine Configuration Management
PAI-Rec是阿里云的可编程推荐系统,提供智能推荐能力。本Skill帮助用户:
  1. 诊断PAI-Rec引擎接口问题:当引擎API返回错误或异常结果时,通过EAS服务日志和引擎配置追踪请求,定位根本原因。
  2. 验证引擎配置:在部署前分析引擎配置文件,排查潜在问题、不一致性或配置错误。
架构:PAI-EAS服务 + PAI-Rec引擎 + 引擎配置管理

Key Components

核心组件

  • PAI-EAS Service: Elastic Algorithm Service hosting the recommendation engine
  • PAI-Rec Engine: The recommendation engine processing requests
  • Engine Configuration: Configuration files defining engine behavior
  • Service Logs: EAS service logs containing request traces

  • PAI-EAS Service:托管推荐引擎的弹性算法服务
  • PAI-Rec Engine:处理请求的推荐引擎
  • Engine Configuration:定义引擎行为的配置文件
  • Service Logs:包含请求追踪信息的EAS服务日志

Installation

安装

Pre-check: Aliyun CLI >= 3.3.3 required
Run
aliyun version
to verify >= 3.3.3. If not installed or version too low, run
curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash
to install/update, or see
references/cli-installation-guide.md
for installation instructions.
Pre-check: Aliyun CLI plugin update required
[MUST] run
aliyun configure set --auto-plugin-install true
to enable automatic plugin installation. [MUST] run
aliyun plugin update
to ensure that any existing plugins are always up-to-date.

前置检查:需安装Aliyun CLI >= 3.3.3版本
运行
aliyun version
验证版本是否>=3.3.3。若未安装或版本过低, 运行
curl -fsSL https://aliyuncli.alicdn.com/setup.sh | bash
进行安装/更新, 或查看
references/cli-installation-guide.md
获取安装说明。
前置检查:需更新Aliyun CLI插件
[必须] 运行
aliyun configure set --auto-plugin-install true
启用自动插件安装。 [必须] 运行
aliyun plugin update
确保所有现有插件保持最新。

Authentication

身份认证

Pre-check: Alibaba Cloud Credentials Required
Security Rules:
  • NEVER read, echo, or print AK/SK values (e.g.,
    echo $ALIBABA_CLOUD_ACCESS_KEY_ID
    is FORBIDDEN)
  • NEVER ask the user to input AK/SK directly in the conversation or command line
  • NEVER use
    aliyun configure set
    with literal credential values
  • ONLY use
    aliyun configure list
    to check credential status
bash
aliyun configure list
Check the output for a valid profile (AK, STS, or OAuth identity).
If no valid profile exists, STOP here.
  1. Obtain credentials from Alibaba Cloud Console
  2. Configure credentials outside of this session (via
    aliyun configure
    in terminal or environment variables in shell profile)
  3. Return and re-run after
    aliyun configure list
    shows a valid profile

前置检查:需准备阿里云凭证
安全规则:
  • 严禁读取、回显或打印AK/SK值(例如:
    echo $ALIBABA_CLOUD_ACCESS_KEY_ID
    是禁止操作)
  • 严禁要求用户在对话或命令行中直接输入AK/SK
  • 严禁使用
    aliyun configure set
    命令传入明文凭证值
  • 仅允许使用
    aliyun configure list
    检查凭证状态
bash
aliyun configure list
检查输出是否包含有效的配置文件(AK、STS或OAuth身份)。
若无有效配置文件,请在此处暂停。
  1. 阿里云控制台获取凭证
  2. 在本次会话外配置凭证(通过终端中的
    aliyun configure
    命令或Shell配置文件中的环境变量)
  3. aliyun configure list
    显示有效配置文件后,再返回继续操作

RAM Policy

RAM权限策略

[MUST] Permission Failure Handling: When any command or API call fails due to permission errors at any point during execution, follow this process:
  1. Read
    references/ram-policies.md
    to get the full list of permissions required by this SKILL
  2. Use
    ram-permission-diagnose
    skill to guide the user through requesting the necessary permissions
  3. Pause and wait until the user confirms that the required permissions have been granted
For detailed RAM policies required by this skill, see references/ram-policies.md.

[必须] 权限失败处理流程: 在执行过程中,若任何命令或API调用因权限错误失败,请遵循以下流程:
  1. 查看
    references/ram-policies.md
    获取本Skill所需的完整权限列表
  2. 使用
    ram-permission-diagnose
    Skill引导用户申请必要权限
  3. 暂停操作,等待用户确认所需权限已授予
本Skill所需的详细RAM权限策略,请查看 references/ram-policies.md

Parameter Confirmation

参数确认

IMPORTANT: Parameter Confirmation — Before executing any command or API call, ALL user-customizable parameters (e.g., RegionId, service names, instance IDs, environment names, config names, etc.) MUST be confirmed with the user. Do NOT assume or use default values without explicit user approval.
重要提示:参数确认 — 在执行任何命令或API调用前, 所有用户可自定义的参数(例如:RegionId、服务名称、实例ID、 环境名称、配置名称等)必须与用户确认。 未经用户明确批准,不得假设或使用默认值。

Required Parameters

必填参数

Parameter NameRequired/OptionalDescriptionDefault Value
service-name
RequiredPAI-EAS service nameN/A
cluster-id
RequiredEAS cluster/region IDN/A
instance-id
RequiredPAI-Rec instance IDN/A
request-id
Required (for diagnosis)Request ID from engine API responseN/A
config-name
OptionalEngine configuration nameN/A
environment
OptionalEngine environment (Prod/Pre)N/A
engine-config-id
OptionalSpecific engine config version IDN/A
region
OptionalAlibaba Cloud regionN/A

参数名称必填/可选描述默认值
service-name
必填PAI-EAS服务名称N/A
cluster-id
必填EAS集群/地域IDN/A
instance-id
必填PAI-Rec实例IDN/A
request-id
必填(诊断场景)引擎API返回的请求IDN/A
config-name
可选引擎配置名称N/A
environment
可选引擎环境(Prod/Pre)N/A
engine-config-id
可选特定引擎配置版本IDN/A
region
可选阿里云地域N/A

Core Workflow

核心工作流

At the start of the Core Workflow (before any CLI invocation): [MUST] Enable AI-Mode — AI-mode is required for Agent Skill execution. Run the following commands before any CLI invocation:
bash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"
[MUST] Disable AI-Mode at EVERY exit point — Before delivering the final response for ANY reason, always disable AI-mode first. This applies to ALL exit paths: workflow success, workflow failure, error/exception, user cancellation, session end, or any other scenario where no further CLI commands will be executed. AI-mode is only used for Agent Skill invocation scenarios and MUST NOT remain enabled after the skill stops running.
bash
aliyun configure ai-mode disable
在核心工作流开始时(执行任何CLI命令前): [必须] 启用AI模式 — 执行Agent Skill需开启AI模式。 在执行任何CLI命令前,先运行以下命令:
bash
aliyun configure ai-mode enable
aliyun configure ai-mode set-user-agent --user-agent "AlibabaCloud-Agent-Skills/alibabacloud-pai-rec-diagnosis"
[必须] 在所有退出点禁用AI模式 — 无论因何种原因返回最终响应前,始终先禁用AI模式。此规则适用于所有退出路径:工作流成功、工作流失败、错误/异常、用户取消、会话结束,或任何不再执行CLI命令的场景。 AI模式仅用于Agent Skill调用场景,Skill停止运行后必须禁用。
bash
aliyun configure ai-mode disable

Workflow 1: PAI-Rec Engine Interface Diagnosis

工作流1:PAI-Rec引擎接口诊断

This workflow helps diagnose issues when a PAI-Rec engine API returns errors or unexpected results.
Input Example:
Service Name: embedding_recall
API Response:
{
    "code": 299,
    "msg": "items size not enough",
    "request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
    "size": 0,
    "experiment_id": "",
    "items": []
}
此工作流用于诊断PAI-Rec引擎API返回错误或异常结果时的问题。
输入示例:
服务名称: embedding_recall
API响应:
{
    "code": 299,
    "msg": "items size not enough",
    "request_id": "941b4e14-d1c5-489f-a184-b2b17f8b4fdb",
    "size": 0,
    "experiment_id": "",
    "items": []
}

Step 1: Retrieve EAS Service Information

步骤1:获取EAS服务信息

Get the service details to find the EAS service ID and configuration:
bash
aliyun eas describe-service \
  --cluster-id <cluster-id> \
  --service-name <service-name>
What to extract:
  • Resource
    : EAS service resource ID (e.g.,
    eas-r-1v4qb1yan3qmnjwxqe
    )
  • ServiceConfig.envs
    : Environment variables containing:
    • REGION
      : The region
    • INSTANCE_ID
      : PAI-Rec instance ID
    • CONFIG_NAME
      : Engine configuration name
    • PAIREC_ENVIRONMENT
      : Environment (product/prepub)
获取服务详情以找到EAS服务ID和配置:
bash
aliyun eas describe-service \\
  --cluster-id <cluster-id> \\
  --service-name <service-name>
需提取的信息:
  • Resource
    : EAS服务资源ID(例如:
    eas-r-1v4qb1yan3qmnjwxqe
  • ServiceConfig.envs
    : 包含以下内容的环境变量:
    • REGION
      : 地域
    • INSTANCE_ID
      : PAI-Rec实例ID
    • CONFIG_NAME
      : 引擎配置名称
    • PAIREC_ENVIRONMENT
      : 环境(product/prepub)

Step 2: Extract Request ID from API Response

步骤2:从API响应中提取请求ID

Parse the API response JSON to get the
request_id
field. This will be used to search service logs.
解析API响应JSON,获取
request_id
字段,用于搜索服务日志。

Step 3: Query EAS Service Logs

步骤3:查询EAS服务日志

Use the request ID as the sole filter to search service logs. Do NOT pass
--start-time
/
--end-time
when searching PAI-Rec business logs:
bash
aliyun eas describe-service-log \
  --cluster-id <cluster-id> \
  --service-name <service-name> \
  --keyword <request-id> \
  --page-size 500
[CRITICAL] Known CLI pitfall — keyword-only lookup is required for business logs:
  • When only
    --keyword
    is supplied (no time range), the CLI returns the full PAI-Rec application trace (
    controller.go
    /
    feed.go
    /
    recall.go
    /
    rank_service.go
    etc.) matching the request_id.
  • As soon as
    --start-time
    /
    --end-time
    are added — even if the window covers the real log timestamp — the CLI silently drops business logs and only returns infrastructure noise (
    /bin/sh
    wrapper heartbeats,
    502 Bad Gateway
    retries,
    postgres.go dbstat
    ).
  • Therefore: for request-level diagnosis, always omit the time range and rely on
    --keyword <request-id>
    alone.
Notes:
  • --keyword
    : Use the full
    request_id
    extracted from the API response (case-sensitive exact match).
  • --page-size
    : Raise to 500 to capture the entire trace in a single page; total matched entries for one request is usually < 30.
  • --start-time
    /
    --end-time
    : Only use these for broad time-window scans without
    --keyword
    (e.g., when investigating non-request-specific issues). Required format is
    yyyy-MM-dd HH:mm:ss
    in UTC (space separator, no
    T
    / no
    Z
    ). ISO-8601 forms like
    2025-04-28T00:00:00Z
    will be rejected with
    InvalidParameter
    .
使用请求ID作为唯一过滤器搜索服务日志。搜索PAI-Rec业务日志时,请勿传入
--start-time
/
--end-time
参数:
bash
aliyun eas describe-service-log \\
  --cluster-id <cluster-id> \\
  --service-name <service-name> \\
  --keyword <request-id> \\
  --page-size 500
[关键] 已知CLI陷阱 — 业务日志需仅用关键词查询:
  • 仅传入
    --keyword
    参数(无时间范围)时,CLI会返回与request_id匹配的完整PAI-Rec应用追踪日志(
    controller.go
    /
    feed.go
    /
    recall.go
    /
    rank_service.go
    等)。
  • 一旦添加
    --start-time
    /
    --end-time
    参数 — 即使时间范围包含真实日志时间戳 — CLI会自动过滤掉业务日志,仅返回基础设施日志(
    /bin/sh
    包装器心跳、
    502 Bad Gateway
    重试、
    postgres.go dbstat
    等)。
  • 因此:针对请求级诊断,始终省略时间范围,仅依赖
    --keyword <request-id>
    参数。
注意事项:
  • --keyword
    : 使用从API响应中提取的完整
    request_id
    (区分大小写的精确匹配)。
  • --page-size
    : 设置为500,以便在单页中捕获完整追踪日志;单个请求的匹配条目通常少于30条。
  • --start-time
    /
    --end-time
    : 仅在不传入
    --keyword
    的宽时间范围扫描时使用(例如:排查非请求特定问题)。格式必须为UTC时区的
    yyyy-MM-dd HH:mm:ss
    (空格分隔,无
    T
    / 无
    Z
    )。类似
    2025-04-28T00:00:00Z
    的ISO-8601格式会被拒绝并返回
    InvalidParameter
    错误。

Step 4: List Engine Configurations

步骤4:列出引擎配置

Map the environment and list matching configurations:
Environment Mapping:
  • product
    Prod
  • prepub
    Pre
bash
aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --status Released \
  --name <config-name>
What to extract:
  • Find the configuration with
    Status: Released
  • Get
    EngineConfigId
    and
    Version
映射环境并列出匹配的配置:
环境映射:
  • product
    Prod
  • prepub
    Pre
bash
aliyun pairecservice list-engine-configs \\
  --instance-id <instance-id> \\
  --environment <Prod|Pre> \\
  --status Released \\
  --name <config-name>
需提取的信息:
  • 找到状态为
    Released
    的配置
  • 获取
    EngineConfigId
    Version

Step 5: Get Engine Configuration Details

步骤5:获取引擎配置详情

bash
aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id>
What to extract:
  • ConfigValue
    : The actual engine configuration (JSON/YAML)
bash
aliyun pairecservice get-engine-config \\
  --instance-id <instance-id> \\
  --engine-config-id <engine-config-id>
需提取的信息:
  • ConfigValue
    : 实际引擎配置(JSON/YAML格式)

Step 5.5 (Optional): Static Config Sanity Check

步骤5.5(可选):静态配置合理性检查

Optionally run
scripts/validate.py
against the retrieved
ConfigValue
to quickly rule out structural / reference / naming errors in the engine configuration before diving into the log trace. See Workflow 2 § Step 3 and references/config-validation.md for usage, exit codes, and the full rule list.
bash
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
When to run: when the log trace points at a specific configuration element (e.g. a
RecallConfs
/
FilterConfs
/
SceneConfs
entry), or when the configuration is being diagnosed for the first time in this skill session.
When to skip: when the log trace already shows a decisive non-config root cause (e.g. a
scene_id
not present in
SceneConfs
, a 5xx from an upstream EAS dependency, a missing feature table).
validate.py
is a static checker and cannot detect request-time mismatches between client input and configuration.
[MUST] Scoping rule for the final report:
  • validate.py
    findings may enter the final diagnosis ONLY when they are directly tied to the log evidence for the current
    request_id
    (e.g. the log blames a
    RecallConf
    name that
    validate.py
    flags as duplicated or dangling).
  • Findings unrelated to the current
    request_id
    trace MUST NOT be added to the final conclusion. They remain an internal sanity-check signal only. This preserves the evidence-only reporting rule in Step 6.
可选择将获取到的
ConfigValue
传入
scripts/validate.py
运行,在深入分析日志追踪前快速排除引擎配置中的结构/引用/命名错误。使用方法、退出码和完整规则列表请查看工作流2 §步骤3和 references/config-validation.md
bash
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
何时运行: 当日志追踪指向特定配置元素(例如
RecallConfs
/
FilterConfs
/
SceneConfs
条目),或本次Skill会话中首次诊断该配置时。
何时跳过: 当日志追踪已明确显示非配置类根本原因(例如
scene_id
不存在于
SceneConfs
、上游EAS依赖返回5xx错误、特征表缺失)。
validate.py
是静态检查工具,无法检测请求时客户端输入与配置的不匹配问题。
[必须] 最终报告范围规则:
  • validate.py
    的检查结果仅当与当前
    request_id
    的日志证据直接关联时,才可纳入最终诊断报告(例如:日志指出某个
    RecallConf
    名称存在问题,而
    validate.py
    标记该名称重复或无效)。
  • 与当前
    request_id
    追踪无关的检查结果不得添加到最终结论中,仅作为内部合理性检查信号。此规则遵循步骤6中的"仅基于证据报告"原则。

Step 6: Comprehensive Analysis

步骤6:综合分析

Analyze the following components together:
  1. API Response: Error code, message, and returned data
  2. Service Logs: Trace logs for the request_id showing processing flow
  3. Engine Configuration: Settings that may affect the behavior
Common Issues to Check:
  • Configuration mismatches (e.g., recall settings, filtering rules)
  • Resource limitations (e.g., insufficient items, timeout settings)
  • Data source issues (e.g., table access, feature availability)
  • Environment inconsistencies (e.g., prod config in prepub environment)
[MUST] Evidence-only reporting rule:
The final diagnosis delivered to the user MUST be grounded strictly in what the EAS service logs and the engine configuration directly show. Apply the following constraints:
  • Report only what is observed. Quote the exact log line (file:line, level, message) and the exact config fragment that proves each claim.
  • State the direct causal chain from log evidence to the API response, and stop there.
  • Do NOT add any of the following unless the user explicitly asks:
    • Speculative root causes not visible in logs/config (e.g., "client probably sent wrong X")
    • Fix recommendations or remediation steps
    • Conditional "if X then Y" scenarios
    • Tangential best-practice advice (security, fallback design, naming, etc.)
    • Guesses about upstream systems, client code, or data sources not covered by the logs/config
  • If the evidence is insufficient to reach a conclusion, state explicitly what additional data (specific log lines, other config versions, other environments) is needed, instead of guessing.
  • Recommendations are opt-in only. Provide fixes/suggestions only when the user explicitly requests them in a follow-up.

结合以下组件进行分析:
  1. API响应:错误码、消息和返回数据
  2. 服务日志:request_id对应的追踪日志,展示处理流程
  3. 引擎配置:可能影响行为的设置
需检查的常见问题:
  • 配置不匹配(例如:召回设置、过滤规则)
  • 资源限制(例如:物品数量不足、超时设置)
  • 数据源问题(例如:表访问权限、特征可用性)
  • 环境不一致(例如:预发布环境使用生产配置)
[必须] 仅基于证据报告规则:
提交给用户的最终诊断报告必须严格基于EAS服务日志和引擎配置直接显示的内容。请遵循以下约束:
  • 仅报告观察到的内容。引用确切的日志行(文件:行号、级别、消息)和确切的配置片段来证明每个结论。
  • 陈述从日志证据到API响应的直接因果链,点到为止。
  • 不得添加以下内容,除非用户明确要求:
    • 日志/配置中未体现的推测性根本原因(例如:"客户端可能发送了错误的X")
    • 修复建议或补救步骤
    • 条件性的"如果X则Y"场景
    • 无关的最佳实践建议(安全、 fallback设计、命名等)
    • 对日志/配置未覆盖的上游系统、客户端代码或数据源的猜测
  • 若证据不足以得出结论,明确说明需要哪些额外数据(特定日志行、其他配置版本、其他环境),而非猜测。
  • 建议仅在用户主动请求时提供。仅当用户在后续对话中明确要求时,才提供修复/建议方案。

Workflow 2: PAI-Rec Engine Configuration Validation

工作流2:PAI-Rec引擎配置验证

This workflow validates engine configurations for potential issues.
Input: Configuration name and environment (Prod/Pre)
此工作流用于验证引擎配置是否存在潜在问题。
输入: 配置名称和环境(Prod/Pre)

Step 1: List Configuration Versions

步骤1:列出配置版本

If user doesn't provide
engine-config-id
, list available versions:
bash
aliyun pairecservice list-engine-configs \
  --instance-id <instance-id> \
  --environment <Prod|Pre> \
  --name <config-name>
Display to user:
  • Version
    : Version number
  • Status
    : Configuration status (Released/Draft/Archived)
  • GmtCreateTime
    : Creation timestamp
  • EngineConfigId
    : Version ID
Ask user to select a version or provide the
engine-config-id
.
若用户未提供
engine-config-id
,列出可用版本:
bash
aliyun pairecservice list-engine-configs \\
  --instance-id <instance-id> \\
  --environment <Prod|Pre> \\
  --name <config-name>
展示给用户的信息:
  • Version
    : 版本号
  • Status
    : 配置状态(Released/Draft/Archived)
  • GmtCreateTime
    : 创建时间戳
  • EngineConfigId
    : 版本ID
请用户选择版本或提供
engine-config-id

Step 2: Retrieve Configuration Details

步骤2:获取配置详情

bash
aliyun pairecservice get-engine-config \
  --instance-id <instance-id> \
  --engine-config-id <engine-config-id>
bash
aliyun pairecservice get-engine-config \\
  --instance-id <instance-id> \\
  --engine-config-id <engine-config-id>

Step 3: Run Schema + Rule Validation

步骤3:运行Schema + 规则验证

[MUST] Feed the extracted
ConfigValue
JSON into
scripts/validate.py
. The script enforces JSON Schema (
references/schema.json
) + reference-consistency rules and exits with status 0 on pass, 1 on failure.
bash
undefined
[必须] 将提取到的
ConfigValue
JSON传入
scripts/validate.py
。该脚本会强制执行JSON Schema(
references/schema.json
)+ 引用一致性规则,验证通过时退出码为0,失败时为1。
bash
undefined

From stdin (recommended when ConfigValue is already in memory)

从标准输入传入(推荐在ConfigValue已在内存中时使用)

printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin
printf '%s' "$CONFIG_VALUE" | python3 scripts/validate.py --stdin

From a saved JSON file

从保存的JSON文件传入

python3 scripts/validate.py /tmp/engine-config.json
python3 scripts/validate.py /tmp/engine-config.json

From an inline JSON string

从内联JSON字符串传入

python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'

Requires `jsonschema` (`pip install jsonschema`); if missing the script falls back to
rule-only validation without Schema checks.

**What the script checks (summary):**

1. **Structure** — JSON well-formedness, required fields, types (`RunMode`,
   `RecallConfs`, `FilterConfs`, `SortConfs`, `AlgoConfs`, `SceneConfs`, `RankConf`,
   `FeatureConfs`, `UserFeatureConfs`, `DebugConfs`, `FeatureLogConfs`,
   `CallBackConfs`, `PipelineConfs`, etc.)
2. **Enum values** — `RecallType` / `FilterType` / `SortType` / `RunMode` /
   `DebugConfs.OutputType` / `GeneralRankConfs.ActionConfs[].ActionType`
3. **Reference consistency** — `SceneConfs.RecallNames` → `RecallConfs`;
   `FilterNames` → `FilterConfs`; `SortNames` → `SortConfs`;
   `RankConf.RankAlgoList` → `AlgoConfs`; any `DaoConf.AdapterType` +
   `*Name` → the corresponding `*Confs` (Hologres / Redis / MySQL / TableStore /
   FeatureStore / …)
4. **Business rules**
   - `User2ItemExposureFilter` with `WriteLog=true` + FeatureStore adapter: must set
     `TimeInterval > 0`
   - `PriorityAdjustCountFilter` in `accumulator` mode: `Count` must be strictly
     increasing (use `Type="fix"` for independent per-recall caps)
   - `PipelineConfs.*.Name` must be globally unique
   - `DebugConfs.Rate` must be an integer in `[0, 100]`
5. **Duplicate name detection** within `RecallConfs`, `FilterConfs`, `SortConfs`,
   `AlgoConfs`

Detailed usage, exit codes, example outputs and the full rule list live in
[references/config-validation.md](references/config-validation.md).
python3 scripts/validate.py '{"RunMode":"product","RecallConfs":[...]}'

需安装 `jsonschema`(`pip install jsonschema`);若未安装,脚本会回退到仅执行规则验证,跳过Schema检查。

**脚本检查内容摘要:**

1. **结构** — JSON格式正确性、必填字段、类型(`RunMode`、
   `RecallConfs`、`FilterConfs`、`SortConfs`、`AlgoConfs`、`SceneConfs`、`RankConf`、
   `FeatureConfs`、`UserFeatureConfs`、`DebugConfs`、`FeatureLogConfs`、
   `CallBackConfs`、`PipelineConfs`等)
2. **枚举值** — `RecallType` / `FilterType` / `SortType` / `RunMode` /
   `DebugConfs.OutputType` / `GeneralRankConfs.ActionConfs[].ActionType`
3. **引用一致性** — `SceneConfs.RecallNames` → `RecallConfs`;
   `FilterNames` → `FilterConfs`; `SortNames` → `SortConfs`;
   `RankConf.RankAlgoList` → `AlgoConfs`; 任何 `DaoConf.AdapterType` +
   `*Name` → 对应的 `*Confs`(Hologres / Redis / MySQL / TableStore /
   FeatureStore / …)
4. **业务规则**
   - `User2ItemExposureFilter` 设置 `WriteLog=true` 且使用FeatureStore适配器时:必须设置
     `TimeInterval > 0`
   - `PriorityAdjustCountFilter` 处于`accumulator`模式时:`Count`必须严格递增(使用`Type="fix"`实现独立的召回上限)
   - `PipelineConfs.*.Name` 必须全局唯一
   - `DebugConfs.Rate` 必须是`[0, 100]`范围内的整数
5. **重复名称检测** — 在`RecallConfs`、`FilterConfs`、`SortConfs`、
   `AlgoConfs`中检测重复名称

详细使用方法、退出码、示例输出和完整规则列表请查看
[references/config-validation.md](references/config-validation.md)。

Step 4: Evidence-Grounded Report

步骤4:基于证据的报告

Report to the user based strictly on the script's output plus any additional inspection of
ConfigValue
:
  • ✅ Checks passed (Schema clean, references resolved, no rule violations)
  • ⚠️ Warnings reported by the script (severity=warning) or inconsistencies observed in
    ConfigValue
    (e.g. naming collisions between
    RankScore
    variables and model output fields, env/region mismatches)
  • ❌ Errors reported by the script (severity=error) or missing required fields
  • Missing-evidence notes — what extra data (other config versions, model signatures, etc.) would be needed to turn a warning into a confirmed error
Do not add speculative fixes or best-practice tangents; suggestions are provided only when the user explicitly asks for them.

严格基于脚本输出和对
ConfigValue
的额外检查结果向用户报告:
  • ✅ 检查通过(Schema合规、引用已解析、无规则违反)
  • ⚠️ 脚本报告的警告(severity=warning)或在
    ConfigValue
    中观察到的不一致性(例如:
    RankScore
    变量与模型输出字段命名冲突、环境/地域不匹配)
  • ❌ 脚本报告的错误(severity=error)或缺失必填字段
  • 证据缺失说明 — 需要哪些额外数据(其他配置版本、模型签名等)才能将警告确认为错误
请勿添加推测性修复方案或无关的最佳实践内容;仅当用户明确要求时才提供建议。

Success Verification Method

成功验证方法

For detailed verification steps, see references/verification-method.md.
Quick Verification:
  1. For Diagnosis Workflow:
    • Service information retrieved successfully
    • Logs found containing the request_id
    • Configuration loaded correctly
    • Root cause identified
  2. For Validation Workflow:
    • Configuration retrieved successfully
    • All validation checks executed
    • Issues clearly reported
    • Recommendations provided (if applicable)

详细验证步骤请查看 references/verification-method.md
快速验证:
  1. 诊断工作流:
    • 成功获取服务信息
    • 找到包含request_id的日志
    • 正确加载配置
    • 定位到根本原因
  2. 验证工作流:
    • 成功获取配置
    • 执行所有验证检查
    • 清晰报告问题
    • 提供建议(如适用)

Cleanup

清理

This skill performs read-only operations and does not create any resources that require cleanup.

本Skill仅执行只读操作,不会创建需要清理的资源。

Best Practices

最佳实践

  1. Always capture request_id: When reporting API issues, include the full response with request_id for accurate log correlation.
  2. Log queries — keyword only, no time range: For request-level diagnosis, pass
    --keyword <request_id>
    to
    aliyun eas describe-service-log
    and leave
    --start-time
    /
    --end-time
    unset. Combining keyword with a time range filters out business logs due to a CLI quirk (see Workflow 1, Step 3). Only use time ranges for broad non-request scans, and only with the
    yyyy-MM-dd HH:mm:ss
    UTC format (no
    T
    / no
    Z
    ).
  3. Environment awareness: Always verify that configurations match the target environment (Prod vs Pre).
  4. Version control: When validating configurations, check multiple versions if issues persist across deployments.
  5. Log retention: EAS service logs are retained for limited periods; diagnose issues promptly after occurrence.
  6. Configuration backup: Before applying changes based on validation results, ensure current configurations are backed up.
  7. Cross-reference: Compare working configurations with problematic ones to identify differences.
  8. Service status: Check EAS service status before diagnosing; service-level issues may mask configuration problems.
  9. Evidence-only conclusions: Ground every statement in the diagnosis on a specific log line or config fragment. Do not speculate, do not propose fixes, and do not volunteer best-practice advice unless the user explicitly asks. If the evidence is insufficient, say what is missing rather than inferring.
  10. Structured analysis: Follow the systematic workflow rather than jumping to conclusions based on error messages alone.
  11. Document findings: Keep track of recurring issues and their resolutions for faster future diagnosis.

  1. 始终捕获request_id:报告API问题时,包含完整响应及request_id,以便准确关联日志。
  2. 日志查询 — 仅用关键词,不设时间范围:针对请求级诊断,调用
    aliyun eas describe-service-log
    时传入
    --keyword <request_id>
    ,不设置
    --start-time
    /
    --end-time
    。由于CLI特性,同时使用关键词和时间范围会过滤掉业务日志(查看工作流1步骤3)。仅在进行非请求相关的宽范围扫描时使用时间范围,且必须使用UTC时区的
    yyyy-MM-dd HH:mm:ss
    格式(无
    T
    / 无
    Z
    )。
  3. 环境感知:始终验证配置与目标环境(Prod vs Pre)是否匹配。
  4. 版本控制:验证配置时,若问题在部署后持续存在,请检查多个版本。
  5. 日志保留:EAS服务日志保留期限有限,请在问题发生后及时诊断。
  6. 配置备份:根据验证结果修改配置前,确保已备份当前配置。
  7. 交叉对比:将正常工作的配置与有问题的配置进行对比,找出差异。
  8. 服务状态检查:诊断前先检查EAS服务状态;服务级问题可能掩盖配置问题。
  9. 仅基于证据得出结论:诊断中的每个陈述都必须基于特定日志行或配置片段。请勿推测、请勿主动提出修复方案、请勿主动提供最佳实践建议,除非用户明确要求。若证据不足,请说明缺失的内容,而非推断。
  10. 结构化分析:遵循系统化工作流,而非仅根据错误消息直接得出结论。
  11. 记录发现:跟踪重复出现的问题及其解决方案,以便未来更快诊断。

Reference Links

参考链接

Reference DocumentDescription
RAM PoliciesRequired RAM permissions for PAI-Rec and EAS APIs
Related CommandsComplete CLI command reference
Verification MethodDetailed verification procedures
CLI Installation GuideAlibaba Cloud CLI installation instructions
Configuration ExamplesSample engine configurations and common patterns
Config Validation
scripts/validate.py
usage, exit codes, rule catalogue
Troubleshooting GuideCommon issues and solutions
参考文档描述
RAM PoliciesPAI-Rec和EAS API所需的RAM权限
Related Commands完整CLI命令参考
Verification Method详细验证流程
CLI Installation Guide阿里云CLI安装说明
Configuration Examples示例引擎配置和常见模式
Config Validation
scripts/validate.py
使用方法、退出码、规则目录
Troubleshooting Guide常见问题及解决方案
",