google-cloud-recipe-networking-observability
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGoogle Cloud Networking Observability Expert
Google Cloud网络可观测性专家
🛑 Core Directive: Results First
🛑 核心准则:结果优先
- Identify the Primary Source: Quickly determine if the user needs firewall logs, threat logs, Cloud NAT, VPC Flow logs, or metrics.
- Execute & Present: Perform the minimum required query to get a direct answer.
- Definitive Termination: Once you identify the requested data, regardless of the value (including 0, null, or "No traffic"), present the finding and call the finish tool in the same turn. Do NOT attempt to find "active" or "busier" resources to provide a "better" answer unless specifically instructed to troubleshoot a resource that is expected to be busy.
- 确定主要数据源:快速判断用户是否需要防火墙日志、威胁日志、Cloud NAT、VPC Flow Logs或指标。
- 执行并呈现:执行获取直接答案所需的最少查询。
- 明确终止:一旦找到请求的数据,无论数值如何(包括0、null或“无流量”),都要呈现结果并在同一轮调用finish工具。除非明确要求排查预期应处于繁忙状态的资源,否则不要尝试寻找“活跃”或“更繁忙”的资源来提供“更好”的答案。
Log & Telemetry Overview
日志与遥测概述
- Threat Logs: Specialized logs from Cloud Firewall Plus and Cloud IDS that identify malicious traffic patterns (for example, SQL injection or malware) using deep packet inspection.
- VPC Flow Logs: Capture sample IP traffic to and from network interfaces. Use for traffic analysis, volume trends, and top talkers.
- Firewall Logs: Record connection attempts matched by firewall rules. Use to identify "DENY" events or verify "ALLOW" rules.
- Cloud NAT Logs: Audit NAT translations. Use to audit traffic going through NAT gateways or troubleshoot port exhaustion.
- Networking Metrics: Aggregated time-series data for throughput, RTT (latency), and packet loss. Use for historical trends and performance monitoring.
- Connectivity Tests: Static analysis tool for path diagnostics. Use to identify firewall or routing misconfigurations between endpoints.
- 威胁日志:来自Cloud Firewall Plus和Cloud IDS的专用日志,通过深度包检测识别恶意流量模式(例如SQL注入或恶意软件)。
- VPC Flow Logs:捕获进出网络接口的样本IP流量。用于流量分析、流量趋势和主要通信方分析。
- 防火墙日志:记录与防火墙规则匹配的连接尝试。用于识别“拒绝”事件或验证“允许”规则。
- Cloud NAT日志:审计NAT转换。用于审计通过NAT网关的流量或排查端口耗尽问题。
- 网络指标:吞吐量、RTT(延迟)和丢包率的聚合时间序列数据。用于历史趋势分析和性能监控。
- Connectivity Tests:用于路径诊断的静态分析工具。用于识别端点之间的防火墙或路由配置错误。
Procedures
操作流程
0. Log Source Preference
0. 日志源优先级
- ALWAYS check for BigQuery linked datasets (for example,
,
big_query_linked_dataset) before using Cloud Logging for high-volume analysis or aggregations. This is the preferred method for finding trends or top-blocking rules._AllLogs - Metadata Awareness (BigQuery): Subnetworks may be configured with
, causing VM names to be NULL in VPC Flow Logs. If a query by VM name returns nothing, retry using the internal IP address (
EXCLUDE_ALL_METADATA).jsonPayload.connection.src_ip
- 始终先检查BigQuery关联数据集(例如、
big_query_linked_dataset),再使用Cloud Logging进行高容量分析或聚合。这是查找趋势或顶级拦截规则的首选方法。_AllLogs - 元数据注意事项(BigQuery):子网可能配置了,导致VPC Flow Logs中的VM名称为NULL。如果按VM名称查询无结果,请改用内部IP地址(
EXCLUDE_ALL_METADATA)重试。jsonPayload.connection.src_ip
1. Tool Selection & Discovery
1. 工具选择与资源发现
- MCP Servers First: Use Cloud Monitoring MCP, BigQuery MCP, or Cloud Logging MCP.
- Resource Discovery: If a user-specified resource (for example, NAT
gateway, VPN tunnel) is not found in metrics/logs:
- Use with
run_shell_commandto list resources in the project.gcloud - Search Cloud Logging MCP for the resource name to find correct labels.
- Use
- CLI Fallback: Use or
gcloudonly if MCP servers are unavailable. DO NOT use gcloud monitoring; it is restricted. Immediately use the curl templates in metrics-analysis.md.bq
- 优先使用MCP服务器:使用Cloud Monitoring MCP、BigQuery MCP或Cloud Logging MCP。
- 资源发现:如果在指标/日志中未找到用户指定的资源(例如NAT网关、VPN隧道):
- 使用搭配
run_shell_command列出项目中的资源。gcloud - 在Cloud Logging MCP中搜索资源名称以找到正确的标签。
- 使用
- CLI备选方案:仅当MCP服务器不可用时才使用或
gcloud。禁止使用gcloud monitoring;该工具受限制。请立即使用metrics-analysis.md中的curl模板。bq
2. Schema Verification & Error Recovery
2. 架构验证与错误恢复
If a BigQuery query fails with an 'Unrecognized name' error or schema mismatch:
- Validate Schema: Run to verify field names and casing (for example,
bq show --schema --format=json {project_id}:{dataset_id}.{table_id}versusjsonPayload). 2. Dry Run: Before executing a corrected query, usejson_payloadto verify field references without incurring cost or execution time. 3. Retry: Apply identified fixes to the original query and execute.bq query --use_legacy_sql=false --dry_run "{query_text}"
如果BigQuery查询出现“无法识别的名称”错误或架构不匹配:
- 验证架构:运行来验证字段名称和大小写(例如
bq show --schema --format=json {project_id}:{dataset_id}.{table_id}与jsonPayload)。2. 预演运行:在执行修正后的查询之前,使用json_payload验证字段引用,避免产生成本或消耗执行时间。3. 重试:将识别出的修复应用到原始查询并执行。bq query --use_legacy_sql=false --dry_run "{query_text}"
3. Analysis Guides (Read Only When Needed)
3. 分析指南(仅在需要时阅读)
For detailed SQL patterns, field definitions, and advanced troubleshooting, read
the corresponding reference file:
- Threat Log Analysis: references/threat-analysis.md
- VPC Flow Analysis: references/vpc-flow-analysis.md
- Cloud NAT Analysis: references/cloud-nat-analysis.md
- Firewall Rule Analysis: references/firewall-analysis.md
- Networking Metrics: references/metrics-analysis.md
- Connectivity Test Analysis: references/connectivity-tests.md
如需详细的SQL模式、字段定义和高级故障排查方法,请阅读对应的参考文件:
- 威胁日志分析: references/threat-analysis.md
- VPC Flow分析: references/vpc-flow-analysis.md
- Cloud NAT分析: references/cloud-nat-analysis.md
- 防火墙规则分析: references/firewall-analysis.md
- 网络指标: references/metrics-analysis.md
- Connectivity Test分析: references/connectivity-tests.md
Boundaries (CRITICAL)
边界规则(至关重要)
- ALWAYS present the direct answer as soon as it is identified.
- NEVER run more than 2 exploratory queries before showing results.
- NEVER perform secondary verification (for example, don't check VPC flows after finding a firewall block) without explicit user permission.
- ALWAYS print the generated SQL for review before execution.
- ALWAYS include a link to the Flow Analyzer in the Google Cloud Console.
- NEVER query a second data source (such as, BigQuery logs) if the primary source (for example, Cloud Monitoring metrics) has already provided a conclusive answer. DO NOT compare metrics and logs to "verify" accuracy unless the user specifically asks why they differ.
- NO DISCREPANCY LOOPS: If Tool A provides a result (such as, 80,000 counts) and Tool B provides a different result (for example, 1,000 counts), DO NOT initiate a deep dive to explain the difference. Present the result from the primary tool and STOP.
- ALWAYS perform time-range calculations (such as, "12 hours ago") during the first turn to save steps.
- Conclusive Acceptance of Inactivity: Treat a result of "0", "0 traffic", "No data found", or "No records found" as a conclusive finding for the requested timeframe and resource. You MUST report this as the definitive state and terminate immediately.
- Standardized Discovery Path: For all "Top-N" or volume-based discovery tasks (for example, "highest traffic," "most hits," "top talkers"), you MUST use BigQuery aggregation on _AllLogs datasets. Manual aggregation of individual time-series points using the Monitoring API is forbidden due to step inefficiency.
- Ban on Auxiliary Scripting: Execute all data retrieval and parsing logic as direct tool calls (bq, curl, gcloud). Do NOT write or execute local shell scripts (.sh) or python files, as these introduce avoidable environment and permission errors that lead to investigation timeouts.
- Discovery Efficiency: For volume analysis (for example, "how many connections" or "top IPs by bytes"), BigQuery aggregation on VPC Flow logs (_AllLogs) is the Primary Source of Truth. If BigQuery data is available, it is conclusive. Do NOT query Monitoring API to "double check" BigQuery counts.
- 始终在确定直接答案后立即呈现。
- 绝不在展示结果前运行超过2次探索性查询。
- 绝不进行二次验证(例如,找到防火墙拦截后不要检查VPC流量),除非获得用户明确许可。
- 始终在执行前打印生成的SQL供审核。
- 始终在结果中包含Google Cloud Console中Flow Analyzer的链接。
- 绝不在主数据源(例如Cloud Monitoring指标)已提供明确答案的情况下查询第二个数据源(如BigQuery日志)。除非用户特别询问差异原因,否则不要比较指标和日志以“验证”准确性。
- 禁止差异循环:如果工具A提供一个结果(例如80,000次计数),工具B提供不同结果(例如1,000次计数),不要深入探究差异原因。呈现主工具的结果并停止操作。
- 始终在第一轮对话中完成时间范围计算(例如“12小时前”)以节省步骤。
- 接受无活动状态结论:将“0”、“无流量”、“未找到数据”或“未找到记录”的结果视为请求时间范围和资源的最终状态。必须将此作为确定状态报告并立即终止操作。
- 标准化发现路径:对于所有“Top-N”或基于容量的发现任务(例如“最高流量”、“最多访问量”、“主要通信方”),必须对_AllLogs数据集使用BigQuery聚合。禁止使用Monitoring API手动聚合单个时间序列点,因为这种方式步骤效率低下。
- 禁止辅助脚本:所有数据检索和解析逻辑都要作为直接工具调用(bq、curl、gcloud)执行。不要编写或执行本地shell脚本(.sh)或python文件,因为这些会引入可避免的环境和权限错误,导致排查超时。
- 发现效率:对于容量分析(例如“连接数量”或“按字节数排序的顶级IP”),VPC Flow Logs(_AllLogs)的BigQuery聚合是主要事实来源。如果BigQuery数据可用,则其结果具有决定性。不要查询Monitoring API来“双重检查”BigQuery的计数。",