notebook-guidance
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseNotebook Guidance
Notebook使用指南
When to Use a Notebook
何时使用Notebook
Before choosing to use a notebook, evaluate the task complexity using these
heuristics.
Use a notebook if you meet at least one of these criteria:
- 📈 Data Insights & Storytelling: Use a notebook for any request to "give insights", "find trends", "explore data", or "analyze data". These tasks benefit from using visualizations to present the data.
- 📊 Visualizations are requested: The user explicitly asks for charts or plots.
- 🔄 Stateful / Iterative Exploration: You need to run a query, inspect results, and decide the next query based on those results while keeping state in memory.
Do NOT use a notebook ONLY if:
- 📝 Simple Fact/Status: The request only requires a single number (e.g., "how many rows") or a status check (e.g., "when was this table updated").
- 🏃♂️ Schema Preview: The request is only about the schema or field types.
Golden Rule of Data Storytelling: If any analytical insight, trend, or
comparison is involved, favor a notebook and a visualization. A notebook is the
"standard" environment for our developer workflow; do not avoid it because of
"overhead".
在选择使用Notebook之前,请通过以下规则评估任务复杂度。
满足至少以下任一条件时,使用Notebook:
- 📈 数据洞察与叙事:当用户请求“获取洞察”“发现趋势”“探索数据”或“分析数据”时,使用Notebook。这类任务借助可视化呈现数据效果更佳。
- 📊 明确要求可视化:用户明确要求生成图表或图形。
- 🔄 有状态/迭代式探索:你需要运行查询、检查结果,并基于这些结果决定下一步查询,同时在内存中保留状态。
仅在以下情况时,不要使用Notebook:
- 📝 简单事实/状态查询:请求仅需要单个数值(如“有多少行数据”)或状态检查(如“该表何时更新”)。
- 🏃♂️ Schema预览:请求仅涉及Schema或字段类型。
数据叙事黄金法则:如果涉及任何分析洞察、趋势或对比,优先使用Notebook和可视化。Notebook是我们开发者工作流的“标准”环境,不要因“额外开销”而回避使用它。
Notebook Best Practices
Notebook最佳实践
[!IMPORTANT] Agent execution rules: Your behavior MUST depend on whether thetool is available in your current context:notebook_execute_cell
- If notebook
tool is available: You MUST follow the incremental GENERATE CELL -> EXECUTE CELL -> VALIDATE flow.execute_cell- If notebook
tool is NOT available: You MUST generate the complete notebook and request user execution.execute_cell
- CONDITIONAL EXECUTION FLOW:
- If notebook tool is available: Follow the STEP BY STEP GENERATE CELL -> EXECUTE CELL -> VALIDATE OUTPUT flow. Generate ONE cell, execute it, then verify the output. If the output is data (e.g. a dataframe), you MUST inspect it to confirm the logic is correct before generating the next step. Batch generation of an entire notebook is strictly prohibited because error propagation in notebooks is expensive to fix.
execute_cell - If notebook tool is NOT available:
execute_cell- Create the whole notebook at once.
- Tell the user to run the notebook.
- Tell the user to let you know once the notebook run is completed so you can check the outputs to verify it's correct and fix any errors.
- If notebook
- IDENTIFY DATA EARLY: Use or BigQuery list tools to find the correct
@skill:discovering-gcp-data-assetsbefore writing ANY code. If the table ID is missing, ask the user.project.dataset.table - CLEAN FINAL STATE: The final notebook MUST NOT have failed cells. If a cell fails, you MUST fix it. If you tried several versions, delete the failed attempts before you present the notebook to the user.
- LOGICAL CHUNK FIDELITY: Keep cells small. One logical transformation or
visualization per cell. Group related cells into logical units (e.g., a
BigQuery magic cell followed immediately by a Python visualization cell for those results). Use descriptive markdown cells to separate and document different logical sections.
%%bqsql - GENERATE VISUALIZATIONS: Always accompany data insights with visualizations; charts are often more effective than raw numbers for communicating trends and comparisons.
[!IMPORTANT] Agent执行规则:你的行为必须取决于当前环境中是否提供工具:notebook_execute_cell
- 如果提供notebook
工具:必须遵循“生成单元格 → 执行单元格 → 验证”的增量流程。execute_cell- 如果未提供notebook
工具:必须生成完整的Notebook,并请求用户自行执行。execute_cell
- 条件化执行流程:
- 如果提供notebook 工具:遵循分步生成单元格 → 执行单元格 → 验证输出流程。每次生成一个单元格,执行后验证输出。如果输出是数据(如dataframe),必须先检查确认逻辑正确,再生成下一步内容。严禁批量生成完整Notebook,因为Notebook中的错误传播修复成本极高。
execute_cell - 如果未提供notebook 工具:
execute_cell- 一次性创建完整的Notebook。
- 告知用户运行该Notebook。
- 告知用户运行完成后通知你,以便检查输出是否正确并修复任何错误。
- 如果提供notebook
- 尽早识别数据:在编写任何代码之前,使用或BigQuery列表工具找到正确的
@skill:discovering-gcp-data-assets。如果缺少表ID,请询问用户。project.dataset.table - 清理最终状态:最终交付的Notebook不得包含执行失败的单元格。如果单元格执行失败,必须修复它。如果尝试过多个版本,请在呈现给用户前删除失败的尝试。
- 逻辑块完整性:保持单元格内容精简。每个单元格对应一个逻辑转换或可视化操作。将相关单元格分组为逻辑单元(例如,一个BigQuery 魔法单元格后紧跟一个针对该结果的Python可视化单元格)。使用描述性Markdown单元格分隔并记录不同的逻辑章节。
%%bqsql - 生成可视化:数据洞察必须伴随可视化呈现;图表通常比原始数值更能有效传达趋势和对比信息。
Kernel & Environment Management
内核与环境管理
Notebooks run in specific Kernels (execution backends). You MUST ensure the
kernel’s Python environment contains the necessary libraries (,
, etc.).
bigframesipykernelNotebook运行在特定的内核(执行后端)中。你必须确保内核的Python环境包含必要的库(、等)。
bigframesipykernelKernel Types
内核类型
- Local Python: Standard Python 3 kernel running on the notebook host (Managed instance, local machine).
- Cloud Spark Remote (Dataproc Serverless): Transient Spark environment managed by GCP. Use for large-scale data processing.
- Cloud Spark Remote (Dataproc Cluster): Persistent Spark clusters for shared or custom configurations.
- Colab (Managed): Ephemeral Google-managed runtimes.
- 本地Python:运行在Notebook主机(托管实例、本地机器)上的标准Python 3内核。
- 云端Spark远程(Dataproc Serverless):由GCP管理的临时Spark环境,用于大规模数据处理。
- 云端Spark远程(Dataproc集群):持久化Spark集群,用于共享或自定义配置。
- Colab(托管):Google管理的临时运行时。
No Active Kernel / Setup Check
无活跃内核/环境设置检查
- Infer or Ask about Kernel Preferences:
- Infer from Context:
- If the task mentions "Spark", "PySpark", or "distributed compute", or if the active workspace is already a Spark cluster, lean towards Remote Spark.
- If the task is focused on "BigQuery", "BigFrames", or standard API calls, lean towards Local Python.
- Ask when Ambiguous: If multiple options fit, ask if they prefer a Local Python or a Cloud/Remote Kernel (e.g., Colab, Spark).
- Infer from Context:
- For Local Setup: Use to verify if a virtual environment exists. If not, create one. Ensure
@skill:managing-python-dependenciesis installed in that environment. Install any other relevant libraries.ipykernel - For Remote Setup: Advise the user to use the UI to select the appropriate remote kernel.
[!IMPORTANT]HARD STOP on kernel failure: If a cell execution returns "no active kernel" or any kernel-not-found error, you MUST stop immediately. Do NOT scaffold, generate, or insert any further cells. Inform the user which kernel is needed (e.g., PySpark / Dataproc Serverless) and wait for explicit confirmation that a kernel is active before proceeding with notebook execution.
- 推断或询问内核偏好:
- 从上下文推断:
- 如果任务提及“Spark”“PySpark”或“分布式计算”,或者当前工作区已是Spark集群,优先选择远程Spark内核。
- 如果任务聚焦于“BigQuery”“BigFrames”或标准API调用,优先选择本地Python内核。
- 模糊时询问:如果多个选项都适用,请询问用户偏好本地Python还是云端/远程内核(如Colab、Spark)。
- 从上下文推断:
- 本地设置:使用验证项目中是否存在虚拟环境。如果不存在,则创建一个。确保该环境中已安装
@skill:managing-python-dependencies,并安装其他相关库。ipykernel - 远程设置:建议用户通过UI选择合适的远程内核。
[!IMPORTANT]内核失败时立即停止:如果单元格执行返回“无活跃内核”或任何内核未找到错误,必须立即停止操作。不得继续构建、生成或插入任何单元格。告知用户需要的内核类型(如PySpark / Dataproc Serverless),并等待用户明确确认内核已激活后,再继续Notebook执行。
Proper Library Installation
正确安装库
1. Local Kernels
1. 本地内核
Before installing any python libraries, you MUST use
to detect how python dependencies are
managed in the project.
@skill:managing-python-dependencies在安装任何Python库之前,必须使用检测项目中Python依赖的管理方式。
@skill:managing-python-dependencies2. Remote Kernels (Spark/Colab)
2. 远程内核(Spark/Colab)
Since these are often ephemeral or managed by GCP:
- Check first (REQUIRED): Before writing any cell, run
%pip installor%pip listto confirm the package is not already present. Managed runtimes (Dataproc Serverless, Colab) pre-install many common packages. Only install what is confirmed missing.import <package> - Use in the first cell if a package is confirmed missing and it's the only way to modify the runtime.
%pip install <package>
When in doubt about the kernel type or preferred installation method, ask the
user for clarification.
由于这些内核通常是临时的或由GCP管理:
- 必须先检查:在编写任何单元格之前,运行
%pip install或%pip list确认包是否已存在。托管运行时(Dataproc Serverless、Colab)预安装了许多常用包。仅安装确认缺失的包。import <package> - 如果确认包缺失且这是修改运行时的唯一方式,可在第一个单元格中使用。
%pip install <package>
如果对内核类型或首选安装方法有疑问,请向用户确认。
Data Analysis & Visualization Rules
数据分析与可视化规则
Guidelines for performing exploratory data analysis, data cleaning, and
visualization in notebooks.
在Notebook中执行探索性数据分析、数据清洗和可视化的指南。
Notebook Layout
Notebook布局
The notebook should read like a story. While you have flexibility (e.g.,
multiple visualizations for one data cell, or data cells building on each
other), aim for this general flow:
- Title & Objective (Markdown Cell)
- What is this notebook for? (e.g., )
# Retention Analysis
- What is this notebook for? (e.g.,
- Section Header (Markdown Cell)
- What are we looking at now? (e.g., )
## Exploring User Retention
- What are we looking at now? (e.g.,
- Data Acquisition/Transformation (Python cell, may contain magics)
%%bqsql- Query BigQuery or transform data.
- Verification (Optional but Recommended) (Python Cell)
- or assert sanity checks.
df.head()
- Visualization (The Goal) (Python Cell)
- Plot the insight (e.g., ).
df.plot()
- Plot the insight (e.g.,
Repeat steps 2-5 for each new sub-topic or insight. You can have multiple Data
cells before a Visualization, or multiple Visualizations from one Data cell. The
key is to keep them grouped logically and separated by Markdown headers.
-
Final Summary (Markdown Cell)
- At the end of the notebook, add a markdown cell containing a summary paragraph that summarizes the findings to the user. The summary MUST follow these guidelines:
- MUST NOT add Python code to the summary.
- The summary MUST NOT start with a code block.
- The summary MUST be strictly grounded in the numerical data verified in the notebook.
- The summary MUST ONLY contain the following three sections:
-
Q&A If the data analysis task contains questions (implied or
explicit), you MUST answer them based on the solving process. Skip this section if there are no questions to answer. -
Data Analysis Key Findings Summarize the key analysis findings
in bullet points, it's a plus to quote the numbers in the previous steps. Only report high-value findings, skip the obvious ones. -
Insights or Next Steps Provide 1-2 concise insights or next
steps in bullet points.
-
-
Next Steps: After the notebook has been successfully executed and verified, and the summary is complete, notify the user and propose next step suggestions.
Notebook的结构应像一个故事。虽然你可以灵活调整(例如,一个数据单元格对应多个可视化,或多个数据单元格构建为一组),但目标遵循以下通用流程:
- 标题与目标(Markdown单元格)
- 本Notebook的用途是什么?(例如,)
# 用户留存分析
- 本Notebook的用途是什么?(例如,
- 章节标题(Markdown单元格)
- 当前关注的内容是什么?(例如,)
## 探索用户留存情况
- 当前关注的内容是什么?(例如,
- 数据获取/转换(Python单元格,可能包含魔法命令)
%%bqsql- 查询BigQuery或转换数据。
- 验证(可选但推荐)(Python单元格)
- 使用或断言进行合理性检查。
df.head()
- 使用
- 可视化(核心目标)(Python单元格)
- 绘制洞察结果(例如,)。
df.plot()
- 绘制洞察结果(例如,
针对每个新子主题或洞察,重复步骤2-5。可以在可视化前设置多个数据单元格,或从一个数据单元格生成多个可视化。关键是将它们逻辑分组,并用Markdown标题分隔。
-
最终总结(Markdown单元格)
- 在Notebook末尾添加一个Markdown单元格,包含向用户总结发现的段落。总结必须遵循以下准则:
- 不得在总结中添加Python代码。
- 总结不得以代码块开头。
- 总结必须严格基于Notebook中验证过的数值数据。
- 总结必须仅包含以下三个部分:
-
问答环节 如果数据分析任务包含隐含或明确的问题,必须基于解决过程回答。如果没有问题需要回答,则跳过此部分。
-
数据分析关键发现 用要点总结关键分析结果,最好引用之前步骤中的数值。仅报告高价值发现,跳过显而易见的内容。
-
洞察或下一步建议 用要点提供1-2条简洁的洞察或下一步建议。
-
-
后续步骤:在Notebook成功执行并验证、总结完成后,通知用户并提出后续步骤建议。
Plotting Rules
绘图规则
- You MUST use different colors for different features to ensure plots are readable for humans.
- When creating a plot, you MUST adjust the figure size based on the number of features. The labels and legends MUST NOT overlap.
- You SHOULD arrange the layout wisely. Using subplots CAN help in placing different plots effectively.
- You MUST use inline figures to present figures and plots along with code and text in the notebook.
- For clustering, use PCA to reduce to 2D before scatter plotting.
- Use Line Charts ONLY for continuous data (e.g. time series) where interpolation between points is meaningful.
- 必须为不同特征使用不同颜色,确保图表对人类可读。
- 创建图表时,必须根据特征数量调整图尺寸。标签和图例不得重叠。
- 应合理安排布局。使用子图可以有效放置不同的图表。
- 必须使用内联图,在Notebook中随代码和文本一起呈现图形和图表。
- 对于聚类任务,在散点图之前使用PCA将数据降维到2D。
- 折线图仅适用于连续数据(如时间序列),且点之间的插值有意义的场景。
Data Cleaning Rules
数据清洗规则
- You MUST be careful about missing values and duplicated values.
- You MUST NOT drop columns unless absolutely necessary. Dropping columns is irreversible.
- You SHOULD focus on columns directly related to accomplishing the task; not every column NEEDS to be cleaned.
- 必须注意缺失值和重复值。
- 除非绝对必要,不得删除列。删除列是不可逆操作。
- 应专注于与完成任务直接相关的列;并非每一列都需要清洗。
Specialized Notebook Guidance
专项Notebook指南
Refer to the following resources for guidance on specific notebook topics:
针对特定Notebook主题,请参考以下资源:
1. BigQuery in Notebooks
1. Notebook中的BigQuery使用
Use BigFrames magics for BigQuery SQL queries. These cells support
native BigQuery SQL execution and data export to BigFrames dataframes.
%%bqsql[!IMPORTANT]
- Unless specified by the user, always use SQL for querying BigQuery.
- DO NOT use the standard BigQuery Python client library (
) orgoogle.cloud.bigquery.pandas.read_gbq- Mandatory dataframe export: Always provide a dataframe name e.g.
. This makes it easy to use results in follow up Python cells.%%bqsql <df_name>- Verify that
version numberbigframesand above is installed in the notebook runtime environment. If it is missing, ask the user if they would like you to upgrade for them.2.38.0
Example %%bqsql magic usage:
python
undefined使用BigFrames魔法命令执行BigQuery SQL查询。这些单元格支持原生BigQuery SQL执行,并将数据导出到BigFrames dataframe。
%%bqsql[!IMPORTANT]
- 除非用户指定,始终使用SQL查询BigQuery。
- 不得使用标准BigQuery Python客户端库(
)或google.cloud.bigquery。pandas.read_gbq- 必须导出dataframe:始终指定dataframe名称,例如
。这样便于在后续Python单元格中使用查询结果。%%bqsql <df_name>- 验证Notebook运行时环境中是否安装了
2.38.0及以上版本。如果缺失,请询问用户是否需要你协助升级。bigframes
%%bqsql魔法命令使用示例:
python
undefinedInitialize BigFrames and load %%bqsql magics
初始化BigFrames并加载%%bqsql魔法命令
import bigframes
import bigframes.pandas as bpd
%load_ext bigframes
> [!CAUTION] Always use `%load_ext bigframes` exactly as shown. Do not load
> submodules — for example, `%load_ext bigframes.magics` or `%load_ext
> bigframes.bigquery` are not valid and must not be used.
> [!IMPORTANT] The `bigframes` library must be installed. Determine if bigframes
> needs to be installed by following @skill:managing-python-dependencies.
```python
%%bqsql df_sample
SELECT * FROM `project.dataset.table` LIMIT 10import bigframes
import bigframes.pandas as bpd
%load_ext bigframes
> [!CAUTION] 必须严格按照上述方式使用`%load_ext bigframes`。不得加载子模块——例如,`%load_ext bigframes.magics`或`%load_ext bigframes.bigquery`无效,不得使用。
> [!IMPORTANT] 必须安装`bigframes`库。请遵循`@skill:managing-python-dependencies`的指引判断是否需要安装。
```python
%%bqsql df_sample
SELECT * FROM `project.dataset.table` LIMIT 10Anti-patterns (NEVER DO THESE)
反模式(绝对禁止)
[!CAUTION] 1. NO Python SDK for Queries: Do not switch toif SQL fails. Fix the SQL syntax instead. 2. NO Mixing Logic: Do not put Python code in the same cell asclient.query(sql).to_dataframe()magics.%%bqsql
[!CAUTION] 1. 禁止使用Python SDK查询:如果SQL执行失败,不得切换到。应修复SQL语法。 2. 禁止混合逻辑:不得在同一单元格中同时放置Python代码和client.query(sql).to_dataframe()魔法命令。%%bqsql
Working with SQL Results in Python
在Python中使用SQL结果
Magic cells with produce a BigQuery DataFrame. In
subsequent cells, you can use directly.
%%bqsql <df_name><df_name>[!IMPORTANT] You MUST use BigFrames for data exploration, manipulation, splitting etc. You MUST use BQML SQL or bigframes.ml for machine learning tasks. You MUST NOT use pandas or Scikit-learn.
带有的魔法单元格会生成BigQuery DataFrame。在后续单元格中,可以直接使用。
%%bqsql <df_name><df_name>[!IMPORTANT] 必须使用BigFrames进行数据探索、操作、拆分等。必须使用BQML SQL或bigframes.ml完成机器学习任务。不得使用pandas或Scikit-learn。
BigQuery DataFrame Tips
BigQuery DataFrame使用技巧
- Avoid : You MUST NOT use
.to_pandas()to download the entire dataset into memory. There are some exceptions:.to_pandas()- An error message explicitly requests you to use
to_pandas() - You are going to visualize the data, and the visualization library
does not accept BigFrames Dataframe/Series instances. In this case,
reduce the amount of data you are going to download before calling
.to_pandas()
- An error message explicitly requests you to use
- Avoid for SQL: Do not write SQL queries and execute them with
read_gbq(). Use BigFrames Dataframe/Series methods instead.read_gbq() - Use BigFrames ML package for Machine Learning Tasks: Do not use
Scikit-learn or other ML libraries with BigFrames dataframes. Import your
tools/classes from .
bigframes.ml - Stay in the Cloud: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
- Accessors over UDFs/Lambdas:
- Prefer built-in accessors (e.g., ,
df.col.str.*) over remote UDFs.df.col.dt.* - Do not use lambdas with or
Series.map().DataFrame.apply()
- Prefer built-in accessors (e.g.,
- Schema Verification: Do not assume schema of intermediate outputs. Check
after loading, and use
.dtypeswithdisplay()or.head()..peek() - Visualization: BigFrames Dataframe mostly works directly with Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't work, try using the "plot" accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling "to_pandas()".
- Model Persistence: To persist a model. use . To load a persisted model, use
model.to_gbq().bpd.read_gbq_model()
- 避免使用:不得使用
.to_pandas()将整个数据集下载到内存中。以下情况除外:.to_pandas()- 错误消息明确要求使用
to_pandas() - 你要可视化数据,且可视化库不接受BigFrames DataFrame/Series实例。这种情况下,在调用之前,必须先减少要下载的数据量。
.to_pandas()
- 错误消息明确要求使用
- 禁止使用执行SQL:不得编写SQL查询并通过
read_gbq()执行。应使用BigFrames DataFrame/Series方法替代。read_gbq() - 使用BigFrames ML包完成机器学习任务:不得将Scikit-learn或其他ML库与BigFrames dataframe一起使用。请从导入工具/类。
bigframes.ml - 保持在云端处理:通过BigFrames方法执行数据清洗、转换和分析,以利用BigQuery的规模优势。
- 优先使用访问器而非UDF/ Lambda:
- 优先使用内置访问器(如、
df.col.str.*)而非远程UDF。df.col.dt.* - 不得在或
Series.map()中使用lambda。DataFrame.apply()
- 优先使用内置访问器(如
- Schema验证:不要假设中间输出的Schema。加载后检查,并结合
.dtypes或.head()使用.peek()。display() - 可视化:BigFrames DataFrame大多可直接与Matplotlib、Seaborn等绘图库配合使用。如果尝试失败,请尝试使用“plot”访问器。如果仍然失败,必须先对数据进行采样或聚合,使其足够小后再调用。
to_pandas() - 模型持久化:要持久化模型,请使用。要加载已持久化的模型,请使用
model.to_gbq()。bpd.read_gbq_model()
2. Machine Learning in Notebooks
2. Notebook中的机器学习
Integration with machine learning workflows and best practices. - Guide: Use
. - MUST READ WHEN: The task involves machine
learning, training a model, clustering, classification, regression, or
time-series forecasting.
@skill:ml-best-practicesIf any "MUST READ WHEN" condition is met, you MUST read the corresponding guide
before proceeding.
机器学习工作流集成与最佳实践。- 指南:使用。- 必须阅读场景:任务涉及机器学习、模型训练、聚类、分类、回归或时间序列预测。
@skill:ml-best-practices如果满足任一“必须阅读场景”条件,必须先阅读对应指南再继续操作。