Notebook Guidance
When to Use a Notebook
Before choosing to use a notebook, evaluate the task complexity using these
heuristics.
Use a notebook if you meet at least one of these criteria:
- 📈 Data Insights & Storytelling: Use a notebook for any request to "give
insights", "find trends", "explore data", or "analyze data". These tasks
benefit from using visualizations to present the data.
- 📊 Visualizations are requested: The user explicitly asks for charts or
plots.
- 🔄 Stateful / Iterative Exploration: You need to run a query, inspect
results, and decide the next query based on those results while keeping
state in memory.
Do NOT use a notebook ONLY if:
- 📝 Simple Fact/Status: The request only requires a single number (e.g.,
"how many rows") or a status check (e.g., "when was this table updated").
- 🏃♂️ Schema Preview: The request is only about the schema or field
types.
Golden Rule of Data Storytelling: If any analytical insight, trend, or
comparison is involved, favor a notebook and a visualization. A notebook is the
"standard" environment for our developer workflow; do not avoid it because of
"overhead".
Notebook Best Practices
[!IMPORTANT]
Agent execution rules: Your behavior MUST depend on whether the
tool is available in your current context:
- If notebook tool is available: You MUST follow the incremental GENERATE CELL -> EXECUTE CELL -> VALIDATE flow.
- If notebook tool is NOT available: You MUST generate the complete notebook and request user execution.
- CONDITIONAL EXECUTION FLOW:
- If notebook tool is available: Follow the STEP BY
STEP GENERATE CELL -> EXECUTE CELL -> VALIDATE OUTPUT flow. Generate
ONE cell, execute it, then verify the output. If the output is data
(e.g. a dataframe), you MUST inspect it to confirm the logic is correct
before generating the next step. Batch generation of an entire notebook
is strictly prohibited because error propagation in notebooks is
expensive to fix.
- If notebook tool is NOT available:
- Create the whole notebook at once.
- Tell the user to run the notebook.
- Tell the user to let you know once the notebook run is completed so
you can check the outputs to verify it's correct and fix any errors.
- IDENTIFY DATA EARLY: Use
@skill:discovering-gcp-data-assets
or
BigQuery list tools to find the correct before
writing ANY code. If the table ID is missing, ask the user.
- CLEAN FINAL STATE: The final notebook MUST NOT have failed cells. If a
cell fails, you MUST fix it. If you tried several versions, delete the
failed attempts before you present the notebook to the user.
- LOGICAL CHUNK FIDELITY: Keep cells small. One logical transformation or
visualization per cell. Group related cells into logical units (e.g., a
BigQuery magic cell followed immediately by a Python visualization
cell for those results). Use descriptive markdown cells to separate and
document different logical sections.
- GENERATE VISUALIZATIONS: Always accompany data insights with
visualizations; charts are often more effective than raw numbers for
communicating trends and comparisons.
Kernel & Environment Management
Notebooks run in specific
Kernels (execution backends). You MUST ensure the
kernel’s Python environment contains the necessary libraries (
,
, etc.).
Kernel Types
- Local Python: Standard Python 3 kernel running on the notebook host
(Managed instance, local machine).
- Cloud Spark Remote (Dataproc Serverless): Transient Spark environment
managed by GCP. Use for large-scale data processing.
- Cloud Spark Remote (Dataproc Cluster): Persistent Spark clusters for
shared or custom configurations.
- Colab (Managed): Ephemeral Google-managed runtimes.
No Active Kernel / Setup Check
- Infer or Ask about Kernel Preferences:
- Infer from Context:
- If the task mentions "Spark", "PySpark", or "distributed compute",
or if the active workspace is already a Spark cluster, lean towards
Remote Spark.
- If the task is focused on "BigQuery", "BigFrames", or standard API
calls, lean towards Local Python.
- Ask when Ambiguous: If multiple options fit, ask if they prefer a
Local Python or a Cloud/Remote Kernel (e.g., Colab, Spark).
- For Local Setup: Use
@skill:managing-python-dependencies
to verify if
a virtual environment exists. If not, create one. Ensure is
installed in that environment. Install any other relevant libraries.
- For Remote Setup: Advise the user to use the UI to select the
appropriate remote kernel.
[!IMPORTANT]
HARD STOP on kernel failure: If a cell execution returns "no active
kernel" or any kernel-not-found error, you MUST stop immediately. Do NOT
scaffold, generate, or insert any further cells. Inform the user which kernel
is needed (e.g., PySpark / Dataproc Serverless) and wait for explicit
confirmation that a kernel is active before proceeding with notebook
execution.
Proper Library Installation
1. Local Kernels
Before installing any python libraries, you MUST use
@skill:managing-python-dependencies
to detect how python dependencies are
managed in the project.
2. Remote Kernels (Spark/Colab)
Since these are often ephemeral or managed by GCP:
- Check first (REQUIRED): Before writing any cell, run
or to confirm the package is not already
present. Managed runtimes (Dataproc Serverless, Colab) pre-install many
common packages. Only install what is confirmed missing.
- Use in the first cell if a package is confirmed
missing and it's the only way to modify the runtime.
When in doubt about the kernel type or preferred installation method, ask the
user for clarification.
Data Analysis & Visualization Rules
Guidelines for performing exploratory data analysis, data cleaning, and
visualization in notebooks.
Notebook Layout
The notebook should read like a story. While you have flexibility (e.g.,
multiple visualizations for one data cell, or data cells building on each
other), aim for this general flow:
- Title & Objective (Markdown Cell)
- What is this notebook for? (e.g., )
- Section Header (Markdown Cell)
- What are we looking at now? (e.g.,
## Exploring User Retention
)
- Data Acquisition/Transformation (Python cell, may contain magics)
- Query BigQuery or transform data.
- Verification (Optional but Recommended) (Python Cell)
- Visualization (The Goal) (Python Cell)
- Plot the insight (e.g., ).
Repeat steps 2-5 for each new sub-topic or insight. You can have multiple Data
cells before a Visualization, or multiple Visualizations from one Data cell. The
key is to keep them grouped logically and separated by Markdown headers.
-
Final Summary (Markdown Cell)
- At the end of the notebook, add a markdown cell containing a summary
paragraph that summarizes the findings to the user. The summary MUST
follow these guidelines:
- MUST NOT add Python code to the summary.
- The summary MUST NOT start with a code block.
- The summary MUST be strictly grounded in the numerical data verified in
the notebook.
- The summary MUST ONLY contain the following three sections:
-
Q&A If the data analysis task contains questions (implied or
explicit), you MUST answer them based on the solving process. Skip
this section if there are no questions to answer.
-
Data Analysis Key Findings Summarize the key analysis findings
in bullet points, it's a plus to quote the numbers in the previous
steps. Only report high-value findings, skip the obvious ones.
-
Insights or Next Steps Provide 1-2 concise insights or next
steps in bullet points.
-
Next Steps: After the notebook has been successfully executed and
verified, and the summary is complete, notify the user and propose next step
suggestions.
Plotting Rules
- You MUST use different colors for different features to ensure plots are
readable for humans.
- When creating a plot, you MUST adjust the figure size based on the number of
features. The labels and legends MUST NOT overlap.
- You SHOULD arrange the layout wisely. Using subplots CAN help in placing
different plots effectively.
- You MUST use inline figures to present figures and plots along with code and
text in the notebook.
- For clustering, use PCA to reduce to 2D before scatter plotting.
- Use Line Charts ONLY for continuous data (e.g. time series) where
interpolation between points is meaningful.
Data Cleaning Rules
- You MUST be careful about missing values and duplicated values.
- You MUST NOT drop columns unless absolutely necessary. Dropping columns is
irreversible.
- You SHOULD focus on columns directly related to accomplishing the task; not
every column NEEDS to be cleaned.
Specialized Notebook Guidance
Refer to the following resources for guidance on specific notebook topics:
1. BigQuery in Notebooks
Use BigFrames magics
for BigQuery SQL queries. These cells support
native BigQuery SQL execution and data export to BigFrames dataframes.
[!IMPORTANT]
- Unless specified by the user, always use SQL for querying BigQuery.
- DO NOT use the standard BigQuery Python client library
() or .
- Mandatory dataframe export: Always provide a dataframe name e.g.
. This makes it easy to use results in follow up
Python cells.
- Verify that version number and above is installed in
the notebook runtime environment. If it is missing, ask the user if they
would like you to upgrade for them.
Example %%bqsql magic usage:
python
# Initialize BigFrames and load %%bqsql magics
import bigframes
import bigframes.pandas as bpd
%load_ext bigframes
[!CAUTION] Always use
exactly as shown. Do not load
submodules — for example,
%load_ext bigframes.magics
or
%load_ext bigframes.bigquery
are not valid and must not be used.
[!IMPORTANT] The
library must be installed. Determine if bigframes
needs to be installed by following @skill:managing-python-dependencies.
python
%%bqsql df_sample
SELECT * FROM `project.dataset.table` LIMIT 10
Anti-patterns (NEVER DO THESE)
[!CAUTION] 1.
NO Python SDK for Queries: Do not switch to
client.query(sql).to_dataframe()
if SQL fails. Fix the SQL syntax instead.
2.
NO Mixing Logic: Do not put Python code in the same cell as
magics.
Working with SQL Results in Python
Magic cells with
produce a
BigQuery DataFrame. In
subsequent cells, you can use
directly.
[!IMPORTANT] You MUST use BigFrames for data exploration, manipulation,
splitting etc. You MUST use BQML SQL or bigframes.ml for machine learning
tasks. You MUST NOT use pandas or Scikit-learn.
BigQuery DataFrame Tips
- Avoid : You MUST NOT use to download the
entire dataset into memory. There are some exceptions:
- An error message explicitly requests you to use
- You are going to visualize the data, and the visualization library
does not accept BigFrames Dataframe/Series instances. In this case,
reduce the amount of data you are going to download before calling
- Avoid for SQL: Do not write SQL queries and execute them
with . Use BigFrames Dataframe/Series methods instead.
- Use BigFrames ML package for Machine Learning Tasks: Do not use
Scikit-learn or other ML libraries with BigFrames dataframes. Import your
tools/classes from .
- Stay in the Cloud: Perform data cleaning, transformation, and analysis
via BigFrames methods to leverage BigQuery's scale.
- Accessors over UDFs/Lambdas:
- Prefer built-in accessors (e.g., , ) over
remote UDFs.
- Do not use lambdas with or .
- Schema Verification: Do not assume schema of intermediate outputs. Check
after loading, and use with or .
- Visualization: BigFrames Dataframe mostly works directly with
Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't
work, try using the "plot" accessor. If that didn't work either, you MUST
sample or aggregate your data to make it small enough before calling
"to_pandas()".
- Model Persistence: To persist a model. use . To load a
persisted model, use .
2. Machine Learning in Notebooks
Integration with machine learning workflows and best practices. -
Guide: Use
. -
MUST READ WHEN: The task involves machine
learning, training a model, clustering, classification, regression, or
time-series forecasting.
If any "MUST READ WHEN" condition is met, you MUST read the corresponding guide
before proceeding.