Flows Design Review

This is step 3 of the Flows app certification flow:

flows-app-brief  →  build  →  flows-code-review  →  flows-design-review (this skill)  →  flows-external-app-submit

This is the manual design quality assessment described in docs.cognite.com/cdf/flows/guides/quality-guidelines. Target overall average: 3.8 or higher to be launch-ready.

Operating rules

Automate first, ask second. For every question Q1–Q10, run the probes listed below to gather hard evidence from the repo and propose a draft score (1–5) with rationale before asking the user. The user's job is to confirm or override the proposed score, not to grade from scratch. This dramatically reduces the manual burden.
The task walkthrough (Step 2) is the one part that cannot be skipped — automation cannot tell whether a user "gets lost" navigating a screen. Capture it manually and use it to override the auto-derived scores where lived experience disagrees.
Use
```
AskQuestion
```
for every score so answers are structured. For each question present three options: (a) accept the draft score, (b) override with a specific score, (c) override + add a note.
Pre-fill user, tasks, and persona context from
```
App-Brief.md
```
frontmatter when present.

Step 0 — Pre-scan before prompting

Always pre-scan before asking the user anything. Read these sources silently and surface what you found as evidence — never as scores, never auto-saved:

Source	Use it for
`App-Brief.md` frontmatter	Pre-fill primary user ( `userRole` ), tasks ( `oneSentenceStory` ), success criteria
`package.json`	Confirm `@cognite/aura` is installed and surface its version (informs Q1)
Latest `reviews/code-review/feedback-round-<N>/code-review-report.md`	Pull design-adjacent findings (accessibility, error handling, UX copy) and present them as evidence under Q4/Q10
`src/*/.{ts,tsx,css}`	Q1 probe — grep for hard-coded hex/rgb colors and raw `px` / `rem` values outside Aura tokens
`src/*/.{ts,tsx}`	Q5 probe — `onClick` on non-button elements without `role` / `tabIndex`
`src/*/.{ts,tsx}`	Q10 probe — icon buttons missing `aria-label` , `<img>` without `alt` , missing focus styles

Show the user the pre-scan results in your opening message before any scoring. They are starting points, not verdicts. The manual task walkthrough (Step 2) and user-assigned scores remain authoritative.

Step 0b — Choose feedback round

Look at

reviews/design-review/

. If it doesn't exist, this is round 1. Otherwise increment to the next missing

feedback-round-<N>/

directory.

Step 1 — Confirm user and tasks

Per the docs, "the quality assessment is only as useful as the clarity of the user and tasks it's based on."

App-Brief.md

exists, parse

userRole

oneSentenceStory

, and

successCriteria

from its frontmatter and propose them as the primary user and tasks. Ask the user to confirm or extend.

Capture, via

AskQuestion

Primary user — specific role and context (e.g. "Maintenance engineers on offshore platforms").
2–3 critical tasks — the workflows this user needs to complete (e.g. "Check pump vibration alerts", "Schedule maintenance work").
Context — experience level, time constraints, device, success criteria.

Step 2 — Walk each task end-to-end (manual)

Instruct the user to:

Open the app as that user in a clean browser session with representative test data.
Complete each task from beginning to end without shortcuts.
Note pain points: where they get stuck, confused, or make errors.

For each task, prompt the user to paste back: what happened, where they got stuck, and any screenshots / notes. Capture these as

taskWalkthroughs[]

for the report.

Do NOT proceed to scoring until the user confirms they walked every task. If they refuse, write a stub report that records "task walkthrough skipped" and exits — do not score.

Step 3 — Score the 10 questions (probe → propose → confirm)

For every question Q1–Q10, follow the same loop:

Run the listed probes. They are concrete shell / grep / lint / build commands that produce hard evidence from the repo.
Propose a draft score (1–5) based on the probe results and the rubric. Show your work: which probe results led to which score.
Cross-check against the user's task-walkthrough notes from Step 2 (especially for navigation, clickability, error prevention).
Ask the user via
AskQuestion
with three options: (a) accept the proposed score
N
, (b) override with a specific score, (c) override + add a note.
Capture the final score, a one-line rationale, and an improvement note.

Heuristics for translating probe results into a draft score

These thresholds are starting points — adjust based on the specific evidence and the rubric language. The user always has the final say.

Signal	Drift toward
0 anti-pattern matches, lint clean for the relevant rule	5
≤ 3 small matches, mostly in one file	4
5–15 matches across several files, or 1 systemic issue	3
15+ matches, or pervasive anti-pattern	2
Anti-pattern is the default style	1

Per-question automated probes

Each question's probe list is the first thing the agent should run before asking the user anything about that question. Always state which probes were run and what they returned.

The 10 questions and rubric

Q1 — Aura design system consistency. Are you using Aura tokens, layouts, components and patterns correctly?

Probes (automatable):

```
grep -c '@cognite/aura' package.json
```
— confirm Aura is a dependency

grep -rlE "from '@cognite/aura'" --include='*.ts' --include='*.tsx' src | wc -l

— count files importing Aura

grep -rlE '#[0-9a-fA-F]{3,8}' --include='*.css' --include='*.tsx' --include='*.ts' src

— files with hard-coded hex colors

grep -rlE '\b(rgb|rgba|hsl|hsla)\(' --include='*.tsx' --include='*.css' src

— files with raw rgb/hsl values

npx eslint . --ext .ts,.tsx --rule '{"aura/no-overriding-styles":"error"}' --no-eslintrc --quiet 2>&1 | tail -5

or read the existing lint output for

aura/no-overriding-styles

warning counts

Translate to draft score: 0 hard-coded colors + 0

aura/no-overriding-styles

warnings → 5. Few warnings (1–5) → 4. Many warnings (>15) or no Aura imports → 2–3.

5 Excellent: All Aura tokens applied correctly, no hard-coded values. Proper responsive sizing and page layouts. Aura components used without style overrides. Best practices followed.
4 Good: Mostly Aura tokens and components with 1–2 minor exceptions. Layout spacing mostly consistent. Minimal style overrides.
3 Average: Mix of Aura and custom elements. Some proper spacing, some random values. Overriding styles in multiple places.
2 Below average: Frequently custom colors, typography, or spacing instead of Aura tokens. Heavy customization that breaks patterns.
1 Poor: Not using Aura at all. Custom colors, fonts, spacing throughout.

Q2 — Navigation, layout and hierarchy. Can users tell where they are and navigate easily?

Probes (partially automatable — relies on Step 2 walkthrough):

```
grep -rcE '<Route\b' --include='*.tsx' src
```
— count routes (informs navigation surface)
```
grep -rlE 'Breadcrumb' --include='*.tsx' src
```
— files using breadcrumb components (location cues)

grep -rlE 'NavLink|Link to=|useLocation' --include='*.tsx' src

— navigation primitives in use

grep -rlE '<Topbar|<Sidebar|<Header' --include='*.tsx' src

— top-level chrome

Look at the route tree (
```
src/routes/
```
) and ask: does each non-trivial page show its own title and a way back?

Translate to draft score: Default to the walkthrough finding since navigation feel is hard to measure statically. Use probes to flag risks (e.g. routes without breadcrumbs).

5: Current location always clear. Easy navigation forward/back. Consistent menus. Strong visual hierarchy. Content flows logically (F/Z pattern).
4: Usually clear. Navigation mostly consistent. Minor exceptions.
3: Sometimes unclear. Navigation works but not always intuitive. Hierarchy exists but not always clear.
2: Often lost or confused. Navigation changes between pages. Weak hierarchy.
1: No indication of current location. No clear navigation. Inconsistent structure.

Q3 — Clear labels and language. Are buttons, inputs, and actions labeled clearly?

Probes (automatable):

grep -rcE ">(Submit|OK|Click here|Go|Yes|No)<" --include='*.tsx' src

— count vague button labels

```
grep -rcE '<Button[^>]*>[[:space:]]*</Button>' --include='*.tsx' src
```
— empty buttons (icon-only without label needs aria-label, handled in Q10)

grep -rlE '<Label\b' --include='*.tsx' src

and

grep -rlE '<input\b' --include='*.tsx' src

— input elements vs labels; mismatch suggests unlabeled inputs

```
grep -rcE 'placeholder=' --include='*.tsx' src
```
— placeholder-as-label is an anti-pattern; high count without matching
```
<Label>
```
is a smell

Translate to draft score: 0 vague labels + every input has a matching label → 5. Few placeholder-only inputs → 4. Vague labels in several places → 3.

5: Every element has a clear, specific label. Plain, action-oriented language ("Save changes", "Delete item").
4: Most labels clear. Minor ambiguity.
3: Labels present but sometimes vague ("Submit", "OK"). Some unnecessary jargon.
2: Many labels unclear. Heavy technical terms without explanation.
1: Labels missing, confusing, or jargon-laden.

Q4 — System feedback and validation. Do users know what's happening? Are forms easy to use?

Probes (automatable):

grep -rlE 'isLoading|isPending|<Skeleton|<Loader|<Spinner' --include='*.tsx' src

— files with loading affordances

grep -rlE 'isError|onError|<Alert|toast\.' --include='*.tsx' src

— files with error/success affordances

grep -rlE 'useMutation' --include='*.tsx' src

— mutation sites; cross-check that each has

onSuccess

onError

handlers

```
grep -rlE 'ErrorBoundary' --include='*.tsx' src
```
— error boundaries (also cross-checked in code review)
For each route/feature folder, ratio of (loading + error files) ÷ (data-fetching files) should be ≈ 1

Translate to draft score: Loading and error states present on every fetch/mutation → 5. A few mutations without explicit error handling → 4. Mixed coverage → 3.

5: Immediate feedback. Clear loading states. Helpful success/error messages. All fields labeled, required fields marked, real-time validation with specific messages.
4: Most actions provide feedback. Loading states present. Validation mostly helpful.
3: Some feedback but inconsistent. Loading states sometimes missing. Generic error messages.
2: Minimal feedback. Users often don't know if actions worked. Validation only on submit.
1: No feedback. Silent failures. Technical error codes.

Q5 — Clickability and interactions. Is it obvious what's clickable?

Probes (automatable):

grep -rcE '<div[^>]*onClick' --include='*.tsx' src

—

onClick

<div>

(non-semantic, often missing keyboard support)

grep -rcE '<span[^>]*onClick' --include='*.tsx' src

— same for

<span>

grep -rcE 'role="button"' --include='*.tsx' src

— explicit role assignments (good if

<div onClick>

is unavoidable)

```
grep -rcE 'hover:|focus:' --include='*.tsx' src
```
— Tailwind hover/focus utility usage (high = good)

grep -rcE 'cursor-pointer' --include='*.tsx' src

— explicit pointer cursor

Translate to draft score: 0

<div onClick>

without role + many hover/focus utilities → 5. 1–3 violations → 4. Many

onClick

on non-button elements → 2–3.

5: All clickable items look clickable. Hover effects on interactive elements. Cursor changes appropriately.
4: Most interactive elements obvious. Hover effects mostly present.
3: Inconsistent hover states. Occasionally unclear what's interactive.
2: Many interactive elements don't look clickable. Few hover effects.
1: Can't tell what's clickable. No visual feedback.

Q6 — Error prevention and recovery. Can users undo or cancel destructive actions?

Probes (partially automatable):

grep -rilE 'delete|remove|archive|reset' --include='*.tsx' src | head -20

— files with potentially destructive actions

grep -rlE 'AlertDialog|ConfirmDialog|window\.confirm' --include='*.tsx' src

— confirm-dialog usage

grep -rcE 'variant="destructive"|destructive' --include='*.tsx' src

— destructive button styling

For each file with destructive verbs, check there is a corresponding
```
AlertDialog
```
/
```
ConfirmDialog
```
invocation in the same file or its imports

N/A guidance: Read-only viewer apps (the common case for Flows demos) have no destructive actions and should score 5 by default with a "no destructive actions" rationale. Do not penalize an app for not having confirmations it does not need.

5: Confirmation dialogs before destructive actions. Auto-save prevents data loss. Clear undo or cancel options. OR the app has no destructive actions.
4: Most destructive actions have warnings. Some auto-save or undo.
3: Some warnings for major actions. Limited undo/cancel.
2: Few warnings. No undo. Easy to lose work.
1: No warnings. No undo. Frequent accidental data loss.

Q7 — Responsive design and multi-device support. Does it work on different screen sizes?

Probes (automatable):

grep -rcE '\b(sm|md|lg|xl|2xl):' --include='*.tsx' src

— Tailwind responsive utility usage (high = good)

grep -E '<meta name="viewport"' index.html

— viewport meta tag present

grep -rcE 'overflow-x-auto|overflow-x-scroll' --include='*.tsx' src

— horizontal scroll containers (often a smell)

grep -rcE '\bw-\[[0-9]+px\]|\bh-\[[0-9]+px\]' --include='*.tsx' src

— fixed-px sizing (usually breaks small screens)

Read
```
App-Brief.md
```
```
userRole
```
— if it says "desktop or laptop in control room" the app may be intentionally desktop-only; this is acceptable per the rubric ("Hidden or limited on mobile if not intended for mobile")

Translate to draft score: If app is desktop-only by design (per App-Brief) and renders cleanly on laptop down to 13" → 5. Mixed responsive utility usage → 4. Many fixed-px sizes → 3.

5: Seamless across desktop, tablet, mobile. Touch targets 40px+. Text readable. No horizontal scrolling. Hover states accounted for on touch. OR intentionally desktop-only per the brief and clean on supported sizes.
4: Works well on most devices. Minor issues.
3: Functional on multiple devices but not optimized. Some layout issues on smaller screens.
2: Poor mobile/tablet experience. Layouts break.
1: Desktop only. Broken on mobile/tablet.

Q8 — Empty states and first-time experience. When there's no data, is it clear what to do next?

Probes (automatable):

grep -rilE 'empty|no\s+(data|results|items|files|matches)' --include='*.tsx' src

— files with empty-state copy

grep -rlE '<EmptyState|EmptyPlaceholder' --include='*.tsx' src

— explicit empty-state components

For each panel/list module (anything with
```
.list(
```
or
```
.items.map(
```
), check there is at least one branch handling
```
items.length === 0
```
with user-visible copy. List the panels that DO and DO NOT.

grep -rcE 'items\.length === 0|items\.length > 0' --include='*.tsx' src

— explicit empty checks

Translate to draft score: Every data-fetching panel has an empty-state branch with copy → 5. One or two missing → 4. Many panels missing → 2–3.

5: All empty states show helpful messages and clear next steps. First-time users know exactly what to do.
4: Most empty states helpful. Minor gaps.
3: Some empty states explained. First-time users can figure it out.
2: Many blank pages with no guidance.
1: Blank pages everywhere. No guidance.

Q9 — Performance and efficiency. Does the app load quickly?

Probes (automatable):

First, check whether a recent build already exists — avoids a slow rebuild when

dist/

is fresh:

bash

find dist -maxdepth 1 -newer package.json -name '*.js' 2>/dev/null | wc -l
du -sh dist/ 2>/dev/null

If the count is 0 (no recent build), fall back to:

bash

npm run build 2>&1 | tail -20

Then gather the remaining metrics:

grep -rcE 'React\.lazy|lazy\(' --include='*.tsx' src

— code-split routes (good)

grep -rcE 'useMemo|useCallback' --include='*.tsx' src

— memoization usage (informs render efficiency)

grep -rlE 'useVirtual|react-window|react-virtual' --include='*.tsx' src

— list virtualization (good for big lists)

grep -rlE '\.list\([^)]*\)' --include='*.ts' --include='*.tsx' src | xargs -I{} grep -l 'limit:' {} 2>/dev/null | wc -l

vs total list call sites — pagination coverage

Cross-reference the latest
```
code-review-report.md
```
criterion 2.3 (Limits & pages) score

Translate to draft score: Build under 1 MB gzipped + every list has a limit + react-query in use → 5. Bundle 1–2 MB or some lists missing limits → 4. Bundle > 2 MB or systemic unbounded fetches → 2–3.

5: Fast loading with progressive content. Bulk actions, keyboard shortcuts. Common tasks take minimal clicks.
4: Reasonable loading. Most tasks streamlined.
3: Acceptable performance. Tasks moderate effort. Few shortcuts.
2: Slow loading. Tasks require many steps.
1: Very slow or unresponsive.

Q10 — Accessibility (WCAG AA 2.1). Can people use it with assistive tech?

Probes (automatable):

Count
```
<img>
```
tags and
```
<img>
```
tags with
```
alt
```
attributes separately to identify missing alt text:
bash
```
grep -rcE '<img\b' --include='*.tsx' src
grep -rcE '<img[^>]*\balt=' --include='*.tsx' src
```
Any difference means images are missing
```
alt
```
.

grep -rcE '<button[^>]*>[[:space:]]*<(svg|Icon)' --include='*.tsx' src

— icon-only buttons (need

aria-label

)

grep -rcE 'aria-label=' --include='*.tsx' src

— ARIA label usage

grep -rcE 'focus-visible:|focus:' --include='*.tsx' src

— focus styles

```
grep -rcE 'tabIndex=\{-1\}|tabIndex="?-1' --include='*.tsx' src
```
— elements removed from tab order (sometimes intentional, sometimes a bug)

eslint-plugin-jsx-a11y

is installed:

npx eslint . --ext .ts,.tsx --no-eslintrc --rule '{"jsx-a11y/alt-text":"error","jsx-a11y/anchor-is-valid":"error","jsx-a11y/click-events-have-key-events":"error"}' 2>&1 | tail -10

If
```
axe-core
```
is available: suggest the user run an axe scan in the running app and paste results — automation can flag candidates, not enforce contrast

Translate to draft score: 0 missing alts + 0 icon-only buttons without aria-label + focus styles everywhere → 5. A few violations → 4. Systemic gaps → 2–3.

5: All interactions via keyboard. Text contrast meets WCAG AA. Clear focus indicators. Proper ARIA labels. Alt text on images. Touch targets 40px+ / mouse targets 20px+. Form errors announced to screen readers.
4: Most requirements met. Minor exceptions.
3: Basic keyboard support but missing for some features. Mostly acceptable contrast. Focus indicators present but not always clear.
2: Limited keyboard support. Multiple contrast failures. Weak focus indicators.
1: No keyboard navigation. Poor contrast. No focus indicators. Not usable with assistive tech.

Step 4 — Compute average and quality level

Average = sum of all 10 scores ÷ 10.

Map to the quality level table from the docs:

Average	Quality level	Recommendation
4.5 – 5.0	Excellent — ready to launch	Minor improvements over time
3.8 – 4.4	Good — launch with minor fixes	Address lower-scoring areas
3.0 – 3.7	Average — needs improvement	Fix major problems before launching
Below 3.0	Needs significant work	Substantial improvements required

flows-external-app-submit

gates on average ≥ 3.8.

Step 5 — Write the report

Create

reviews/design-review/feedback-round-<N>/design-review-report.md

with this structure:

markdown

# Design Review — <appName> — round <N>

## User and tasks

- **Primary user:** ...
- **Tasks evaluated:**
  1. ...
  2. ...
  3. ...
- **Context:** ...

## Task walkthrough findings

- **Task 1 — ...** ...
- **Task 2 — ...** ...
- **Task 3 — ...** ...

## Scores

| Question | Score | Rationale | Improvement note |
| --- | --- | --- | --- |
| Q1 Aura consistency | n | ... | ... |
| Q2 Navigation & hierarchy | n | ... | ... |
| Q3 Labels & language | n | ... | ... |
| Q4 Feedback & validation | n | ... | ... |
| Q5 Clickability | n | ... | ... |
| Q6 Error prevention | n | ... | ... |
| Q7 Responsive | n | ... | ... |
| Q8 Empty states | n | ... | ... |
| Q9 Performance | n | ... | ... |
| Q10 Accessibility | n | ... | ... |

## Summary

- Average score: <X.X>
- Quality level: <Excellent | Good | Average | Needs significant work>

## Must Fix (any score < 3)

- ...

## Should Fix (any score 3 – 3.7)

- ...

## Nice to Fix (any score 3.8 – 4.4)

- ...

The

Average score:

line must be machine-readable in exactly that format —

flows-external-app-submit

parses it.

Step 6 — Print the gate status

After writing, print to the terminal:

The average score
The quality level
Whether the result meets the
```
flows-external-app-submit
```
gate (≥ 3.8)
If below 3.8, instruct the user to fix Must Fix and Should Fix items and re-run this skill in a new feedback round.

flows-design-review

NPX Install

Tags

SKILL.md Content

Flows Design Review

Operating rules

Step 0 — Pre-scan before prompting

Step 0b — Choose feedback round

Step 1 — Confirm user and tasks

Step 2 — Walk each task end-to-end (manual)

Step 3 — Score the 10 questions (probe → propose → confirm)

Heuristics for translating probe results into a draft score

Per-question automated probes

The 10 questions and rubric

Step 4 — Compute average and quality level

Step 5 — Write the report

Step 6 — Print the gate status