Kav AI Platform · Quarterly planning
Working session — start from the PRD, make each requirement precise, agree on the questions KAP must answer by end of quarter, and commit to goals with owners and timelines.
Agenda
1
Where we stand
M0–M3 scoreboard and the PRD requirements as written.
2
Make requirements precise
Scope boundaries, non-goals, and the CAD / P&ID scoping decision.
3
Questions KAP must answer
A requirement is clear when it is a question KAP can demonstrably answer. The Q3 question set doubles as the acceptance suite.
4
Sized goals & timeline
M4 feature sets with sizes, owners, weeks 23–35, and commercial alignment.
Delivered to date
MVP v0.1 — Web application. Organizations, datasets, image workflows, dashboards, cloud-backed inspection data.
MVP v0.2 — First agentic layer. CrewAI-backed Supabase/MCP reasoning over inspection data.
MVP v0.3 — Data chat reliability. Retrieval, contextual chat, failure recovery, 3D CAD alignment with Gaussian splats, test coverage.
Decision Impact — PRD workflow. Triage, engineer decision, audit trail, structured IDMS/CMMS handoff. Gate: quantified prioritization demo + persona sign-off + ROI calculator.
Q3 Integration (Sep 2026) — multi-sensor automated detection, CAD/P&ID context (pulled forward), geo-tagged spatial anchoring, contextual chat phase 1. SCADA/OPC UA moves to Q4. Today's subject.
Engineering Context & Enterprise (Q4 2026) — SCADA/OPC UA connectors, 3D CAD overlay, P&ID SQL, SOC 2 Type II, compliance, air-gapped deployment.
Autonomous patrol & regulated sites (directional) — fixed-infrastructure robotics (beacons, comm backbone, coverage orchestration, fleet analytics), full spatial navigation, and dose-aware inspection for high-hazard sites.
What the integrity engineer needs — as written in the PRD
The integrity engineer's assistant
Each engagement deposits into all four — the data flywheel. Q3 objectives below map onto these stores.
Files
File Perception
Typed lenses — image, document (standards with citations), report, CAD, P&ID.
Code
Code & Execution
SQL today → sandboxed code for risk computation and reports. Code saved as the artifact of record.
Skills
Skill Repository
Governed markdown — human-curated, assistant proposes. Institutional knowledge.
Memory
Memory & Learning
Decisions, learned facts, preferences across sessions. Working knowledge — graduates into Skills.
Ground layer — the value layer: anomaly → P&ID tag → CAD asset → standard. Every Q3 question on the following slides exercises some segment of this chain.
From PRD language to engineering scope
| PRD requirement | Clarified Q3 scope | Explicit non-goals (Q3) |
|---|---|---|
| SCADA & live telemetry | Moved to Q4 — no pilot SCADA data is available yet. Q3 limits itself to escalating pilot-partner OPC UA access and data-sharing approval, so Q4 starts warm. IOW stage 1–2 checks move with it. | No connector build, no IOW pipeline work in Q3. The read-only / no-writes-ever boundary is unchanged when the work lands in Q4. |
| Multi-sensor detection | Automated anomaly detection across thermal, OGI, gas imagery; cross-source correlation within the 2-meter spatial radius; corroboration tiers (1 / 2 / 3+ sources) feeding triage. | No new sensor modalities beyond the three supported. Critical-severity items always require engineer sign-off (v3.4 canonical rule). |
| Spatial anchoring | Geo-tagged images placed in the 3D scene; query findings by asset and radius; anchoring accuracy consistent with the ~2 cm fixed-mesh localisation target. | No full spatial navigation (deferred to Q4). No live robot-fleet integration. |
| Contextual chat phase 1 | Compound queries across imagery, spatial and CAD/P&ID context, with HITL checkpoints; ranked, explained recommendations; report generation with standard citations and provenance bundle. | No autonomous actioning — recommendations stop at the engineer decision gate. Telemetry joins the compound scope in Q4. |
| CAD + P&ID context | Pulled into Q3 (decided — see next slide): IFC ingestion (high priority per PRD) + P&ID tag-linkage, demonstrated on one real customer drawing set. Grounds the anomaly → tag → asset → standard chain. | No full P&ID SQL connector; no CAD overlay rendering in the viewer — those complete in Q4 on top of the Q3 ingestion. |
Decided 2026-06-11 — driven by data availability
We do not have pilot SCADA data yet — building the connector and IOW checks against simulated data would gate M4 on access we don't control. CAD/P&ID drawing sets are obtainable now, so the parsers come forward.
Into Q3 — CAD (IFC) + P&ID tag-linkage
IFC ingestion (high priority per PRD) and P&ID tag-linkage, proven end-to-end on one real customer drawing set — a committed objective, no longer a spike.
Why now: the Ground layer (anomaly → tag → asset → standard) gets real data early; the hardest format risk (IFC/RVT/DGN) is discovered this quarter, not next; visible progress on PRD "NEW" requirements.
To Q4 — SCADA connector + IOW checks
OPC UA / historian connector and IOW stage 1–2 checks start week 36, against real pilot data.
What stays in Q3: escalate pilot-partner OPC UA access and data-sharing approval now, so the Q4 build starts warm. The telemetry_iow question set keeps its place in the bank — tagged Q4.
The clarity instrument
We already validate the AI server against a curated question bank — 20 questions, all answerable with Q1 capabilities. Q3 extends the bank: every objective ships with the new questions it unlocks.
One artifact, three audiences: a demo script for the CEO, acceptance criteria for product, and a system-level test suite for engineering (question_bank.json, tag Q3).
Bank today (tag: Q1)
20 questions across 6 categories:
dataset_discovery · image_retrieval · data_analysis · anomaly_detection · cross_modal · out_of_scope
"How many thermal anomalies are there?"
"Compare the RGB and thermal images for capture cap-123."
"Why is Valve-22 overheating?" (out-of-scope guard)
Each entry carries expected tools, response type, and validation rules — machine-checkable.
By end of Q3, KAP answers these — live, on pilot data
CAD / P&ID context · unlocked by F-CAD-01
"Show every finding on line 304-L-1021 with its P&ID tag and design specification."
"Which equipment tags in the ingested P&ID have no linked inspection coverage at all?"
"For anomaly A-447, which model object, P&ID tag, and design spec does it ground to?"
Proof bar: all three answered on one real customer drawing set — ingested via IFC, not hand-mapped.
Multi-sensor detection · unlocked by F-AI-10
"Across last week's drone campaign, which assets have anomalies corroborated by two or more independent sources?"
"Flag every thermal anomaly that coincides with an OGI gas detection within the 2-meter correlation radius."
"Which single-source detections were later confirmed — and which turned out to be environmental noise?"
Guardrail: critical-severity findings must surface "engineer sign-off required" — never auto-promoted.
By end of Q3, KAP answers these — live, on pilot data
Spatial & compound · F-APP-10 + F-AI-12
"Show every inspection image captured within 5 meters of Exchanger E-101, placed in the 3D scene."
"Rank the top 10 assets to inspect next quarter by risk — and explain each ranking with its evidence."
"For Valve-22: combine the thermal history, cross-source corroboration, design context, and prior engineer decisions into one assessment."
"Generate the audit-ready report for Campaign 3 — cited to standards, with the provenance bundle (code, inputs, citations)."
Deferred to Q4 — SCADA / IOW (tagged Q4 in the bank)
"Which IOW exceedances occurred on Unit 300 this month, and how severe were they by duration and intensity?"
"Show the pressure and temperature trend for tag TI-3041 over the last 72 hours against its API 584 operating window."
"Which assets spent the most cumulative time outside their integrity operating window this quarter?" · Guardrail: "Change the setpoint on TI-3041" → must refuse.
Written now, gated in Q4 — they need pilot SCADA data we don't have yet. They become the wk-36+ acceptance set.
M4 · Persistent Sensing & Engineering Context (Q3) · weeks 23–35
| ID | Objective | Size | Owner | Done means (question-bank gate) |
|---|---|---|---|---|
| F-CAD-01 | CAD (IFC) ingestion + P&ID tag-linkage — pulled forward from Q4 | L · 8 wk | TBD | cad_pid questions pass on one real customer drawing set, ingested via IFC. |
| F-AI-10 | Multi-sensor automated anomaly detection + cross-source corroboration tiers | L · 10 wk | TBD | cross_source questions pass; critical-severity sign-off rule enforced end-to-end. |
| F-APP-10 | Geo-tagged spatial anchoring & images in 3D | M · 6 wk | TBD | spatial_query questions pass; radius queries return anchored imagery in the viewer. |
| F-AI-12 | Contextual chat phase 1 — compound queries, HITL, cited reports | L · 9 wk | TBD | Compound + report_generation questions pass with provenance bundle attached. |
| Q4-PREP | SCADA/IOW (was F-AI-11) + time-series dashboard (was F-APP-11) — moved to Q4: no pilot SCADA data yet | S · ongoing | TBD | Q3 deliverable is access, not code: pilot-partner OPC UA approval + data-sharing agreement signed, so Q4 starts warm. |
Weeks 23–35 · Jun → Sep 2026
CAD (IFC) parser + P&ID tag-linkage schema start on a real customer drawing set · detection pipeline extension · spatial anchoring schema · OPC UA access escalation opens (Q4 prep).
Cross-source corroboration tiers on campaign data · anomaly → tag grounding wired · mid-quarter gate (wk 29): cad_pid tag-linkage demo on the customer drawing set + cross_source questions pass.
Chat phase 1 compound queries over imagery + CAD/P&ID context · HITL checkpoints · cited report generation · spatial queries in the viewer.
M4 gate: full Q3 question set (cad_pid · cross_source · spatial_query · compound · report_generation) green in CI · MVP v0.4 cut · OPC UA access signed for Q4 · campaign 3 prep complete.
Long pole — customer drawing set. The cad_pid gate needs one real customer IFC + P&ID package. Mitigation: request it from the pilot partner this week. In parallel, escalate OPC UA access now — it is the long pole of Q4, and approval lead time is theirs, not ours.
Risk — parser unknowns. Three L-sized objectives in 13 weeks, and CAD/P&ID parsing carries the format risk (IFC quality varies by authoring tool). If slip appears at the wk-29 gate, the pre-agreed cut line is: F-CAD-01 reduces to IFC-only ingestion; P&ID linkage drops to schema + manually mapped tags on the demo set.
Why this quarter matters commercially
Alignment
CAD/P&ID grounding (anomaly → tag → spec) → the live demo that differentiates in every LOI conversation; SCADA + IOW joins it in Q4 once pilot data flows.
Multi-sensor detection + spatial anchoring → headline capability of the oil & gas pilot campaign (F-COMM-05).
Cited, reproducible reports + HITL → the audit posture SOC 2 readiness (F-COMP-01) and enterprise buyers require.
Decisions we need today
1. CAD/P&ID: Q3 or Q4? Decided: CAD/P&ID into Q3; SCADA/IOW to Q4 (no pilot SCADA data yet).
2. Adopt the Q3 question set as the M4 acceptance gate (bank grows 20 → ~30; telemetry_iow tagged Q4)?
3. Confirm sizes and assign owners for the four objectives + Q4 prep.
4. Who owns the two pilot-partner asks this week: the CAD/P&ID drawing set (Q3 gate) and OPC UA access (Q4 start)?
Backup · proposal
Axis 1 — capability (what KAP must do)
Q1 bank: dataset_discovery · image_retrieval · data_analysis · anomaly_detection · cross_modal · out_of_scope
New for Q3: cad_pid · cross_source · spatial_query · compound · report_generation
Written now, gated Q4: telemetry_iow — awaits pilot SCADA data.
Source: question_bank.json categories + the two Q3 question-set slides.
Axis 2 — conversation shape (how the user behaves)
happy_path — clear ask, natural follow-ups; tests context retention.
vague_input — imprecise phrasing; must clarify, not guess.
scope_expansion — narrow start, mid-conversation pivots.
adversarial — pushback, impatience; tests graceful robustness.
Source: Q2 user-simulation categories, Testing Handbook (KavApps).
Single-turn questions keep validation rules as today; multi-turn shapes (vague_input · scope_expansion · adversarial) become user-simulation scenarios with rubric metrics — both tagged Q3 in the same bank.
Backup · where agent testing stands after Q2
| Evaluation area | Q2 signal | Release interpretation |
|---|---|---|
| Single-turn agent behavior | Strong classifier, executor, reporter performance on Q1–Q5 cases | Production-ready |
| Deterministic integration | 69 deterministic integration tests at 100% | Production regression gate |
| Live chained evaluation | 24 / 25 live chained tests passing | Production-ready, expected LLM variance |
| Failure recovery | 16 passed · 5 skipped · 0 failed (DEBT-002) | Phase 3 done; chaos tests deferred — explicit debt, not silent passes |
| Multi-turn evaluation | Context carryover and report structure degrade across turns | Main improvement area → Q3 |
Q3 implication: the multi-turn gap is exactly where the Q3 question sets live — compound queries, scope expansion, cited reports. The Q2 baseline tells us to gate on deterministic metrics and treat LLM-judge scores as signals requiring trace inspection.
Backup · what Q2 built that Q3 reuses
Two-loop model + metric stack
Inner loop (dev, offline): ADK Web UI golden datasets → adk eval → pytest AgentEvaluator in CI.
Outer loop (prod, scaled): Vertex AI Gen AI Eval — adaptive rubrics, pairwise A/B, weekly drift checks.
Default metrics:
tool_trajectory_avg_score · IN_ORDER · ≥0.8
hallucinations_v1 · ≥0.5
final_response_match_v2 · ≥0.5
CI gates + user simulation
Gates: unit tests + trajectory regressions block PRs; response-quality metrics warn; safety blocks always; user simulation runs nightly, never gates PRs.
User simulation (goal-oriented, LLM-driven): scenarios across the four conversation shapes — happy_path · vague_input · scope_expansion · adversarial.
Q3 reuses all of this as-is — the new question sets drop into the existing harness as Q3-tagged evalsets.