Database & Cloud Storage Handbook

Supabase schema, multi-cloud storage, migrations, and data types in KAP

Overview

KAP uses Supabase (managed Postgres, Auth, Storage, Realtime) as the primary database and thumbnail store, paired with Azure Blob Storage for high-resolution asset storage.

This handbook summarizes how the database and cloud storage fit the product, where migrations live, how TypeScript types stay aligned, and the multi-cloud storage architecture of KAP.

Product data types (conceptual)

Day-to-day features are organized around three user-visible data types; table names in Postgres may differ until a future rename.

Concept	Primary table / API today
Datasets	`datasets` (campaigns are dataset rows under an organization)
Notes	`location_notes` (REST and routes under `/api/.../notes`)
Reports	`pdf_reports` (PDF URLs plus KRML fields: `report_id`, `rendered`, `metadata`, optional `template_name`)

Work-order-style workflows use notes with structured metadata (for example kind: "work_order") rather than a separate work_orders table unless reporting needs justify one later.

KRML — shared report format (not engine-specific)

KRML (Kav AI Report Markup Language) is a platform-level convention: structured report sections, rendered Markdown (and optional PDF binaries via url), and JSON metadata stored on pdf_reports. Any part of the stack may produce or consume KRML-shaped rows—web exports, batch jobs, AG-UI chat report pipelines, or agent runtimes—as long as they honor the same table and metadata shapes (e.g. kind: "krml_report", layout templates with kind: "krml_layout_template").

The Hermes agent runtime currently ships the reference krml_* tool implementations (krml_render_report, krml_save_report, etc.) under ai/src/kavai/systems/hermes/. That is one integration path, not the definition of KRML. New engines or services should reuse pdf_reports and the same field semantics rather than inventing parallel report tables.

Schema changes (migrations)

Location: supabase/migrations/ at the repository root.
Apply: Use your team's standard path (Supabase CLI db push, Dashboard SQL, or CI) so supabase_migrations.schema_migrations stays in sync with the repo.
Idempotency: Prefer IF NOT EXISTS / DROP IF EXISTS patterns where replays or branches are common.

New tables and columns should follow existing RLS patterns: gate access through organization membership (organization_members + datasets.organization_id) unless the row is intentionally public.

TypeScript types

Generated file: web/lib/database.generated.ts (and companions if your project splits types).
Refresh: From web/, run npm run db:types after migrations are applied (requires Supabase CLI and project linkage as documented in the web package).

Do not hand-edit generated tables for long; regenerate after DDL changes so inserts and selects stay type-safe.

`pdf_reports` column semantics (KRML)

Persisted reports and templates live in pdf_reports (legacy hermes_* tables were removed in favor of this single store):

Report rows: report_id, name (title), rendered (Markdown), status, facility, metadata (typically includes kind: "krml_report").
Layout templates: rows with template_name set and metadata.kind: "krml_layout_template" (section_order, section_config).

Hermes registers LLM-callable tools named krml_* (krml_save_report, …) implemented in ai/src/kavai/systems/hermes/tools/krml_renderer.py and skills/krml_composition.py. Web or other services should use the same columns and metadata conventions when writing report rows so all consumers stay interoperable.

Cloud Storage Architecture

KAP employs a hybrid, multi-cloud storage architecture designed to optimize for heavy asset sizes (such as high-resolution RGB and thermal imagery) while ensuring rapid loading of thumbnails and documents.

Azure Blob Storage (Primary Asset Store)

To keep database-associated storage lightweight and cost-effective, raw high-fidelity assets are housed on Azure Blob Storage.

RGB and Thermal Images: Stored in dedicated Azure Blob containers (e.g., raw-rgb and raw-thermal). This allows high-throughput pipeline access for the Orion and Argus ML runtimes.
Access Pattern: The backend generates short-lived, secure Shared Access Signature (SAS) tokens to serve these files directly to authorized front-end clients, preventing public exposure of sensitive industrial datasets.

Azure URL and Token Generation Mechanism

When a client requests a full-size image, the KAP server performs dynamic Shared Access Signature (SAS) token generation rather than storing public URLs in the database.

Resolution of Credentials:
- The dataset record is fetched containing the encrypted/stored access key under dataset.credentials (accountName and accountKey).
- The target storage configuration is derived from dataset.storage_path, which is formatted as accountName:containerName.
Blob Name Extraction:
- The full path to the image is retrieved from image.storage_path (e.g., container-name/DJI_20240807/DJI_20240807194032_0274_V.JPG).
- The application strips the container prefix from the path to get the exact blob name relative to the container root:
```
const blobName = image.storage_path.split('/').slice(1).join('/');
```
SAS Query Parameter Construction:
- The server instantiates StorageSharedKeyCredential(accountName, accountKey).
- Read permissions are explicitly enabled:
```
const permissions = new BlobSASPermissions();
permissions.read = true;
```
- An expiration timestamp is set (typically 60 minutes from the generation time).
- The @azure/storage-blob library compiles these parameters into a secure query string:
```
const sasToken = generateBlobSASQueryParameters({
  containerName,
  blobName,
  permissions,
  expiresOn: expiryTime,
}, sharedKeyCredential).toString();
```

URL Compilation:

The final signed URL is assembled and returned to the client as:

https://${accountName}.blob.core.windows.net/${containerName}/${blobName}?${sasToken}

This approach guarantees that raw, high-resolution imagery is never exposed to the public internet while minimizing database overhead by only storing the metadata path.

REST APIs for Azure Blob Storage

The Next.js backend exposes dedicated REST endpoints to handle container validation, file discovery, and SAS-signing:

/api/datasets/verify-storage (POST): Validates storage connection credentials (connectionString or accountName/accountKey) and container reachability.
/api/datasets/list-bucket (GET): Discovers and lists available blobs (images and videos) within a given container, supporting optional directory prefix filtering.
/api/datasets/onboarding (POST): Initiates crawler configuration to scan Azure containers and register raw imagery into the Postgres database.
/api/datasets/[slug]/image-url (POST): Generates a signed SAS URL to fetch a full-size image. This route performs active, cached HEAD checks to detect cloud provider outages or subscription blockages (e.g., 403 Forbidden).
/api/datasets/[slug]/sign-video (POST): Dynamically SAS-signs video streams (MP4s) for frontend browser playback.
/api/datasets/[slug]/index-images (POST): Triggers background worker execution to crawl the Azure container and index missing assets or update geodata.

Programmatic Asset Access in Backend Functions

Backend systems (Node.js web server and Python ML runtimes) access raw image data differently depending on the processing task.

Next.js (Node.js Backend)

Node.js servers utilize the standard @azure/storage-blob SDK wrapped in the unified CloudStorageService client (web/lib/cloud-storage.ts).

Full Image Download: Downloads the complete binary blob into a local Buffer for processing or resizing:

const storageService = new CloudStorageService("azure", dataset.credentials);
const imageBuffer = await storageService.downloadBlob(containerName, blobName);

Fast Header/EXIF Parsing: To extract GPS and orientation tags without wasting network bandwidth on large images, the client retrieves only the first 1MB containing the header payload:
```
const headerBuffer = await storageService.downloadBlobHeader(containerName, blobName);
```

Python AI Runtimes (Argus & Orion)

Machine learning inference pipelines running in Python utilize two access patterns:

Signed HTTP Fetch: The gateway fetches signed SAS URLs from the Next.js API layer and forwards them to the Python runtimes. Runtimes fetch the bytes over standard HTTP/HTTPS:
```
import requests
response = requests.get(signed_sas_url)
image_bytes = response.content
```

Direct SDK Integration: Since azure-storage-blob is installed in the pixi python dependencies, offline batch runtimes can connect directly to Azure:

from azure.storage.blob import BlobServiceClient

# Using Connection String or Shared Credentials
client = BlobServiceClient.from_connection_string(connection_string)
blob = client.get_blob_client(container=container, blob=blob_path)

# Read full bytes
image_bytes = blob.download_blob().readall()

Image UUID Resolution to Raw Azure Asset (Python)

If a Python backend function is initialized with only an image UUID, it must dynamically resolve the container name, cloud credentials, and exact blob path before fetching the file.

Required Parameters:
- storage_path of the target image (extracted from the images table).
- credentials of the parent dataset (JSON containing connectionString or accountName/accountKey).
- storage_path of the parent dataset (extracted from the datasets table to determine the target container name).

Resolution and Download Steps:

import os
from supabase import create_client
from azure.storage.blob import BlobServiceClient

# A. Initialize Supabase Client (bypassing RLS with service role key for system operations)
supabase = create_client(
    supabase_url=os.environ["SUPABASE_URL"],
    supabase_key=os.environ["SUPABASE_SERVICE_ROLE_KEY"]
)

image_id = "your-image-uuid"

# B. Query the Image Metadata
img_res = supabase.table("images").select("storage_path, dataset_id").eq("id", image_id).single().execute()
image_data = img_res.data
raw_image_path = image_data["storage_path"]  # e.g., "raw-rgb/DJI_0274_V.JPG"
dataset_id = image_data["dataset_id"]

# C. Query the Parent Dataset Storage and Credentials
ds_res = supabase.table("datasets").select("credentials, storage_path").eq("id", dataset_id).single().execute()
dataset_data = ds_res.data
credentials = dataset_data["credentials"]        # e.g., {"accountName": "...", "accountKey": "..."}
storage_config = dataset_data["storage_path"]    # e.g., "accountName:containerName"

# D. Parse Storage Target Details
# Extract the actual container name from the storage_path configuration
_, container_name = storage_config.split(":")

# Extract the relative blob path by stripping the container prefix from the image storage path
blob_name = "/".join(raw_image_path.split("/")[1:])

# E. Initialize Azure Blob Client and Fetch Content
if "connectionString" in credentials:
    blob_service = BlobServiceClient.from_connection_string(credentials["connectionString"])
else:
    from azure.storage.blob import StorageSharedKeyCredential
    blob_service = BlobServiceClient(
        account_url=f"https://{credentials['accountName']}.blob.core.windows.net",
        credential=StorageSharedKeyCredential(credentials["accountName"], credentials["accountKey"])
    )

blob_client = blob_service.get_blob_client(container=container_name, blob=blob_name)
image_bytes = blob_client.download_blob().readall()

Supabase Storage (Lightweight Assets & Reports)

Supabase Storage is utilized for highly interactive and frequently accessed metadata or generated assets.

Thumbnails: Scaled-down versions of RGB and thermal images are stored directly in Supabase Storage buckets (e.g., thumbnails) to drive the high-performance UI galleries without fetching large MB-sized raw assets from Azure.
Generated Reports: Markdown and PDF binaries generated by KRML composition pipelines are stored in the reports bucket.
Access & Security: Access is restricted and gated via standard Row Level Security (RLS) policies based on user organization membership, mirroring database-level security.

Multi-Cloud Expansion (AWS S3 & Google Cloud Storage)

KAP is architected to support future multi-cloud enterprise deployments by abstracting the storage layer:

Storage Provider Abstraction: A unified, provider-agnostic storage client interface allows seamless switching or concurrent use of Azure Blob Storage, AWS S3, and Google Cloud Storage (GCS).
Interoperability: The images and pdf_reports tables store relative URIs (e.g., provider://bucket/path) rather than provider-specific hardcoded URLs, enabling the gateway runtime to resolve the correct signed URLs dynamically.

Workspace and schema direction: docs/proposals/20260418_workspace_architecture.md (Schema & Backend section).
Environment variables for DB access: web/.env / .env.example (SUPABASE_*, POSTGRES_URL_NON_POOLING, etc.).

Quick reference (common tables)

The live project may include additional tables (Auth, Storage, Realtime, extensions). Application tables frequently touched by KAP include:

organizations, organization_members, profiles
datasets, images, data_groups
location_notes, annotations
pdf_reports, chat_sessions, messages
gas_readings, equipment, investigations (and related domain tables)

Use Supabase Studio or information_schema when you need an authoritative column list for your environment.

Overview​

Product data types (conceptual)​

KRML — shared report format (not engine-specific)​

Schema changes (migrations)​

TypeScript types​

pdf_reports column semantics (KRML)​

Cloud Storage Architecture​

Azure Blob Storage (Primary Asset Store)​

Azure URL and Token Generation Mechanism​

REST APIs for Azure Blob Storage​

Programmatic Asset Access in Backend Functions​

Next.js (Node.js Backend)​

Python AI Runtimes (Argus & Orion)​

Image UUID Resolution to Raw Azure Asset (Python)​

Supabase Storage (Lightweight Assets & Reports)​

Multi-Cloud Expansion (AWS S3 & Google Cloud Storage)​

Related documentation​

Quick reference (common tables)​