Observability
Philosophy
Section titled “Philosophy”Every ProcedureExecution (PKO) should be auditable and every IssueOccurrence should be detectable without manual checking. This aligns with the Arcadia method’s emphasis on providing meaningful information for decision-makers.
ProcedureExecution-level auditability
Section titled “ProcedureExecution-level auditability”The pipeline records every ProcedureExecution in procedure_execution_record:
| Field | Purpose | Ontological note |
|---|---|---|
run_id | Unique identifier for correlation | IAO Identifier |
process_initiated_at / process_completed_at | Duration tracking | BFO Temporal Region boundaries |
step | Current PKO Step (for resume/debug) | PKO ExecutionStatus |
orders_count / line_items_count | Volume Measurement Data | IAO Measurement Datum (ratio) |
error_text | IssueOccurrence details | PKO error handling |
file_path | R2 bearer_entity_key for raw JSONL IBE | IAO concretized_by reference |
Raw JSONL is archived in R2 (IBE store) for traceability and replay.
Workers logging
Section titled “Workers logging”Cloudflare Workers provide structured logging via console.log() + Workers Logs:
console.log(JSON.stringify({ level: 'info', event: 'procedure_execution_completed', run_id: runId, orders: count, duration_ms: Date.now() - startTime,}));Logpush forwards logs to external destinations (e.g., Datadog, S3, R2) for retention beyond Workers’ built-in log viewer.
Health endpoint
Section titled “Health endpoint”GET /api/health reports EngineeredSystem availability:
{ "status": "ok", "database": "connected", "last_sync": { "run_id": "uuid", "process_completed_at": "ISO8601", "status": "completed" }, "queue_depth": 0}Key metrics
Section titled “Key metrics”| Metric | Source | Alert threshold |
|---|---|---|
| Daily sync duration | procedure_execution_record | > 15 minutes |
| Sync not completed | procedure_execution_record staleness | No completed execution in 26 hours |
| Queue retry count | Queue metrics | > 10 retries/hour |
| DLQ growth | Queue DLQ | Any message in DLQ (exhausted FallbackSteps) |
| API error rate | Workers analytics | > 5% 5xx in 5-minute window |
| Mart freshness | performance_measurement_dataset.last_refreshed | > 2 hours stale |
Alerting
Section titled “Alerting”Alerts are derived from metrics and surfaced via Cloudflare notifications or external webhook:
- Critical: ProcedureExecution failure, DLQ messages (exhausted FallbackSteps), database unreachable
- Warning: Execution duration approaching limit, elevated error rate, stale Measurement Dataset
- Info: Successful ProcedureExecution completion, large batch processed
Request tracing
Section titled “Request tracing”Every request gets a unique ID for correlation:
app.use('*', async (c, next) => { const requestId = c.req.header('cf-ray') || crypto.randomUUID(); c.set('requestId', requestId); c.header('X-Request-Id', requestId); await next();});All log entries within a request include the request ID for end-to-end tracing.