Monorepo consolidation: workspace, shared types, transport plans, docker/swam assets
This commit is contained in:
44
observability/INSTRUMENTATION.md
Normal file
44
observability/INSTRUMENTATION.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Instrumentation Requirements (Control Plane Dashboards)
|
||||
|
||||
## Global Conventions
|
||||
- Build/version labeling: every service exports a gauge metric `*_build_info{service,version,git_sha} 1`
|
||||
- Correlation identifiers:
|
||||
- Logs include `correlation_id` and `request_id` fields for all request spans
|
||||
- Traces propagate `traceparent` end-to-end and expose trace IDs in logs
|
||||
- Cardinality safety:
|
||||
- `tenant_id` must not be a label on high-frequency metrics unless bounded (sampling, rollups, or explicit allowlist)
|
||||
|
||||
## Dashboard: Noisy Neighbor & Tenant Health
|
||||
Required metrics (examples):
|
||||
- `http_request_duration_ms_bucket{service,route,method,status}` histogram
|
||||
- `job_duration_ms_bucket{job_kind,status}` histogram for control-plane jobs
|
||||
- Optional bounded tenant signals:
|
||||
- `tenant_active_jobs{tenant_id}` only if tenant count is bounded and enforced
|
||||
|
||||
## Dashboard: API Regression & Deployment
|
||||
Required metrics (examples):
|
||||
- `*_build_info{service,version,git_sha} 1`
|
||||
- `http_request_duration_ms_bucket{service,route,method,status}`
|
||||
- `http_requests_total{service,route,method,status}`
|
||||
Deploy markers:
|
||||
- Grafana annotations for deploy events (vertical markers) or a low-cardinality metric like:
|
||||
- `deploy_event{service,version,git_sha} 1` (sparse, emitted once per deploy)
|
||||
|
||||
## Dashboard: Storage & Event Bus Bottlenecks
|
||||
Required metrics (examples):
|
||||
- Storage:
|
||||
- `process_resident_memory_bytes{service}`
|
||||
- `disk_io_time_seconds_total{device}` (node-exporter)
|
||||
- `mdbx_*` or equivalent libmdbx metrics if exposed by storage services
|
||||
- Event bus / JetStream:
|
||||
- `nats_*` / `jetstream_*` metrics for consumer lag, ack latency, stream bytes, and redeliveries
|
||||
|
||||
## Dashboard: Infrastructure Exhaustion
|
||||
Required metrics (examples):
|
||||
- Node exporter:
|
||||
- `node_cpu_seconds_total`
|
||||
- `node_memory_MemAvailable_bytes`
|
||||
- `node_filesystem_avail_bytes`
|
||||
- Container/service level:
|
||||
- `process_open_fds`
|
||||
- `tokio_*` runtime metrics (if enabled) for saturation indicators
|
||||
Reference in New Issue
Block a user