chore: full stack stability and migration fixes, plus react UI progress
This commit is contained in:
@@ -1,310 +1,369 @@
|
||||
# Milestone 7: CI/CD & Operability
|
||||
|
||||
**Goal:** Every commit is validated. Deployments are reproducible and observable.
|
||||
|
||||
**Depends on:** M0 (Security), M1 (Foundation)
|
||||
|
||||
---
|
||||
|
||||
## 7.1 — Rust CI Pipeline
|
||||
|
||||
### 7.1.1 Add Rust jobs to CI
|
||||
|
||||
**File:** `.github/workflows/ci.yml`
|
||||
|
||||
Add a new job before the existing frontend jobs:
|
||||
|
||||
```yaml
|
||||
rust:
|
||||
runs-on: ubuntu-latest
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:15
|
||||
env:
|
||||
POSTGRES_PASSWORD: postgres
|
||||
ports:
|
||||
- 5432:5432
|
||||
options: >-
|
||||
--health-cmd pg_isready
|
||||
--health-interval 10s
|
||||
--health-timeout 5s
|
||||
--health-retries 5
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Install Rust toolchain
|
||||
uses: dtolnay/rust-toolchain@stable
|
||||
with:
|
||||
components: rustfmt, clippy
|
||||
|
||||
- name: Cache cargo registry and build
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: |
|
||||
~/.cargo/registry
|
||||
~/.cargo/git
|
||||
target
|
||||
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
|
||||
|
||||
- name: Check formatting
|
||||
run: cargo fmt --all --check
|
||||
|
||||
- name: Run clippy
|
||||
run: cargo clippy --workspace -- -D warnings
|
||||
|
||||
- name: Build workspace
|
||||
run: cargo build --workspace
|
||||
|
||||
- name: Run tests
|
||||
run: cargo test --workspace
|
||||
env:
|
||||
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
JWT_SECRET: test-secret-for-ci-only-not-production
|
||||
DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
|
||||
- name: Verify sqlx offline data
|
||||
run: cargo sqlx prepare --check --workspace
|
||||
env:
|
||||
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
### /Users/vlad/Developer/madapes/madbase/_milestones/M7_cicd_operability.md
|
||||
```markdown
|
||||
1: # Milestone 7: CI/CD & Operability
|
||||
2:
|
||||
3: **Goal:** Every commit is validated. Deployments are reproducible and observable.
|
||||
4:
|
||||
5: **Infrastructure:**
|
||||
6: - Container runtime: Podman
|
||||
7: - Container orchestration: Podman Compose
|
||||
8: - CI/CD platform: Gitea Actions (git.madapes.com)
|
||||
9: - Container registry: git.madapes.com
|
||||
10:
|
||||
11: **Depends on:** M0 (Security), M1 (Foundation)
|
||||
12:
|
||||
13: ---
|
||||
14:
|
||||
15: ## 7.1 — Rust CI Pipeline
|
||||
16:
|
||||
17: ### 7.1.1 Add Rust jobs to CI
|
||||
18:
|
||||
19: **File:** `.gitea/workflows/ci.yml`
|
||||
20:
|
||||
21: Add a new job before the existing frontend jobs:
|
||||
22:
|
||||
23: ```yaml
|
||||
24: rust:
|
||||
25: runs-on: ubuntu-latest
|
||||
26: services:
|
||||
27: postgres:
|
||||
28: image: postgres:15
|
||||
29: env:
|
||||
30: POSTGRES_PASSWORD: postgres
|
||||
31: ports:
|
||||
32: - 5432:5432
|
||||
33: options: >-
|
||||
34: --health-cmd pg_isready
|
||||
35: --health-interval 10s
|
||||
36: --health-timeout 5s
|
||||
37: --health-retries 5
|
||||
38: steps:
|
||||
39: - uses: actions/checkout@v4
|
||||
40:
|
||||
41: - name: Install Rust toolchain
|
||||
42: uses: dtolnay/rust-toolchain@stable
|
||||
43: with:
|
||||
44: components: rustfmt, clippy
|
||||
45:
|
||||
46: - name: Cache cargo registry and build
|
||||
47: uses: actions/cache@v4
|
||||
48: with:
|
||||
49: path: |
|
||||
50: ~/.cargo/registry
|
||||
51: ~/.cargo/git
|
||||
52: target
|
||||
53: key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
|
||||
54:
|
||||
55: - name: Check formatting
|
||||
56: run: cargo fmt --all --check
|
||||
57:
|
||||
58: - name: Run clippy
|
||||
59: run: cargo clippy --workspace -- -D warnings
|
||||
60:
|
||||
61: - name: Build workspace
|
||||
62: run: cargo build --workspace
|
||||
63:
|
||||
64: - name: Run tests
|
||||
65: run: cargo test --workspace
|
||||
66: env:
|
||||
67: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
68: JWT_SECRET: test-secret-for-ci-only-not-production
|
||||
69: DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
70:
|
||||
71: - name: Verify sqlx offline data
|
||||
72: run: cargo sqlx prepare --check --workspace
|
||||
73: env:
|
||||
74: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||||
75: ```
|
||||
76:
|
||||
77: ### 7.1.2 Enable sqlx offline mode
|
||||
78:
|
||||
79: Run locally:
|
||||
80: ```bash
|
||||
81: cargo sqlx prepare --workspace
|
||||
82: ```
|
||||
83:
|
||||
84: This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
|
||||
85:
|
||||
86: ### 7.1.3 Fix the lint job
|
||||
87:
|
||||
88: **File:** `.gitea/workflows/ci.yml`
|
||||
89:
|
||||
90: ```yaml
|
||||
91: # BEFORE
|
||||
92: run: npm run lint || true
|
||||
93:
|
||||
94: # AFTER
|
||||
95: run: npm run lint
|
||||
96: ```
|
||||
97:
|
||||
98: ### 7.1.4 Pin Gitea Actions
|
||||
99:
|
||||
100: Update all `@v3` to `@v4` throughout the file:
|
||||
101: - `actions/checkout@v3` → `@v4`
|
||||
102: - `actions/setup-node@v3` → `@v4`
|
||||
103: - `actions/upload-artifact@v3` → `@v4`
|
||||
104: - `codecov/codecov-action@v3` → `@v4`
|
||||
105:
|
||||
106: ### 7.1.5 Add Podman build job
|
||||
107:
|
||||
108: ```yaml
|
||||
109: podman-build:
|
||||
110: runs-on: ubuntu-latest
|
||||
111: needs: rust
|
||||
112: container:
|
||||
113: image: docker.io/podman/stable:latest
|
||||
114: steps:
|
||||
115: - uses: actions/checkout@v4
|
||||
116:
|
||||
117: - name: Build gateway-runtime
|
||||
118: run: podman build --target gateway-runtime -t git.madapes.com/madbase/gateway:ci .
|
||||
119:
|
||||
120: - name: Build worker-runtime
|
||||
121: run: podman build --target worker-runtime -t git.madapes.com/madbase/worker:ci .
|
||||
122:
|
||||
123: - name: Build control-runtime
|
||||
124: run: podman build --target control-runtime -t git.madapes.com/madbase/control:ci .
|
||||
125:
|
||||
126: - name: Build proxy-runtime
|
||||
127: run: podman build --target proxy-runtime -t git.madapes.com/madbase/proxy:ci .
|
||||
128:
|
||||
129: - name: Login to registry
|
||||
130: if: github.ref == 'refs/heads/main'
|
||||
131: run: podman login git.madapes.com -u ${{ secrets.REGISTRY_USER }} -p ${{ secrets.REGISTRY_PASSWORD }}
|
||||
132:
|
||||
133: - name: Push images
|
||||
134: if: github.ref == 'refs/heads/main'
|
||||
135: run: |
|
||||
136: podman push git.madapes.com/madbase/gateway:ci
|
||||
137: podman push git.madapes.com/madbase/worker:ci
|
||||
138: podman push git.madapes.com/madbase/control:ci
|
||||
139: podman push git.madapes.com/madbase/proxy:ci
|
||||
140: ```
|
||||
141:
|
||||
142: ---
|
||||
143:
|
||||
144: ## 7.2 — Container Improvements (Podman)
|
||||
145:
|
||||
146: ### 7.2.1 Slim runtime images
|
||||
147:
|
||||
148: **File:** `Dockerfile` — all runtime stages (compatible with Podman)
|
||||
149:
|
||||
150: ```dockerfile
|
||||
151: # BEFORE
|
||||
152: FROM rust:latest AS worker-runtime
|
||||
153:
|
||||
154: # AFTER — shared base
|
||||
155: FROM debian:bookworm-slim AS runtime-base
|
||||
156: RUN apt-get update && apt-get install -y ca-certificates libssl3 && rm -rf /var/lib/apt/lists/*
|
||||
157: RUN useradd -r -s /bin/false madbase
|
||||
158:
|
||||
159: FROM runtime-base AS worker-runtime
|
||||
160: WORKDIR /app
|
||||
161: COPY --from=builder /app/target/release/worker .
|
||||
162: USER madbase
|
||||
163: EXPOSE 8002
|
||||
164: HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
|
||||
165: CMD ["./worker"]
|
||||
166: ```
|
||||
167:
|
||||
168: ### 7.2.2 Create .containerignore
|
||||
169:
|
||||
170: ```
|
||||
171: .git
|
||||
172: target
|
||||
173: docs
|
||||
174: *.md
|
||||
175: env
|
||||
176: scripts
|
||||
177: _milestones
|
||||
178: .gitea
|
||||
179: control-plane-ui/node_modules
|
||||
180: control-plane-ui/dist
|
||||
181: ```
|
||||
182:
|
||||
183: > **Note:** While `.dockerignore` also works with Podman, `.containerignore` is the modern standard that works across all OCI-compliant container runtimes.
|
||||
184:
|
||||
185: ### 7.2.3 Pin image tags
|
||||
186:
|
||||
187: Replace all ` :latest` tags:
|
||||
188: - `cargo-chef:latest-rust-latest` → `cargo-chef:0.1.68-rust-1.77`
|
||||
189: - `victoriametrics/victoria-metrics:latest` → `v1.101.0`
|
||||
190: - `grafana/loki:latest` → `2.9.6`
|
||||
191: - `grafana/grafana:latest` → `10.4.2`
|
||||
192: - `victoriametrics/vmagent:latest` → `v1.101.0`
|
||||
193:
|
||||
194: ### 7.2.4 Update compose configuration for Podman Compose
|
||||
195:
|
||||
196: **File:** `compose.yaml` (or `docker-compose.yaml`)
|
||||
197:
|
||||
198: Ensure compatibility with Podman Compose:
|
||||
199:
|
||||
200: ```yaml
|
||||
201: services:
|
||||
202: gateway:
|
||||
203: image: git.madapes.com/madbase/gateway:latest
|
||||
204: ports:
|
||||
205: - "8000:8000"
|
||||
206: environment:
|
||||
207: - DATABASE_URL=${DATABASE_URL}
|
||||
208: - JWT_SECRET=${JWT_SECRET}
|
||||
209: depends_on:
|
||||
210: - postgres
|
||||
211: restart: unless-stopped
|
||||
212: healthcheck:
|
||||
213: test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
|
||||
214: interval: 10s
|
||||
215: timeout: 3s
|
||||
216: retries: 3
|
||||
217:
|
||||
218: # ... other services ...
|
||||
219: ```
|
||||
220:
|
||||
221: Run with Podman Compose:
|
||||
222: ```bash
|
||||
223: podman-compose up -d
|
||||
224: ```
|
||||
225:
|
||||
226: ---
|
||||
227:
|
||||
228: ## 7.3 — Observability
|
||||
229:
|
||||
230: ### 7.3.1 Create config files
|
||||
231:
|
||||
232: See M1 for `config/prometheus.yml` and `config/vmagent.yml` content.
|
||||
233:
|
||||
234: ### 7.3.2 Request correlation IDs
|
||||
235:
|
||||
236: **File:** `gateway/src/proxy.rs` — `proxy_request` function
|
||||
237:
|
||||
238: ```rust
|
||||
239: use uuid::Uuid;
|
||||
240:
|
||||
241: // Generate or propagate request ID
|
||||
242: let request_id = req.headers()
|
||||
243: .get("x-request-id")
|
||||
244: .and_then(|v| v.to_str().ok())
|
||||
245: .map(|s| s.to_string())
|
||||
246: .unwrap_or_else(|| Uuid::new_v4().to_string());
|
||||
247:
|
||||
248: // Add to proxied request
|
||||
249: request_builder = request_builder.header("x-request-id", &request_id);
|
||||
250:
|
||||
251: // Add to response
|
||||
252: response_builder = response_builder.header("x-request-id", &request_id);
|
||||
253: ```
|
||||
254:
|
||||
255: Use `tracing::Span` with the request ID for log correlation:
|
||||
256: ```rust
|
||||
257: let span = tracing::info_span!("request", id = %request_id);
|
||||
258: ```
|
||||
259:
|
||||
260: ### 7.3.3 OpenTelemetry tracing
|
||||
261:
|
||||
262: Add dependencies:
|
||||
263: ```toml
|
||||
264: opentelemetry = "0.22"
|
||||
265: opentelemetry-otlp = "0.15"
|
||||
266: tracing-opentelemetry = "0.23"
|
||||
267: ```
|
||||
268:
|
||||
269: Initialize in `gateway/src/main.rs`:
|
||||
270: ```rust
|
||||
271: if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
|
||||
272: let tracer = opentelemetry_otlp::new_pipeline()
|
||||
273: .tracing()
|
||||
274: .with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
|
||||
275: .install_batch(opentelemetry_sdk::runtime::Tokio)?;
|
||||
276:
|
||||
277: let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
|
||||
278: // Add to the subscriber registry
|
||||
279: }
|
||||
280: ```
|
||||
281:
|
||||
282: ### 7.3.4 Alerting rules
|
||||
283:
|
||||
284: Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert:
|
||||
285:
|
||||
286: ```yaml
|
||||
287: groups:
|
||||
288: - name: madbase
|
||||
289: rules:
|
||||
290: - alert: ServiceDown
|
||||
291: expr: up == 0
|
||||
292: for: 1m
|
||||
293: labels:
|
||||
294: severity: critical
|
||||
295:
|
||||
296: - alert: HighErrorRate
|
||||
297: expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
|
||||
298: for: 5m
|
||||
299: labels:
|
||||
300: severity: warning
|
||||
301:
|
||||
302: - alert: HighLatency
|
||||
303: expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
|
||||
304: for: 5m
|
||||
305: labels:
|
||||
306: severity: warning
|
||||
307: ```
|
||||
308:
|
||||
309: ---
|
||||
310:
|
||||
311: ## Completion Requirements
|
||||
312:
|
||||
313: This milestone is **not complete** until every item below is satisfied.
|
||||
314:
|
||||
315: ### 1. Full Test Suite — All Green
|
||||
316:
|
||||
317: - [ ] `cargo test --workspace` passes with **zero failures**
|
||||
318: - [ ] `cargo fmt --all -- --check` passes (no formatting issues)
|
||||
319: - [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings)
|
||||
320: - [ ] `cargo sqlx prepare --check` passes (offline query data is up to date)
|
||||
321: - [ ] All **pre-existing tests** still pass (no regressions)
|
||||
322: - [ ] **New tests** are written for CI/operability features:
|
||||
323:
|
||||
324: | Test | Location | What it validates |
|
||||
325: |------|----------|-------------------|
|
||||
326: | `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it |
|
||||
327: | `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers |
|
||||
328: | `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status |
|
||||
329: | `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status |
|
||||
330: | `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status |
|
||||
331: | `test_podman_build_proxy` | `.gitea/workflows/ci.yml` | Podman build target `proxy-runtime` succeeds (CI job) |
|
||||
332: | `test_podman_build_worker` | `.gitea/workflows/ci.yml` | Podman build target `worker-runtime` succeeds (CI job) |
|
||||
333: | `test_podman_build_control` | `.gitea/workflows/ci.yml` | Podman build target `control-runtime` succeeds (CI job) |
|
||||
334:
|
||||
335: ### 2. CI Pipeline Verification
|
||||
336:
|
||||
337: - [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green
|
||||
338: - [ ] `cargo sqlx prepare --check` passes in CI
|
||||
339: - [ ] Podman build succeeds for all 4 targets (proxy, worker, control, functions)
|
||||
340: - [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`)
|
||||
341: - [ ] CI runs in under 15 minutes for a clean build
|
||||
342: - [ ] Images are successfully pushed to `git.madapes.com` on main branch
|
||||
343:
|
||||
344: ### 3. Podman / Operability Verification
|
||||
345:
|
||||
346: - [ ] Runtime images are under 200MB each (down from ~1.5GB)
|
||||
347: - [ ] Containers run as non-root user (`USER madbase`)
|
||||
348: - [ ] `podman inspect <image>` shows a `HEALTHCHECK` for each runtime image
|
||||
349: - [ ] `.containerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/`
|
||||
350: - [ ] All container image tags are pinned (no ` :latest` in Dockerfile)
|
||||
351: - [ ] `podman-compose up -d` successfully starts all services
|
||||
352: - [ ] Images can be pulled from `git.madapes.com` in production
|
||||
353:
|
||||
354: ### 4. Observability Verification
|
||||
355:
|
||||
356: - [ ] `X-Request-Id` header appears in proxy responses
|
||||
357: - [ ] Logs contain structured JSON with request IDs (verify via `podman logs proxy | jq .`)
|
||||
358: - [ ] Prometheus/VictoriaMetrics scrapes metrics from all services
|
||||
359: - [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate
|
||||
360: - [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
|
||||
361:
|
||||
362: ### 5. CI Gate
|
||||
363:
|
||||
364: - [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
|
||||
365: - [ ] All milestones M0–M6 tests pass in the CI pipeline retroactively
|
||||
366: - [ ] Gitea Actions workflows are properly configured with secrets for registry access
|
||||
```
|
||||
|
||||
### 7.1.2 Enable sqlx offline mode
|
||||
|
||||
Run locally:
|
||||
```bash
|
||||
cargo sqlx prepare --workspace
|
||||
```
|
||||
|
||||
This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
|
||||
|
||||
### 7.1.3 Fix the lint job
|
||||
|
||||
**File:** `.github/workflows/ci.yml` line 29
|
||||
|
||||
```yaml
|
||||
# BEFORE
|
||||
run: npm run lint || true
|
||||
|
||||
# AFTER
|
||||
run: npm run lint
|
||||
```
|
||||
|
||||
### 7.1.4 Pin GitHub Actions
|
||||
|
||||
Update all `@v3` to `@v4` throughout the file:
|
||||
- `actions/checkout@v3` → `@v4`
|
||||
- `actions/setup-node@v3` → `@v4`
|
||||
- `actions/upload-artifact@v3` → `@v4`
|
||||
- `codecov/codecov-action@v3` → `@v4`
|
||||
|
||||
### 7.1.5 Add Docker build job
|
||||
|
||||
```yaml
|
||||
docker:
|
||||
runs-on: ubuntu-latest
|
||||
needs: rust
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Build gateway-runtime
|
||||
run: docker build --target gateway-runtime -t madbase/gateway:ci .
|
||||
|
||||
- name: Build worker-runtime
|
||||
run: docker build --target worker-runtime -t madbase/worker:ci .
|
||||
|
||||
- name: Build control-runtime
|
||||
run: docker build --target control-runtime -t madbase/control:ci .
|
||||
|
||||
- name: Build proxy-runtime
|
||||
run: docker build --target proxy-runtime -t madbase/proxy:ci .
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7.2 — Docker Improvements
|
||||
|
||||
### 7.2.1 Slim runtime images
|
||||
|
||||
**File:** `Dockerfile` — all runtime stages
|
||||
|
||||
```dockerfile
|
||||
# BEFORE
|
||||
FROM rust:latest AS worker-runtime
|
||||
|
||||
# AFTER — shared base
|
||||
FROM debian:bookworm-slim AS runtime-base
|
||||
RUN apt-get update && apt-get install -y \
|
||||
ca-certificates libssl3 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
RUN useradd -r -s /bin/false madbase
|
||||
|
||||
FROM runtime-base AS worker-runtime
|
||||
WORKDIR /app
|
||||
COPY --from=builder /app/target/release/worker .
|
||||
USER madbase
|
||||
EXPOSE 8002
|
||||
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
|
||||
CMD ["./worker"]
|
||||
```
|
||||
|
||||
### 7.2.2 Create .dockerignore
|
||||
|
||||
```
|
||||
.git
|
||||
target
|
||||
docs
|
||||
*.md
|
||||
env
|
||||
scripts
|
||||
_milestones
|
||||
.github
|
||||
control-plane-ui/node_modules
|
||||
control-plane-ui/dist
|
||||
```
|
||||
|
||||
### 7.2.3 Pin image tags
|
||||
|
||||
Replace all `:latest` tags:
|
||||
- `cargo-chef:latest-rust-latest` → `cargo-chef:0.1.68-rust-1.77`
|
||||
- `victoriametrics/victoria-metrics:latest` → `:v1.101.0`
|
||||
- `grafana/loki:latest` → `:2.9.6`
|
||||
- `grafana/grafana:latest` → `:10.4.2`
|
||||
- `victoriametrics/vmagent:latest` → `:v1.101.0`
|
||||
|
||||
---
|
||||
|
||||
## 7.3 — Observability
|
||||
|
||||
### 7.3.1 Create config files
|
||||
|
||||
See M1 for `config/prometheus.yml` and `config/vmagent.yml` content.
|
||||
|
||||
### 7.3.2 Request correlation IDs
|
||||
|
||||
**File:** `gateway/src/proxy.rs` — `proxy_request` function
|
||||
|
||||
```rust
|
||||
use uuid::Uuid;
|
||||
|
||||
// Generate or propagate request ID
|
||||
let request_id = req.headers()
|
||||
.get("x-request-id")
|
||||
.and_then(|v| v.to_str().ok())
|
||||
.map(|s| s.to_string())
|
||||
.unwrap_or_else(|| Uuid::new_v4().to_string());
|
||||
|
||||
// Add to proxied request
|
||||
request_builder = request_builder.header("x-request-id", &request_id);
|
||||
|
||||
// Add to response
|
||||
response_builder = response_builder.header("x-request-id", &request_id);
|
||||
```
|
||||
|
||||
Use `tracing::Span` with the request ID for log correlation:
|
||||
```rust
|
||||
let span = tracing::info_span!("request", id = %request_id);
|
||||
```
|
||||
|
||||
### 7.3.3 OpenTelemetry tracing
|
||||
|
||||
Add dependencies:
|
||||
```toml
|
||||
opentelemetry = "0.22"
|
||||
opentelemetry-otlp = "0.15"
|
||||
tracing-opentelemetry = "0.23"
|
||||
```
|
||||
|
||||
Initialize in `gateway/src/main.rs`:
|
||||
```rust
|
||||
if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
|
||||
let tracer = opentelemetry_otlp::new_pipeline()
|
||||
.tracing()
|
||||
.with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
|
||||
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
|
||||
|
||||
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
|
||||
// Add to the subscriber registry
|
||||
}
|
||||
```
|
||||
|
||||
### 7.3.4 Alerting rules
|
||||
|
||||
Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: madbase
|
||||
rules:
|
||||
- alert: ServiceDown
|
||||
expr: up == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
|
||||
- alert: HighErrorRate
|
||||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
|
||||
- alert: HighLatency
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Completion Requirements
|
||||
|
||||
This milestone is **not complete** until every item below is satisfied.
|
||||
|
||||
### 1. Full Test Suite — All Green
|
||||
|
||||
- [ ] `cargo test --workspace` passes with **zero failures**
|
||||
- [ ] `cargo fmt --all -- --check` passes (no formatting issues)
|
||||
- [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings)
|
||||
- [ ] `cargo sqlx prepare --check` passes (offline query data is up to date)
|
||||
- [ ] All **pre-existing tests** still pass (no regressions)
|
||||
- [ ] **New tests** are written for CI/operability features:
|
||||
|
||||
| Test | Location | What it validates |
|
||||
|------|----------|-------------------|
|
||||
| `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it |
|
||||
| `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers |
|
||||
| `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status |
|
||||
| `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status |
|
||||
| `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status |
|
||||
| `test_docker_build_proxy` | `.github/workflows/ci.yml` | Docker build target `proxy-runtime` succeeds (CI job) |
|
||||
| `test_docker_build_worker` | `.github/workflows/ci.yml` | Docker build target `worker-runtime` succeeds (CI job) |
|
||||
| `test_docker_build_control` | `.github/workflows/ci.yml` | Docker build target `control-runtime` succeeds (CI job) |
|
||||
|
||||
### 2. CI Pipeline Verification
|
||||
|
||||
- [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green
|
||||
- [ ] `cargo sqlx prepare --check` passes in CI
|
||||
- [ ] Docker build succeeds for all 4 targets (proxy, worker, control, functions)
|
||||
- [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`)
|
||||
- [ ] CI runs in under 15 minutes for a clean build
|
||||
|
||||
### 3. Docker / Operability Verification
|
||||
|
||||
- [ ] Runtime images are under 200MB each (down from ~1.5GB)
|
||||
- [ ] Containers run as non-root user (`USER madbase`)
|
||||
- [ ] `docker inspect <image>` shows a `HEALTHCHECK` for each runtime image
|
||||
- [ ] `.dockerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/`
|
||||
- [ ] All Docker image tags are pinned (no `:latest`)
|
||||
|
||||
### 4. Observability Verification
|
||||
|
||||
- [ ] `X-Request-Id` header appears in proxy responses
|
||||
- [ ] Logs contain structured JSON with request IDs (verify via `docker compose logs proxy | jq .`)
|
||||
- [ ] Prometheus/VictoriaMetrics scrapes metrics from all services
|
||||
- [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate
|
||||
- [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
|
||||
|
||||
### 5. CI Gate
|
||||
|
||||
- [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
|
||||
- [ ] All milestones M0–M6 tests pass in the CI pipeline retroactively
|
||||
|
||||
Reference in New Issue
Block a user