# Milestone 7: CI/CD & Operability **Goal:** Every commit is validated. Deployments are reproducible and observable. **Depends on:** M0 (Security), M1 (Foundation) --- ## 7.1 — Rust CI Pipeline ### 7.1.1 Add Rust jobs to CI **File:** `.github/workflows/ci.yml` Add a new job before the existing frontend jobs: ```yaml rust: runs-on: ubuntu-latest services: postgres: image: postgres:15 env: POSTGRES_PASSWORD: postgres ports: - 5432:5432 options: >- --health-cmd pg_isready --health-interval 10s --health-timeout 5s --health-retries 5 steps: - uses: actions/checkout@v4 - name: Install Rust toolchain uses: dtolnay/rust-toolchain@stable with: components: rustfmt, clippy - name: Cache cargo registry and build uses: actions/cache@v4 with: path: | ~/.cargo/registry ~/.cargo/git target key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }} - name: Check formatting run: cargo fmt --all --check - name: Run clippy run: cargo clippy --workspace -- -D warnings - name: Build workspace run: cargo build --workspace - name: Run tests run: cargo test --workspace env: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres JWT_SECRET: test-secret-for-ci-only-not-production DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres - name: Verify sqlx offline data run: cargo sqlx prepare --check --workspace env: DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres ``` ### 7.1.2 Enable sqlx offline mode Run locally: ```bash cargo sqlx prepare --workspace ``` This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync. ### 7.1.3 Fix the lint job **File:** `.github/workflows/ci.yml` line 29 ```yaml # BEFORE run: npm run lint || true # AFTER run: npm run lint ``` ### 7.1.4 Pin GitHub Actions Update all `@v3` to `@v4` throughout the file: - `actions/checkout@v3` → `@v4` - `actions/setup-node@v3` → `@v4` - `actions/upload-artifact@v3` → `@v4` - `codecov/codecov-action@v3` → `@v4` ### 7.1.5 Add Docker build job ```yaml docker: runs-on: ubuntu-latest needs: rust steps: - uses: actions/checkout@v4 - name: Build gateway-runtime run: docker build --target gateway-runtime -t madbase/gateway:ci . - name: Build worker-runtime run: docker build --target worker-runtime -t madbase/worker:ci . - name: Build control-runtime run: docker build --target control-runtime -t madbase/control:ci . - name: Build proxy-runtime run: docker build --target proxy-runtime -t madbase/proxy:ci . ``` --- ## 7.2 — Docker Improvements ### 7.2.1 Slim runtime images **File:** `Dockerfile` — all runtime stages ```dockerfile # BEFORE FROM rust:latest AS worker-runtime # AFTER — shared base FROM debian:bookworm-slim AS runtime-base RUN apt-get update && apt-get install -y \ ca-certificates libssl3 \ && rm -rf /var/lib/apt/lists/* RUN useradd -r -s /bin/false madbase FROM runtime-base AS worker-runtime WORKDIR /app COPY --from=builder /app/target/release/worker . USER madbase EXPOSE 8002 HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1 CMD ["./worker"] ``` ### 7.2.2 Create .dockerignore ``` .git target docs *.md env scripts _milestones .github control-plane-ui/node_modules control-plane-ui/dist ``` ### 7.2.3 Pin image tags Replace all `:latest` tags: - `cargo-chef:latest-rust-latest` → `cargo-chef:0.1.68-rust-1.77` - `victoriametrics/victoria-metrics:latest` → `:v1.101.0` - `grafana/loki:latest` → `:2.9.6` - `grafana/grafana:latest` → `:10.4.2` - `victoriametrics/vmagent:latest` → `:v1.101.0` --- ## 7.3 — Observability ### 7.3.1 Create config files See M1 for `config/prometheus.yml` and `config/vmagent.yml` content. ### 7.3.2 Request correlation IDs **File:** `gateway/src/proxy.rs` — `proxy_request` function ```rust use uuid::Uuid; // Generate or propagate request ID let request_id = req.headers() .get("x-request-id") .and_then(|v| v.to_str().ok()) .map(|s| s.to_string()) .unwrap_or_else(|| Uuid::new_v4().to_string()); // Add to proxied request request_builder = request_builder.header("x-request-id", &request_id); // Add to response response_builder = response_builder.header("x-request-id", &request_id); ``` Use `tracing::Span` with the request ID for log correlation: ```rust let span = tracing::info_span!("request", id = %request_id); ``` ### 7.3.3 OpenTelemetry tracing Add dependencies: ```toml opentelemetry = "0.22" opentelemetry-otlp = "0.15" tracing-opentelemetry = "0.23" ``` Initialize in `gateway/src/main.rs`: ```rust if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") { let tracer = opentelemetry_otlp::new_pipeline() .tracing() .with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint)) .install_batch(opentelemetry_sdk::runtime::Tokio)?; let telemetry = tracing_opentelemetry::layer().with_tracer(tracer); // Add to the subscriber registry } ``` ### 7.3.4 Alerting rules Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert: ```yaml groups: - name: madbase rules: - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical - alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for: 5m labels: severity: warning - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2 for: 5m labels: severity: warning ``` --- ## Completion Requirements This milestone is **not complete** until every item below is satisfied. ### 1. Full Test Suite — All Green - [ ] `cargo test --workspace` passes with **zero failures** - [ ] `cargo fmt --all -- --check` passes (no formatting issues) - [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings) - [ ] `cargo sqlx prepare --check` passes (offline query data is up to date) - [ ] All **pre-existing tests** still pass (no regressions) - [ ] **New tests** are written for CI/operability features: | Test | Location | What it validates | |------|----------|-------------------| | `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it | | `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers | | `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status | | `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status | | `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status | | `test_docker_build_proxy` | `.github/workflows/ci.yml` | Docker build target `proxy-runtime` succeeds (CI job) | | `test_docker_build_worker` | `.github/workflows/ci.yml` | Docker build target `worker-runtime` succeeds (CI job) | | `test_docker_build_control` | `.github/workflows/ci.yml` | Docker build target `control-runtime` succeeds (CI job) | ### 2. CI Pipeline Verification - [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green - [ ] `cargo sqlx prepare --check` passes in CI - [ ] Docker build succeeds for all 4 targets (proxy, worker, control, functions) - [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`) - [ ] CI runs in under 15 minutes for a clean build ### 3. Docker / Operability Verification - [ ] Runtime images are under 200MB each (down from ~1.5GB) - [ ] Containers run as non-root user (`USER madbase`) - [ ] `docker inspect ` shows a `HEALTHCHECK` for each runtime image - [ ] `.dockerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/` - [ ] All Docker image tags are pinned (no `:latest`) ### 4. Observability Verification - [ ] `X-Request-Id` header appears in proxy responses - [ ] Logs contain structured JSON with request IDs (verify via `docker compose logs proxy | jq .`) - [ ] Prometheus/VictoriaMetrics scrapes metrics from all services - [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate - [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s ### 5. CI Gate - [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones - [ ] All milestones M0–M6 tests pass in the CI pipeline retroactively