Files
madbase/_milestones/M7_cicd_operability.md
Vlad Durnea cffdf8af86
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
wip:milestone 0 fixes
2026-03-15 12:35:42 +02:00

8.8 KiB
Raw Blame History

Milestone 7: CI/CD & Operability

Goal: Every commit is validated. Deployments are reproducible and observable.

Depends on: M0 (Security), M1 (Foundation)


7.1 — Rust CI Pipeline

7.1.1 Add Rust jobs to CI

File: .github/workflows/ci.yml

Add a new job before the existing frontend jobs:

  rust:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    steps:
      - uses: actions/checkout@v4

      - name: Install Rust toolchain
        uses: dtolnay/rust-toolchain@stable
        with:
          components: rustfmt, clippy

      - name: Cache cargo registry and build
        uses: actions/cache@v4
        with:
          path: |
            ~/.cargo/registry
            ~/.cargo/git
            target
          key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}

      - name: Check formatting
        run: cargo fmt --all --check

      - name: Run clippy
        run: cargo clippy --workspace -- -D warnings

      - name: Build workspace
        run: cargo build --workspace

      - name: Run tests
        run: cargo test --workspace
        env:
          DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
          JWT_SECRET: test-secret-for-ci-only-not-production
          DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres

      - name: Verify sqlx offline data
        run: cargo sqlx prepare --check --workspace
        env:
          DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres

7.1.2 Enable sqlx offline mode

Run locally:

cargo sqlx prepare --workspace

This creates .sqlx/ directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.

7.1.3 Fix the lint job

File: .github/workflows/ci.yml line 29

# BEFORE
run: npm run lint || true

# AFTER
run: npm run lint

7.1.4 Pin GitHub Actions

Update all @v3 to @v4 throughout the file:

  • actions/checkout@v3@v4
  • actions/setup-node@v3@v4
  • actions/upload-artifact@v3@v4
  • codecov/codecov-action@v3@v4

7.1.5 Add Docker build job

  docker:
    runs-on: ubuntu-latest
    needs: rust
    steps:
      - uses: actions/checkout@v4

      - name: Build gateway-runtime
        run: docker build --target gateway-runtime -t madbase/gateway:ci .

      - name: Build worker-runtime
        run: docker build --target worker-runtime -t madbase/worker:ci .

      - name: Build control-runtime
        run: docker build --target control-runtime -t madbase/control:ci .

      - name: Build proxy-runtime
        run: docker build --target proxy-runtime -t madbase/proxy:ci .

7.2 — Docker Improvements

7.2.1 Slim runtime images

File: Dockerfile — all runtime stages

# BEFORE
FROM rust:latest AS worker-runtime

# AFTER — shared base
FROM debian:bookworm-slim AS runtime-base
RUN apt-get update && apt-get install -y \
    ca-certificates libssl3 \
    && rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false madbase

FROM runtime-base AS worker-runtime
WORKDIR /app
COPY --from=builder /app/target/release/worker .
USER madbase
EXPOSE 8002
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
CMD ["./worker"]

7.2.2 Create .dockerignore

.git
target
docs
*.md
env
scripts
_milestones
.github
control-plane-ui/node_modules
control-plane-ui/dist

7.2.3 Pin image tags

Replace all :latest tags:

  • cargo-chef:latest-rust-latestcargo-chef:0.1.68-rust-1.77
  • victoriametrics/victoria-metrics:latest:v1.101.0
  • grafana/loki:latest:2.9.6
  • grafana/grafana:latest:10.4.2
  • victoriametrics/vmagent:latest:v1.101.0

7.3 — Observability

7.3.1 Create config files

See M1 for config/prometheus.yml and config/vmagent.yml content.

7.3.2 Request correlation IDs

File: gateway/src/proxy.rsproxy_request function

use uuid::Uuid;

// Generate or propagate request ID
let request_id = req.headers()
    .get("x-request-id")
    .and_then(|v| v.to_str().ok())
    .map(|s| s.to_string())
    .unwrap_or_else(|| Uuid::new_v4().to_string());

// Add to proxied request
request_builder = request_builder.header("x-request-id", &request_id);

// Add to response
response_builder = response_builder.header("x-request-id", &request_id);

Use tracing::Span with the request ID for log correlation:

let span = tracing::info_span!("request", id = %request_id);

7.3.3 OpenTelemetry tracing

Add dependencies:

opentelemetry = "0.22"
opentelemetry-otlp = "0.15"
tracing-opentelemetry = "0.23"

Initialize in gateway/src/main.rs:

if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
        .install_batch(opentelemetry_sdk::runtime::Tokio)?;

    let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
    // Add to the subscriber registry
}

7.3.4 Alerting rules

Create config/alerts.yml for Grafana alerting or VictoriaMetrics vmalert:

groups:
  - name: madbase
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning

Completion Requirements

This milestone is not complete until every item below is satisfied.

1. Full Test Suite — All Green

  • cargo test --workspace passes with zero failures
  • cargo fmt --all -- --check passes (no formatting issues)
  • cargo clippy --workspace -- -D warnings passes (no warnings)
  • cargo sqlx prepare --check passes (offline query data is up to date)
  • All pre-existing tests still pass (no regressions)
  • New tests are written for CI/operability features:
Test Location What it validates
test_request_id_middleware gateway/src/middleware.rs Request without X-Request-Id gets one generated; request with one keeps it
test_request_id_propagated gateway/src/proxy.rs X-Request-Id from proxy request appears in upstream headers
test_health_endpoint_worker gateway/src/bin/worker.rs GET /health returns 200 with JSON status
test_health_endpoint_system gateway/src/bin/system.rs GET /health returns 200 with JSON status
test_health_endpoint_proxy gateway/src/bin/proxy.rs GET /health returns 200 with JSON status
test_docker_build_proxy .github/workflows/ci.yml Docker build target proxy-runtime succeeds (CI job)
test_docker_build_worker .github/workflows/ci.yml Docker build target worker-runtime succeeds (CI job)
test_docker_build_control .github/workflows/ci.yml Docker build target control-runtime succeeds (CI job)

2. CI Pipeline Verification

  • CI passes on a clean PR: cargo fmt, cargo clippy, cargo build, cargo test all green
  • cargo sqlx prepare --check passes in CI
  • Docker build succeeds for all 4 targets (proxy, worker, control, functions)
  • CI caches Rust build artifacts (via actions-rust-lang/setup-rust-toolchain or Swatinem/rust-cache)
  • CI runs in under 15 minutes for a clean build

3. Docker / Operability Verification

  • Runtime images are under 200MB each (down from ~1.5GB)
  • Containers run as non-root user (USER madbase)
  • docker inspect <image> shows a HEALTHCHECK for each runtime image
  • .dockerignore exists and excludes target/, .git/, env/, _milestones/, docs/
  • All Docker image tags are pinned (no :latest)

4. Observability Verification

  • X-Request-Id header appears in proxy responses
  • Logs contain structured JSON with request IDs (verify via docker compose logs proxy | jq .)
  • Prometheus/VictoriaMetrics scrapes metrics from all services
  • Grafana dashboards show request rate, latency p50/p95/p99, error rate
  • Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s

5. CI Gate

  • The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
  • All milestones M0M6 tests pass in the CI pipeline retroactively