8.8 KiB
Milestone 7: CI/CD & Operability
Goal: Every commit is validated. Deployments are reproducible and observable.
Depends on: M0 (Security), M1 (Foundation)
7.1 — Rust CI Pipeline
7.1.1 Add Rust jobs to CI
File: .github/workflows/ci.yml
Add a new job before the existing frontend jobs:
rust:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
- name: Cache cargo registry and build
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
- name: Check formatting
run: cargo fmt --all --check
- name: Run clippy
run: cargo clippy --workspace -- -D warnings
- name: Build workspace
run: cargo build --workspace
- name: Run tests
run: cargo test --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
JWT_SECRET: test-secret-for-ci-only-not-production
DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
- name: Verify sqlx offline data
run: cargo sqlx prepare --check --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
7.1.2 Enable sqlx offline mode
Run locally:
cargo sqlx prepare --workspace
This creates .sqlx/ directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
7.1.3 Fix the lint job
File: .github/workflows/ci.yml line 29
# BEFORE
run: npm run lint || true
# AFTER
run: npm run lint
7.1.4 Pin GitHub Actions
Update all @v3 to @v4 throughout the file:
actions/checkout@v3→@v4actions/setup-node@v3→@v4actions/upload-artifact@v3→@v4codecov/codecov-action@v3→@v4
7.1.5 Add Docker build job
docker:
runs-on: ubuntu-latest
needs: rust
steps:
- uses: actions/checkout@v4
- name: Build gateway-runtime
run: docker build --target gateway-runtime -t madbase/gateway:ci .
- name: Build worker-runtime
run: docker build --target worker-runtime -t madbase/worker:ci .
- name: Build control-runtime
run: docker build --target control-runtime -t madbase/control:ci .
- name: Build proxy-runtime
run: docker build --target proxy-runtime -t madbase/proxy:ci .
7.2 — Docker Improvements
7.2.1 Slim runtime images
File: Dockerfile — all runtime stages
# BEFORE
FROM rust:latest AS worker-runtime
# AFTER — shared base
FROM debian:bookworm-slim AS runtime-base
RUN apt-get update && apt-get install -y \
ca-certificates libssl3 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false madbase
FROM runtime-base AS worker-runtime
WORKDIR /app
COPY --from=builder /app/target/release/worker .
USER madbase
EXPOSE 8002
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
CMD ["./worker"]
7.2.2 Create .dockerignore
.git
target
docs
*.md
env
scripts
_milestones
.github
control-plane-ui/node_modules
control-plane-ui/dist
7.2.3 Pin image tags
Replace all :latest tags:
cargo-chef:latest-rust-latest→cargo-chef:0.1.68-rust-1.77victoriametrics/victoria-metrics:latest→:v1.101.0grafana/loki:latest→:2.9.6grafana/grafana:latest→:10.4.2victoriametrics/vmagent:latest→:v1.101.0
7.3 — Observability
7.3.1 Create config files
See M1 for config/prometheus.yml and config/vmagent.yml content.
7.3.2 Request correlation IDs
File: gateway/src/proxy.rs — proxy_request function
use uuid::Uuid;
// Generate or propagate request ID
let request_id = req.headers()
.get("x-request-id")
.and_then(|v| v.to_str().ok())
.map(|s| s.to_string())
.unwrap_or_else(|| Uuid::new_v4().to_string());
// Add to proxied request
request_builder = request_builder.header("x-request-id", &request_id);
// Add to response
response_builder = response_builder.header("x-request-id", &request_id);
Use tracing::Span with the request ID for log correlation:
let span = tracing::info_span!("request", id = %request_id);
7.3.3 OpenTelemetry tracing
Add dependencies:
opentelemetry = "0.22"
opentelemetry-otlp = "0.15"
tracing-opentelemetry = "0.23"
Initialize in gateway/src/main.rs:
if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
// Add to the subscriber registry
}
7.3.4 Alerting rules
Create config/alerts.yml for Grafana alerting or VictoriaMetrics vmalert:
groups:
- name: madbase
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
Completion Requirements
This milestone is not complete until every item below is satisfied.
1. Full Test Suite — All Green
cargo test --workspacepasses with zero failurescargo fmt --all -- --checkpasses (no formatting issues)cargo clippy --workspace -- -D warningspasses (no warnings)cargo sqlx prepare --checkpasses (offline query data is up to date)- All pre-existing tests still pass (no regressions)
- New tests are written for CI/operability features:
| Test | Location | What it validates |
|---|---|---|
test_request_id_middleware |
gateway/src/middleware.rs |
Request without X-Request-Id gets one generated; request with one keeps it |
test_request_id_propagated |
gateway/src/proxy.rs |
X-Request-Id from proxy request appears in upstream headers |
test_health_endpoint_worker |
gateway/src/bin/worker.rs |
GET /health returns 200 with JSON status |
test_health_endpoint_system |
gateway/src/bin/system.rs |
GET /health returns 200 with JSON status |
test_health_endpoint_proxy |
gateway/src/bin/proxy.rs |
GET /health returns 200 with JSON status |
test_docker_build_proxy |
.github/workflows/ci.yml |
Docker build target proxy-runtime succeeds (CI job) |
test_docker_build_worker |
.github/workflows/ci.yml |
Docker build target worker-runtime succeeds (CI job) |
test_docker_build_control |
.github/workflows/ci.yml |
Docker build target control-runtime succeeds (CI job) |
2. CI Pipeline Verification
- CI passes on a clean PR:
cargo fmt,cargo clippy,cargo build,cargo testall green cargo sqlx prepare --checkpasses in CI- Docker build succeeds for all 4 targets (proxy, worker, control, functions)
- CI caches Rust build artifacts (via
actions-rust-lang/setup-rust-toolchainorSwatinem/rust-cache) - CI runs in under 15 minutes for a clean build
3. Docker / Operability Verification
- Runtime images are under 200MB each (down from ~1.5GB)
- Containers run as non-root user (
USER madbase) docker inspect <image>shows aHEALTHCHECKfor each runtime image.dockerignoreexists and excludestarget/,.git/,env/,_milestones/,docs/- All Docker image tags are pinned (no
:latest)
4. Observability Verification
X-Request-Idheader appears in proxy responses- Logs contain structured JSON with request IDs (verify via
docker compose logs proxy | jq .) - Prometheus/VictoriaMetrics scrapes metrics from all services
- Grafana dashboards show request rate, latency p50/p95/p99, error rate
- Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
5. CI Gate
- The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
- All milestones M0–M6 tests pass in the CI pipeline retroactively