Files
madbase/_milestones/M7_cicd_operability.md
Vlad Durnea cffdf8af86
Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
wip:milestone 0 fixes
2026-03-15 12:35:42 +02:00

311 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Milestone 7: CI/CD & Operability
**Goal:** Every commit is validated. Deployments are reproducible and observable.
**Depends on:** M0 (Security), M1 (Foundation)
---
## 7.1 — Rust CI Pipeline
### 7.1.1 Add Rust jobs to CI
**File:** `.github/workflows/ci.yml`
Add a new job before the existing frontend jobs:
```yaml
rust:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
- name: Cache cargo registry and build
uses: actions/cache@v4
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
- name: Check formatting
run: cargo fmt --all --check
- name: Run clippy
run: cargo clippy --workspace -- -D warnings
- name: Build workspace
run: cargo build --workspace
- name: Run tests
run: cargo test --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
JWT_SECRET: test-secret-for-ci-only-not-production
DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
- name: Verify sqlx offline data
run: cargo sqlx prepare --check --workspace
env:
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
```
### 7.1.2 Enable sqlx offline mode
Run locally:
```bash
cargo sqlx prepare --workspace
```
This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
### 7.1.3 Fix the lint job
**File:** `.github/workflows/ci.yml` line 29
```yaml
# BEFORE
run: npm run lint || true
# AFTER
run: npm run lint
```
### 7.1.4 Pin GitHub Actions
Update all `@v3` to `@v4` throughout the file:
- `actions/checkout@v3``@v4`
- `actions/setup-node@v3``@v4`
- `actions/upload-artifact@v3``@v4`
- `codecov/codecov-action@v3``@v4`
### 7.1.5 Add Docker build job
```yaml
docker:
runs-on: ubuntu-latest
needs: rust
steps:
- uses: actions/checkout@v4
- name: Build gateway-runtime
run: docker build --target gateway-runtime -t madbase/gateway:ci .
- name: Build worker-runtime
run: docker build --target worker-runtime -t madbase/worker:ci .
- name: Build control-runtime
run: docker build --target control-runtime -t madbase/control:ci .
- name: Build proxy-runtime
run: docker build --target proxy-runtime -t madbase/proxy:ci .
```
---
## 7.2 — Docker Improvements
### 7.2.1 Slim runtime images
**File:** `Dockerfile` — all runtime stages
```dockerfile
# BEFORE
FROM rust:latest AS worker-runtime
# AFTER — shared base
FROM debian:bookworm-slim AS runtime-base
RUN apt-get update && apt-get install -y \
ca-certificates libssl3 \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false madbase
FROM runtime-base AS worker-runtime
WORKDIR /app
COPY --from=builder /app/target/release/worker .
USER madbase
EXPOSE 8002
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
CMD ["./worker"]
```
### 7.2.2 Create .dockerignore
```
.git
target
docs
*.md
env
scripts
_milestones
.github
control-plane-ui/node_modules
control-plane-ui/dist
```
### 7.2.3 Pin image tags
Replace all `:latest` tags:
- `cargo-chef:latest-rust-latest``cargo-chef:0.1.68-rust-1.77`
- `victoriametrics/victoria-metrics:latest``:v1.101.0`
- `grafana/loki:latest``:2.9.6`
- `grafana/grafana:latest``:10.4.2`
- `victoriametrics/vmagent:latest``:v1.101.0`
---
## 7.3 — Observability
### 7.3.1 Create config files
See M1 for `config/prometheus.yml` and `config/vmagent.yml` content.
### 7.3.2 Request correlation IDs
**File:** `gateway/src/proxy.rs``proxy_request` function
```rust
use uuid::Uuid;
// Generate or propagate request ID
let request_id = req.headers()
.get("x-request-id")
.and_then(|v| v.to_str().ok())
.map(|s| s.to_string())
.unwrap_or_else(|| Uuid::new_v4().to_string());
// Add to proxied request
request_builder = request_builder.header("x-request-id", &request_id);
// Add to response
response_builder = response_builder.header("x-request-id", &request_id);
```
Use `tracing::Span` with the request ID for log correlation:
```rust
let span = tracing::info_span!("request", id = %request_id);
```
### 7.3.3 OpenTelemetry tracing
Add dependencies:
```toml
opentelemetry = "0.22"
opentelemetry-otlp = "0.15"
tracing-opentelemetry = "0.23"
```
Initialize in `gateway/src/main.rs`:
```rust
if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
// Add to the subscriber registry
}
```
### 7.3.4 Alerting rules
Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert:
```yaml
groups:
- name: madbase
rules:
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: warning
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
```
---
## Completion Requirements
This milestone is **not complete** until every item below is satisfied.
### 1. Full Test Suite — All Green
- [ ] `cargo test --workspace` passes with **zero failures**
- [ ] `cargo fmt --all -- --check` passes (no formatting issues)
- [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings)
- [ ] `cargo sqlx prepare --check` passes (offline query data is up to date)
- [ ] All **pre-existing tests** still pass (no regressions)
- [ ] **New tests** are written for CI/operability features:
| Test | Location | What it validates |
|------|----------|-------------------|
| `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it |
| `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers |
| `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status |
| `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status |
| `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status |
| `test_docker_build_proxy` | `.github/workflows/ci.yml` | Docker build target `proxy-runtime` succeeds (CI job) |
| `test_docker_build_worker` | `.github/workflows/ci.yml` | Docker build target `worker-runtime` succeeds (CI job) |
| `test_docker_build_control` | `.github/workflows/ci.yml` | Docker build target `control-runtime` succeeds (CI job) |
### 2. CI Pipeline Verification
- [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green
- [ ] `cargo sqlx prepare --check` passes in CI
- [ ] Docker build succeeds for all 4 targets (proxy, worker, control, functions)
- [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`)
- [ ] CI runs in under 15 minutes for a clean build
### 3. Docker / Operability Verification
- [ ] Runtime images are under 200MB each (down from ~1.5GB)
- [ ] Containers run as non-root user (`USER madbase`)
- [ ] `docker inspect <image>` shows a `HEALTHCHECK` for each runtime image
- [ ] `.dockerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/`
- [ ] All Docker image tags are pinned (no `:latest`)
### 4. Observability Verification
- [ ] `X-Request-Id` header appears in proxy responses
- [ ] Logs contain structured JSON with request IDs (verify via `docker compose logs proxy | jq .`)
- [ ] Prometheus/VictoriaMetrics scrapes metrics from all services
- [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate
- [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
### 5. CI Gate
- [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
- [ ] All milestones M0M6 tests pass in the CI pipeline retroactively