Some checks failed
CI/CD Pipeline / unit-tests (push) Failing after 1m16s
CI/CD Pipeline / integration-tests (push) Failing after 2m32s
CI/CD Pipeline / lint (push) Successful in 5m22s
CI/CD Pipeline / e2e-tests (push) Has been skipped
CI/CD Pipeline / build (push) Has been skipped
311 lines
8.8 KiB
Markdown
311 lines
8.8 KiB
Markdown
# Milestone 7: CI/CD & Operability
|
||
|
||
**Goal:** Every commit is validated. Deployments are reproducible and observable.
|
||
|
||
**Depends on:** M0 (Security), M1 (Foundation)
|
||
|
||
---
|
||
|
||
## 7.1 — Rust CI Pipeline
|
||
|
||
### 7.1.1 Add Rust jobs to CI
|
||
|
||
**File:** `.github/workflows/ci.yml`
|
||
|
||
Add a new job before the existing frontend jobs:
|
||
|
||
```yaml
|
||
rust:
|
||
runs-on: ubuntu-latest
|
||
services:
|
||
postgres:
|
||
image: postgres:15
|
||
env:
|
||
POSTGRES_PASSWORD: postgres
|
||
ports:
|
||
- 5432:5432
|
||
options: >-
|
||
--health-cmd pg_isready
|
||
--health-interval 10s
|
||
--health-timeout 5s
|
||
--health-retries 5
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
|
||
- name: Install Rust toolchain
|
||
uses: dtolnay/rust-toolchain@stable
|
||
with:
|
||
components: rustfmt, clippy
|
||
|
||
- name: Cache cargo registry and build
|
||
uses: actions/cache@v4
|
||
with:
|
||
path: |
|
||
~/.cargo/registry
|
||
~/.cargo/git
|
||
target
|
||
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
|
||
|
||
- name: Check formatting
|
||
run: cargo fmt --all --check
|
||
|
||
- name: Run clippy
|
||
run: cargo clippy --workspace -- -D warnings
|
||
|
||
- name: Build workspace
|
||
run: cargo build --workspace
|
||
|
||
- name: Run tests
|
||
run: cargo test --workspace
|
||
env:
|
||
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||
JWT_SECRET: test-secret-for-ci-only-not-production
|
||
DEFAULT_TENANT_DB_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||
|
||
- name: Verify sqlx offline data
|
||
run: cargo sqlx prepare --check --workspace
|
||
env:
|
||
DATABASE_URL: postgres://postgres:postgres@localhost:5432/postgres
|
||
```
|
||
|
||
### 7.1.2 Enable sqlx offline mode
|
||
|
||
Run locally:
|
||
```bash
|
||
cargo sqlx prepare --workspace
|
||
```
|
||
|
||
This creates `.sqlx/` directory with query metadata. Check it into git. Add the CI step above to verify it stays in sync.
|
||
|
||
### 7.1.3 Fix the lint job
|
||
|
||
**File:** `.github/workflows/ci.yml` line 29
|
||
|
||
```yaml
|
||
# BEFORE
|
||
run: npm run lint || true
|
||
|
||
# AFTER
|
||
run: npm run lint
|
||
```
|
||
|
||
### 7.1.4 Pin GitHub Actions
|
||
|
||
Update all `@v3` to `@v4` throughout the file:
|
||
- `actions/checkout@v3` → `@v4`
|
||
- `actions/setup-node@v3` → `@v4`
|
||
- `actions/upload-artifact@v3` → `@v4`
|
||
- `codecov/codecov-action@v3` → `@v4`
|
||
|
||
### 7.1.5 Add Docker build job
|
||
|
||
```yaml
|
||
docker:
|
||
runs-on: ubuntu-latest
|
||
needs: rust
|
||
steps:
|
||
- uses: actions/checkout@v4
|
||
|
||
- name: Build gateway-runtime
|
||
run: docker build --target gateway-runtime -t madbase/gateway:ci .
|
||
|
||
- name: Build worker-runtime
|
||
run: docker build --target worker-runtime -t madbase/worker:ci .
|
||
|
||
- name: Build control-runtime
|
||
run: docker build --target control-runtime -t madbase/control:ci .
|
||
|
||
- name: Build proxy-runtime
|
||
run: docker build --target proxy-runtime -t madbase/proxy:ci .
|
||
```
|
||
|
||
---
|
||
|
||
## 7.2 — Docker Improvements
|
||
|
||
### 7.2.1 Slim runtime images
|
||
|
||
**File:** `Dockerfile` — all runtime stages
|
||
|
||
```dockerfile
|
||
# BEFORE
|
||
FROM rust:latest AS worker-runtime
|
||
|
||
# AFTER — shared base
|
||
FROM debian:bookworm-slim AS runtime-base
|
||
RUN apt-get update && apt-get install -y \
|
||
ca-certificates libssl3 \
|
||
&& rm -rf /var/lib/apt/lists/*
|
||
RUN useradd -r -s /bin/false madbase
|
||
|
||
FROM runtime-base AS worker-runtime
|
||
WORKDIR /app
|
||
COPY --from=builder /app/target/release/worker .
|
||
USER madbase
|
||
EXPOSE 8002
|
||
HEALTHCHECK --interval=10s --timeout=3s CMD curl -f http://localhost:8002/health || exit 1
|
||
CMD ["./worker"]
|
||
```
|
||
|
||
### 7.2.2 Create .dockerignore
|
||
|
||
```
|
||
.git
|
||
target
|
||
docs
|
||
*.md
|
||
env
|
||
scripts
|
||
_milestones
|
||
.github
|
||
control-plane-ui/node_modules
|
||
control-plane-ui/dist
|
||
```
|
||
|
||
### 7.2.3 Pin image tags
|
||
|
||
Replace all `:latest` tags:
|
||
- `cargo-chef:latest-rust-latest` → `cargo-chef:0.1.68-rust-1.77`
|
||
- `victoriametrics/victoria-metrics:latest` → `:v1.101.0`
|
||
- `grafana/loki:latest` → `:2.9.6`
|
||
- `grafana/grafana:latest` → `:10.4.2`
|
||
- `victoriametrics/vmagent:latest` → `:v1.101.0`
|
||
|
||
---
|
||
|
||
## 7.3 — Observability
|
||
|
||
### 7.3.1 Create config files
|
||
|
||
See M1 for `config/prometheus.yml` and `config/vmagent.yml` content.
|
||
|
||
### 7.3.2 Request correlation IDs
|
||
|
||
**File:** `gateway/src/proxy.rs` — `proxy_request` function
|
||
|
||
```rust
|
||
use uuid::Uuid;
|
||
|
||
// Generate or propagate request ID
|
||
let request_id = req.headers()
|
||
.get("x-request-id")
|
||
.and_then(|v| v.to_str().ok())
|
||
.map(|s| s.to_string())
|
||
.unwrap_or_else(|| Uuid::new_v4().to_string());
|
||
|
||
// Add to proxied request
|
||
request_builder = request_builder.header("x-request-id", &request_id);
|
||
|
||
// Add to response
|
||
response_builder = response_builder.header("x-request-id", &request_id);
|
||
```
|
||
|
||
Use `tracing::Span` with the request ID for log correlation:
|
||
```rust
|
||
let span = tracing::info_span!("request", id = %request_id);
|
||
```
|
||
|
||
### 7.3.3 OpenTelemetry tracing
|
||
|
||
Add dependencies:
|
||
```toml
|
||
opentelemetry = "0.22"
|
||
opentelemetry-otlp = "0.15"
|
||
tracing-opentelemetry = "0.23"
|
||
```
|
||
|
||
Initialize in `gateway/src/main.rs`:
|
||
```rust
|
||
if let Ok(otlp_endpoint) = std::env::var("OTEL_EXPORTER_OTLP_ENDPOINT") {
|
||
let tracer = opentelemetry_otlp::new_pipeline()
|
||
.tracing()
|
||
.with_exporter(opentelemetry_otlp::new_exporter().tonic().with_endpoint(otlp_endpoint))
|
||
.install_batch(opentelemetry_sdk::runtime::Tokio)?;
|
||
|
||
let telemetry = tracing_opentelemetry::layer().with_tracer(tracer);
|
||
// Add to the subscriber registry
|
||
}
|
||
```
|
||
|
||
### 7.3.4 Alerting rules
|
||
|
||
Create `config/alerts.yml` for Grafana alerting or VictoriaMetrics vmalert:
|
||
|
||
```yaml
|
||
groups:
|
||
- name: madbase
|
||
rules:
|
||
- alert: ServiceDown
|
||
expr: up == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
|
||
- alert: HighErrorRate
|
||
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
|
||
- alert: HighLatency
|
||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
```
|
||
|
||
---
|
||
|
||
## Completion Requirements
|
||
|
||
This milestone is **not complete** until every item below is satisfied.
|
||
|
||
### 1. Full Test Suite — All Green
|
||
|
||
- [ ] `cargo test --workspace` passes with **zero failures**
|
||
- [ ] `cargo fmt --all -- --check` passes (no formatting issues)
|
||
- [ ] `cargo clippy --workspace -- -D warnings` passes (no warnings)
|
||
- [ ] `cargo sqlx prepare --check` passes (offline query data is up to date)
|
||
- [ ] All **pre-existing tests** still pass (no regressions)
|
||
- [ ] **New tests** are written for CI/operability features:
|
||
|
||
| Test | Location | What it validates |
|
||
|------|----------|-------------------|
|
||
| `test_request_id_middleware` | `gateway/src/middleware.rs` | Request without `X-Request-Id` gets one generated; request with one keeps it |
|
||
| `test_request_id_propagated` | `gateway/src/proxy.rs` | `X-Request-Id` from proxy request appears in upstream headers |
|
||
| `test_health_endpoint_worker` | `gateway/src/bin/worker.rs` | `GET /health` returns 200 with JSON status |
|
||
| `test_health_endpoint_system` | `gateway/src/bin/system.rs` | `GET /health` returns 200 with JSON status |
|
||
| `test_health_endpoint_proxy` | `gateway/src/bin/proxy.rs` | `GET /health` returns 200 with JSON status |
|
||
| `test_docker_build_proxy` | `.github/workflows/ci.yml` | Docker build target `proxy-runtime` succeeds (CI job) |
|
||
| `test_docker_build_worker` | `.github/workflows/ci.yml` | Docker build target `worker-runtime` succeeds (CI job) |
|
||
| `test_docker_build_control` | `.github/workflows/ci.yml` | Docker build target `control-runtime` succeeds (CI job) |
|
||
|
||
### 2. CI Pipeline Verification
|
||
|
||
- [ ] CI passes on a clean PR: `cargo fmt`, `cargo clippy`, `cargo build`, `cargo test` all green
|
||
- [ ] `cargo sqlx prepare --check` passes in CI
|
||
- [ ] Docker build succeeds for all 4 targets (proxy, worker, control, functions)
|
||
- [ ] CI caches Rust build artifacts (via `actions-rust-lang/setup-rust-toolchain` or `Swatinem/rust-cache`)
|
||
- [ ] CI runs in under 15 minutes for a clean build
|
||
|
||
### 3. Docker / Operability Verification
|
||
|
||
- [ ] Runtime images are under 200MB each (down from ~1.5GB)
|
||
- [ ] Containers run as non-root user (`USER madbase`)
|
||
- [ ] `docker inspect <image>` shows a `HEALTHCHECK` for each runtime image
|
||
- [ ] `.dockerignore` exists and excludes `target/`, `.git/`, `env/`, `_milestones/`, `docs/`
|
||
- [ ] All Docker image tags are pinned (no `:latest`)
|
||
|
||
### 4. Observability Verification
|
||
|
||
- [ ] `X-Request-Id` header appears in proxy responses
|
||
- [ ] Logs contain structured JSON with request IDs (verify via `docker compose logs proxy | jq .`)
|
||
- [ ] Prometheus/VictoriaMetrics scrapes metrics from all services
|
||
- [ ] Grafana dashboards show request rate, latency p50/p95/p99, error rate
|
||
- [ ] Alerting rules fire for: service down >1min, error rate >5%, p99 latency >2s
|
||
|
||
### 5. CI Gate
|
||
|
||
- [ ] The CI workflow itself is the gate — this milestone's success means CI is the gatekeeper for all future milestones
|
||
- [ ] All milestones M0–M6 tests pass in the CI pipeline retroactively
|