madbase/docs/CACHING_STRATEGY.md

# MadBase Caching Strategy

## Overview

MadBase implements a **two-tier caching architecture** that maintains the simplicity of the 4-pillar system while providing enterprise-grade caching capabilities.

## Architecture

### Tier 1: L1 Cache (In-Memory)
- **Technology**: moka (Rust)
- **Location**: Proxy / Worker nodes
- **Purpose**: Ultra-low latency for frequently accessed data
- **Typical Use Cases**:
  - Project configurations
  - JWT validation cache
  - Hot database query results
  - API response caching

### Tier 2: L2 Cache (Redis)
- **Technology**: Redis 7
- **Location**: State Pillar (Pillar 3)
- **Purpose**: Shared state across the entire cluster
- **Typical Use Cases**:
  - Distributed session storage
  - Realtime presence tracking
  - Rate limiting counters
  - Distributed locking
  - Pub/Sub messaging

## State Pillar Integration

The **State Pillar** (formerly "Database Pillar") now hosts both PostgreSQL and Redis:

```
┌─────────────────────────────────────────┐
│           State Pillar Node             │
├─────────────────────────────────────────┤
│  ┌──────────┐        ┌─────────────┐   │
│  │PostgreSQL│        │    Redis    │   │
│  │  :5432   │        │    :6379    │   │
│  └──────────┘        └─────────────┘   │
│         │                   │          │
│         └─────────┬─────────┘          │
│                   ▼                     │
│            ┌─────────────┐             │
│            │   HAProxy   │             │
│            │ :5433/:6379 │             │
│            └─────────────┘             │
└─────────────────────────────────────────┘
```

### Why This Approach?

1. **Resource Symmetry**: Both PostgreSQL and Redis are memory-intensive and share the same VPS requirements
2. **HA Piggybacking**: Pillar 3 already manages HA via Patroni and etcd. Redis benefits from the same infrastructure
3. **Centralized State**: Maintains clean separation of Compute (Worker/Proxy) vs. State (DB/Redis)
4. **Zero Complexity**: No new pillar needed, just enhanced the existing one

## Features

### 1. Shared Auth Sessions

Users can now stay logged in even if the Proxy node handling their request changes:

```rust
use auth::SessionManager;

// Create a session
let session_token = session_manager
    .create_session(user_id, email, "authenticated".to_string())
    .await?;

// Validate on any proxy node
let session = session_manager
    .validate_session(&session_token)
    .await?;
```

### 2. Realtime Presence

Track "Who is online" across multiple Worker nodes:

```rust
use realtime::PresenceManager;

// User joins a channel
presence_manager
    .join_channel(user_id, "public-chat".to_string(), None)
    .await?;

// Get online count
let count = presence_manager
    .get_channel_online_count("public-chat".to_string())
    .await?;
```

### 3. Distributed Locking

Prevent race conditions during background operations:

```rust
use common::DistributedLock;

let lock = DistributedLock::new(
    redis_client,
    "migration:lock".to_string(),
    30, // 30 seconds TTL
);

if lock.acquire().await? {
    // Perform critical section
    lock.release().await?;
}
```

### 4. Rate Limiting

Distributed rate limiting across all instances:

```rust
use gateway::rate_limit::RateLimitMiddleware;

// Check IP-based rate limit
if !middleware.check_ip(&user_ip).await? {
    return Err("Rate limit exceeded");
}
```

## Configuration

### Environment Variables

```bash
# PostgreSQL
DATABASE_URL="postgres://user:pass@db:5432/madbase"

# Redis (Optional - will fallback to L1 only)
REDIS_URL="redis://db:6379/0"

# Cache TTL
CACHE_TTL_SECONDS=3600
```

### Cache Keyspaces

| Pattern | Purpose | TTL |
|---------|---------|-----|
| `session:{token}` | User sessions | 3600s |
| `presence:channel:{name}:user:{id}` | User presence | 60s |
| `ratelimit:ip:{addr}` | IP rate limiting | 60s |
| `ratelimit:user:{id}` | User rate limiting | 60s |
| `lock:{name}` | Distributed locks | Configurable |

## HAProxy Configuration

The State Pillar's HAProxy routes both PostgreSQL and Redis traffic:

```haproxy
listen primary
    bind *:5433
    mode tcp
    server patroni1 patroni:5432 check

listen redis
    bind *:6379
    mode tcp
    server redis1 redis:6379 check
```

## Scaling Strategy

### Horizontal Scaling

- **Proxy Nodes**: Add more proxies, all share the same Redis cache
- **Worker Nodes**: Add more workers, presence tracking works seamlessly
- **State Nodes**: Scale to 3 or 5 nodes for HA, Redis is replicated via Sentinel/Cluster

### Vertical Scaling

- Upgrade State Node plan for more RAM (benefits both PostgreSQL and Redis)
- Typical: CX21 (8GB) → CX31 (16GB) → CX41 (32GB)

## Monitoring

Redis is monitored alongside PostgreSQL:

- **HAProxy Stats**: http://db-node:7000
- **Grafana Dashboard**: "State Pillar Performance"
- **Metrics**:
  - Redis memory usage
  - Cache hit/miss ratios
  - Connection pool utilization
  - Rate limit enforcement

## Best Practices

1. **Session Management**: Use appropriate TTLs (shorter for sensitive data)
2. **Presence Tracking**: Implement heartbeats to keep users "online"
3. **Rate Limiting**: Use different limits for different user tiers
4. **Distributed Locks**: Always set reasonable TTLs to prevent deadlocks
5. **Cache Invalidation**: Use versioned keys or explicit deletion

## Migration Guide

### From Single-Node to Cluster

1. Update State Pillar image to include Redis
2. Set `REDIS_URL` in all Proxy/Worker configurations
3. Deploy SessionManager in Auth handlers
4. Enable presence tracking in Realtime module
5. Update rate limiting to use distributed counters

### Testing

```bash
# Test Redis connection
redis-cli -h db-node ping

# Test session creation
curl -X POST http://localhost:8000/auth/v1/token \
  -d '{"email":"test@example.com","password":"password"}'

# Check presence
redis-cli -h db-node SMEMBERS "presence:channel:public:users"
```

## Performance

### Expected Latency

| Operation | L1 Cache (moka) | L2 Cache (Redis) | Database |
|-----------|-----------------|------------------|----------|
| Get       | <1μs            | 1-2ms            | 10-50ms  |
| Set       | <1μs            | 1-2ms            | 10-50ms  |
| Delete    | <1μs            | 1-2ms            | 10-50ms  |

### Cache Hit Ratios

- **L1 Hit**: 95%+ for frequently accessed data
- **L2 Hit**: 80%+ for shared state
- **Miss**: Falls through to database

## Future Enhancements

- [ ] Redis Cluster for horizontal scaling
- [ ] Pub/Sub for real-time events
- [ ] Bloom filters for existence checks
- [ ] HyperLogLog for cardinality estimation
- [ ] Geospatial indexing for location features