# CI/CD — GitHub → AWS (ECS Fargate + Amplify)

> Pipeline for pushing changes from the GitHub repo to AWS. The main Next.js
> app ships to **ECS Fargate** (via ECR); public static surfaces (marketing
> landing, docs, or any auxiliary site) ship to **Amplify**.
>
> Two jobs, one trigger. `push` to `main` → prod. `push` to `develop` →
> staging. `pull_request` → CI gates only (no deploy).

---

## 1. Topology

```
┌────────────────────────────────────────────────────────────────┐
│                         GitHub repo                             │
│    main          develop         feature/*                      │
└─────┬──────────────┬──────────────────────────────┬─────────────┘
      │              │                              │
  push/merge     push/merge                      pull_request
      │              │                              │
      ▼              ▼                              ▼
┌──────────────────────────────────────┐   ┌────────────────────┐
│      GitHub Actions (this repo)      │   │   CI gates only    │
│                                       │   │   (typecheck +     │
│  ┌──────────────┐ ┌────────────────┐ │   │    test + build)   │
│  │ deploy-      │ │ deploy-amplify │ │   └────────────────────┘
│  │ fargate.yml  │ │ .yml           │ │
│  └──────┬───────┘ └────────┬───────┘ │
└─────────┼──────────────────┼─────────┘
          │                  │
          ▼                  ▼
   ┌──────────────┐   ┌──────────────┐
   │  AWS ECR     │   │ AWS Amplify  │
   │  (image)     │   │  (static /   │
   └──────┬───────┘   │   SSG site)  │
          │           └──────┬───────┘
          ▼                  ▼
   ┌──────────────┐   ┌──────────────┐
   │ ECS Fargate  │   │ Amplify edge │
   │ (main app:   │   │ (marketing / │
   │  Next.js SSR │   │  public docs)│
   │  + API +     │   └──────────────┘
   │  schedulers) │
   └──────────────┘
```

**Why split it?**

- Fargate carries the full Next.js App Router: SSR, API routes, SSE streams,
  background schedulers, broker integrations, LLM calls. Amplify's Lambda-
  per-route model doesn't handle the always-on scheduler cleanly.
- Amplify is the right home for fully static (or lightly dynamic) surfaces:
  the public marketing site, partner/docs pages, landing pages for lead
  capture. Amplify gives you a global CDN, automatic previews on PRs, and
  branch-per-environment deploys without paying for an idle Fargate task.

If you don't have a separate marketing site today, you can defer the Amplify
pipeline — the Fargate pipeline alone is enough to ship the app. Everything
below for Amplify is additive.

---

## 2. Branch + environment mapping

| Git branch | Target env | Fargate stack | Amplify branch | URL pattern |
|---|---|---|---|---|
| `main` | prod | `agencio-predict-prod` | `main` | `app.agencio-predict.com` (Fargate), `agencio-predict.com` (Amplify) |
| `develop` | staging | `agencio-predict-staging` | `develop` | `staging.app.*`, `staging.*` |
| `feature/*` | — | none | PR preview | Ephemeral preview URL per PR |

PR previews are Amplify-only by default. If you need full-stack preview
environments (with their own RDS/Redis) that's a separate ask — cost is
~$60/month per live preview.

---

## 3. First-time AWS prerequisites

Do these **once per environment** before any pipeline run. CloudFormation
can't create a stack that references secrets that don't exist yet.

### 3.1 Secrets Manager entries

For each env (`staging`, `prod`), create:

| Secret name | Value |
|---|---|
| `/agencio-predict/<env>/database_url` | `postgresql://user:pass@host/db` |
| `/agencio-predict/<env>/redis_url` | `redis://host:6379` |
| `/agencio-predict/<env>/jwt_secret` | 32+ random bytes (hex or base64) |
| `/agencio-predict/<env>/credentials_encryption_key` | 32-byte hex for AES-256-GCM broker creds |
| `/agencio-predict/<env>/claude_api_key` *(optional)* | Anthropic key |
| `/agencio-predict/<env>/polygon_api_key` *(optional)* | Polygon key |
| `/agencio-predict/<env>/sentry_dsn` *(optional)* | Sentry project DSN |

```bash
aws secretsmanager create-secret \
  --name /agencio-predict/prod/jwt_secret \
  --secret-string "$(openssl rand -hex 48)"
```

Grab each secret's full ARN (`aws secretsmanager describe-secret --secret-id
...`) — CloudFormation wants ARNs, not names.

### 3.2 ACM certificate

Issue one certificate covering every hostname you plan to serve (apex +
app + staging). Must be in `us-east-1` if Amplify will also use it;
otherwise the Fargate cert can live in the ALB's region.

```bash
aws acm request-certificate \
  --domain-name agencio-predict.com \
  --subject-alternative-names app.agencio-predict.com staging.agencio-predict.com staging.app.agencio-predict.com \
  --validation-method DNS
```

Add the CNAME validation records to Route53 (or your DNS provider) and wait
for `ISSUED` status.

### 3.3 VPC + subnets

Two public subnets in different AZs is the minimum. If you already have a
VPC, reuse it; if not, spin one up with the AWS "VPC with public subnets"
wizard. Capture the `VpcId` + subnet IDs.

### 3.4 ECR repository

```bash
aws ecr create-repository --repository-name agencio-predict
```

Idempotent — the first `deploy-fargate.yml` run will also create this if
absent, so you can skip this step.

### 3.5 Amplify app (if using)

Create the Amplify app once through the console or CLI:

```bash
aws amplify create-app \
  --name agencio-predict-marketing \
  --repository https://github.com/Agencio-Bertha/agencio-predict \
  --access-token "$GITHUB_PAT" \
  --platform WEB
```

Then create branches inside it:

```bash
aws amplify create-branch --app-id <app-id> --branch-name main --stage PRODUCTION
aws amplify create-branch --app-id <app-id> --branch-name develop --stage DEVELOPMENT
```

Capture the `app-id` and `branch-name`s — the pipeline needs them.

---

## 4. OIDC — no long-lived AWS keys in GitHub

GitHub Actions authenticates to AWS via OIDC. You register GitHub as a
trusted identity provider once per account, then each workflow assumes a
dedicated IAM role scoped to exactly the actions it needs. There are no
`AWS_ACCESS_KEY_ID` secrets stored in GitHub, ever.

### 4.1 Register GitHub as an OIDC provider

```bash
aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1
```

One-time per AWS account.

### 4.2 Create the deploy role

Create `deploy-role-trust.json`:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:Agencio-Bertha/agencio-predict:ref:refs/heads/*"
        }
      }
    }
  ]
}
```

The `sub` condition restricts this role to only this repo's branches —
important; without it any GitHub Actions workflow on any repo could assume
your role.

```bash
aws iam create-role \
  --role-name agencio-predict-deploy \
  --assume-role-policy-document file://deploy-role-trust.json

aws iam attach-role-policy \
  --role-name agencio-predict-deploy \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPowerUser

aws iam attach-role-policy \
  --role-name agencio-predict-deploy \
  --policy-arn arn:aws:iam::aws:policy/AmazonECS_FullAccess
```

Then add a custom inline policy for CloudFormation + Amplify trigger rights.
Minimum viable:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    { "Effect": "Allow", "Action": "cloudformation:*", "Resource": "*" },
    { "Effect": "Allow", "Action": "amplify:StartJob", "Resource": "arn:aws:amplify:*:*:apps/*/branches/*/jobs/*" },
    { "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::*:role/agencio-predict-*" }
  ]
}
```

### 4.3 Add the role ARN as a GitHub secret

In GitHub: **Settings → Secrets and variables → Actions → New repository
secret**:

- `AWS_DEPLOY_ROLE_ARN` = `arn:aws:iam::<account>:role/agencio-predict-deploy`
- `AWS_REGION` = `us-east-1` (or your region)
- `AMPLIFY_APP_ID` = from step 3.5

Environment-scoped secrets (cleaner):

- GitHub → Settings → Environments → `prod`, `staging` — each gets its own
  `AWS_DEPLOY_ROLE_ARN` etc. The workflow selects the right env by branch.

---

## 5. Fargate pipeline — `.github/workflows/deploy-fargate.yml`

Triggered by push to `main` or `develop`. Builds the image, pushes to ECR,
rolls forward the CloudFormation stack. Failed deploys auto-rollback via the
`DeploymentCircuitBreaker` config in `deploy/aws/cloudformation.yml`.

File lives at `.github/workflows/deploy-fargate.yml` — see the companion
workflow file added alongside this doc.

Key knobs:

- **Image tag** = short git SHA. Gives you traceability from "which commit
  is running in prod?" back to GitHub.
- **Environment** = `prod` for `main`, `staging` for `develop`. GitHub
  Environments gate the deploy — you can require manual approval on `prod`.
- **Parameter overrides** = pulled from CloudFormation's existing stack
  state. First deploy needs full params via repository variables; every
  subsequent deploy reuses what's already in the stack.

Runtime: ~6–8 minutes end-to-end. ~4 min Docker build, ~1 min push, ~2 min
CloudFormation update with rolling replacement.

## 6. Amplify pipeline — `.github/workflows/deploy-amplify.yml`

Triggered by push to `main` or `develop`. Amplify has native GitHub
integration and can auto-deploy on push, but we route through Actions so
branch-based access control, approval gates, and deploy notifications land
in the same place as the Fargate pipeline.

The workflow calls `aws amplify start-job --app-id $ID --branch-name $BRANCH
--job-type RELEASE`. Amplify pulls from GitHub itself.

If the marketing site lives in a subdirectory (`marketing/site` or similar),
configure Amplify's build spec (`amplify.yml`) to `cd` into that directory
before running `npm ci && npm run build`. The workflow filter on
`paths: ['marketing/**']` skips Amplify deploys when only the app changed.

Runtime: ~2–3 minutes. Amplify builds + deploys + cache-invalidates the
CDN in one step.

## 7. PR previews (Amplify)

Enable via the Amplify console: **App settings → Previews → Enable**. Every
open PR gets a unique `pr-42.<amplify-domain>.amplifyapp.com` URL built from
the PR head commit. Auto-destroys on merge/close.

Keep turned off for branches that don't ship to the marketing site — PR
previews aren't free and can add up with many open PRs.

---

## 8. CI gates (existing — `.github/workflows/ci.yml`)

Run on every PR + push. Independent of deploys. Must pass before deploy
workflows run.

- Typecheck across all 4 workspaces (`packages/be`, `packages/fe`,
  `packages/shared`, `apps/web`)
- Lint (ESLint)
- Next.js build with test env vars
- npm audit (non-blocking)
- Docker build smoke test (no push)

Vitest tests (269 today) should run here too — add a `test` job if not
already present:

```yaml
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20.x, cache: 'npm' }
      - run: npm ci
      - run: npm test --workspaces --if-present
```

---

## 9. Rollback procedures

### 9.1 Fargate — bad deploy already in prod

```bash
# List recent task-def revisions
aws ecs list-task-definitions --family-prefix agencio-predict-prod \
  --status ACTIVE --sort DESC --max-items 5

# Flip the service back to the prior revision
aws ecs update-service \
  --cluster agencio-predict-prod \
  --service agencio-predict-prod \
  --task-definition agencio-predict-prod:41   # prior good
```

The `DeploymentCircuitBreaker` usually catches failed deploys before they
reach this stage. Manual rollback is for "health checks pass but app
misbehaves" cases.

### 9.2 Amplify — bad deploy on main

```bash
aws amplify list-jobs --app-id <app-id> --branch-name main --max-results 5
aws amplify start-job --app-id <app-id> --branch-name main \
  --job-type RETRY --job-id <previous-good-job-id>
```

Or revert the commit on `main` — Amplify picks up the new push within ~30
seconds.

---

## 10. Monitoring a deploy in real time

```bash
# Fargate — watch task replacement
aws ecs describe-services --cluster agencio-predict-prod \
  --services agencio-predict-prod \
  --query 'services[0].events[0:10].[createdAt,message]' --output table

# Fargate — tail new-task logs
aws logs tail /ecs/agencio-predict-prod --follow --since 2m

# Amplify — poll job status
aws amplify get-job --app-id <app-id> --branch-name main --job-id <job-id>
```

Also available:

- **CloudWatch Container Insights** (enabled by our stack) — CPU/memory
  per-task, request counts.
- **ALB target group health** — how many tasks are `healthy` vs `unhealthy`
  after the rolling replacement.
- **Sentry** — crash reports from the new image land within seconds of the
  first real request. If `SENTRY_DSN` is set on the task env, you'll see
  errors tagged with the git SHA in the release.

---

## 11. Security posture

- **No long-lived AWS keys.** OIDC short-lived credentials per-workflow-run.
- **No raw secrets in the pipeline.** All credentials injected from AWS
  Secrets Manager at task start, never in workflow YAML or env vars.
- **Branch scoping.** The OIDC trust policy `sub` condition limits the
  deploy role to `repo:Agencio-Bertha/agencio-predict:ref:refs/heads/*` —
  forked repo workflows can't assume it.
- **GitHub Environments for approvals.** The `prod` environment should
  require 1 reviewer + restrict to `main` branch (Settings → Environments
  → `prod` → Deployment protection rules). Staging can auto-deploy.
- **`ALLOW_DEV_AUTH_BYPASS=false`** hardcoded in the task env. The deploy
  pipeline never sets this to `true` — dev-only setting.

---

## 12. Cost notes (monthly, approximate)

| Item | Cost |
|---|---|
| ECS Fargate (2× 1-vCPU / 2GB, 24/7) | ~$55 |
| ALB | ~$18 |
| CloudWatch Logs (30-day retention, 10 GB/mo) | ~$6 |
| ECR storage (5 images × 400 MB) | ~$0.20 |
| Amplify (pay-per-build + bandwidth, low traffic) | ~$5 |
| GitHub Actions (2000 free minutes on Team plan) | $0 |
| **Per-environment total** | **~$85/mo** |

Production + staging side-by-side: ~$170/mo before RDS, ElastiCache, data
transfer, and Sentry/Datadog. See `docs/25-data-feed-llm-opex.md` for the
full OPEX model.

---

## 13. First-time deployment walkthrough

Assuming all prerequisites from §3 + §4 are done:

```bash
# 1. Merge whatever you want to ship into `main`.

# 2. From your laptop, run the first deploy manually so you supply all the
#    CloudFormation parameters. Subsequent deploys will reuse the stack state.

cd deploy/aws

PARAMS="VpcId=vpc-xxx \
  SubnetIds=subnet-a,subnet-b \
  CertificateArn=arn:aws:acm:us-east-1:...:certificate/... \
  DatabaseUrlSecretArn=arn:aws:secretsmanager:us-east-1:...:secret:/agencio-predict/prod/database_url-xxx \
  RedisUrlSecretArn=arn:aws:secretsmanager:us-east-1:...:secret:/agencio-predict/prod/redis_url-xxx \
  JwtSecretArn=arn:aws:secretsmanager:us-east-1:...:secret:/agencio-predict/prod/jwt_secret-xxx \
  CredentialsEncryptionKeySecretArn=arn:aws:secretsmanager:us-east-1:...:secret:/agencio-predict/prod/credentials_encryption_key-xxx" \
./deploy.sh prod

# 3. Note the LoadBalancerDnsName from the output. Point your Route53 alias
#    (app.agencio-predict.com) at it.

# 4. Test the endpoint:
curl -sf https://app.agencio-predict.com/api/health

# 5. If Amplify: push marketing site content to `main`, watch the deploy
#    workflow, connect the apex domain in the Amplify console.

# 6. Every subsequent deploy is fully hands-off — push to `main` and the
#    pipeline handles it.
```

---

## 14. Troubleshooting

### Deploy fails with `ResourceNotReady: ECS service stable`

The image started but health checks never went green. Usual suspects:

1. `/api/health` returns non-200 because DATABASE_URL is wrong — check the
   Secrets Manager value. `docker exec` from the container to verify the
   database is reachable.
2. Container can't start because of a missing env var. Tail CloudWatch
   logs for the `predict-web` log stream from the latest task.
3. Security group blocks ALB → task traffic. The CloudFormation stack
   configures this correctly; only happens if you edit it by hand.

### `AccessDenied` from the workflow

The deploy role policy doesn't cover the action. Check the role's attached
policies + inline policies against §4.2's minimum set.

### Amplify build fails: "cannot find module"

Usually monorepo path handling — `amplify.yml` needs `cd` into the right
workspace before `npm ci`. Example for a marketing subsite:

```yaml
version: 1
frontend:
  phases:
    preBuild:
      commands:
        - npm ci
    build:
      commands:
        - npm run build --workspace=@agencio-predict/marketing
  artifacts:
    baseDirectory: marketing/site/.next
    files:
      - '**/*'
```

### "New image deploys but old code keeps serving"

CloudFront / Amplify CDN cache. Force an invalidation:

```bash
aws amplify start-deployment --app-id <id> --branch-name main
```

Or in the ALB case, confirm the new tasks are actually healthy — the ALB
only routes to `healthy` targets, so if the new image fails health checks
you'll keep getting old-task responses until the deployment circuit breaker
trips and rolls back.
