๐Ÿ“‹ Detailed Implementation Guide

Testing & Monitoring Suite
Complete Documentation

Step-by-step instructions, code samples, and verification checklists for implementing the full testing and monitoring infrastructure.

๐Ÿ“‘ Table of Contents

๐Ÿค– AI-Accelerated Timeline

This implementation guide assumes using Claude AI with tools like Cursor for code generation, configuration, and documentation.

Manual coding: ~5 weeks โ†’ AI-assisted: ~1-2 weeks โœจ 3-4x faster

โšกQuick Start (TL;DR)

For experienced engineers, here's the condensed implementation path:

Phase 0: Create staging branch โ†’ Configure branch protection โ†’ Create CONTRIBUTING.md
Phase 1: Create 8 new KV namespaces + 1 Hyperdrive โ†’ Update wrangler.toml โ†’ Set secrets
Phase 2: Create 3 GitHub workflows โ†’ Set secrets/variables โ†’ Enable status checks
Phase 3: pnpm add @sentry/cloudflare โ†’ Wrap index.ts โ†’ Update error handler
Phase 4: Add [observability] to wrangler.toml โ†’ Configure Axiom export in CF Dashboard
Phase 5: Create 3 Axiom dashboards โ†’ Configure 4 monitors
Phase 6: Create 5 Checkly checks (health, findร—2, validateร—2)
Phase 7: Create 3 K6 scripts โ†’ Add to CI workflow
Phase 8: Install CodeRabbit app โ†’ Create .coderabbit.yaml
Phase 9: Add coverage thresholds โ†’ Create test factories
โš ๏ธ Critical Path

Phase 0 โ†’ Phase 1 โ†’ Phase 2 must be sequential and blocks all other work.

๐Ÿ“‹Prerequisites

Required Access

Resource Required Permission Who Can Grant
GitHub RepositoryAdminRepository owner
Cloudflare AccountEdit Workers, KV, HyperdriveAccount owner
Sentry OrganizationAdmin (to create project)Sentry admin
Axiom OrganizationAdmin (to create datasets)Axiom admin
Checkly AccountAdminAccount owner
Slack WorkspaceCreate webhookWorkspace admin

Required Tools (Local Machine)

# Verify installations
node --version    # >= 20.0.0
pnpm --version    # >= 9.0.0
wrangler --version # >= 4.0.0
git --version     # >= 2.40.0

# Optional but recommended
k6 --version      # For local load testing
jq --version      # For JSON processing

Pre-Implementation Checklist

โœ“ Complete before starting
  • All team members have GitHub repository access
  • Cloudflare API token created with Workers permissions
  • Slack channel created for notifications (#email-finder-alerts)
  • Slack incoming webhook URL created (Channel Settings โ†’ Integrations โ†’ Incoming Webhooks)
  • Budget approved for external services (Sentry, Checkly)
  • Team notified of upcoming workflow changes
  • Current production deployment is stable
  • Verify test:e2e script exists in package.json (or create it - see below)

test:e2e Script Setup

If the test:e2e script doesn't exist in package.json, add it:

{
  "scripts": {
    "test:e2e": "vitest run --config vitest.e2e.config.ts"
  }
}

Create vitest.e2e.config.ts if it doesn't exist:

import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    globals: true,
    environment: 'node',
    include: ['tests/e2e/**/*.test.ts'],
    testTimeout: 30000, // 30s timeout for E2E tests
    hookTimeout: 30000,
  },
});

๐Ÿ”Current State Analysis

Current Stack

ComponentTechnology
RuntimeCloudflare Workers with Hono v4.7
LanguageTypeScript with Valibot validation
DatabaseNeon PostgreSQL via Hyperdrive
CachingCloudflare KV (4 namespaces)
LintingBiome v2.0.0
TestingVitest v4.0 with unit and E2E tests
LoggingAxiom (basic integration)

Gaps Identified

โš ๏ธ Critical Issue

Staging and production share the same KV namespaces and Hyperdrive configuration, causing data isolation issues.

  • No documented Git flow or branch protection rules
  • No CI/CD pipeline (GitHub Actions)
  • No error tracking (Sentry)
  • No distributed tracing (OpenTelemetry)
  • No synthetic monitoring (Checkly)
  • No load testing (K6)
  • No AI code review (CodeRabbit)

๐Ÿ› ๏ธTool Stack Summary

CategoryToolPurpose
CI/CDGitHub ActionsAutomated testing and deployment
Code ReviewCodeRabbitAI-powered PR reviews
Error TrackingSentryException monitoring, performance insights
TracingCloudflare OTELDistributed tracing with automatic instrumentation
Logging/DashboardsAxiomEvents, monitors, dashboards
Synthetic MonitoringChecklyAPI health checks, multi-region testing
Load TestingK6Performance and stress testing
LintingBiomeCode quality (already configured)
Unit/E2E TestingVitestAutomated test suite
NotificationsSlackTeam alerts and deployment notifications
0
Git Flow & Best Practices

Time: Day 1 Morning (~2-3 hours)

Git Flow Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        PRODUCTION                                โ”‚
โ”‚                           main                                   โ”‚
โ”‚                            โ–ฒ                                     โ”‚
โ”‚                            โ”‚ PR (requires approval + CI)         โ”‚
โ”‚                            โ”‚                                     โ”‚
โ”‚                        STAGING                                   โ”‚
โ”‚                         staging                                  โ”‚
โ”‚                            โ–ฒ                                     โ”‚
โ”‚                            โ”‚ PR (requires CI)*                   โ”‚
โ”‚                            โ”‚                                     โ”‚
โ”‚                       DEVELOPMENT                                โ”‚
โ”‚              feature/* | fix/* | chore/*                        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

* CI status check requirement is enabled in Phase 2 after workflows are created.

Branch Naming Conventions

TypePatternExample
Productionmainmain
Stagingstagingstaging
Featurefeature/<ticket>-<desc>feature/EF-123-add-validation
Bug Fixfix/<ticket>-<desc>fix/EF-456-null-pointer
Chorechore/<desc>chore/update-dependencies
Hotfixhotfix/<desc>hotfix/critical-auth-fix

Mandatory Rules

โš ๏ธ No Exceptions

1. Never push directly to main
2. Never push directly to staging
3. All code must pass through staging before production
4. All PRs require CI checks to pass*
5. PRs to main require approval

Branch Protection Configuration

Configure in GitHub Repository Settings โ†’ Branches:

Setting (main branch)Value
Require pull request before mergingโœ… Enabled
Require approvalsโœ… 1 approval required
Dismiss stale approvalsโœ… Enabled
Require status checksโŒ Disabled (enable in Phase 2)
Require up to date branchesโœ… Enabled
Allow force pushesโŒ Disabled

Commit Message Convention

Follow Conventional Commits format: <type>(<scope>): <description>

TypeDescriptionExample
featNew featurefeat(validation): add bypass_cache parameter
fixBug fixfix(smtp): handle timeout errors gracefully
docsDocumentation onlydocs(readme): update deployment instructions
refactorCode change (no bug fix or feature)refactor(cache): simplify key generation
testAdding or updating teststest(find): add edge case coverage
choreMaintenance taskschore(deps): update @sentry/cloudflare
ciCI/CD changesci: add K6 smoke test to staging

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 1
  • CONTRIBUTING.md created with Git flow documentation
  • Branch protection configured for main (WITHOUT status checks)
  • Branch protection configured for staging (WITHOUT status checks)
  • staging branch created from main
  • Test: Direct push to main is blocked
  • Test: Direct push to staging is blocked
  • Test: PR to main without approval is blocked
  • Team notified of new workflow rules
1
Infrastructure Separation (CRITICAL)

Time: Day 1 Afternoon (~3-4 hours)

โš ๏ธ Critical Pre-requisite

This phase MUST be completed before CI/CD. Staging and production currently share the same KV namespaces and Hyperdrive, causing data isolation issues.

Create Staging KV Namespaces

# Create staging-specific KV namespaces
wrangler kv namespace create "PATTERN_CACHE_STAGING"
wrangler kv namespace create "DOMAIN_CACHE_STAGING"
wrangler kv namespace create "RESULT_CACHE_STAGING"
wrangler kv namespace create "NEGATIVE_CACHE_STAGING"

Create Production KV Namespaces

# Create production-specific KV namespaces
wrangler kv namespace create "PATTERN_CACHE_PRODUCTION"
wrangler kv namespace create "DOMAIN_CACHE_PRODUCTION"
wrangler kv namespace create "RESULT_CACHE_PRODUCTION"
wrangler kv namespace create "NEGATIVE_CACHE_PRODUCTION"

Create Staging Hyperdrive

wrangler hyperdrive create email-finder-staging \
  --connection-string="postgresql://..."

Update wrangler.toml

# Staging Environment
[env.staging]
name = "email-finder-service-staging"
vars = { ENVIRONMENT = "staging" }

[[env.staging.kv_namespaces]]
binding = "PATTERN_CACHE"
id = "<NEW_STAGING_PATTERN_CACHE_ID>"

[[env.staging.kv_namespaces]]
binding = "DOMAIN_CACHE"
id = "<NEW_STAGING_DOMAIN_CACHE_ID>"

# ... repeat for all KV namespaces

[[env.staging.hyperdrive]]
binding = "HYPERDRIVE"
id = "<NEW_STAGING_HYPERDRIVE_ID>"

Set Staging Secrets

wrangler secret put API_KEYS --env staging
wrangler secret put AXIOM_API_TOKEN --env staging
wrangler secret put SOCKS5_PROXY_USER --env staging
wrangler secret put SOCKS5_PROXY_PASS --env staging
wrangler secret put OCHECKER_API_KEY --env staging
wrangler secret put NO2BOUNCE_API_KEY --env staging
wrangler secret put MILLIONVERIFIER_API_KEY --env staging
wrangler secret put BYTEMINE_API_TOKEN --env staging

# Note: SENTRY_DSN will be set in Phase 3 after Sentry is configured

Set Production Secrets

# Skip if production secrets already exist from current deployment
# Verify with: wrangler secret list --env production

wrangler secret put API_KEYS --env production
wrangler secret put AXIOM_API_TOKEN --env production
wrangler secret put SOCKS5_PROXY_USER --env production
wrangler secret put SOCKS5_PROXY_PASS --env production
wrangler secret put OCHECKER_API_KEY --env production
wrangler secret put NO2BOUNCE_API_KEY --env production
wrangler secret put MILLIONVERIFIER_API_KEY --env production
wrangler secret put BYTEMINE_API_TOKEN --env production

# Note: SENTRY_DSN will be set in Phase 3 after Sentry is configured

Note: If this is an existing production service, these secrets may already be configured. Run wrangler secret list --env production to verify. Only set secrets that are missing.

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 2
  • Staging KV namespaces created with unique IDs (4 namespaces)
  • Production KV namespaces created with unique IDs (4 namespaces)
  • Staging Hyperdrive created (separate from production)
  • wrangler.toml updated with staging-specific IDs
  • wrangler.toml updated with production-specific IDs
  • Staging secrets configured (8 secrets, excluding SENTRY_DSN)
  • Production secrets verified or configured (8 secrets, excluding SENTRY_DSN)
  • Test deploy to staging: wrangler deploy --env staging
  • Verify staging /health/ready returns 200
  • Test deploy to production: wrangler deploy --env production
  • Verify production /health/ready returns 200
  • Write test data to staging KV, verify it does NOT appear in production
2
CI/CD Pipeline with GitHub Actions

Time: Day 2 (~4-5 hours)

GitHub Repository Variables

Go to Repository Settings โ†’ Secrets and variables โ†’ Actions โ†’ Variables:

VariableValue
STAGING_API_URLhttps://email-finder-service-staging.<subdomain>.workers.dev
PRODUCTION_API_URLhttps://email-finder-service.<subdomain>.workers.dev
SENTRY_ENABLEDfalse (change to true after Phase 3)

Create CI Workflow

.github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [main, staging]

jobs:
  ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 9

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'pnpm'

      - run: pnpm install --frozen-lockfile

      - name: Type Check
        run: pnpm typecheck

      - name: Lint
        run: pnpm lint

      - name: Unit Tests
        run: pnpm test

      - name: Dead Code Check
        run: pnpm knip

Create Deploy Staging Workflow

.github/workflows/deploy-staging.yml
name: Deploy Staging

on:
  push:
    branches: [staging]

jobs:
  ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
        with: { version: 9 }
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'pnpm' }
      - run: pnpm install --frozen-lockfile
      - run: pnpm typecheck
      - run: pnpm lint
      - run: pnpm test
      - run: pnpm knip

  deploy:
    needs: ci
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
        with: { version: 9 }
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'pnpm' }
      - run: pnpm install --frozen-lockfile

      - name: Deploy to Staging
        run: pnpm wrangler deploy --env staging
        env:
          CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}

      - name: Post-Deploy Smoke Test
        run: |
          sleep 5
          response=$(curl -s -o /dev/null -w "%{http_code}" \
            "${{ vars.STAGING_API_URL }}/health/ready")
          if [ "$response" != "200" ]; then
            echo "Smoke test failed! Got HTTP $response"
            exit 1
          fi
          echo "Smoke test passed!"

      - name: E2E Tests
        run: pnpm test:e2e
        env:
          API_URL: ${{ vars.STAGING_API_URL }}
          API_KEY: ${{ secrets.STAGING_API_KEY }}

      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v1.25.0
        with:
          payload: |
            {"text": "Staging deploy ${{ job.status }}: ${{ github.event.head_commit.message }}"}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Create Deploy Production Workflow

.github/workflows/deploy-production.yml
name: Deploy Production

on:
  push:
    branches: [main]

jobs:
  ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
        with: { version: 9 }
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'pnpm' }
      - run: pnpm install --frozen-lockfile
      - run: pnpm typecheck
      - run: pnpm lint
      - run: pnpm test
      - run: pnpm knip

  # E2E tests must pass against staging BEFORE deploying to production
  e2e-staging-gate:
    needs: ci
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
        with: { version: 9 }
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'pnpm' }
      - run: pnpm install --frozen-lockfile
      - name: E2E Tests Against Staging
        run: pnpm test:e2e
        env:
          API_URL: ${{ vars.STAGING_API_URL }}
          API_KEY: ${{ secrets.STAGING_API_KEY }}

  deploy:
    needs: e2e-staging-gate
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for Sentry commits

      - uses: pnpm/action-setup@v4
        with: { version: 9 }
      - uses: actions/setup-node@v4
        with: { node-version: 20, cache: 'pnpm' }
      - run: pnpm install --frozen-lockfile

      - name: Deploy to Production
        run: pnpm wrangler deploy --env production
        env:
          CLOUDFLARE_ACCOUNT_ID: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
          CLOUDFLARE_API_TOKEN: ${{ secrets.CLOUDFLARE_API_TOKEN }}

      - name: Post-Deploy Smoke Test
        run: |
          sleep 5
          response=$(curl -s -o /dev/null -w "%{http_code}" \
            "${{ vars.PRODUCTION_API_URL }}/health/ready")
          if [ "$response" != "200" ]; then
            echo "Production smoke test failed! Got HTTP $response"
            exit 1
          fi
          echo "Production smoke test passed!"

      - name: Create Sentry Release
        if: ${{ vars.SENTRY_ENABLED == 'true' }}
        uses: getsentry/action-release@v1
        env:
          SENTRY_AUTH_TOKEN: ${{ secrets.SENTRY_AUTH_TOKEN }}
          SENTRY_ORG: ${{ secrets.SENTRY_ORG }}
          SENTRY_PROJECT: ${{ secrets.SENTRY_PROJECT }}
        with:
          environment: production
          set_commits: auto

      - name: Notify Slack
        if: always()
        uses: slackapi/slack-github-action@v1.25.0
        with:
          payload: |
            {"text": "Production deploy ${{ job.status }}: ${{ github.event.head_commit.message }}"}
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Required GitHub Secrets

SecretPurposeWhen to Add
CLOUDFLARE_ACCOUNT_IDCloudflare account for deploymentsPhase 2
CLOUDFLARE_API_TOKENWrangler authenticationPhase 2
STAGING_API_KEYAPI key for staging E2E testsPhase 2
SLACK_WEBHOOK_URLDeployment notifications (GitHub Actions only)Phase 2
SENTRY_AUTH_TOKENSentry release managementPhase 3
SENTRY_ORGSentry organization slugPhase 3
SENTRY_PROJECTSentry project slugPhase 3

Note on Slack Webhooks: This webhook is used only for GitHub Actions deployment notifications. Axiom (Phase 5) and Checkly (Phase 6) require separate Slack integrations configured in their respective dashboards. You can use the same Slack channel but each service needs its own webhook/integration.

Enable Status Checks

๐Ÿ“ After CI workflow runs once

Update branch protection rules to require the CI / ci status check for both main and staging branches.

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 3
  • GitHub variables configured: STAGING_API_URL, PRODUCTION_API_URL, SENTRY_ENABLED
  • .github/workflows/ci.yml created and committed
  • .github/workflows/deploy-staging.yml created and committed
  • .github/workflows/deploy-production.yml created and committed
  • GitHub secrets configured: CLOUDFLARE_ACCOUNT_ID, CLOUDFLARE_API_TOKEN
  • GitHub secrets configured: STAGING_API_KEY, SLACK_WEBHOOK_URL
  • Test PR to staging triggers CI workflow
  • Update main branch protection to require CI / ci status check
  • Update staging branch protection to require CI / ci status check
  • Test: PR with failing CI cannot be merged
  • Staging deployment succeeds and smoke test passes
  • Slack notifications received
3
Sentry Error Tracking

Time: Day 3 (~2-3 hours)

Install Sentry SDK

pnpm add @sentry/cloudflare

Configure wrangler.toml

[version_metadata]
binding = "CF_VERSION_METADATA"

Update Type Definitions

src/config/bindings.ts
export interface Env {
  // ... existing bindings ...

  /** Environment name (staging, production) - set in wrangler.toml */
  ENVIRONMENT?: string;

  /** Sentry DSN for error reporting */
  SENTRY_DSN?: string;

  /** Cloudflare Worker version metadata (auto-populated) */
  CF_VERSION_METADATA?: {
    id: string;
    tag: string;
    timestamp: string;
  };
}

Wrap Worker Export

src/index.ts
import * as Sentry from '@sentry/cloudflare';
import { app } from './app';
import { scheduled } from './scheduled';
import type { Env } from './config/bindings';

export default Sentry.withSentry(
  (env: Env) => ({
    dsn: env.SENTRY_DSN,
    release: env.CF_VERSION_METADATA?.id,
    environment: env.ENVIRONMENT ?? 'development',
    tracesSampleRate: env.ENVIRONMENT === 'production' ? 0.1 : 1.0,
    sendDefaultPii: false,
  }),
  {
    fetch: app.fetch,
    scheduled,
  } as ExportedHandler<Env>
);

Error Handler Integration

Update src/http/middleware/error-handler.ts to capture exceptions:

src/http/middleware/error-handler.ts
import * as Sentry from '@sentry/cloudflare';

// Inside the error handler, before returning the response:
Sentry.captureException(err, {
  extra: {
    requestId,
    path: instance,
    method: c.req.method,
  },
  tags: {
    endpoint: c.req.path,
    environment: c.env.ENVIRONMENT,
  },
});

Set Sentry DSN Secrets

# Get DSN from Sentry project settings
wrangler secret put SENTRY_DSN --env staging
wrangler secret put SENTRY_DSN --env production

Enable Sentry in CI/CD

  1. Add GitHub secrets: SENTRY_AUTH_TOKEN, SENTRY_ORG, SENTRY_PROJECT
  2. Change GitHub variable SENTRY_ENABLED from false to true

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 4
  • @sentry/cloudflare installed in package.json
  • wrangler.toml has [version_metadata] binding
  • src/config/bindings.ts has SENTRY_DSN and CF_VERSION_METADATA types
  • src/index.ts wrapped with Sentry.withSentry()
  • SENTRY_DSN secret set for staging environment
  • SENTRY_DSN secret set for production environment
  • Deploy to staging and trigger a test error
  • Verify error appears in Sentry dashboard
  • GitHub variable SENTRY_ENABLED changed to true
4
OpenTelemetry Tracing

Time: Day 4 (~2-3 hours)

Enable Cloudflare Automatic Tracing

[observability]
enabled = true

[observability.logs]
enabled = true
invocation_logs = true
head_sampling_rate = 1

[observability.tracing]
enabled = true
head_sampling_rate = 1

Configure OTEL Export to Axiom

  1. Navigate to Workers & Pages โ†’ Your Worker โ†’ Settings โ†’ Observability
  2. Under "Trace export", click "Add destination"
  3. Select "HTTP" as destination type
  4. Configure:
    • Endpoint: https://api.axiom.co/v1/traces
    • Header: Authorization: Bearer <AXIOM_API_TOKEN>
    • Header: X-Axiom-Dataset: email-finder-traces

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 5
  • wrangler.toml has [observability] section with tracing enabled
  • Deploy to staging with observability enabled
  • Cloudflare Dashboard shows traces in Workers โ†’ Observability
  • Axiom trace export configured in Cloudflare Dashboard
  • Make several API requests to staging
  • Verify traces appear in Axiom dataset
5
Axiom Dashboards & Monitors

Time: Day 4 (~2-3 hours)

Create Dashboards

DashboardMetrics
OperationsRequests/min, success/failure rate, response times (p50/p95/p99), cache hit ratio
Cost AnalyticsCost per request by validator, daily/weekly trends, validator efficiency
Infrastructure HealthCircuit breaker states, rate limit triggers, DB query latency

Create Monitors

MonitorConditionAction
High Error Rateerror_rate > 5% for 5mSlack alert
Slow Responsep95_latency > 10s for 5mSlack alert
Circuit Opencircuit_state = 'open'Slack alert
Provider Downprovider_errors > 10 in 1mSlack alert

Axiom Slack Integration Setup

  1. In Axiom Dashboard, go to Settings โ†’ Integrations
  2. Add Slack integration (OAuth or Webhook)
  3. Select the #email-finder-alerts channel
  4. Test the integration with a sample alert

Note: This is a separate integration from the GitHub Actions Slack webhook. Axiom manages its own Slack connection.

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 6
  • Operations Dashboard created in Axiom
  • Cost Analytics Dashboard created in Axiom
  • Infrastructure Health Dashboard created in Axiom
  • All 4 monitors configured
  • Slack integration configured for alerts
  • Test alert by temporarily lowering threshold
6
Checkly Synthetic Monitoring

Time: Day 5 (~2-3 hours)

API Check Strategy

  • Cache-hit checks (every 5 minutes): Use consistent test data that hits cache
  • Cache-bypass checks (every 60 minutes): Use bypass_cache=true for full validation

Check Volume Summary

CheckFrequencyRegionsDaily VolumeExternal Cost
Health1 min45,760/dayNone
Find (cached)5 min1288/dayNone
Find (bypass)60 min124/dayYes
Validate (cached)5 min1288/dayNone
Validate (bypass)60 min124/dayYes

Total external API consumption: 48 calls/day

Checkly Slack Integration Setup

  1. In Checkly Dashboard, go to Alerts โ†’ Alert Channels
  2. Add Slack channel integration
  3. Authorize Checkly to post to #email-finder-alerts
  4. Configure alert conditions (failure, recovery, degraded)
  5. Test with a manual alert

Note: This is a separate integration from GitHub Actions and Axiom. Each service maintains its own Slack connection for reliability.

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 7
  • Checkly account created and configured
  • Health check created (1 min frequency, 4 regions)
  • Find cached check created (5 min frequency)
  • Find bypass check created (60 min frequency)
  • Validate cached check created (5 min frequency)
  • Validate bypass check created (60 min frequency)
  • All checks passing for staging and production
  • Slack alert channel configured
7
K6 Load Testing

Time: Day 5 (~2 hours)

Create Load Test Scripts

Create tests/load/ directory with three scripts:

tests/load/smoke-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

const API_URL = __ENV.API_URL;
const API_KEY = __ENV.API_KEY || '';

export const options = {
  vus: 5,
  duration: '30s',
  thresholds: {
    http_req_duration: ['p(95)<2000'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const healthRes = http.get(`${API_URL}/health/ready`);
  check(healthRes, {
    'health status is 200': (r) => r.status === 200,
  });
  sleep(1);
}

Load Test Script

tests/load/load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

const API_URL = __ENV.API_URL;
const API_KEY = __ENV.API_KEY || '';

export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up to 50 users
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<3000'],
    http_req_failed: ['rate<0.05'],
  },
};

export default function () {
  const healthRes = http.get(`${API_URL}/health/ready`);
  check(healthRes, {
    'health status is 200': (r) => r.status === 200,
  });

  if (API_KEY) {
    const findRes = http.post(
      `${API_URL}/find`,
      JSON.stringify({
        full_name: `User ${__VU}-${__ITER}`,
        domain: 'k6-load-test.example.com',
      }),
      {
        headers: {
          'Content-Type': 'application/json',
          'X-API-Key': API_KEY,
        },
      }
    );
    check(findRes, {
      'find responds': (r) => r.status === 200 || r.status === 404,
    });
  }

  sleep(1);
}

Stress Test Script

tests/load/stress-test.js
/**
 * WARNING: This test should ONLY run against STAGING with mocked external providers.
 * Running against production will consume significant external API credits.
 */
import http from 'k6/http';
import { check, sleep } from 'k6';

const API_URL = __ENV.API_URL;
const API_KEY = __ENV.API_KEY || '';

export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Warm up
    { duration: '3m', target: 100 },  // Increase load
    { duration: '3m', target: 150 },  // Push further
    { duration: '3m', target: 200 },  // Near limit
    { duration: '2m', target: 0 },    // Recovery
  ],
  thresholds: {
    http_req_duration: ['p(95)<5000'],  // More lenient
    http_req_failed: ['rate<0.10'],     // Allow up to 10% failures
  },
};

export default function () {
  const healthRes = http.get(`${API_URL}/health/ready`);
  check(healthRes, {
    'health status is 200': (r) => r.status === 200,
  });

  // Only test find endpoint occasionally (1 in 10 iterations)
  if (API_KEY && __ITER % 10 === 0) {
    const findRes = http.post(
      `${API_URL}/find`,
      JSON.stringify({
        full_name: `Stress User ${__VU}`,
        domain: 'k6-stress-test.example.com',
      }),
      {
        headers: {
          'Content-Type': 'application/json',
          'X-API-Key': API_KEY,
        },
      }
    );
    check(findRes, {
      'find responds under stress': (r) => 
        r.status === 200 || r.status === 404 || r.status === 429,
    });
  }

  sleep(0.5);
}

Add K6 to CI/CD Pipeline

Add this step to deploy-staging.yml after E2E tests:

- name: Run K6 Smoke Test
  uses: grafana/k6-action@v0.3.1
  with:
    filename: tests/load/smoke-test.js
  env:
    API_URL: ${{ vars.STAGING_API_URL }}
    API_KEY: ${{ secrets.STAGING_API_KEY }}

Running Load Tests Manually

# Smoke test (CI)
k6 run tests/load/smoke-test.js -e API_URL=https://... -e API_KEY=...

# Load test (performance baseline)
k6 run tests/load/load-test.js -e API_URL=https://... -e API_KEY=...

# Stress test (staging only!)
k6 run tests/load/stress-test.js -e API_URL=https://staging... -e API_KEY=...
โš ๏ธ Warning: Stress Tests

Stress tests should ONLY run against STAGING with mocked external providers. Running against production will consume significant external API credits.

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 8
  • tests/load/ directory created
  • smoke-test.js created
  • load-test.js created
  • stress-test.js created with warning comment
  • Run smoke test locally: all checks pass
  • K6 step added to deploy-staging.yml
  • K6 smoke test passes in CI pipeline
8
CodeRabbit AI Code Review

Time: Day 6 (~1-2 hours)

Install CodeRabbit GitHub App

  1. Go to CodeRabbit GitHub App
  2. Install on LeadMagic/email-finder-service repository
  3. Grant permissions: read code, write PR comments

Create Configuration File

.coderabbit.yaml
language: en
early_access: false
reviews:
  auto_review:
    enabled: true
    drafts: false
    base_branches:
      - main
      - staging
  request_changes_workflow: false
  high_level_summary: true
  poem: false
  review_status: true
  path_filters:
    - '!**/pnpm-lock.yaml'
    - '!**/package-lock.json'
    - '!**/*.md'
  path_instructions:
    - path: 'src/**/*.ts'
      instructions: |
        Focus on:
        - Type safety and proper error handling
        - Performance implications for Cloudflare Workers
        - Security vulnerabilities
        - Proper async/await handling
    - path: 'tests/**/*.ts'
      instructions: |
        Focus on:
        - Test coverage completeness
        - Edge case handling
        - Mock correctness
chat:
  auto_reply: true

Verification Checklist

๐Ÿ›‘ STOP: Complete before Phase 9
  • CodeRabbit GitHub App installed on repository
  • .coderabbit.yaml created and committed
  • Create test PR with a code change
  • CodeRabbit posts automatic review comment
  • Review contains high-level summary
9
Enhanced Test Suite

Time: Day 6 (~2 hours)

Coverage Requirements

vitest.config.ts
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    globals: true,
    environment: 'node',
    include: ['tests/**/*.test.ts'],
    coverage: {
      provider: 'v8',
      reporter: ['text', 'json', 'html'],
      exclude: ['node_modules/', 'tests/', '**/*.d.ts', '**/*.config.*'],
      thresholds: {
        statements: 70,
        branches: 60,
        functions: 70,
        lines: 70,
      },
    },
  },
});

Test Data Factories

tests/factories/email.factory.ts
export function createFindRequest(overrides = {}) {
  return {
    full_name: 'Test User',
    domain: 'example.com',
    ...overrides,
  };
}

export function createValidateRequest(overrides = {}) {
  return {
    email: 'test@example.com',
    ...overrides,
  };
}

Verification Checklist

โœ… All Phases Complete!
  • vitest.config.ts updated with coverage thresholds
  • tests/factories/ directory created
  • Factory functions created for common test data
  • Run pnpm test - all tests pass
  • Run pnpm test -- --coverage - coverage meets thresholds
  • Coverage report generated in coverage/ directory

๐Ÿ“…Implementation Order

DayPhaseDurationKey Deliverables
Day 1 AMPhase 0: Git Flow~2-3hCONTRIBUTING.md, branch protection, staging branch
Day 1 PMPhase 1: Infrastructure (CRITICAL)~3-4h8 KV namespaces, Hyperdrive, wrangler.toml updates
Day 2Phase 2: CI/CD Pipeline~4-5h3 GitHub workflows, secrets, status checks enabled
Day 3Phase 3: Sentry Integration~2-3hSDK installed, error handler integrated, DSN secrets set
Day 4Phase 4 & 5: OTEL + Axiom~4-6hTracing enabled, 3 dashboards, 4 monitors
Day 5Phase 6 & 7: Checkly + K6~4-5h5 Checkly checks, 3 K6 scripts, K6 in CI
Day 6Phase 8 & 9: CodeRabbit + Tests~3-4hCodeRabbit app, coverage thresholds, test factories

๐Ÿ“Š Timeline Comparison

ApproachEstimated TimeNotes
Manual coding~5 weeksTraditional development without AI assistance
AI-assisted (Claude + Cursor)~1-2 weeks3-4x faster - recommended approach
๐Ÿ“ Phase Dependencies

Sequential (must be in order): Phase 0 โ†’ Phase 1 โ†’ Phase 2
Parallelizable (after Phase 2): Phases 3-9 can be done in any order or simultaneously

๐Ÿ”„Rollback Procedures

Deployment Rollback

# List recent deployments
wrangler deployments list --env production

# Rollback to specific deployment
wrangler rollback --env production --deployment-id <ID>

# Or rollback to previous deployment
wrangler rollback --env production

Rollback Decision Matrix

SymptomSeverityAction
Error rate > 10%P1Immediate rollback
Latency > 10s (p95)P1Immediate rollback
Error rate 5-10%P2Investigate, rollback if not resolved in 15 min
Feature not workingP3Investigate, hotfix if possible
Minor issueP4Fix forward in next deployment

๐ŸšจIncident Response

Severity Levels

LevelDescriptionResponse TimeExamples
P1Service down, all customers affected15 minutes500 errors, deployment failure
P2Degraded service, partial impact1 hourSlow responses, one provider down
P3Minor issue, workaround available4 hoursNon-critical feature broken
P4Cosmetic, no functional impactNext sprintTypo in response

Incident Response Process

1. DETECT โ†’ Alert triggered or customer report
2. ASSESS โ†’ Check Sentry, Axiom, Cloudflare; determine severity
3. COMMUNICATE โ†’ Notify team in Slack #email-finder-alerts
4. MITIGATE โ†’ Rollback if deployment caused; enable circuit breaker if provider issue
5. RESOLVE โ†’ Identify root cause, implement fix, deploy via normal process
6. POST-MORTEM โ†’ Document timeline, define action items (P1/P2 only)

๐Ÿ’ปLocal Development Guide

Initial Setup

# Clone repository
git clone git@github.com:LeadMagic/email-finder-service.git
cd email-finder-service

# Install dependencies
pnpm install

# Verify setup
pnpm typecheck && pnpm lint && pnpm test

Environment Configuration

Create a .dev.vars file in the repository root:

API_KEYS=["dev-key-12345"]
AXIOM_API_TOKEN=xaat-xxx
SOCKS5_PROXY_USER=xxx
SOCKS5_PROXY_PASS=xxx
OCHECKER_API_KEY=xxx
NO2BOUNCE_API_KEY=xxx
MILLIONVERIFIER_API_KEY=xxx
BYTEMINE_API_TOKEN=xxx

Running Locally

# Start development server
pnpm dev
# โ†’ Server running at http://localhost:8787

# Test health endpoint
curl http://localhost:8787/health/ready

Common Development Tasks

TaskCommand
Format codepnpm format
Lint codepnpm lint
Fix lint issuespnpm lint:fix
Type checkpnpm typecheck
Run testspnpm test
Run E2E testspnpm test:e2e
Check dead codepnpm knip

๐Ÿ”งTroubleshooting Guide

Common Issues

1. "Wrangler not authenticated"

# Solution:
wrangler login
# Or set environment variable:
export CLOUDFLARE_API_TOKEN="your-token"

2. "KV namespace not found"

# Verify namespace exists:
wrangler kv namespace list
# Compare IDs with wrangler.toml

3. "Hyperdrive connection failed"

# 1. Verify Hyperdrive exists
wrangler hyperdrive list

# 2. Test database connection directly
psql "$DATABASE_URL"

# 3. Recreate if needed
wrangler hyperdrive create email-finder-staging --connection-string="..."

4. "CI workflow failing"

ErrorCauseSolution
Type errorTypeScript issueRun pnpm typecheck locally
Lint errorCode style issueRun pnpm lint:fix
Test failedBroken testRun pnpm test locally
Deploy failedMissing secretAdd secret to GitHub repository

๐Ÿ”Environment Variables Summary

GitHub Repository Variables

VariableInitial ValuePurpose
STAGING_API_URLhttps://email-finder-service-staging.*.workers.devStaging endpoint
PRODUCTION_API_URLhttps://email-finder-service.*.workers.devProduction endpoint
SENTRY_ENABLEDfalseControls Sentry releases

GitHub Actions Secrets

SecretPurposeWhen to Add
CLOUDFLARE_ACCOUNT_IDCloudflare accountPhase 2
CLOUDFLARE_API_TOKENWrangler authPhase 2
STAGING_API_KEYE2E testsPhase 2
SLACK_WEBHOOK_URLNotificationsPhase 2
SENTRY_AUTH_TOKENSentry releasesPhase 3
SENTRY_ORGSentry org slugPhase 3
SENTRY_PROJECTSentry project slugPhase 3

Cloudflare Worker Secrets

Set via wrangler secret put <NAME> --env <ENV>

SecretPurposeWhen to Add
API_KEYSApplication API keysPhase 1
AXIOM_API_TOKENAxiom loggingPhase 1
SOCKS5_PROXY_USERSMTP proxy credentialsPhase 1
SOCKS5_PROXY_PASSSMTP proxy credentialsPhase 1
OCHECKER_API_KEYOChecker validationPhase 1
NO2BOUNCE_API_KEYNo2Bounce validationPhase 1
MILLIONVERIFIER_API_KEYMillionVerifier validationPhase 1
BYTEMINE_API_TOKENByteMine enrichmentPhase 1
SENTRY_DSNSentry project DSNPhase 3

๐Ÿ“–Glossary

TermDefinition
Circuit BreakerPattern that prevents cascading failures by stopping requests to failing services
Cold StartInitial latency when a Worker instance is first created
DSNData Source Name - connection string for Sentry
E2E TestEnd-to-end test that validates entire user flows
HonoLightweight web framework for Cloudflare Workers
HyperdriveCloudflare's database connection pooling service
KVCloudflare's key-value storage service
MTTRMean Time To Recovery - average time to recover from failure
OTELOpenTelemetry - standard for distributed tracing
SLI/SLOService Level Indicator/Objective - metrics and targets for service health
Smoke TestQuick test to verify basic functionality after deployment
Synthetic MonitoringAutomated tests that simulate user behavior
TTLTime To Live - expiration time for cached data
WranglerCloudflare's CLI tool for Workers development

โš ๏ธRisk Assessment

Implementation Risks

RiskProbabilityImpactMitigation
Shared KV data corruption during migrationMediumHighPhase 1 creates new namespaces (no migration needed)
CI/CD breaks existing workflowLowMediumGradual rollout, status checks added after CI works
Sentry adds latencyLowLowAsync error reporting, minimal overhead
Checkly costs exceed budgetLowLowOptimized check frequency, 48 external calls/day
Team resistance to new workflowMediumMediumClear documentation, training session

Rollback Difficulty

ComponentDifficultyTimeNotes
Worker codeEasy< 1 minCloudflare instant rollback
KV namespacesMedium5-10 minUpdate wrangler.toml + deploy
SecretsEasy< 2 minRe-set via wrangler CLI
GitHub workflowsEasy< 2 minGit revert + push
Branch protectionEasy< 2 minGitHub UI changes

๐ŸŽฏSuccess Metrics

MetricTarget
CI/CDAll deployments automated, < 10 min pipeline
Error Tracking< 5 min MTTR for P1 issues
Monitoring99.9% uptime visibility
Testing> 70% code coverage
Code Review100% PR coverage with AI review
Git Flow100% compliance with branch protection rules

๐Ÿ“Files to Create/Modify

Files to Create

CONTRIBUTING.md
.github/
  workflows/
    ci.yml
    deploy-staging.yml
    deploy-production.yml
.coderabbit.yaml
tests/
  load/
    smoke-test.js
    load-test.js
    stress-test.js
  factories/
    email.factory.ts
  integration/
    (new integration tests)

Files to Modify

FileChangesPhase
wrangler.tomlAdd staging/production KV IDs, ENVIRONMENT vars, version_metadata, observability1, 3, 4
src/index.tsWrap export with Sentry.withSentry()3
src/config/bindings.tsAdd ENVIRONMENT, SENTRY_DSN, CF_VERSION_METADATA types3
src/http/middleware/error-handler.tsAdd Sentry.captureException()3
vitest.config.tsAdd coverage thresholds9
package.jsonAdd @sentry/cloudflare dependency3
.github/workflows/deploy-staging.ymlAdd K6 smoke test step7

๐Ÿ“šReferences

Official Documentation

๐Ÿ”ฎFuture Improvements

Planned improvements after all phases are implemented and stabilized:

Secrets Management

๐Ÿ” Doppler Integration

Centralize all secrets in Doppler for unified management with version history, audit trail, and automated rotation.

When to adopt: When team grows beyond 5 developers or compliance requires audit trails.

Infrastructure Improvements

ImprovementBenefitTrigger
Separate Staging DatabaseFull data isolation, safe schema testingHigh Priority
Preview EnvironmentsEphemeral environments per PRPR queue > 3
Canary DeploymentsGradual rollout with auto-rollbackDeployment incidents > 2/quarter
Feature FlagsRuntime toggles without deploymentsNeed A/B testing

Enhanced Observability

ImprovementDescriptionPriority
Public Status PageReal-time incident communicationP2
SLO/SLI DashboardsFormal service level objectives with error budgetsP2
OpenAPI DocumentationAuto-generated API docs with interactive explorerP2

Testing Enhancements

ToolPurposePriority
PactConsumer-driven contract testingP4
StrykerMutation testing for test qualityP4
Chaos EngineeringDeliberately inject failures to test resilienceP4

Security Enhancements

ToolPurposePriority
DependabotAutomated dependency updatesP1
SnykDeep vulnerability analysisP2
GitleaksSecret detection in codeP2

๐Ÿ“„ Document Information

Version: 1.0.0  |  Last Updated: January 2026

For the complete markdown source, download detailed_plan.md

โ†‘