Senior Java · Production Engineering

Deploy.
Survive.
Scale.

Battle-tested answers for production deployment, JVM crises, and system stability questions. Tailored for a decade of production scars.

Zero-Downtime OOMKilled GC Tuning Deadlocks Blue-Green Kubernetes Observability Saga · Outbox

The Four Golden Signals — Always Start Here

Every production incident investigation starts with these four. Know them cold — interviewers frame war stories around them.

⏱

Latency

Time to serve a request. Track p50/p95/p99 — average hides tail latency.

📶

Traffic

RPS, TPS. Sudden drops can indicate failures, not just spikes.

🔴

Errors

Rate of HTTP 5xx, exceptions, failed DB writes. Silent errors are worst.

🌡

Saturation

CPU, heap, thread pool, DB connection pool — how close to the limit?

Zero-Downtime Deployment Strategies

ARCH

Walk me through Blue-Green, Canary, and Rolling deployments. When would you pick each in a real Java microservices system?

Mid▼

Strategy	Mechanism	Rollback Speed	Infra Cost	Best For
Blue-Green	Two full envs; switch LB/DNS	Instant (flip back)	2x	Major releases, DB schema changes
Canary	Route X% traffic to new version	Minutes (reroute)	~1.1x	High-risk feature flags, A/B testing
Rolling	Replace pods one by one	Slow (re-deploy old)	1x	Low-risk patches, stateless services

// Kubernetes Rolling Update — key config fields

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during update
      maxUnavailable: 0  # Zero pods down at any time
  template:
    spec:
      containers:
        - name: payment-service
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 20
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 45  # Give JVM time to warm up

⚠ Production Gotcha maxUnavailable: 0 is critical — without it, Rolling Update can kill all pods simultaneously and cause downtime. Also, if readinessProbe isn't configured, Kubernetes routes traffic to the new pod before Spring Boot has finished loading its ApplicationContext — requests fail silently.

Blue-Green with Spring Boot — the DB migration problem: The hardest part of Blue-Green is backward-compatible DB schema changes. Use Expand-Contract (Parallel Change) pattern: in v1 add the new column as nullable → deploy v2 that writes both old and new columns → drop the old column only in v3.

PROD

Canary deployment went bad — 5% of users hitting errors but dashboards look green. How do you catch this?

Hard▼

Real Incident Pattern Canary at 5% traffic. Aggregate error rate looks fine (0.05% of total = noise). But canary pod has 100% error rate for its slice. Standard dashboards aggregate across all pods — you need per-pod / per-version metrics.

Per-version metrics: Add a version label to all Prometheus metrics. Filter Grafana dashboards by version="v2-canary" separately from version="v1-stable".

Distributed tracing: Jaeger/Zipkin trace IDs tagged with pod version. Errors in canary show up isolated in trace search.

Automated canary analysis: Use Flagger or Argo Rollouts with success rate metrics. Auto-rollback if canary error rate exceeds threshold (e.g., 2% over 5 min).

User-segment logging: Log which version served each request. Alert on version-correlated error spikes via ELK.

// Spring Boot — inject pod version into every log/metric

# application.properties
management.metrics.tags.version=${APP_VERSION:unknown}
management.metrics.tags.pod=${POD_NAME:unknown}

# MDC logging — add version to every log line

@Component
public class VersionMdcFilter implements Filter {
    @Value("${app.version:unknown}")
    private String version;

    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
            throws IOException, ServletException {
        MDC.put("appVersion", version);
        try { chain.doFilter(req, res); }
        finally { MDC.remove("appVersion"); }
    }
}

CI/CD Pipeline Design

ARCH

Design a production-grade CI/CD pipeline for a Java microservice. What gates do you add before prod?

Hard▼

Pipeline stages for a 10-year engineer — this is the full answer, not the junior one:

Source → Build: Maven/Gradle with dependency caching. Fail fast on compile errors. Run unit tests with JaCoCo coverage gate (min 80%).

Static Analysis: SonarQube quality gate — block on critical bugs, security hotspots, code smells above threshold. OWASP Dependency-Check for CVEs in transitive dependencies.

Integration Tests: Testcontainers spins up real Postgres, Kafka, Redis in Docker. Tests run against actual infrastructure, not mocks.

Contract Tests: Pact/Spring Cloud Contract — verify consumer/provider API contracts don't break across services before merge.

Docker Build + Scan: Build image, run Trivy/Snyk container scan. Block on HIGH/CRITICAL CVEs in base image layers.

Push to Staging: Deploy to staging namespace. Run smoke tests and load tests (Gatling/k6). Check p99 latency against SLA baseline.

Manual Approval Gate (Continuous Delivery): or auto-promote (Continuous Deployment) based on all gates green.

Prod Deploy: Canary or Blue-Green via Argo CD (GitOps). Automated rollback on error-rate threshold breach.

Key Differentiator Answer Most candidates stop at unit tests + Docker build. Senior answer includes contract testing, load testing baseline comparison, and automated canary analysis with rollback — these catch the bugs that unit tests never will.

PROD

Deployment succeeded but 30% of requests are 500s. Rollback or investigate? What's your decision tree?

Hard▼

Incident Response — Time Pressure 30% error rate = P1 incident. Your first decision: mitigate first, root-cause second. Don't spend 10 minutes debugging while users are down.

Rollback immediately if the error started at deployment time. Git diff will show the culprit. Rollback SLA: under 5 minutes with Blue-Green (DNS flip) or Argo CD rollback.

Investigate only if: (a) error predates the deployment, (b) rollback isn't safe due to DB migrations already applied, or (c) the error is in a dependency you don't control.

Check in order: App logs for exception stack trace → DB connection pool saturation → Downstream service health → JVM heap/thread state → Recent config changes (Kubernetes ConfigMap diff).

Feature flag kill switch: If you have LaunchDarkly/Unleash, disable the feature flag for the new feature before deciding on full rollback.

// Argo CD rollback command

# Rollback to previous revision
argocd app rollback payment-service --revision 42

# Or kubectl — rollback Deployment
kubectl rollout undo deployment/payment-service -n production
kubectl rollout status deployment/payment-service -n production

Kubernetes Operations

PROD

Java pod keeps restarting with "OOMKilled" in Kubernetes. Debug and fix it.

Hard▼

Exception / Event kubectl describe pod: Last State: Terminated, Reason: OOMKilled, Exit Code: 137

Exit code 137 = killed by OS (OOM killer). Kubernetes enforces resources.limits.memory at cgroup level. Java has historically been bad at this because JVM doesn't know about container memory limits — it reads total host memory and sets heap accordingly.

Diagnose: kubectl top pod <name> to see actual memory. kubectl describe pod for OOMKilled events. Check if it's heap or off-heap (Metaspace, native threads, DirectByteBuffer).

Fix heap awareness: JDK 8u191+ and JDK 11+ support container-aware heap sizing via UseContainerSupport (on by default). Set MaxRAMPercentage instead of fixed -Xmx.

Account for non-heap: JVM memory = Heap + Metaspace + Threads + CodeCache + DirectMemory. Container limit must be ~30% larger than heap setting.

// Kubernetes Deployment — correct JVM memory config

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "768Mi"  # limit = 1.5x of MaxRAMPercentage allocation
    cpu: "1000m"
env:
  - name: JAVA_TOOL_OPTIONS
    value: "-XX:MaxRAMPercentage=60.0  # 60% of 768Mi = ~460Mi heap
            -XX:InitialRAMPercentage=50.0
            -XX:+UseG1GC
            -XX:+HeapDumpOnOutOfMemoryError
            -XX:HeapDumpPath=/tmp/heapdump.hprof
            -Xss256k"  # Reduce thread stack size (default 512k-1MB)

Off-Heap OOM — Harder to Debug If heap looks fine but pod still OOMKills, check Metaspace (-XX:MaxMetaspaceSize=128m), DirectByteBuffer (Netty/NIO — limit with -XX:MaxDirectMemorySize=64m), and thread count × stack size. Each thread uses ~256KB–1MB of native memory.

PROD

CrashLoopBackOff on a Spring Boot service in Kubernetes. Your debug sequence?

Mid▼

kubectl logs <pod> --previous — see logs from the crashed container. Look for stack trace, not just "Terminated".

kubectl describe pod <pod> — check Exit Code (1=app crash, 137=OOMKilled, 143=SIGTERM timeout), liveness probe failure count, events section.

Common Java causes: Missing env vars (NPE during @PostConstruct), incorrect heap → OOMKilled, liveness probe firing before JVM warms up, failed DB connection at startup.

kubectl exec -it <pod> -- /bin/sh (if container is still alive) — run jcmd 1 VM.flags or jmap -heap 1 to inspect live JVM state.

// Liveness probe — give JVM enough time to start

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60  # Spring Boot + Hibernate init can take 30-45s
  periodSeconds: 10
  failureThreshold: 3       # 3 failures × 10s = 30s grace before kill
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5

JVM Tuning & GC Under Production Load

PROD

Application slows every 2 hours in production — high latency spikes. What's your investigation process?

Hard▼

Periodic latency spikes at regular intervals = GC pause is the first suspect. The pattern (every 2 hours) suggests Full GC triggered by old-gen filling up.

Enable GC logging: -Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=5,filesize=20m. Look for "Pause Full" entries — these stop the world.

Heap histogram over time: jcmd <pid> GC.heap_info and jmap -histo:live <pid>. Watch which object types grow unchecked.

Check for Memory Leaks: Static caches (Maps/Lists) never cleared. HttpSession objects accumulating. Unclosed streams/connections holding references.

GC algorithm choice: G1GC (default JDK 9+) for low-pause. ZGC/Shenandoah for <10ms pauses on large heaps. ParallelGC for batch jobs where throughput matters over latency.

// JVM flags for G1GC in production

# G1GC — recommended for API services (low pause)
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200          # Target max pause (G1 will try to meet this)
-XX:G1HeapRegionSize=16m          # Tune for large object allocation
-XX:InitiatingHeapOccupancyPercent=45  # Trigger concurrent GC earlier
-XX:+G1UseAdaptiveIHOP           # Let JVM adapt threshold
-Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=5,filesize=20m

# ZGC — for Java 15+ with <10ms pause requirement (fintech/HFT)
-XX:+UseZGC
-XX:SoftMaxHeapSize=4g            # Aggressive GC before hitting -Xmx

Promotion Failure — Advanced GC Issue If G1GC logs show "to-space exhausted" or "evacuation failure", old-gen can't accommodate surviving young-gen objects. Fix: increase heap, increase -XX:G1OldCSetRegionThresholdPercent, or reduce object allocation rate by pooling objects.

ADV

What is False Sharing in Java and when have you seen it cause production issues?

Hard▼

False Sharing occurs when two threads on different CPU cores update different variables that happen to share the same 64-byte CPU cache line. When core A writes its variable, the entire cache line is invalidated on core B — forcing B to reload from L2/L3 cache. At high frequency, this causes significant performance degradation despite no actual data sharing.

// Problem — counter and flag share cache line

public class MetricsHolder {
    public volatile long requestCount = 0;   // Thread A updates this
    public volatile long errorCount = 0;     // Thread B updates this
    // These are adjacent in memory → same cache line → false sharing!
}

// Fix — @Contended (Java 8+) adds padding

@SuppressWarnings("restriction")
public class MetricsHolder {
    @jdk.internal.vm.annotation.Contended
    public volatile long requestCount = 0;

    @jdk.internal.vm.annotation.Contended
    public volatile long errorCount = 0;
    // JVM inserts 128 bytes of padding → different cache lines
}
// Also requires: -XX:-RestrictContended JVM flag

// Or manual padding (pre-Java 8):
public class PaddedLong {
    public volatile long value = 0;
    public long p1, p2, p3, p4, p5, p6, p7;  // 7×8=56 bytes padding
}

When to Care False sharing matters in tight loops on multi-core hardware with millions of ops/second — metrics collectors, ring buffers (LMAX Disruptor uses padding extensively), game loops, HFT. For typical REST services at moderate traffic, it's not the bottleneck. Measure first with JFR or async-profiler before applying this optimization.

OutOfMemoryError — Detection & Recovery

PROD

Production: java.lang.OutOfMemoryError: Java heap space — live system. Your exact response.

Hard▼

P1 Incident OOM means the JVM cannot allocate more objects. Service is effectively dead or severely degraded. Act within 2 minutes.

Preserve evidence first: Trigger heap dump before restart. jcmd <pid> GC.heap_dump /tmp/heapdump.hprof. Or configure -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/ to auto-capture on crash.

Stabilize: Restart the pod/instance. If autoscaling is configured, it may have already done this. Confirm service is serving traffic again.

Post-mortem with MAT (Eclipse Memory Analyzer): Open heap dump. Run "Leak Suspects" report. Look for Dominator Tree — which object holds the most retained heap.

Root cause patterns: Unbounded cache (static HashMap), Hibernate N+1 loading huge result sets, ThreadLocal not removed (in thread pools), event listener registration without deregistration.

// Common memory leak patterns

// PATTERN 1: Unbounded static cache
public class UserCache {
    private static final Map<String, User> cache = new HashMap<>();
    // Grows forever — use Caffeine/Guava with size limit and TTL
}

// FIX:
private final Cache<String, User> cache = Caffeine.newBuilder()
    .maximumSize(10_000)
    .expireAfterWrite(Duration.ofMinutes(30))
    .build();

// PATTERN 2: ThreadLocal leak in thread pool
private static ThreadLocal<UserContext> ctx = new ThreadLocal<>();

// Never cleaned — pool threads reused, ThreadLocal stays
// FIX: always clean in finally block
try {
    ctx.set(new UserContext(userId));
    processRequest();
} finally {
    ctx.remove();  // CRITICAL in thread pools
}

PROD

OutOfMemoryError: Metaspace in production. Different from heap OOM — how do you handle it?

Hard▼

Metaspace holds class metadata (class definitions, method bytecode). Unlike heap, Metaspace is native memory (off-heap). By default it grows until OS memory limit — can silently consume all available native memory.

Causes:

Dynamic class generation (Groovy scripts, CGLIB proxies, ASM) creating classes that are never unloaded
Multiple hot-deploys in app servers (Tomcat/JBoss) without ClassLoader cleanup → ClassLoader leak
Plugins/OSGi bundles repeatedly loaded without unloading old versions

// Detect and limit Metaspace

# Cap Metaspace to force GC of dead classes before OOM
-XX:MaxMetaspaceSize=256m

# Monitor: if Metaspace grows near this, you have a class leak
-verbose:class         # log every class load/unload

# JFR recording: capture class loading events
jcmd <pid> JFR.start duration=60s filename=metaspace.jfr
# Analyze in JMC: look for classes loaded but never unloaded

Spring Boot Hot Reload (DevTools) Warning Spring Boot DevTools creates a RestartClassLoader that hot-reloads classes. In production-like environments without proper cleanup, repeated restarts cause Metaspace growth. Never use DevTools in production JVM. Confirm with -Dspring.devtools.restart.enabled=false.

Thread Contention & Deadlocks

PROD

Production service is hung — no responses, no errors in logs. Diagnose a deadlock live.

Hard▼

Symptom Service stops responding. CPU is near 0% (threads are blocked, not spinning). Thread pool exhausted. HTTP requests queue up and timeout.

Thread dump: kill -3 <pid> (prints to stdout) or jstack <pid> > /tmp/tdump.txt. Take 3 dumps at 10-second intervals to see which threads are stuck vs. which are progressing.

Look for BLOCKED state: grep -A 20 "BLOCKED" /tmp/tdump.txt. Deadlock shows as Thread A blocked waiting for lock held by Thread B, and B blocked waiting for lock held by A.

JVM auto-detection: jstack prints "Found one Java-level deadlock" section automatically when it detects circular lock dependencies.

Recover: Restart the pod. Root cause fix: ensure all code acquires locks in a consistent order. Use tryLock(timeout) with timeouts on ReentrantLock instead of synchronized to avoid indefinite blocking.

// Deadlock-safe locking with tryLock

private final ReentrantLock lockA = new ReentrantLock();
private final ReentrantLock lockB = new ReentrantLock();

public void transfer() throws InterruptedException {
    boolean gotA = false, gotB = false;
    try {
        gotA = lockA.tryLock(100, TimeUnit.MILLISECONDS);
        gotB = lockB.tryLock(100, TimeUnit.MILLISECONDS);
        if (gotA && gotB) {
            doTransfer();
        } else {
            // Back off and retry — no deadlock possible
            throw new RetryableException("Could not acquire locks");
        }
    } finally {
        if (gotB) lockB.unlock();
        if (gotA) lockA.unlock();
    }
}

Database & Connection Pool Issues

PROD

HikariCP: "Connection is not available, request timed out after 30000ms." Production fix.

Hard▼

Exception java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms

Root causes in order of likelihood:

Pool size too small for traffic — all connections in use, requests queue up
Connection leak — @Transactional method opens connection but exception path doesn't close it
Slow queries holding connections for long durations — blocking pool
DB-side: max_connections exhausted, all denied by DB server

// HikariCP tuning + leak detection

# application.properties
spring.datasource.hikari.maximum-pool-size=20        # Default is 10 — tune per service
spring.datasource.hikari.minimum-idle=5
spring.datasource.hikari.connection-timeout=5000     # Fail fast instead of 30s wait
spring.datasource.hikari.idle-timeout=300000
spring.datasource.hikari.max-lifetime=600000         # Recycle before DB kills idle connections
spring.datasource.hikari.leak-detection-threshold=2000  # Log stack trace if conn held >2s

Optimal pool size formula (Little's Law):

Formula Pool Size = (Core Count × 2) + Effective Spindle Count. For a 4-core machine with SSD: 4×2+1 = 9 connections. HikariCP recommends starting small and measuring — a pool of 10 often outperforms 100 because smaller pools reduce lock contention at the DB level.

// Detect connection leaks in test

@Test
void detectConnectionLeak() {
    // Enable leak detection in test config
    // Check HikariCP metrics: active connections should return to baseline
    HikariPoolMXBean pool = dataSource.getHikariPoolMXBean();
    callServiceMethod();
    assertThat(pool.getActiveConnections()).isZero(); // all returned
}

Circuit Breaker, Retry & Bulkhead

ARCH

Design a resilient service call with Resilience4j — Circuit Breaker + Retry + Bulkhead. Production config.

Hard▼

Pattern	Protects Against	Mechanism
Circuit Breaker	Cascading failure to dead service	CLOSED→OPEN after X% failures, half-open probe
Retry	Transient network glitches	Exponential backoff with jitter
Bulkhead	One slow service starving thread pool	Separate thread pool per downstream
Rate Limiter	Overloading downstream	Max calls per time window
Time Limiter	Slow downstream holding threads	Cancel call after timeout

// application.yml — full Resilience4j config

resilience4j:
  circuitbreaker:
    instances:
      paymentGateway:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 20
        failureRateThreshold: 50        # OPEN after 50% failures in last 20 calls
        waitDurationInOpenState: 10s   # Wait before half-open probe
        permittedNumberOfCallsInHalfOpenState: 3
        slowCallRateThreshold: 80       # Also open if 80% of calls are slow
        slowCallDurationThreshold: 2s
  retry:
    instances:
      paymentGateway:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2  # 500ms → 1s → 2s
        retryExceptions:
          - java.net.ConnectException
          - java.net.SocketTimeoutException
        ignoreExceptions:
          - com.app.exception.InvalidPaymentException  # Don't retry business errors
  bulkhead:
    instances:
      paymentGateway:
        maxConcurrentCalls: 20
        maxWaitDuration: 100ms

// Java — annotation-based with fallback

@Service
public class PaymentClient {

    @CircuitBreaker(name = "paymentGateway", fallbackMethod = "paymentFallback")
    @Retry(name = "paymentGateway")
    @Bulkhead(name = "paymentGateway")
    @TimeLimiter(name = "paymentGateway")
    public CompletableFuture<PaymentResult> charge(PaymentRequest req) {
        return CompletableFuture.supplyAsync(() -> gateway.process(req));
    }

    // Fallback: must have same return type + extra Throwable param
    private CompletableFuture<PaymentResult> paymentFallback(
            PaymentRequest req, Throwable t) {
        log.error("Payment gateway down, routing to fallback", t);
        // Queue for async processing or return cached result
        return CompletableFuture.completedFuture(PaymentResult.queued(req));
    }
}

Observability — Logs, Metrics, Traces

ARCH

How do you trace a single user request across 8 microservices to find where latency is added?

Hard▼

Distributed tracing is the answer. OpenTelemetry (OTel) is the current standard — vendor-neutral, exportable to Jaeger, Zipkin, Datadog, or New Relic.

// Spring Boot 3 — Micrometer Tracing + OTel

# pom.xml
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

# application.properties
management.tracing.sampling.probability=1.0       # 100% in staging, 10% in prod
management.otlp.tracing.endpoint=http://otel-collector:4318/v1/traces
logging.pattern.level=%5p [${spring.application.name},%X{traceId},%X{spanId}]

// Propagate trace context in Kafka messages (often missed)

@KafkaListener(topics = "payment-events")
public void consume(ConsumerRecord<String, String> record) {
    // Extract trace context from Kafka headers
    Map<String, String> headers = new HashMap<>();
    record.headers().forEach(h ->
        headers.put(h.key(), new String(h.value())));

    // OTel auto-propagation handles this when using Spring Kafka
    // Ensure: spring-kafka-otel bridge is on classpath
    processEvent(record.value());
}

Interview Gold: Three Pillars Logs (what happened), Metrics (how much), Traces (where time went). They answer different questions. A trace shows: API Gateway 20ms → Order Service 15ms → DB query 480ms ← bottleneck found. Without tracing, you'd grep logs across 8 services manually.

PROD

JFR vs JProfiler vs async-profiler — which do you use in production and why?

Mid▼

Tool	Overhead	Prod Safe?	Best For
JFR (Java Flight Recorder)	<1%	Yes — built into JDK	Always-on profiling, GC, I/O, threads, allocations
async-profiler	1–3%	Yes — sampling only	CPU flame graphs, allocation profiling, lock contention
JProfiler / YourKit	10–30%	No — too heavy	Dev/staging deep profiling with GUI
VisualVM	5–15%	No	Local heap dump analysis

// JFR — safe to run in production continuously

# Start JFR recording (JDK 11+)
jcmd <pid> JFR.start duration=120s filename=/tmp/recording.jfr settings=profile

# Or: always-on with rolling file (low overhead)
-XX:StartFlightRecording=disk=true,maxage=1h,maxsize=500m,\
  filename=/recordings/app.jfr,settings=default

# async-profiler — CPU flame graph
./profiler.sh -d 30 -f /tmp/flamegraph.html <pid>

Distributed Consistency & Transactions

ARCH

Explain the Transactional Outbox Pattern — why Saga alone isn't enough.

Hard▼

Saga handles the distributed transaction logic. But there's a fundamental problem: how do you atomically write to DB AND publish to Kafka? If the service crashes between these two operations, you get inconsistency — DB updated but event never published.

// The Outbox Pattern — atomic write to DB + event table

@Transactional
public void placeOrder(Order order) {
    orderRepo.save(order);   // write order

    // Write event to OUTBOX table in SAME transaction
    // If this transaction commits → both order + outbox entry persist atomically
    // If this transaction rolls back → neither persists
    OutboxEvent event = new OutboxEvent(
        "ORDER_PLACED",
        JsonUtils.toJson(order),
        Instant.now()
    );
    outboxRepo.save(event);
    // Kafka is NOT called here — no dual-write problem
}

// Separate Outbox Publisher (CDC or polling)
@Scheduled(fixedDelay = 1000)
public void publishPendingEvents() {
    List<OutboxEvent> pending = outboxRepo.findUnpublished();
    pending.forEach(event -> {
        kafka.send(event.getType(), event.getPayload());
        event.markPublished();
        outboxRepo.save(event);  // idempotent update
    });
}

Production Alternative: Debezium CDC Instead of polling, use Debezium to tail the DB transaction log (binlog/WAL). Debezium publishes outbox table inserts to Kafka automatically — zero polling overhead, sub-second latency, guaranteed at-least-once delivery.

ADV

CAP Theorem in practice — which databases choose what, and how does this affect your prod system design?

Hard▼

System	CAP Choice	Why	Java Use Case
PostgreSQL / MySQL	CP	Strong consistency, may be unavailable during partition	Payments, orders, user accounts
Cassandra	AP	Always available, eventual consistency	Audit logs, time-series, sensor data
DynamoDB	AP (configurable)	Tunable consistency per read	Session store, product catalog
MongoDB	CP (with write concern)	Replica set majority writes = strong consistency	Document stores, content systems
Redis (Sentinel)	AP	Redis Cluster can lose writes during failover	Cache only — never source of truth
ZooKeeper / etcd	CP	Raft/Paxos consensus	Leader election, distributed config

Modern Context: PACELC CAP only applies during network partition. PACELC extends it: even without partitions, what's the tradeoff between Latency and Consistency? DynamoDB with eventual reads = lower latency. With strong reads = higher latency. Design your Spring Boot read paths with this in mind — use eventual consistency reads for catalog, strong reads for inventory levels.

Security in Production

ARCH

Beyond JWT auth — what security layers does a production Java API need?

Mid▼

mTLS between microservices: Service-to-service calls should use mutual TLS (both sides present certificates). Istio service mesh handles this transparently with SPIFFE identity certificates. JWT is only for external clients.

Rate limiting per user/IP: Bucket4j or API Gateway (Kong/AWS API GW) — prevent credential stuffing, DDoS. Different limits per tier: free vs paid vs internal.

Secret management: Never hardcode credentials. Use HashiCorp Vault or AWS Secrets Manager + Spring Cloud Vault. Rotate secrets dynamically without restart.

Input validation + SQL injection: Always use parameterized queries (JPA does this). Validate all inputs with Bean Validation. OWASP dependency scanner in CI.

Actuator lockdown: /actuator/env and /actuator/heapdump expose sensitive data. Restrict to internal network only. Separate management port from app port.

// Spring Boot — secure actuator endpoints

# application.properties
management.server.port=8081                          # Separate port — not exposed to internet
management.endpoints.web.exposure.include=health,metrics,prometheus
management.endpoints.web.exposure.exclude=env,beans,heapdump,threaddump
management.endpoint.health.show-details=when-authorized

# Kubernetes NetworkPolicy: allow 8081 only from monitoring namespace

Rapid Fire — Senior-Level One-Shots

What's the difference between liveness and readiness probes, and what happens if you mix them up?

Mid▼

Liveness: Is the JVM alive? If fails → pod restarted. Use for deadlock detection. Readiness: Is the app ready to serve traffic? If fails → pod removed from load balancer (no restart). Use for DB connection health, downstream dependencies.

Mixing Them Up = Outage If you put DB health in the liveness probe, a transient DB blip restarts all your pods simultaneously. Entire service goes down. Readiness probe should fail during DB issues (stop traffic), not liveness (don't restart the pod).

PROD

GraalVM Native Image in Kubernetes — what breaks and what's the production tradeoff?

Hard▼

GraalVM compiles Java to native binary (AOT). Startup: 50ms vs 8 seconds. Memory: 60% reduction. Great for Kubernetes where pods scale rapidly.

What breaks:

Dynamic class loading (CGLIB proxies, runtime Groovy) — must configure reflection hints
Many libraries not yet native-image compatible (check GraalVM reachability metadata repo)
Build time: 3–5 minutes vs 30 seconds for JVM — slower CI
No JIT optimization — peak throughput lower than JVM for long-running services
Debugging native binaries is harder — no JFR, limited profiling tools

When to Use Serverless functions, CLI tools, short-lived sidecar containers → GraalVM wins. Long-running stateful services processing millions of requests/hour → JVM with JIT still wins on throughput. Use Spring Boot 3 with native hints for Spring beans.

Continuous Delivery vs Continuous Deployment — and what gate do you add between them?

Easy▼

Continuous Delivery: Pipeline ensures the build is always deployable. A human clicks deploy to production. Continuous Deployment: Every green pipeline commit goes to production automatically — no human click. Most mature teams use CD with automated gates: e2e test pass + load test baseline + security scan + canary analysis — all automated. Human approvals only for major releases or regulated environments (fintech/healthcare).

PROD

Pod is running but /health returns 200 and service is still broken. How?

Hard▼

A health endpoint returning 200 only means Spring Boot's Actuator health check passed. It doesn't mean your business logic works. Common scenarios:

Health check only pings DB SELECT 1 — but the query users actually run is deadlocked
External dependency (payment gateway) is down — but health check doesn't include it
Feature flag misconfigured — requests routed to disabled code path, returns 200 with empty body
Cache serving stale data — "healthy" but responses are hours old

Fix: Synthetic Monitoring Add a synthetic transaction that runs a real business operation (place a test order, send a test payment) every 60 seconds. Alert if it fails. This catches business-logic failures that health endpoints miss.

ADV

How do you do a database schema migration safely with zero downtime and Blue-Green deployment?

Hard▼

The hardest part of zero-downtime deployment is DB schema changes. You can't deploy a new schema while old pods are still running — they'll fail. Use Expand-Contract (Parallel Change) in 3 releases:

Expand (v1): Add new column as NULL with no constraints. Old code ignores it. New code writes to both old and new columns.

Migrate (background job): Backfill existing rows in batches. Add constraint after backfill (e.g., NOT NULL DEFAULT).

Contract (v3): After all pods are on v2 and reading only the new column, drop the old column in v3 release.

// Flyway migration — additive only in v1

-- V1__add_email_hash_column.sql (SAFE — additive, nullable)
ALTER TABLE users ADD COLUMN email_hash VARCHAR(64) NULL;

-- V2__backfill_email_hash.sql (run as background job, not migration)
-- DO NOT USE Flyway for large backfills — it locks the table
-- Instead: Spring Batch job with batch size 1000, pausable

-- V3__drop_old_email.sql (only safe after ALL pods use email_hash)
ALTER TABLE users DROP COLUMN email; -- release 3

Deploy.Survive.Scale.

The Four Golden Signals — Always Start Here

Zero-Downtime Deployment Strategies

Walk me through Blue-Green, Canary, and Rolling deployments. When would you pick each in a real Java microservices system?

Canary deployment went bad — 5% of users hitting errors but dashboards look green. How do you catch this?

CI/CD Pipeline Design

Design a production-grade CI/CD pipeline for a Java microservice. What gates do you add before prod?

Deployment succeeded but 30% of requests are 500s. Rollback or investigate? What's your decision tree?

Kubernetes Operations

Java pod keeps restarting with "OOMKilled" in Kubernetes. Debug and fix it.

CrashLoopBackOff on a Spring Boot service in Kubernetes. Your debug sequence?

JVM Tuning & GC Under Production Load

Application slows every 2 hours in production — high latency spikes. What's your investigation process?

What is False Sharing in Java and when have you seen it cause production issues?

OutOfMemoryError — Detection & Recovery

Production: java.lang.OutOfMemoryError: Java heap space — live system. Your exact response.

OutOfMemoryError: Metaspace in production. Different from heap OOM — how do you handle it?

Thread Contention & Deadlocks

Production service is hung — no responses, no errors in logs. Diagnose a deadlock live.

Database & Connection Pool Issues

HikariCP: "Connection is not available, request timed out after 30000ms." Production fix.

Circuit Breaker, Retry & Bulkhead

Design a resilient service call with Resilience4j — Circuit Breaker + Retry + Bulkhead. Production config.

Observability — Logs, Metrics, Traces

How do you trace a single user request across 8 microservices to find where latency is added?

JFR vs JProfiler vs async-profiler — which do you use in production and why?

Distributed Consistency & Transactions

Explain the Transactional Outbox Pattern — why Saga alone isn't enough.

CAP Theorem in practice — which databases choose what, and how does this affect your prod system design?

Security in Production

Beyond JWT auth — what security layers does a production Java API need?

Rapid Fire — Senior-Level One-Shots

What's the difference between liveness and readiness probes, and what happens if you mix them up?

GraalVM Native Image in Kubernetes — what breaks and what's the production tradeoff?

Continuous Delivery vs Continuous Deployment — and what gate do you add between them?

Pod is running but /health returns 200 and service is still broken. How?

How do you do a database schema migration safely with zero downtime and Blue-Green deployment?

Deploy.
Survive.
Scale.