Deploy.
Survive.
Scale.
Battle-tested answers for production deployment, JVM crises, and system stability questions. Tailored for a decade of production scars.
The Four Golden Signals — Always Start Here
Every production incident investigation starts with these four. Know them cold — interviewers frame war stories around them.
Zero-Downtime Deployment Strategies
Walk me through Blue-Green, Canary, and Rolling deployments. When would you pick each in a real Java microservices system?
Mid▼| Strategy | Mechanism | Rollback Speed | Infra Cost | Best For |
|---|---|---|---|---|
| Blue-Green | Two full envs; switch LB/DNS | Instant (flip back) | 2x | Major releases, DB schema changes |
| Canary | Route X% traffic to new version | Minutes (reroute) | ~1.1x | High-risk feature flags, A/B testing |
| Rolling | Replace pods one by one | Slow (re-deploy old) | 1x | Low-risk patches, stateless services |
apiVersion: apps/v1 kind: Deployment spec: replicas: 4 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # Allow 1 extra pod during update maxUnavailable: 0 # Zero pods down at any time template: spec: containers: - name: payment-service readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 20 periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 45 # Give JVM time to warm up
maxUnavailable: 0 is critical — without it, Rolling Update can kill all pods simultaneously and cause downtime. Also, if readinessProbe isn't configured, Kubernetes routes traffic to the new pod before Spring Boot has finished loading its ApplicationContext — requests fail silently.
Blue-Green with Spring Boot — the DB migration problem: The hardest part of Blue-Green is backward-compatible DB schema changes. Use Expand-Contract (Parallel Change) pattern: in v1 add the new column as nullable → deploy v2 that writes both old and new columns → drop the old column only in v3.
Canary deployment went bad — 5% of users hitting errors but dashboards look green. How do you catch this?
Hard▼version label to all Prometheus metrics. Filter Grafana dashboards by version="v2-canary" separately from version="v1-stable".# application.properties management.metrics.tags.version=${APP_VERSION:unknown} management.metrics.tags.pod=${POD_NAME:unknown} # MDC logging — add version to every log line
@Component public class VersionMdcFilter implements Filter { @Value("${app.version:unknown}") private String version; public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain) throws IOException, ServletException { MDC.put("appVersion", version); try { chain.doFilter(req, res); } finally { MDC.remove("appVersion"); } } }
CI/CD Pipeline Design
Design a production-grade CI/CD pipeline for a Java microservice. What gates do you add before prod?
Hard▼Pipeline stages for a 10-year engineer — this is the full answer, not the junior one:
Deployment succeeded but 30% of requests are 500s. Rollback or investigate? What's your decision tree?
Hard▼# Rollback to previous revision argocd app rollback payment-service --revision 42 # Or kubectl — rollback Deployment kubectl rollout undo deployment/payment-service -n production kubectl rollout status deployment/payment-service -n production
Kubernetes Operations
Java pod keeps restarting with "OOMKilled" in Kubernetes. Debug and fix it.
Hard▼kubectl describe pod: Last State: Terminated, Reason: OOMKilled, Exit Code: 137
Exit code 137 = killed by OS (OOM killer). Kubernetes enforces resources.limits.memory at cgroup level. Java has historically been bad at this because JVM doesn't know about container memory limits — it reads total host memory and sets heap accordingly.
kubectl top pod <name> to see actual memory. kubectl describe pod for OOMKilled events. Check if it's heap or off-heap (Metaspace, native threads, DirectByteBuffer).UseContainerSupport (on by default). Set MaxRAMPercentage instead of fixed -Xmx.resources: requests: memory: "512Mi" cpu: "250m" limits: memory: "768Mi" # limit = 1.5x of MaxRAMPercentage allocation cpu: "1000m" env: - name: JAVA_TOOL_OPTIONS value: "-XX:MaxRAMPercentage=60.0 # 60% of 768Mi = ~460Mi heap -XX:InitialRAMPercentage=50.0 -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapdump.hprof -Xss256k" # Reduce thread stack size (default 512k-1MB)
-XX:MaxMetaspaceSize=128m), DirectByteBuffer (Netty/NIO — limit with -XX:MaxDirectMemorySize=64m), and thread count × stack size. Each thread uses ~256KB–1MB of native memory.
CrashLoopBackOff on a Spring Boot service in Kubernetes. Your debug sequence?
Mid▼kubectl logs <pod> --previous — see logs from the crashed container. Look for stack trace, not just "Terminated".kubectl describe pod <pod> — check Exit Code (1=app crash, 137=OOMKilled, 143=SIGTERM timeout), liveness probe failure count, events section.@PostConstruct), incorrect heap → OOMKilled, liveness probe firing before JVM warms up, failed DB connection at startup.kubectl exec -it <pod> -- /bin/sh (if container is still alive) — run jcmd 1 VM.flags or jmap -heap 1 to inspect live JVM state.livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 60 # Spring Boot + Hibernate init can take 30-45s periodSeconds: 10 failureThreshold: 3 # 3 failures × 10s = 30s grace before kill readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 20 periodSeconds: 5
JVM Tuning & GC Under Production Load
Application slows every 2 hours in production — high latency spikes. What's your investigation process?
Hard▼Periodic latency spikes at regular intervals = GC pause is the first suspect. The pattern (every 2 hours) suggests Full GC triggered by old-gen filling up.
-Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=5,filesize=20m. Look for "Pause Full" entries — these stop the world.jcmd <pid> GC.heap_info and jmap -histo:live <pid>. Watch which object types grow unchecked.# G1GC — recommended for API services (low pause) -XX:+UseG1GC -XX:MaxGCPauseMillis=200 # Target max pause (G1 will try to meet this) -XX:G1HeapRegionSize=16m # Tune for large object allocation -XX:InitiatingHeapOccupancyPercent=45 # Trigger concurrent GC earlier -XX:+G1UseAdaptiveIHOP # Let JVM adapt threshold -Xlog:gc*:file=/tmp/gc.log:time,uptime:filecount=5,filesize=20m # ZGC — for Java 15+ with <10ms pause requirement (fintech/HFT) -XX:+UseZGC -XX:SoftMaxHeapSize=4g # Aggressive GC before hitting -Xmx
-XX:G1OldCSetRegionThresholdPercent, or reduce object allocation rate by pooling objects.
What is False Sharing in Java and when have you seen it cause production issues?
Hard▼False Sharing occurs when two threads on different CPU cores update different variables that happen to share the same 64-byte CPU cache line. When core A writes its variable, the entire cache line is invalidated on core B — forcing B to reload from L2/L3 cache. At high frequency, this causes significant performance degradation despite no actual data sharing.
public class MetricsHolder { public volatile long requestCount = 0; // Thread A updates this public volatile long errorCount = 0; // Thread B updates this // These are adjacent in memory → same cache line → false sharing! }
@SuppressWarnings("restriction") public class MetricsHolder { @jdk.internal.vm.annotation.Contended public volatile long requestCount = 0; @jdk.internal.vm.annotation.Contended public volatile long errorCount = 0; // JVM inserts 128 bytes of padding → different cache lines } // Also requires: -XX:-RestrictContended JVM flag // Or manual padding (pre-Java 8): public class PaddedLong { public volatile long value = 0; public long p1, p2, p3, p4, p5, p6, p7; // 7×8=56 bytes padding }
OutOfMemoryError — Detection & Recovery
Production: java.lang.OutOfMemoryError: Java heap space — live system. Your exact response.
Hard▼jcmd <pid> GC.heap_dump /tmp/heapdump.hprof. Or configure -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps/ to auto-capture on crash.// PATTERN 1: Unbounded static cache public class UserCache { private static final Map<String, User> cache = new HashMap<>(); // Grows forever — use Caffeine/Guava with size limit and TTL } // FIX: private final Cache<String, User> cache = Caffeine.newBuilder() .maximumSize(10_000) .expireAfterWrite(Duration.ofMinutes(30)) .build(); // PATTERN 2: ThreadLocal leak in thread pool private static ThreadLocal<UserContext> ctx = new ThreadLocal<>(); // Never cleaned — pool threads reused, ThreadLocal stays // FIX: always clean in finally block try { ctx.set(new UserContext(userId)); processRequest(); } finally { ctx.remove(); // CRITICAL in thread pools }
OutOfMemoryError: Metaspace in production. Different from heap OOM — how do you handle it?
Hard▼Metaspace holds class metadata (class definitions, method bytecode). Unlike heap, Metaspace is native memory (off-heap). By default it grows until OS memory limit — can silently consume all available native memory.
Causes:
- Dynamic class generation (Groovy scripts, CGLIB proxies, ASM) creating classes that are never unloaded
- Multiple hot-deploys in app servers (Tomcat/JBoss) without ClassLoader cleanup → ClassLoader leak
- Plugins/OSGi bundles repeatedly loaded without unloading old versions
# Cap Metaspace to force GC of dead classes before OOM -XX:MaxMetaspaceSize=256m # Monitor: if Metaspace grows near this, you have a class leak -verbose:class # log every class load/unload # JFR recording: capture class loading events jcmd <pid> JFR.start duration=60s filename=metaspace.jfr # Analyze in JMC: look for classes loaded but never unloaded
-Dspring.devtools.restart.enabled=false.
Thread Contention & Deadlocks
Production service is hung — no responses, no errors in logs. Diagnose a deadlock live.
Hard▼kill -3 <pid> (prints to stdout) or jstack <pid> > /tmp/tdump.txt. Take 3 dumps at 10-second intervals to see which threads are stuck vs. which are progressing.grep -A 20 "BLOCKED" /tmp/tdump.txt. Deadlock shows as Thread A blocked waiting for lock held by Thread B, and B blocked waiting for lock held by A.jstack prints "Found one Java-level deadlock" section automatically when it detects circular lock dependencies.tryLock(timeout) with timeouts on ReentrantLock instead of synchronized to avoid indefinite blocking.private final ReentrantLock lockA = new ReentrantLock(); private final ReentrantLock lockB = new ReentrantLock(); public void transfer() throws InterruptedException { boolean gotA = false, gotB = false; try { gotA = lockA.tryLock(100, TimeUnit.MILLISECONDS); gotB = lockB.tryLock(100, TimeUnit.MILLISECONDS); if (gotA && gotB) { doTransfer(); } else { // Back off and retry — no deadlock possible throw new RetryableException("Could not acquire locks"); } } finally { if (gotB) lockB.unlock(); if (gotA) lockA.unlock(); } }
Database & Connection Pool Issues
HikariCP: "Connection is not available, request timed out after 30000ms." Production fix.
Hard▼java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms
Root causes in order of likelihood:
- Pool size too small for traffic — all connections in use, requests queue up
- Connection leak —
@Transactionalmethod opens connection but exception path doesn't close it - Slow queries holding connections for long durations — blocking pool
- DB-side: max_connections exhausted, all denied by DB server
# application.properties spring.datasource.hikari.maximum-pool-size=20 # Default is 10 — tune per service spring.datasource.hikari.minimum-idle=5 spring.datasource.hikari.connection-timeout=5000 # Fail fast instead of 30s wait spring.datasource.hikari.idle-timeout=300000 spring.datasource.hikari.max-lifetime=600000 # Recycle before DB kills idle connections spring.datasource.hikari.leak-detection-threshold=2000 # Log stack trace if conn held >2s
Optimal pool size formula (Little's Law):
@Test void detectConnectionLeak() { // Enable leak detection in test config // Check HikariCP metrics: active connections should return to baseline HikariPoolMXBean pool = dataSource.getHikariPoolMXBean(); callServiceMethod(); assertThat(pool.getActiveConnections()).isZero(); // all returned }
Circuit Breaker, Retry & Bulkhead
Design a resilient service call with Resilience4j — Circuit Breaker + Retry + Bulkhead. Production config.
Hard▼| Pattern | Protects Against | Mechanism |
|---|---|---|
| Circuit Breaker | Cascading failure to dead service | CLOSED→OPEN after X% failures, half-open probe |
| Retry | Transient network glitches | Exponential backoff with jitter |
| Bulkhead | One slow service starving thread pool | Separate thread pool per downstream |
| Rate Limiter | Overloading downstream | Max calls per time window |
| Time Limiter | Slow downstream holding threads | Cancel call after timeout |
resilience4j: circuitbreaker: instances: paymentGateway: slidingWindowType: COUNT_BASED slidingWindowSize: 20 failureRateThreshold: 50 # OPEN after 50% failures in last 20 calls waitDurationInOpenState: 10s # Wait before half-open probe permittedNumberOfCallsInHalfOpenState: 3 slowCallRateThreshold: 80 # Also open if 80% of calls are slow slowCallDurationThreshold: 2s retry: instances: paymentGateway: maxAttempts: 3 waitDuration: 500ms enableExponentialBackoff: true exponentialBackoffMultiplier: 2 # 500ms → 1s → 2s retryExceptions: - java.net.ConnectException - java.net.SocketTimeoutException ignoreExceptions: - com.app.exception.InvalidPaymentException # Don't retry business errors bulkhead: instances: paymentGateway: maxConcurrentCalls: 20 maxWaitDuration: 100ms
@Service public class PaymentClient { @CircuitBreaker(name = "paymentGateway", fallbackMethod = "paymentFallback") @Retry(name = "paymentGateway") @Bulkhead(name = "paymentGateway") @TimeLimiter(name = "paymentGateway") public CompletableFuture<PaymentResult> charge(PaymentRequest req) { return CompletableFuture.supplyAsync(() -> gateway.process(req)); } // Fallback: must have same return type + extra Throwable param private CompletableFuture<PaymentResult> paymentFallback( PaymentRequest req, Throwable t) { log.error("Payment gateway down, routing to fallback", t); // Queue for async processing or return cached result return CompletableFuture.completedFuture(PaymentResult.queued(req)); } }
Observability — Logs, Metrics, Traces
How do you trace a single user request across 8 microservices to find where latency is added?
Hard▼Distributed tracing is the answer. OpenTelemetry (OTel) is the current standard — vendor-neutral, exportable to Jaeger, Zipkin, Datadog, or New Relic.
# pom.xml <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-tracing-bridge-otel</artifactId> </dependency> <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-exporter-otlp</artifactId> </dependency> # application.properties management.tracing.sampling.probability=1.0 # 100% in staging, 10% in prod management.otlp.tracing.endpoint=http://otel-collector:4318/v1/traces logging.pattern.level=%5p [${spring.application.name},%X{traceId},%X{spanId}]
@KafkaListener(topics = "payment-events") public void consume(ConsumerRecord<String, String> record) { // Extract trace context from Kafka headers Map<String, String> headers = new HashMap<>(); record.headers().forEach(h -> headers.put(h.key(), new String(h.value()))); // OTel auto-propagation handles this when using Spring Kafka // Ensure: spring-kafka-otel bridge is on classpath processEvent(record.value()); }
JFR vs JProfiler vs async-profiler — which do you use in production and why?
Mid▼| Tool | Overhead | Prod Safe? | Best For |
|---|---|---|---|
| JFR (Java Flight Recorder) | <1% | Yes — built into JDK | Always-on profiling, GC, I/O, threads, allocations |
| async-profiler | 1–3% | Yes — sampling only | CPU flame graphs, allocation profiling, lock contention |
| JProfiler / YourKit | 10–30% | No — too heavy | Dev/staging deep profiling with GUI |
| VisualVM | 5–15% | No | Local heap dump analysis |
# Start JFR recording (JDK 11+) jcmd <pid> JFR.start duration=120s filename=/tmp/recording.jfr settings=profile # Or: always-on with rolling file (low overhead) -XX:StartFlightRecording=disk=true,maxage=1h,maxsize=500m,\ filename=/recordings/app.jfr,settings=default # async-profiler — CPU flame graph ./profiler.sh -d 30 -f /tmp/flamegraph.html <pid>
Distributed Consistency & Transactions
Explain the Transactional Outbox Pattern — why Saga alone isn't enough.
Hard▼Saga handles the distributed transaction logic. But there's a fundamental problem: how do you atomically write to DB AND publish to Kafka? If the service crashes between these two operations, you get inconsistency — DB updated but event never published.
@Transactional public void placeOrder(Order order) { orderRepo.save(order); // write order // Write event to OUTBOX table in SAME transaction // If this transaction commits → both order + outbox entry persist atomically // If this transaction rolls back → neither persists OutboxEvent event = new OutboxEvent( "ORDER_PLACED", JsonUtils.toJson(order), Instant.now() ); outboxRepo.save(event); // Kafka is NOT called here — no dual-write problem } // Separate Outbox Publisher (CDC or polling) @Scheduled(fixedDelay = 1000) public void publishPendingEvents() { List<OutboxEvent> pending = outboxRepo.findUnpublished(); pending.forEach(event -> { kafka.send(event.getType(), event.getPayload()); event.markPublished(); outboxRepo.save(event); // idempotent update }); }
CAP Theorem in practice — which databases choose what, and how does this affect your prod system design?
Hard▼| System | CAP Choice | Why | Java Use Case |
|---|---|---|---|
| PostgreSQL / MySQL | CP | Strong consistency, may be unavailable during partition | Payments, orders, user accounts |
| Cassandra | AP | Always available, eventual consistency | Audit logs, time-series, sensor data |
| DynamoDB | AP (configurable) | Tunable consistency per read | Session store, product catalog |
| MongoDB | CP (with write concern) | Replica set majority writes = strong consistency | Document stores, content systems |
| Redis (Sentinel) | AP | Redis Cluster can lose writes during failover | Cache only — never source of truth |
| ZooKeeper / etcd | CP | Raft/Paxos consensus | Leader election, distributed config |
Security in Production
Beyond JWT auth — what security layers does a production Java API need?
Mid▼/actuator/env and /actuator/heapdump expose sensitive data. Restrict to internal network only. Separate management port from app port.# application.properties management.server.port=8081 # Separate port — not exposed to internet management.endpoints.web.exposure.include=health,metrics,prometheus management.endpoints.web.exposure.exclude=env,beans,heapdump,threaddump management.endpoint.health.show-details=when-authorized # Kubernetes NetworkPolicy: allow 8081 only from monitoring namespace
Rapid Fire — Senior-Level One-Shots
What's the difference between liveness and readiness probes, and what happens if you mix them up?
Mid▼Liveness: Is the JVM alive? If fails → pod restarted. Use for deadlock detection. Readiness: Is the app ready to serve traffic? If fails → pod removed from load balancer (no restart). Use for DB connection health, downstream dependencies.
GraalVM Native Image in Kubernetes — what breaks and what's the production tradeoff?
Hard▼GraalVM compiles Java to native binary (AOT). Startup: 50ms vs 8 seconds. Memory: 60% reduction. Great for Kubernetes where pods scale rapidly.
What breaks:
- Dynamic class loading (CGLIB proxies, runtime Groovy) — must configure reflection hints
- Many libraries not yet native-image compatible (check GraalVM reachability metadata repo)
- Build time: 3–5 minutes vs 30 seconds for JVM — slower CI
- No JIT optimization — peak throughput lower than JVM for long-running services
- Debugging native binaries is harder — no JFR, limited profiling tools
Continuous Delivery vs Continuous Deployment — and what gate do you add between them?
Easy▼Continuous Delivery: Pipeline ensures the build is always deployable. A human clicks deploy to production. Continuous Deployment: Every green pipeline commit goes to production automatically — no human click. Most mature teams use CD with automated gates: e2e test pass + load test baseline + security scan + canary analysis — all automated. Human approvals only for major releases or regulated environments (fintech/healthcare).
Pod is running but /health returns 200 and service is still broken. How?
Hard▼A health endpoint returning 200 only means Spring Boot's Actuator health check passed. It doesn't mean your business logic works. Common scenarios:
- Health check only pings DB
SELECT 1— but the query users actually run is deadlocked - External dependency (payment gateway) is down — but health check doesn't include it
- Feature flag misconfigured — requests routed to disabled code path, returns 200 with empty body
- Cache serving stale data — "healthy" but responses are hours old
How do you do a database schema migration safely with zero downtime and Blue-Green deployment?
Hard▼The hardest part of zero-downtime deployment is DB schema changes. You can't deploy a new schema while old pods are still running — they'll fail. Use Expand-Contract (Parallel Change) in 3 releases:
NULL with no constraints. Old code ignores it. New code writes to both old and new columns.NOT NULL DEFAULT).-- V1__add_email_hash_column.sql (SAFE — additive, nullable) ALTER TABLE users ADD COLUMN email_hash VARCHAR(64) NULL; -- V2__backfill_email_hash.sql (run as background job, not migration) -- DO NOT USE Flyway for large backfills — it locks the table -- Instead: Spring Batch job with batch size 1000, pausable -- V3__drop_old_email.sql (only safe after ALL pods use email_hash) ALTER TABLE users DROP COLUMN email; -- release 3