| Dimension | Synchronous (REST/gRPC) | Asynchronous (Kafka/RabbitMQ) |
|---|---|---|
| Coupling | Temporal coupling โ both must be UP simultaneously | Temporally decoupled โ producer/consumer independent |
| Latency | Low for simple request-response | Higher โ eventual consistency |
| Complexity | Simple mental model, easy to debug | Complex: ordering, idempotency, dead-letter queues |
| Use When | Real-time response needed (payment gateway, auth) | High throughput, fan-out, audit logs, event sourcing |
| Failure Mode | Cascading failures if downstream is slow | Message accumulation, consumer lag, poison pills |
/api/orders/** โ order-service, /api/users/** โ user-service# 10% traffic to v2, 90% to v1 apiVersion: networking.istio.io/v1beta1 kind: VirtualService spec: http: - route: - destination: host: order-service subset: v1 weight: 90 - destination: host: order-service subset: v2 weight: 10
@Transactional public void placeOrder(Order order) { // 1. Save business entity orderRepository.save(order); // 2. Save event to outbox table โ SAME transaction OutboxEvent event = OutboxEvent.builder() .aggregateType("ORDER") .aggregateId(order.getId()) .eventType("ORDER_PLACED") .payload(toJson(order)) .status(OutboxStatus.PENDING) .build(); outboxRepository.save(event); // No Kafka call here! Relay handles it. } // Separate relay service (polls or uses CDC) @Scheduled(fixedDelay = 1000) public void relayEvents() { List<OutboxEvent> pending = outboxRepository.findByStatus(OutboxStatus.PENDING); pending.forEach(evt -> { kafkaTemplate.send(evt.getEventType(), evt.getPayload()); outboxRepository.markPublished(evt.getId()); }); }
// Command Handler @CommandHandler public void handle(PlaceOrderCommand cmd) { AggregateLifecycle.apply(new OrderPlacedEvent( cmd.getOrderId(), cmd.getItems(), Instant.now() )); } // Event Handler (updates read model) @EventHandler public void on(OrderPlacedEvent event) { // Write to Elasticsearch read model orderReadRepository.save(OrderSummaryView.from(event)); } // Query Handler @QueryHandler public OrderSummaryView handle(GetOrderSummaryQuery query) { return orderReadRepository.findById(query.getOrderId()); }
| Aspect | Choreography Saga | Orchestration Saga |
|---|---|---|
| Control | Decentralized โ each service reacts to events | Centralized โ Saga Orchestrator controls the flow |
| Coupling | Loose โ services don't know each other | Orchestrator coupled to each participant |
| Complexity | Hard to track overall workflow | Easy to visualize and monitor |
| Best For | Simple, short workflows (2-3 steps) | Complex workflows (5+ steps, conditional logic) |
| Tools | Kafka events | Temporal, Axon, custom orchestrator |
@Service public class OrderSagaOrchestrator { @SagaEventHandler(associationProperty = "orderId") public void on(OrderCreatedEvent event) { // Step 1: Reserve inventory commandGateway.send(new ReserveInventoryCommand(event.getOrderId())); } @SagaEventHandler(associationProperty = "orderId") public void on(InventoryReservedEvent event) { // Step 2: Charge payment commandGateway.send(new ChargePaymentCommand(event.getOrderId())); } @SagaEventHandler(associationProperty = "orderId") public void on(PaymentFailedEvent event) { // Compensating transaction โ release inventory commandGateway.send(new ReleaseInventoryCommand(event.getOrderId())); SagaLifecycle.end(); } }
resilience4j:
circuitbreaker:
instances:
paymentService:
# Sliding window: count-based, 10 calls
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
# Open circuit if 50% calls fail
failureRateThreshold: 50
# Stay open for 30s before half-open
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
# Also catch timeout as failure
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback") @Retry(name = "paymentService") @TimeLimiter(name = "paymentService") public CompletableFuture<PaymentResponse> processPayment(PaymentRequest req) { return CompletableFuture.supplyAsync(() -> paymentClient.charge(req)); } public CompletableFuture<PaymentResponse> paymentFallback( PaymentRequest req, Throwable ex) { // Queue for later processing pendingPaymentQueue.enqueue(req); return CompletableFuture.completedFuture( PaymentResponse.pending(req.getOrderId())); }
Like a ship's bulkhead compartments โ if one compartment floods, it doesn't sink the whole ship. Similarly, a slow payment service won't consume all threads and prevent inventory checks.
// application.yml resilience4j.bulkhead.instances.inventoryService: maxConcurrentCalls: 5 maxWaitDuration: 100ms // Usage @Bulkhead(name = "inventoryService", type = Bulkhead.Type.SEMAPHORE) public Inventory checkInventory(String productId) { return inventoryClient.check(productId); }
@Component public class IdempotencyFilter extends OncePerRequestFilter { @Autowired private RedisTemplate<String, String> redis; @Override protected void doFilterInternal(HttpServletRequest req, HttpServletResponse res, FilterChain chain) { String key = req.getHeader("Idempotency-Key"); if (key != null) { String cached = redis.opsForValue().get("idem:" + key); if (cached != null) { // Return cached response โ no duplicate processing res.setStatus(200); res.getWriter().write(cached); return; } } chain.doFilter(req, res); if (key != null) { redis.opsForValue().set("idem:" + key, capturedResponse, Duration.ofHours(24)); } } }
| Protocol | Behavior | Latency |
|---|---|---|
| Eager (default) | Revoke ALL partitions, then reassign | High โ full stop |
| Cooperative Incremental | Only revoke/reassign changed partitions | Low โ no full stop |
// application.yml spring.kafka.consumer: group-id: order-processing-group partition-assignment-strategy: - org.apache.kafka.clients.consumer.CooperativeStickyAssignor max-poll-interval-ms: 300000 # Increase if processing is slow session-timeout-ms: 45000 heartbeat-interval-ms: 3000
CooperativeStickyAssignor in production. With Eager rebalancing and 50 partitions, a rebalance can cause 30โ60 second processing pauses under high load โ unacceptable for fintech.@Bean public DefaultErrorHandler kafkaErrorHandler( KafkaTemplate<String, Object> kafkaTemplate) { DeadLetterPublishingRecoverer recoverer = new DeadLetterPublishingRecoverer(kafkaTemplate, (record, ex) -> new TopicPartition( record.topic() + ".DLT", record.partition())); // Retry 3 times with 1s, 2s, 4s backoff before DLT ExponentialBackOffWithMaxRetries backoff = new ExponentialBackOffWithMaxRetries(3); backoff.setInitialInterval(1000); backoff.setMultiplier(2); return new DefaultErrorHandler(recoverer, backoff); } // DLT Consumer for manual review/replay @KafkaListener(topics = "orders.DLT", groupId = "dlt-handler") public void handleDeadLetter(ConsumerRecord<?, ?> record, @Header(KafkaHeaders.EXCEPTION_MESSAGE) String exMessage) { log.error("DLT message: {} | Error: {}", record.value(), exMessage); alertingService.notify(record, exMessage); }
| Semantic | Risk | How |
|---|---|---|
| At-most-once | Message loss | Commit offset before processing |
| At-least-once | Duplicate processing | Commit offset after processing (default) |
| Exactly-once | Complexity | Idempotent producer + transactional API |
# application.yml spring.kafka.producer: transaction-id-prefix: "tx-" acks: all enable-idempotence: true spring.kafka.consumer: isolation-level: read_committed # Only read committed msgs // Java usage โ transactional producer @Transactional("kafkaTransactionManager") public void processAndPublish(ConsumerRecord<?, ?> record) { // DB update + Kafka publish in ONE transaction orderRepo.updateStatus(record.value()); kafkaTemplate.send("order.processed", record.value()); // Both commit atomically or both rollback }
WebSecurityConfigurerAdapter. A senior engineer must know the new component-based security configuration.
@Bean SecurityFilterChainhttp.authorizeHttpRequests(auth -> auth...)@EnableMethodSecurity replaces @EnableGlobalMethodSecurityantMatchers deprecated โ use requestMatchers@Configuration @EnableMethodSecurity public class SecurityConfig { @Bean public SecurityFilterChain filterChain(HttpSecurity http) throws Exception { return http .csrf(AbstractHttpConfigurer::disable) // Stateless JWT .sessionManagement(s -> s .sessionCreationPolicy(SessionCreationPolicy.STATELESS)) .authorizeHttpRequests(auth -> auth .requestMatchers("/api/auth/**").permitAll() .requestMatchers("/api/admin/**").hasRole("ADMIN") .anyRequest().authenticated()) .addFilterBefore(jwtFilter, UsernamePasswordAuthenticationFilter.class) .build(); } @Bean public JwtAuthFilter jwtFilter() { return new JwtAuthFilter(jwtUtil, userDetailsService); } }
Service A receives a JWT from the API Gateway. Service A calls Service B (via Feign/WebClient). How does B know who the original user is?
@Component public class FeignJwtInterceptor implements RequestInterceptor { @Override public void apply(RequestTemplate template) { ServletRequestAttributes attrs = (ServletRequestAttributes) RequestContextHolder .getRequestAttributes(); if (attrs != null) { String token = attrs.getRequest() .getHeader("Authorization"); if (token != null) { template.header("Authorization", token); // Also propagate trace ID template.header("X-Trace-Id", MDC.get("traceId")); } } } }
Order Service DB โโโบ Kafka (OrderPlaced) โโโบ User Service DB โโโบ Kafka (UserUpdated) โโโบ Analytics Consumer Payment DB โโโบ Kafka (PaymentDone) โโโบ โโโบ Snowflake โฒ Reporting Dashboard (BI)
// Traditional DB โ stores current balance accounts: { id: 1, balance: 500 } // Event Store โ stores history events: [ { type: "AccountOpened", amount: 1000, ts: 2024-01-01 }, { type: "MoneyWithdrawn", amount: 200, ts: 2024-01-02 }, { type: "MoneyWithdrawn", amount: 300, ts: 2024-01-03 } ] // Balance = 1000 - 200 - 300 = 500 (replayed) // Snapshots avoid full replay on large histories snapshot: { balance: 800, afterEventSeq: 50 } // Replay only events 51+ from snapshot
| Pillar | What | Tools |
|---|---|---|
| Logs | Discrete events โ what happened | ELK Stack, Loki |
| Metrics | Aggregated measurements โ CPU, latency p99 | Prometheus + Grafana |
| Traces | Request journey across services โ WHY it's slow | Zipkin, Jaeger, Tempo |
# pom.xml dependencies <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-tracing-bridge-otel</artifactId> </dependency> <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-exporter-zipkin</artifactId> </dependency> # application.yml management: tracing: sampling: probability: 1.0 # 100% in dev, 0.01-0.1 in prod zipkin: tracing: endpoint: http://zipkin:9411/api/v2/spans # Auto-injected in logs: # 2024-01-15 [order-service,traceId=abc123,spanId=def456] ...
@Service public class PaymentService { @Autowired private Tracer tracer; public PaymentResult processPayment(PaymentRequest req) { Span span = tracer.nextSpan() .name("payment.process") .tag("payment.method", req.getMethod()) .tag("payment.amount", req.getAmount().toString()) .start(); try (var ws = tracer.withSpan(span)) { return gatewayClient.charge(req); } catch (Exception e) { span.error(e); throw e; } finally { span.end(); } } }
http_server_requests_seconds โ track p50, p95, p99 percentilesrate(http_server_requests_total{status=~"5.."}[5m])rate(http_server_requests_total[1m])# Alert: Error rate > 1% for 5 mins (SLO breach) - alert: HighErrorRate expr: | rate(http_server_requests_total{status=~"5.."}[5m]) / rate(http_server_requests_total[5m]) > 0.01 for: 5m labels: { severity: critical } annotations: summary: "Error rate SLO breach on {{ $labels.job }}" # Alert: p99 latency > 500ms - alert: HighLatency expr: | histogram_quantile(0.99, rate(http_server_requests_seconds_bucket[5m])) > 0.5
@Service public class OrderService { private final Counter orderCounter; private final Timer orderTimer; public OrderService(MeterRegistry registry) { orderCounter = Counter.builder("orders.placed.total") .description("Total orders placed") .tag("env", "prod") .register(registry); orderTimer = Timer.builder("order.processing.duration") .publishPercentiles(0.5, 0.95, 0.99) .register(registry); } }
apiVersion: apps/v1 kind: Deployment spec: replicas: 3 template: spec: containers: - name: order-service image: order-service:2.1.0 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" # OOMKill if exceeded cpu: "500m" # Throttled if exceeded livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 30 failureThreshold: 3 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 20 periodSeconds: 10 --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler spec: minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
| Probe | Purpose | Action on Fail |
|---|---|---|
| Liveness | Is the app alive? (not deadlocked) | Restart container |
| Readiness | Can app serve traffic? (DB connected) | Remove from Service endpoints |
| Startup | Has app finished starting up? | Delay liveness checks (for slow starts) |
/actuator/health/liveness and /actuator/health/readiness. The readiness probe automatically goes DOWN when the app is gracefully shutting down, removing it from load balancer rotation./notifications/** to new service. Monolith's notification code still exists but is bypassed.# application.yml โ One line to enable! spring.threads.virtual.enabled: true // Or programmatically @Bean public TomcatProtocolHandlerCustomizer<?> virtualThreads() { return handler -> handler.setExecutor( Executors.newVirtualThreadPerTaskExecutor()); } // Structured Concurrency (Java 21 Preview) try (var scope = new StructuredTaskScope.ShutdownOnFailure()) { // Fork parallel tasks var orderTask = scope.fork(() -> orderService.get(id)); var userTask = scope.fork(() -> userService.get(id)); var payTask = scope.fork(() -> paymentService.get(id)); scope.join(); // Wait for all scope.throwIfFailed(); // Propagate any failure // Access results return new OrderSummary( orderTask.get(), userTask.get(), payTask.get()); }
| Virtual Threads (Loom) | Reactive (WebFlux) | |
|---|---|---|
| Code Style | Imperative โ easy to read/debug | Reactive chains โ complex |
| Debugging | Normal stack traces | Mangled reactive stack traces |
| Performance | Excellent for I/O-bound | Excellent for I/O-bound |
| CPU-bound | No benefit | No benefit |
| Migration | One config line | Full rewrite required |
# Maven โ build native image mvn -Pnative native:compile # Docker buildpack (no local GraalVM needed) mvn spring-boot:build-image -Pnative # Result: ~50ms startup vs 8s JVM # Started OrderServiceApplication in 0.087 seconds
@Service public class RAGService { @Autowired private ChatClient chatClient; @Autowired private VectorStore vectorStore; public String askWithContext(String question) { // 1. Similarity search in vector store List<Document> relevant = vectorStore .similaritySearch(SearchRequest.query(question) .withTopK(5) .withSimilarityThreshold(0.7)); // 2. Augment prompt with retrieved context String context = relevant.stream() .map(Document::getContent) .collect(Collectors.joining("\n\n")); // 3. Call LLM with context return chatClient.prompt() .system("Answer based only on this context:\n" + context) .user(question) .call() .content(); } }