System Design Mastery — Senior Java Developer Guide

01

Horizontal vs Vertical Scaling

Choosing the right strategy for your system's growth

🔼 Vertical Scaling (Scale Up)

Add more CPU, RAM, SSD to a single server
No code changes needed — just upgrade hardware
Limited by physical hardware ceiling
Single point of failure — if it dies, everything dies
Expensive at high-end tiers (diminishing returns)
Zero distributed coordination overhead

✓ Simple ✗ Limited Ceiling

↔ Horizontal Scaling (Scale Out)

Add more machines/nodes to a cluster
Requires stateless design + load balancer
Theoretically unlimited scale
Better fault tolerance (one node dies → others serve)
Complexity: network latency, consistency, coordination
Cloud-native: spin up/down instances on demand

✓ Unlimited Scale ⚠ Distributed Complexity

🏗️ Real-World Architecture — Horizontal Scaling with Spring Boot

Here's how a Spring Boot microservice achieves horizontal scaling. The key is stateless design — no session stored in memory, state lives in Redis/DB.

Client

→

AWS ALB
Load Balancer

→

Spring Boot
Instance 1

Spring Boot
Instance 2

Spring Boot
Instance N

→

Redis
Shared State

application.yml — Stateless Spring Boot

# application.yml — Stateless session via Redis
spring:
  session:
    store-type: redis         # ← Store session in Redis, not in-memory
  data:
    redis:
      host: redis.example.com
      port: 6379

server:
  port: 8080

# Now ANY instance can handle ANY request
# Load balancer can route freely — no sticky sessions needed

Java — Stateless Service Example

@RestController
@RequestMapping("/api/orders")
public class OrderController {

    @Autowired
    private OrderService orderService;

    // ✅ STATELESS — no instance-level state
    // All state is in DB/Redis — safe for horizontal scaling
    @GetMapping("/{orderId}")
    public ResponseEntity<Order> getOrder(@PathVariable Long orderId) {
        return ResponseEntity.ok(orderService.findById(orderId));
    }

    // ❌ BAD — storing state in instance variable
    // private List<Order> cache = new ArrayList<>();
    // If ALB routes request to different instance — cache miss!
}

💡

Senior Interview Tip

Don't just say "horizontal scaling is better." Talk about trade-offs. Vertical scaling is perfect for a PostgreSQL primary replica — you can't easily shard a relational DB, so you scale up. Horizontal works best for stateless compute layers (API servers, workers). Show that you pick the strategy based on the component.

02

CAP Theorem

The fundamental trade-off every distributed system must make

📐 Definition

In any distributed system, you can only guarantee 2 out of 3 properties simultaneously:

CP

Consistency + Partition Tolerance

All nodes return the same (latest) data. System may reject requests during a network partition to preserve consistency.

Examples: MySQL, PostgreSQL, HBase, Zookeeper

AP

Availability + Partition Tolerance

System always responds but data might be stale across nodes. Nodes become eventually consistent after partition heals.

Examples: Cassandra, DynamoDB, CouchDB, DNS

CA

Consistency + Availability

Not realistic in distributed systems. Network partitions are inevitable — this only works on a single-node system.

⚠ Theoretical only — never use in interview

🏦 CP Example — Banking System with Spring Boot + PostgreSQL

A bank account balance MUST be consistent. If a network partition happens, we return an error rather than serve stale data. Availability is sacrificed for correctness.

Java — CP: Consistent Bank Transfer

@Service
@Transactional(isolation = Isolation.SERIALIZABLE)  // Strongest isolation
public class BankTransferService {

    @Autowired private AccountRepository accountRepo;

    /**
     * CP Strategy: Acquire pessimistic lock on both accounts.
     * If DB is unreachable (partition), transaction fails with exception.
     * ✅ Consistent  ✅ Partition-tolerant  ❌ Unavailable during partition
     */
    public void transfer(Long fromId, Long toId, BigDecimal amount) {
        // Pessimistic lock — prevents dirty/phantom reads
        Account from = accountRepo.findByIdWithLock(fromId)
            .orElseThrow(() -> new AccountNotFoundException());
        Account to = accountRepo.findByIdWithLock(toId)
            .orElseThrow(() -> new AccountNotFoundException());

        if (from.getBalance().compareTo(amount) < 0)
            throw new InsufficientFundsException();

        from.debit(amount);
        to.credit(amount);
        accountRepo.saveAll(List.of(from, to));

        // If any step fails → entire transaction rolls back (ACID)
    }
}

// Repository with pessimistic lock
public interface AccountRepository extends JpaRepository<Account, Long> {

    @Lock(LockModeType.PESSIMISTIC_WRITE)
    @Query("SELECT a FROM Account a WHERE a.id = :id")
    Optional<Account> findByIdWithLock(@Param("id") Long id);
}

📱 AP Example — Social Media Feed with Cassandra / DynamoDB

A social media feed can show slightly stale data. It's far better to always respond (even with 5-second-old posts) than to refuse requests during a partition.

Java — AP: Feed Service (Eventual Consistency)

@Service
public class FeedService {

    @Autowired private FeedRepository feedRepo;      // DynamoDB backed
    @Autowired private RedisTemplate redisTemplate;   // Read cache

    /**
     * AP Strategy: Read from cache first, fall back to DB.
     * If DB is partitioned, return cached (stale) data.
     * ✅ Always available  ✅ Partition-tolerant  ❌ May serve stale posts
     */
    public List<Post> getFeed(String userId) {
        String cacheKey = "feed:" + userId;

        // Try cache first
        List<Post> cached = (List<Post>) redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) return cached;

        try {
            List<Post> posts = feedRepo.getRecentPosts(userId);
            redisTemplate.opsForValue().set(cacheKey, posts, Duration.ofSeconds(30));
            return posts;
        } catch (DataAccessException e) {
            // DB unreachable during partition — return empty feed, don't fail
            // Availability over consistency — user sees no feed vs 500 error
            return Collections.emptyList();
        }
    }
}

⚠️

Critical Interview Point — Partition Tolerance is NON-NEGOTIABLE

Network partitions WILL happen in any distributed system. Packets get dropped. Nodes go down. You can never eliminate P. So the real choice is always: CP (sacrifice Availability) vs AP (sacrifice Consistency). If an interviewer says "what about CA?", explain that it's only valid for single-node systems — not distributed architectures.

03

Consistency Models

From strongest to weakest — pick based on business requirements

Model	Guarantee	Trade-off	Real Use Case
Strong	Every read always returns the latest write — no exceptions. All nodes in sync.	High latency — must wait for all replicas to acknowledge before responding.	Banking ledgers, financial transactions, inventory counts
Causal	If A caused B, all nodes see A before B. Unrelated operations can be in any order.	Moderate latency — only causal chains are ordered, not all operations.	Chat apps (replies after messages), comment threads, collaborative editors
Read-Your-Writes	After you write, your own subsequent reads reflect that write immediately.	Other users may still see old data momentarily. Sticky routing needed.	User profile updates, settings changes, shopping cart
Eventual	Given no new writes, all replicas will converge to same value — eventually.	Reads may return stale data for milliseconds to seconds.	DynamoDB, Cassandra, DNS, social media likes/views, CDN

Java — Read-Your-Writes with Redis + Spring Boot

@Service
public class UserProfileService {

    @Autowired private UserRepository repo;             // Read replicas (eventual)
    @Autowired private RedisTemplate<String,User> redis;  // Write-through cache

    /**
     * Read-Your-Writes Consistency:
     * When user updates their own profile, we immediately write to Redis.
     * Next read returns from Redis (fresh), not from replica (stale).
     */
    public void updateProfile(String userId, UpdateProfileRequest req) {
        User user = repo.findById(userId).orElseThrow();
        user.update(req);
        repo.save(user);  // Write to primary DB

        // Immediately cache the fresh copy — Read-Your-Writes guarantee
        redis.opsForValue().set("user:" + userId, user, Duration.ofMinutes(5));
    }

    public User getProfile(String userId) {
        // Read from Redis first (your fresh write will be here)
        User cached = redis.opsForValue().get("user:" + userId);
        if (cached != null) return cached;  // Cache hit — fresh data

        // Cache miss — read from DB replica (may be slightly stale for others)
        return repo.findById(userId).orElseThrow();
    }
}

📌

Causal Consistency in Chat Systems (Kafka Ordering)

In a chat application, if User A sends "Hello" and then "How are you?", all consumers must see "Hello" before "How are you?". Kafka guarantees this within a partition — use the same conversation_id as the partition key. All messages of the same conversation go to the same partition and are consumed in order.

04

ACID vs BASE

Transactional correctness vs distributed scalability

🔒 ACID — Traditional Databases

Atomicity — All operations succeed or all rollback. No partial writes.
Consistency — DB moves from one valid state to another. Constraints enforced.
Isolation — Concurrent transactions don't interfere with each other.
Durability — Once committed, data survives crashes (WAL logs).

PostgreSQL MySQL Oracle

🚀 BASE — Distributed Systems

Basically Available — System is always up; occasional partial failures OK.
Soft State — Data may change over time without input (replication lag).
Eventually Consistent — All nodes will converge to same value, eventually.
Trade correctness for scalability and availability.

Cassandra DynamoDB MongoDB

⚖️ ACID in Spring Boot — @Transactional Deep Dive

Java — ACID Transaction with Spring Boot + JPA

@Service
public class OrderService {

    @Autowired private OrderRepository orderRepo;
    @Autowired private InventoryRepository inventoryRepo;
    @Autowired private PaymentRepository paymentRepo;

    /**
     * ATOMICITY: All 3 steps succeed or ALL are rolled back.
     * CONSISTENCY: Stock never goes negative (DB constraint).
     * ISOLATION: Another order cannot read the same stock simultaneously.
     * DURABILITY: After commit, data survives server crash.
     */
    @Transactional(isolation = Isolation.READ_COMMITTED,
                   propagation = Propagation.REQUIRED,
                   rollbackFor = Exception.class)
    public Order placeOrder(OrderRequest req) {
        // Step 1: Reserve inventory (Atomicity — if this fails, nothing persists)
        Inventory inv = inventoryRepo.findByProductId(req.getProductId());
        if (inv.getStock() < req.getQuantity())
            throw new InsufficientStockException(); // → triggers rollback

        inv.decrementStock(req.getQuantity());
        inventoryRepo.save(inv);

        // Step 2: Create order record
        Order order = new Order(req, OrderStatus.PENDING);
        orderRepo.save(order);

        // Step 3: Create payment record
        Payment payment = new Payment(order.getId(), req.getAmount());
        paymentRepo.save(payment);

        // ✅ All 3 steps committed atomically
        return order;
    }
}

📖

Isolation Levels Cheat Sheet

READ_UNCOMMITTED → fastest, dirty reads possible. READ_COMMITTED → most common, no dirty reads. REPEATABLE_READ → prevents non-repeatable reads (MySQL default). SERIALIZABLE → strongest, fully sequential, prevents phantom reads but slowest. For most Spring Boot APIs, READ_COMMITTED is the right default.

⚡ BASE in Practice — DynamoDB / Cassandra Approach

For high-scale systems (millions of writes/sec), you embrace BASE. Example: storing product view counts or shopping cart data where exact consistency isn't critical.

Java — BASE: Product View Counter (Idempotent, Eventual)

@Service
public class ViewCounterService {

    @Autowired private DynamoDbClient dynamoDb;

    /**
     * BASE Pattern: We don't care if view count is 1,001,234 or 1,001,235.
     * Basically Available: We always accept view events.
     * Soft State: Count might be slightly off due to race conditions.
     * Eventually Consistent: Counts will converge across replicas.
     */
    public void incrementViewCount(String productId) {
        // Atomic increment — DynamoDB ADD operation
        UpdateItemRequest request = UpdateItemRequest.builder()
            .tableName("ProductStats")
            .key(Map.of("productId", AttributeValue.fromS(productId)))
            .updateExpression("ADD viewCount :inc")
            .expressionAttributeValues(Map.of(":inc", AttributeValue.fromN("1")))
            .build();

        dynamoDb.updateItem(request);
        // Fire and forget — no ACID guarantee needed here
        // Count will be eventually consistent across all read replicas
    }

    // For exact counts that matter — use Redis INCR (atomic, single thread)
    public Long getCount(String productId) {
        return redisTemplate.opsForValue().increment("views:" + productId, 0L);
    }
}

05

Monolith → Microservices Migration

The Strangler Pattern, bounded contexts, and when NOT to migrate

⚠️ When to Break a Monolith

Codebase >500K LOC — slow compilations, slow deploys
Teams blocking each other on the same codebase
One module needs 10× more compute than others
Different parts need different tech stacks
Deploy cycles slow due to tight coupling
Production outages caused by unrelated changes

✅ When to KEEP the Monolith

Early-stage startup — move fast, iterate
Team smaller than 5-6 engineers
Traffic fits comfortably on one instance
Domain is not well-understood yet (premature abstraction)
Operational complexity of microservices isn't justified
Martin Fowler's rule: "Don't start with microservices"

🌿 Strangler Fig Pattern — Step-by-Step Migration

Named after the strangler fig tree that grows around a host tree — you gradually build the new system around the old one until the old one can be removed.

1

Identify Bounded Contexts (DDD)

Map your domain into bounded contexts: Order, Payment, Inventory, User, Notification. Each context becomes a candidate microservice.

2

Set up API Gateway in Front of Monolith

Put Spring Cloud Gateway / AWS API Gateway in front. Initially it routes 100% to monolith. New services will get their routes added here.

3

Extract the Easiest Service First

Start with a read-heavy, isolated module (e.g., Product Catalog). Build the new Spring Boot microservice, redirect /products/** in gateway.

4

Use Events for Cross-Service Communication

When Order Service creates an order, publish order.created Kafka event. Inventory and Notification services consume it — no direct HTTP coupling.

5

Strangle the Monolith Progressively

As each service is extracted, the monolith shrinks. Eventually the monolith has 0 responsibilities — decommission it.

YAML — Spring Cloud Gateway: Strangler Routing

# application.yml — Spring Cloud Gateway
spring:
  cloud:
    gateway:
      routes:
        # Route 1: Extracted Product Service (new microservice)
        - id: product-service
          uri: lb://product-service        # lb:// = Eureka load balanced
          predicates:
            - Path=/api/products/**
          filters:
            - StripPrefix=1

        # Route 2: Extracted User Service
        - id: user-service
          uri: lb://user-service
          predicates:
            - Path=/api/users/**

        # Fallback: Everything else goes to monolith (still running!)
        - id: legacy-monolith
          uri: http://monolith.internal:8080
          predicates:
            - Path=/**                        # Catch-all

🗺️ E-Commerce Microservices Architecture

┌──────────────────────────────────────────────┐ │ CLIENTS (Web/Mobile/Partner) │ └───────────────────┬──────────────────────────┘ │ HTTPS ┌───────────────────▼──────────────────────────┐ │ API GATEWAY (Spring Cloud Gateway) │ │ Auth · Rate Limit · Routing · Logging │ └────┬──────────┬──────────┬──────────┬────────┘ │ │ │ │ ┌────────────────▼─┐ ┌─────▼────┐ ┌─▼────────┐ ┌▼────────────┐ │ Order Service │ │ User │ │ Product │ │ Payment │ │ :8081 │ │ Service │ │ Service │ │ Service │ │ (Spring Boot) │ │ :8082 │ │ :8083 │ │ :8084 │ └────────┬─────────┘ └──────────┘ └──────────┘ └─────────────┘ │ Publish event ┌────────▼──────────────────────────────────────────────────┐ │ Apache Kafka (Message Broker) │ │ Topics: order.created · payment.processed · stock.low │ └───────────────┬──────────────────────────────────────────┘ │ Subscribe ┌───────────────┼──────────────────────────┐ │ │ │ ┌────▼──────┐ ┌─────▼─────┐ ┌──────▼──────┐ │ Inventory │ │Notification│ │ Analytics │ │ Service │ │ Service │ │ Service │ └─────┬─────┘ └─────┬─────┘ └─────────────┘ │ │ ┌─────▼──────┐ ┌─────▼──────┐ │ PostgreSQL │ │ Redis │ │(per-service│ │ + Email │ │ DB) │ │ /SMS │ └────────────┘ └────────────┘

06

Stateless Services

The foundational principle enabling horizontal scaling in Spring Boot

🔑 JWT-Based Stateless Authentication

Instead of storing session in server memory (stateful), use JWT tokens. The token carries all user information. Any instance can validate it without consulting a central session store.

Client Login

→

Auth Service validates credentials

→

Returns JWT Token

Subsequent Requests

→

Any Instance validates JWT locally

→

Serves Response

Java — Stateless JWT Security with Spring Security 6

@Configuration
@EnableWebSecurity
public class SecurityConfig {

    @Autowired private JwtAuthFilter jwtFilter;

    @Bean
    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
        return http
            .csrf(AbstractHttpConfigurer::disable)

            // ✅ CRITICAL: SessionCreationPolicy.STATELESS
            // Spring Security will NOT create HttpSession
            // No server-side session — safe for horizontal scaling
            .sessionManagement(sm -> sm
                .sessionCreationPolicy(SessionCreationPolicy.STATELESS))

            .authorizeHttpRequests(auth -> auth
                .requestMatchers("/api/auth/**").permitAll()
                .anyRequest().authenticated())

            // Add JWT filter before UsernamePasswordAuthenticationFilter
            .addFilterBefore(jwtFilter, UsernamePasswordAuthenticationFilter.class)
            .build();
    }
}

@Component
public class JwtAuthFilter extends OncePerRequestFilter {

    @Autowired private JwtUtil jwtUtil;

    @Override
    protected void doFilterInternal(HttpServletRequest req,
                                      HttpServletResponse res,
                                      FilterChain chain) {
        String header = req.getHeader("Authorization");

        if (header != null && header.startsWith("Bearer ")) {
            String token = header.substring(7);
            if (jwtUtil.isValid(token)) {
                // Extract claims from token — NO DB lookup needed!
                // Any instance can do this with the same secret key
                String username = jwtUtil.extractUsername(token);
                List<String> roles = jwtUtil.extractRoles(token);

                UsernamePasswordAuthenticationToken auth =
                    new UsernamePasswordAuthenticationToken(
                        username, null, mapRoles(roles));

                SecurityContextHolder.getContext().setAuthentication(auth);
            }
        }
        chain.doFilter(req, res);
    }
}

07

Database per Service Pattern

Polyglot persistence and loose coupling in microservices

🗄️ Each Service Owns Its Own Data

No shared database between services. Services communicate only via APIs or events — never via direct DB queries on another service's schema.

✅ CORRECT — Database Per Service Order Service User Service Product Service Analytics │ │ │ │ PostgreSQL MySQL MongoDB(catalog) Snowflake (orders schema) (users schema) (flexible schema) (read-only DWH) Services share data only via: → REST API calls (sync) → Kafka events (async, preferred) ❌ WRONG — Shared Database Anti-Pattern Order Service ──→ Shared PostgreSQL ←── Product Service User Service ──→ DB ←── Inventory Service Problems: tight coupling, can't scale independently, schema changes break multiple services, no team autonomy

Java — Saga Pattern for Distributed Transactions (Choreography)

// Order Service — publishes events, doesn't call other DBs directly
@Service
public class OrderSagaOrchestrator {

    @Autowired private KafkaTemplate<String,Object> kafka;
    @Autowired private OrderRepository orderRepo;

    /**
     * SAGA: Distributed transaction without cross-service DB access
     * Step 1: Create order (PENDING) in Order DB
     * Step 2: Publish event → Payment Service consumes
     * Step 3: Payment success/fail event → Inventory Service
     * Step 4: If any step fails → compensating transaction (rollback via events)
     */
    @Transactional
    public Order createOrder(OrderRequest req) {
        Order order = new Order(req, OrderStatus.PENDING);
        orderRepo.save(order);  // Save in OWN DB only

        // Publish event — Payment Service will pick it up
        kafka.send("order.created", new OrderCreatedEvent(
            order.getId(), req.getUserId(), req.getAmount(), req.getProductId()
        ));

        return order;
    }

    // ← Listen for results from Payment Service
    @KafkaListener(topics = "payment.completed")
    public void onPaymentCompleted(PaymentCompletedEvent event) {
        Order order = orderRepo.findById(event.getOrderId()).orElseThrow();
        order.setStatus(OrderStatus.CONFIRMED);
        orderRepo.save(order);

        // Trigger next step in saga
        kafka.send("inventory.reserve", new ReserveInventoryEvent(
            order.getId(), event.getProductId(), event.getQuantity()
        ));
    }

    @KafkaListener(topics = "payment.failed")
    public void onPaymentFailed(PaymentFailedEvent event) {
        // COMPENSATING TRANSACTION — rollback order
        Order order = orderRepo.findById(event.getOrderId()).orElseThrow();
        order.setStatus(OrderStatus.CANCELLED);
        orderRepo.save(order);
        // Publish order.cancelled so other services can compensate too
        kafka.send("order.cancelled", new OrderCancelledEvent(order.getId()));
    }
}

08

API Gateway

Single entry point — Spring Cloud Gateway with filters, rate limiting, auth

🚪 Spring Cloud Gateway — Production Setup

YAML — Spring Cloud Gateway Full Config

# application.yml — Spring Cloud Gateway
spring:
  cloud:
    gateway:
      default-filters:
        - DedupeResponseHeader=Access-Control-Allow-Credentials Access-Control-Allow-Origin
        - name: RequestRateLimiter
          args:
            redis-rate-limiter.replenishRate: 10     # 10 req/sec per user
            redis-rate-limiter.burstCapacity: 20     # Allow burst of 20
            key-resolver: "#{@userKeyResolver}"

      routes:
        - id: order-service
          uri: lb://ORDER-SERVICE
          predicates:
            - Path=/api/orders/**
          filters:
            - RewritePath=/api/orders/(?<segment>.*), /orders/${segment}
            - name: CircuitBreaker           # Resilience4j circuit breaker
              args:
                name: orderServiceCB
                fallbackUri: forward:/fallback/orders

        - id: product-service
          uri: lb://PRODUCT-SERVICE
          predicates:
            - Path=/api/products/**
          filters:
            - name: Retry
              args:
                retries: 3
                statuses: BAD_GATEWAY

Java — Gateway JWT Auth Filter

@Component
public class JwtGatewayFilter implements GlobalFilter, Ordered {

    @Autowired private JwtUtil jwtUtil;

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        ServerHttpRequest request = exchange.getRequest();
        String path = request.getPath().value();

        // Skip auth for public endpoints
        if (path.startsWith("/api/auth/")) return chain.filter(exchange);

        String token = extractToken(request);
        if (token == null || !jwtUtil.isValid(token)) {
            exchange.getResponse().setStatusCode(HttpStatus.UNAUTHORIZED);
            return exchange.getResponse().setComplete();
        }

        // Forward userId header to downstream services
        String userId = jwtUtil.extractUserId(token);
        ServerHttpRequest mutatedReq = request.mutate()
            .header("X-User-Id", userId)
            .header("X-User-Roles", jwtUtil.extractRoles(token).toString())
            .build();

        return chain.filter(exchange.mutate().request(mutatedReq).build());
    }

    @Override
    public int getOrder() { return -1; } // Runs before all other filters
}

09

Handling Eventual Consistency

Idempotency, Saga, Outbox Pattern, DLQ — making distributed systems reliable

📬 Transactional Outbox Pattern — Zero Message Loss

The #1 production problem: Service saves to DB successfully but fails to publish to Kafka. The Outbox Pattern solves this — treat message publishing as part of the same DB transaction.

Java — Outbox Pattern Implementation

// Entity: outbox_events table
@Entity
@Table(name = "outbox_events")
public class OutboxEvent {
    @Id @GeneratedValue
    private UUID id;
    private String topic;          // "order.created"
    private String aggregateId;    // orderId
    private String payload;        // JSON event body
    private OutboxStatus status;   // PENDING / PUBLISHED
    private LocalDateTime createdAt;
}

@Service
public class OrderService {
    @Autowired private OrderRepository orderRepo;
    @Autowired private OutboxRepository outboxRepo;
    @Autowired private ObjectMapper mapper;

    @Transactional  // BOTH writes in ONE DB transaction
    public Order createOrder(OrderRequest req) {
        Order order = orderRepo.save(new Order(req));

        // Write to outbox table IN SAME TRANSACTION as order
        // If Kafka is down — no problem, outbox record is safe in DB
        outboxRepo.save(new OutboxEvent("order.created",
            order.getId().toString(),
            mapper.writeValueAsString(new OrderCreatedEvent(order))));

        return order;
    }
}

// Scheduled poller publishes pending outbox events
@Component
public class OutboxPoller {
    @Autowired private OutboxRepository outboxRepo;
    @Autowired private KafkaTemplate<String,String> kafka;

    @Scheduled(fixedDelay = 1000)  // Every 1 second
    public void publishPending() {
        outboxRepo.findByStatus(OutboxStatus.PENDING).forEach(event -> {
            try {
                kafka.send(event.getTopic(), event.getAggregateId(), event.getPayload())
                     .get();  // Synchronous wait for ack
                event.setStatus(OutboxStatus.PUBLISHED);
                outboxRepo.save(event);
            } catch (Exception e) {
                // Retry on next poll — Kafka may be temporarily down
                log.warn("Failed to publish outbox event {}", event.getId());
            }
        });
    }
}

♻️ Idempotent Consumer — Safe to Retry

Java — Idempotent Kafka Consumer

@KafkaListener(topics = "order.created", groupId = "inventory-group")
public void onOrderCreated(OrderCreatedEvent event,
                            @Header(KafkaHeaders.RECEIVED_KEY) String key) {
    // Idempotency: Check if we already processed this event
    // Multiple deliveries MUST NOT cause duplicate stock deductions
    if (processedEventRepo.existsByEventId(event.getEventId())) {
        log.info("Duplicate event {} — skipping", event.getEventId());
        return;  // Safe to ignore — already processed
    }

    inventoryService.reserveStock(event.getProductId(), event.getQuantity());

    // Mark as processed (upsert by eventId)
    processedEventRepo.save(new ProcessedEvent(event.getEventId()));
}

// Dead Letter Queue config — failed messages go here after 3 retries
@Bean
public DefaultErrorHandler kafkaErrorHandler(KafkaOperations<?, ?> template) {
    DeadLetterPublishingRecoverer dlq =
        new DeadLetterPublishingRecoverer(template,
            (r, e) -> new TopicPartition(r.topic() + ".DLQ", r.partition()));

    ExponentialBackOffWithMaxRetries backOff =
        new ExponentialBackOffWithMaxRetries(3);  // Retry 3 times
    backOff.setInitialInterval(1000L);
    backOff.setMultiplier(2.0);                  // 1s → 2s → 4s

    return new DefaultErrorHandler(dlq, backOff);
}

10

Availability & Latency Targets

99.99% uptime, <200ms p99 — how to actually achieve them

SLA	Downtime/Year	Downtime/Month	How to Achieve
99%	3.65 days	7.2 hours	Single region, basic monitoring
99.9%	8.77 hours	43.8 minutes	Health checks, auto-restart, single AZ redundancy
99.99%	52.6 minutes	4.4 minutes	Multi-AZ, active-active, circuit breakers, chaos engineering
99.999%	5.26 minutes	26 seconds	Multi-region, zero-downtime deploys, Netflix-level ops

⚡ Resilience4j — Circuit Breaker + Rate Limiter in Spring Boot

YAML — Resilience4j Configuration

# application.yml
resilience4j:
  circuitbreaker:
    instances:
      payment-service:
        slidingWindowSize: 10          # Evaluate last 10 calls
        failureRateThreshold: 50        # Open if ≥50% fail
        waitDurationInOpenState: 10000  # Wait 10s before half-open
        permittedNumberOfCallsInHalfOpenState: 3
        registerHealthIndicator: true

  retry:
    instances:
      payment-service:
        maxAttempts: 3
        waitDuration: 500ms
        retryExceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException

  ratelimiter:
    instances:
      payment-service:
        limitForPeriod: 100           # 100 calls per period
        limitRefreshPeriod: 1s
        timeoutDuration: 500ms         # Wait max 500ms for permit

Java — Service with Circuit Breaker + Fallback

@Service
public class PaymentClient {

    @Autowired private RestTemplate restTemplate;

    /**
     * Circuit Breaker states:
     * CLOSED → Normal operation, all calls go through
     * OPEN   → 50%+ failures — calls SHORT-CIRCUIT to fallback immediately
     * HALF-OPEN → Test with a few calls — if OK, go back to CLOSED
     */
    @CircuitBreaker(name = "payment-service", fallbackMethod = "paymentFallback")
    @Retry(name = "payment-service")
    @RateLimiter(name = "payment-service")
    public PaymentResponse processPayment(PaymentRequest req) {
        return restTemplate.postForObject(
            "http://payment-service/api/payments", req, PaymentResponse.class);
    }

    // Fallback — called when circuit is OPEN or all retries exhausted
    public PaymentResponse paymentFallback(PaymentRequest req, Exception e) {
        log.error("Payment service unavailable. Queuing payment: {}", req.getOrderId());

        // Graceful degradation: Queue for async processing
        kafka.send("payment.pending", req);

        return PaymentResponse.builder()
            .status(PaymentStatus.QUEUED)
            .message("Payment queued — will process shortly")
            .build();
    }
}

11

Spring Cloud Microservices Stack

Service discovery, config server, distributed tracing — production patterns

🔍 Service Discovery with Eureka

Java — Eureka Server + Client Setup

// Eureka Server (dedicated service)
@SpringBootApplication
@EnableEurekaServer
public class ServiceRegistryApplication {
    public static void main(String[] args) {
        SpringApplication.run(ServiceRegistryApplication.class, args);
    }
}

---

// Microservice Client Registration
@SpringBootApplication
@EnableDiscoveryClient
public class OrderServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(OrderServiceApplication.class, args);
    }

    // Load-balanced RestTemplate — lb:// prefix uses Eureka
    @Bean
    @LoadBalanced
    public RestTemplate restTemplate() { return new RestTemplate(); }
}

// Calling another service by name (not hardcoded URL!)
@Service
public class OrderServiceImpl {
    @Autowired private RestTemplate restTemplate;  // @LoadBalanced

    public Product getProduct(Long productId) {
        // Eureka resolves "PRODUCT-SERVICE" to actual IP/port
        // Handles multiple instances with load balancing
        return restTemplate.getForObject(
            "http://PRODUCT-SERVICE/api/products/" + productId,
            Product.class);
    }
}

📊 Distributed Tracing with Micrometer + Zipkin

YAML — Micrometer Tracing + Zipkin (Spring Boot 3)

dependencies:
  - micrometer-tracing-bridge-brave
  - zipkin-reporter-brave

spring:
  application:
    name: order-service         # Appears in Zipkin traces
  zipkin:
    base-url: http://zipkin:9411

management:
  tracing:
    sampling:
      probability: 1.0           # 100% in dev; use 0.1 in prod

# Trace propagates automatically across REST + Kafka calls
# Trace ID threads through: Gateway → Order → Payment → Inventory
# Find slowest service easily in Zipkin UI

12

Kafka + Spring Boot

Event-driven microservices — producers, consumers, partitioning, ordering

📨 Production Kafka Setup

Java — Kafka Producer with Idempotent Config

@Configuration
public class KafkaProducerConfig {

    @Bean
    public ProducerFactory<String, Object> producerFactory() {
        Map<String, Object> props = new HashMap<>();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JsonSerializer.class);

        // Idempotent producer — exactly-once semantics at Kafka level
        props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);

        // acks=all: wait for all replicas to acknowledge
        props.put(ProducerConfig.ACKS_CONFIG, "all");

        // Retry with backoff
        props.put(ProducerConfig.RETRIES_CONFIG, 3);
        props.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 1000);

        return new DefaultKafkaProducerFactory<>(props);
    }
}

// Publishing with custom partition key for ordering guarantee
@Service
public class EventPublisher {
    @Autowired private KafkaTemplate<String, Object> kafka;

    public void publishOrderEvent(Order order) {
        OrderCreatedEvent event = new OrderCreatedEvent(order);

        // KEY = orderId → all events for same order go to SAME PARTITION
        // Guarantees ordering: CREATED → PAID → SHIPPED → DELIVERED
        kafka.send("order.events", order.getId().toString(), event)
             .whenComplete((result, ex) -> {
                 if (ex != null) log.error("Failed to publish event", ex);
                 else log.info("Published to partition {}",
                     result.getRecordMetadata().partition());
             });
    }
}

13

Docker — Containerizing Spring Boot

Multi-stage builds, Docker Compose for local dev, best practices

🐳 Production Dockerfile — Multi-Stage Build

Multi-stage builds keep the final image small (no JDK, no Maven in prod image) and secure (minimal attack surface). Builder stage compiles; runtime stage only runs.

Dockerfile — Spring Boot Production Build

# ────────────────────────────────────────
# STAGE 1: Build
# ────────────────────────────────────────
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /workspace

# Copy Maven wrapper and pom.xml first (layer caching for dependencies)
COPY mvnw .
COPY .mvn .mvn
COPY pom.xml .

# Download dependencies (cached layer — only re-runs if pom.xml changes)
RUN ./mvnw dependency:go-offline -q

# Copy source and build
COPY src src
RUN ./mvnw package -DskipTests -q

# ────────────────────────────────────────
# STAGE 2: Extract layers (Spring Boot 3 layered jar)
# ────────────────────────────────────────
FROM eclipse-temurin:21-jre-alpine AS extractor
WORKDIR /workspace
COPY --from=builder /workspace/target/*.jar app.jar
RUN java -Djarmode=layertools -jar app.jar extract

# ────────────────────────────────────────
# STAGE 3: Runtime (smallest possible image)
# ────────────────────────────────────────
FROM eclipse-temurin:21-jre-alpine
WORKDIR /app

# Non-root user for security
RUN addgroup -S spring && adduser -S spring -G spring
USER spring:spring

# Layered copy — only changed layers re-download on deploy
COPY --from=extractor /workspace/dependencies/ ./
COPY --from=extractor /workspace/spring-boot-loader/ ./
COPY --from=extractor /workspace/snapshot-dependencies/ ./
COPY --from=extractor /workspace/application/ ./

EXPOSE 8080

# JVM tuning for containers
ENV JAVA_OPTS="-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport -XX:+UseG1GC"
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS org.springframework.boot.loader.launch.JarLauncher"]

🔧 Docker Compose — Full Microservices Local Dev

docker-compose.yml — Full Stack Dev Environment

version: '3.9'

services:

  # ── Infrastructure ──────────────────────────────────────

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: ordersdb
    ports: ["5432:5432"]
    volumes: [postgres_data:/var/lib/postgresql/data]
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin"]
      interval: 5s

  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    command: redis-server --appendonly yes

  kafka:
    image: confluentinc/cp-kafka:7.6.0
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    depends_on: [zookeeper]
    ports: ["9092:9092"]

  zookeeper:
    image: confluentinc/cp-zookeeper:7.6.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  # ── Services ─────────────────────────────────────────────

  api-gateway:
    build: ./api-gateway
    ports: ["8080:8080"]
    environment:
      EUREKA_URI: http://service-registry:8761/eureka
    depends_on: [service-registry]

  order-service:
    build: ./order-service
    environment:
      SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/ordersdb
      SPRING_KAFKA_BOOTSTRAP_SERVERS: kafka:9092
      EUREKA_URI: http://service-registry:8761/eureka
    depends_on:
      postgres: { condition: service_healthy }
      kafka: { condition: service_started }
    deploy:
      replicas: 2                    # 2 instances for local HA test

  service-registry:
    build: ./service-registry
    ports: ["8761:8761"]

volumes:
  postgres_data:

14

Kubernetes — Deploying Spring Boot Microservices

Pods, Deployments, Services, HPA, ConfigMaps, Health Probes

☸️ Core Kubernetes Concepts for Interviews

Pod

Smallest deployable unit. Contains 1+ containers. Ephemeral — if Pod dies, K8s creates a new one.

Deployment

Manages ReplicaSets. Declares desired state: "run 3 replicas of order-service:v2". K8s enforces it.

Service

Stable DNS name + virtual IP for a set of Pods. Types: ClusterIP, NodePort, LoadBalancer.

ConfigMap

Key-value config injected as env vars or mounted files. Decouples config from image.

Secret

Base64-encoded sensitive data (passwords, tokens). Injected same way as ConfigMap but encrypted at rest.

HPA

Horizontal Pod Autoscaler — auto-scales replicas based on CPU/memory or custom metrics (e.g., Kafka lag).

Ingress

HTTP routing rules. Maps external URLs to internal Services. Acts like API Gateway at infra level.

Namespace

Virtual cluster within cluster. Isolate dev/staging/prod environments in same K8s cluster.

PersistentVolume

Storage that survives Pod restarts. Backed by EBS, EFS, NFS. Required for stateful apps (DBs).

📄 Complete Spring Boot Deployment Manifest

YAML — order-service-deployment.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
  labels:
    app: order-service
    version: v2.1.0
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    type: RollingUpdate            # Zero-downtime deploys
    rollingUpdate:
      maxUnavailable: 1           # At most 1 Pod down during update
      maxSurge: 1                 # At most 1 extra Pod during update
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: myregistry.ecr.aws/order-service:v2.1.0
          ports:
            - containerPort: 8080

          # Resource limits — prevents one Pod from starving others
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"           # 0.25 CPU core
            limits:
              memory: "512Mi"
              cpu: "500m"

          # Spring Boot Actuator health probes
          livenessProbe:           # K8s restarts Pod if this fails
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 10

          readinessProbe:          # K8s stops sending traffic until ready
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 5

          # Config from ConfigMap + Secrets
          envFrom:
            - configMapRef:
                name: order-service-config
            - secretRef:
                name: order-service-secrets

---
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: production
spec:
  selector:
    app: order-service
  ports:
    - port: 80
      targetPort: 8080
  type: ClusterIP                   # Internal only — Gateway routes to this

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20                    # Auto-scales up to 20 Pods
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60    # Scale up if CPU >60%

🏥 Spring Boot Actuator Health for K8s Probes

YAML + Java — Custom Health Indicators

# application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true            # /actuator/health/liveness + /readiness
      show-details: always
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true
    db:
      enabled: true             # Checks DB connection
    redis:
      enabled: true             # Checks Redis connection
    kafka:
      enabled: true             # Checks Kafka connection

---

// Custom health indicator for downstream dependency
@Component
public class PaymentServiceHealthIndicator implements HealthIndicator {

    @Autowired private PaymentClient paymentClient;

    @Override
    public Health health() {
        try {
            paymentClient.ping();    // Quick health check call
            return Health.up()
                .withDetail("status", "Payment service reachable")
                .build();
        } catch (Exception e) {
            // If payment service is down, mark this pod as NOT READY
            // K8s will stop routing traffic here until it recovers
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

15

AWS Services for Microservices

EKS, ECS, ALB, RDS, ElastiCache, MSK, SQS, SNS — mapped to Spring Boot

☁️ AWS Architecture — Spring Boot Microservices on EKS

┌─────────────────────── AWS ACCOUNT ─────────────────────────────────────┐ │ │ │ Route 53 → CloudFront (CDN) → WAF → Application Load Balancer │ │ │ │ │ ┌────────────────▼────────────────┐ │ │ │ AWS EKS Cluster │ │ │ │ ┌─────────────────────────────┐ │ │ │ │ │ Kubernetes Namespace: PROD │ │ │ │ │ │ │ │ │ │ │ │ Order(3) Product(3) User(2) │ │ │ │ │ │ Payment(3) Notification(1) │ │ │ │ │ │ API Gateway (Spring Cloud) │ │ │ │ │ └─────────────────────────────┘ │ │ │ │ Node Groups: t3.medium × 3 AZs │ │ │ └─────────────────────────────────┘ │ │ │ │ │ ┌──────────┬──────────┼──────────┬───────────┐ │ │ │ │ │ │ │ │ │ RDS ElastiCache MSK SQS/SNS S3 │ │ PostgreSQL Redis Kafka Async File │ │ Multi-AZ Cluster Managed Messaging Storage │ │ │ │ ECR(images) IAM(auth) Secrets Manager(creds) CloudWatch(logs/metrics) │ └─────────────────────────────────────────────────────────────────────────┘

AWS Service	Replaces	Spring Boot Integration	Key Config
EKS	Self-managed K8s	Deploy via kubectl, Helm charts	Node groups, Fargate profiles
ECS + Fargate	EKS (simpler)	Task definitions, service auto-scaling	No server management needed
RDS PostgreSQL (Multi-AZ)	Self-hosted DB	spring.datasource.url = RDS endpoint	Read replicas for read scaling
ElastiCache Redis (Cluster)	Self-hosted Redis	spring.data.redis.host = ElastiCache endpoint	Multi-AZ, automatic failover
MSK (Managed Kafka)	Self-hosted Kafka	spring.kafka.bootstrap-servers = MSK brokers	IAM auth, in-transit encryption
SQS + SNS	Kafka (simpler)	spring-cloud-aws-messaging	Fan-out: SNS topic → multiple SQS queues
ALB	Nginx / HAProxy	Ingress controller annotations	Path-based routing, SSL termination
Secrets Manager	ConfigMap Secrets	spring-cloud-aws-secrets-manager	Auto-rotation, accessed via SDK
CloudWatch	ELK Stack	AWS CloudWatch Logs appender	Container Insights for K8s metrics
ECR	Docker Hub	Push images: aws ecr push	IAM-based auth, private registry

🔐 AWS Secrets Manager with Spring Boot

YAML + Java — Spring Cloud AWS Secrets Manager

# pom.xml dependency
<dependency>
    <groupId>io.awspring.cloud</groupId>
    <artifactId>spring-cloud-aws-secrets-manager-config</artifactId>
</dependency>

---

# bootstrap.yml — loads BEFORE application context
spring:
  cloud:
    aws:
      secretsmanager:
        enabled: true
        region: ap-south-1          # Mumbai region for India
      credentials:
        instance-profile: true       # Use EKS Pod IAM role (IRSA)

# Secret name in AWS: /prod/order-service/db
# Contains JSON: { "username": "admin", "password": "super-secret" }
# Spring automatically maps to spring.datasource.username / .password

---

// IRSA (IAM Roles for Service Accounts) — no hardcoded AWS keys in pods!
// Kubernetes ServiceAccount → IAM Role → Secrets Manager access
// This is the production-secure way. No AWS_ACCESS_KEY_ID needed.

// Annotate Kubernetes ServiceAccount:
// annotations:
//   eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/order-service-role

🚀 CI/CD Pipeline — GitHub Actions → ECR → EKS

YAML — .github/workflows/deploy.yml

name: Build and Deploy to EKS

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Step 1: Build Spring Boot JAR
      - name: Build JAR
        run: mvn clean package -DskipTests

      # Step 2: Configure AWS credentials (OIDC — no secrets stored!)
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ap-south-1

      # Step 3: Push image to ECR
      - name: Login to ECR
        run: aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI

      - name: Build & Push Docker Image
        run: |
          docker build -t $ECR_URI/order-service:${{ github.sha }} .
          docker push $ECR_URI/order-service:${{ github.sha }}

      # Step 4: Deploy to EKS
      - name: Update kubeconfig
        run: aws eks update-kubeconfig --name prod-cluster --region ap-south-1

      - name: Deploy (Rolling Update)
        run: |
          kubectl set image deployment/order-service \
            order-service=$ECR_URI/order-service:${{ github.sha }} \
            -n production

          # Wait for rollout to complete before marking deploy success
          kubectl rollout status deployment/order-service -n production

16

Senior Interview Q&A

Exact answers structured for 10 YOE interviews — trade-offs first, examples always

🎯 Must-Know Interview Questions with Model Answers

Q1: Explain CAP theorem and give real examples.

"In any distributed system, you can only guarantee two of three: Consistency, Availability, Partition Tolerance. Since network partitions are inevitable — packets get dropped, nodes go down — P is non-negotiable. So the real choice is CP vs AP. I use CP for banking (PostgreSQL with SERIALIZABLE isolation — data must be correct or fail). I use AP for social feeds (DynamoDB — slightly stale posts are acceptable, but the feed must always load). CA only exists on single-node systems, not distributed ones."

Q2: How do you handle eventual consistency in microservices?

"Four key strategies: (1) Outbox Pattern — write events to a DB table in the same transaction as your business data, then a poller publishes to Kafka. Zero event loss. (2) Idempotent consumers — check eventId before processing, skip duplicates. Safe to retry. (3) Saga pattern — choreography via Kafka events for multi-service transactions, with compensating transactions on failure. (4) Dead Letter Queues — failed messages after 3 retries go to DLQ for manual inspection or replay."

Q3: When would you break a monolith into microservices?

"I wouldn't break prematurely — microservices add complexity: distributed tracing, network failures, eventual consistency, more infra. I'd break when: (1) Teams are blocking each other on the same codebase, (2) Specific modules need independent scaling (e.g., image processing at 10× the load of user auth), (3) Deploy frequency is low because one change requires full regression. I'd use DDD to identify bounded contexts, apply the Strangler Pattern, and extract services one at a time starting with isolated, read-heavy modules."

Q4: How does Kubernetes ensure high availability?

"Multiple layers: (1) ReplicaSets — always maintain N Pod replicas, restart if one crashes. (2) Rolling updates — maxUnavailable=1 means zero downtime deploys. (3) Liveness probes — kill and restart unhealthy Pods. (4) Readiness probes — remove Pod from Service until healthy. (5) HPA — auto-scale on CPU/custom metrics. (6) Pod Disruption Budgets — prevent all Pods going down during node drains. On AWS EKS, spread across 3 AZs for zone failure tolerance."

Q5: ACID vs BASE — when do you choose each?

"ACID when correctness is non-negotiable — financial transactions, inventory management, order placement. I use Spring @Transactional with PostgreSQL, choose isolation level based on need (READ_COMMITTED for most cases, SERIALIZABLE for critical financial ops). BASE when scale is priority and slight inconsistency is acceptable — product view counters, social feeds, user activity logs. DynamoDB or Cassandra, eventual consistency accepted. Most systems are hybrid: ACID core (orders, payments) with BASE peripherals (analytics, notifications)."

Q6: How do you achieve 99.99% availability on AWS?

"Multi-layer approach: (1) Multi-AZ EKS with 3 node groups across AZs — one AZ failure doesn't impact service. (2) RDS Multi-AZ with automatic failover. (3) ElastiCache Redis Cluster Mode — sharded + replicated. (4) Circuit breakers (Resilience4j) — prevent cascade failures. (5) ALB health checks — route away from unhealthy instances. (6) Rolling deployments — zero downtime updates. (7) Chaos engineering — regular failure injection to find weak points. (8) CloudWatch alarms + PagerDuty for instant alerts."

Q7: How do you achieve <200ms p99 latency?

"p99 means 99% of requests complete under 200ms. Strategy: (1) Redis caching — cache hot data, avoid DB for reads (sub-millisecond). (2) DB indexing — add composite indexes on frequent query patterns, use EXPLAIN ANALYZE to find slow queries. (3) Async processing — push non-critical work to Kafka (notifications, emails) — respond immediately. (4) CDN (CloudFront) — serve static assets from edge. (5) Connection pooling (HikariCP) — avoid connection setup cost per request. (6) Reduce network hops — co-locate services in same AZ, use gRPC between services."

17

Senior-Level Pro Tips

What separates 10 YOE answers from 3 YOE answers in system design interviews

✅ Senior-Level Answer Patterns

Always lead with trade-offs, not just solutions
Quantify: "This reduces latency from ~800ms to ~5ms with Redis"
Mention failure scenarios proactively: "What if Kafka is down?"
Reference real systems: "Netflix does this with Zuul / Hystrix"
Show evolution: "Start with monolith, extract when needed"
Bring in operational concerns: observability, alerting, on-call

❌ Junior-Level Pitfalls

Jumping to microservices without justification
Using SERIALIZABLE isolation for everything ("it's safest")
Forgetting that distributed systems can fail partially
Not mentioning idempotency for distributed writes
Shared database between microservices ("just use one DB")
No mention of monitoring, alerting, or observability

🏆 The Senior Engineer's Checklist for Any System Design

1

Clarify requirements first

Scale (DAU, QPS, data volume), consistency requirements, latency SLA, budget. Never design without numbers.

2

Identify the CAP position

For each data store / service: is it CP or AP? Explicitly state the trade-off you're making.

3

Design for failure

What happens when DB is down? Kafka is down? A service crashes? Always have circuit breakers + fallbacks.

4

Show the data flow

Draw end-to-end: Client → CDN → ALB → API Gateway → Service → Cache/DB → Event → Consumer

5

Address observability

Metrics (Prometheus/CloudWatch), distributed tracing (Zipkin), centralized logging (ELK/CloudWatch Logs), alerts.

6

Handle cross-cutting concerns

Auth (JWT at gateway), rate limiting, CORS, SSL termination — all at the gateway, not per service.

⚡ GOLDEN RULE Every design decision must be followed by: "The trade-off here is X. I chose Y over Z because at our scale/requirement, consistency/availability/performance matters more than ___." This is what 10 YOE sounds like.