Senior Dev Interview Prep — REST & Webhooks

01

REST Design

// resource modeling · verbs · status codes · security

Q-01

How do you design a REST API for a resource? Walk me through the complete approach.

Mid Level

+

🎯 What Interviewer Wants A standardized approach covering naming conventions, HTTP verb semantics, status codes, and production concerns like pagination, security, and versioning.

I follow a resource-oriented design approach. Everything starts with identifying your nouns (resources), not your actions (verbs).

RESOURCE NAMING

/getUser — verb in URI ❌
/updateUser?action=activate — RPC style ❌

/users — collection (plural noun) ✓
/users/{id} — single resource ✓
/users/{id}/orders — nested sub-resource ✓

HTTP VERB MAPPING

Verb	Action	Safe?	Idempotent?	Example
`GET`	Read	✓	✓	`GET /users/5`
`POST`	Create	✗	✗	`POST /users`
`PUT`	Full Replace	✗	✓	`PUT /users/5`
`PATCH`	Partial Update	✗	Mostly	`PATCH /users/5`
`DELETE`	Remove	✗	✓	`DELETE /users/5`

PRODUCTION CONCERNS

Versioning: URI versioning /v1/users for public APIs, header versioning for internal microservices
Pagination: Cursor-based for large datasets, offset for simple internal tools
Security: OAuth2 / JWT + HTTPS always on
Error Handling: Structured JSON errors with machine-readable codes and a traceId
Rate Limiting: Token Bucket at the gateway level

Q-02

What is the difference between PUT and PATCH? When would you use each?

Foundational

+

🎯 What Interviewer Wants Understanding of idempotency implications, payload size, and real-world side effects.

Feature	PUT	PATCH
Action	Replace entire resource	Partial update (delta)
Payload	Full object required	Only changed fields
Bandwidth	Higher	Lower
Idempotent?	Always ✓	Usually, not guaranteed
Side Effect	Missing fields → set to null/default	Untouched fields unchanged

            ⚠️
            Critical Trap: If you send PUT /users/5 {"name":"Alice"} without the email field, the email should technically be set to null. Many devs accidentally use PUT when they mean PATCH, causing data loss in production.
          

📦 Real World Example

Use PUT when onboarding a user via a config file that always has all fields. Use PATCH for user profile edits where they only change their profile picture — you don't want to accidentally erase their name.

Q-03

How do you design consistent error responses in a REST API?

Mid Level

+

🎯 What Interviewer Wants Consistency, debuggability, and not leaking internal implementation details.

Never return plain text or stack traces. I use a structured envelope that serves both machines (the code field) and developers (the traceId field for log correlation).

JSON — Error Response Schema
{
  "code": "USER_NOT_FOUND",         // machine-readable — client can switch() on this
  "message": "User with ID 42 not found",// human-readable
  "traceId": "abc-123-xyz-789",       // correlates to Splunk/Datadog logs
  "timestamp": "2024-10-27T10:00:00Z",  // for debugging time-based issues
  "errors": [                             // for validation failures
    {
      "field": "email",
      "issue": "Invalid format"
    }
  ]
}

Never expose stack traces, SQL errors, or internal hostnames to clients
Always log full detail server-side, return only what's needed
Use Problem Details RFC 7807 format in enterprise APIs for standardization
Validation errors (400) should list all field errors at once — not one at a time

Q-04

How do you secure a REST API end-to-end?

Senior Level

+

🎯 What Interviewer Wants Multi-layer security thinking — not just "use HTTPS". They want you to cover AuthN vs AuthZ, attack vectors, and operational concerns.

Transport: TLS 1.2+ (HTTPS) everywhere. Reject plain HTTP. Enforce HSTS.
Authentication (AuthN): OAuth 2.0 for 3rd-party delegated access. JWT for stateless service-to-service. API Keys for machine clients with IP whitelisting.
Authorization (AuthZ): RBAC (Role-Based) or ABAC (Attribute-Based). Validate permissions on every request — never trust client claims.
Input Validation: Reject malformed inputs at the gateway. Prevent SQL Injection, XSS, Path Traversal.
Rate Limiting: Per-user and per-IP limits. Return 429 Too Many Requests with Retry-After header.
CORS: Strict allowlist of trusted origins. Never Access-Control-Allow-Origin: * for authenticated APIs.
Secrets: Never put API keys in URLs. Use Authorization header or request body.
Audit Logging: Log all write operations (POST/PUT/PATCH/DELETE) with user, IP, timestamp, and payload hash.

            🔴
            Common 10-Year Mistake: Checking authentication at the gateway but skipping authorization inside microservices. A compromised internal service can then do anything. Always re-check permissions at the resource level too.
          

Q-05

What is the Richardson Maturity Model? Where do most APIs actually sit?

Senior Level

+

🎯 What Interviewer Wants That you can articulate REST maturity beyond just "use HTTP verbs correctly."

Level	Name	What it means	Real World
`L0`	Swamp of POX	HTTP as a tunnel. One endpoint. SOAP / XML-RPC.	Legacy enterprise SOAP
`L1`	Resources	Multiple URIs like `/users`, `/products`. But still just POST everywhere.	Early internal APIs
`L2`	HTTP Verbs	Correct GET/POST/PUT/DELETE + Status Codes. This is where 90% of "REST" APIs live.	Stripe, GitHub API
`L3`	HATEOAS	API drives the client via links in responses. Self-documenting conversation.	Rarely in practice

            💡
            Senior Answer: "Most APIs are Level 2 and that's perfectly fine for production use. Level 3 (HATEOAS) sounds great in theory but adds complexity without proportional benefit for most teams. I'd only advocate for it in truly public, long-lived APIs."
          

Q-06

What is HATEOAS and when would you actually use it?

Expert Level

+

HATEOAS (Hypermedia as the Engine of Application State) means the API response tells the client what actions are available next — instead of the client hardcoding URL patterns.

JSON — HATEOAS Response for Pending Order
{
  "id": 101,
  "status": "pending",
  "amount": 2000,
  "_links": {
    "self":    { "href": "/orders/101", "method": "GET" },
    "cancel":  { "href": "/orders/101/cancel", "method": "POST" },
    "pay":     { "href": "/orders/101/pay", "method": "PUT" }
  }
  // If status was "shipped" → NO "cancel" link would appear here
}

            ✅
            Key Insight: State-driven links prevent impossible actions from being called. A client never needs to check "is this order cancellable?" — the link either appears or it doesn't. This is the real power.
          

02

Idempotency

// retry safety · payment systems · distributed reliability

Q-07

What is idempotency and why does it matter in distributed systems?

Mid Level

+

Idempotency means making the same request multiple times produces the exact same result as making it once. Mathematically: f(f(x)) = f(x).

In distributed systems, network retries are inevitable. A client never knows if a request was lost in transit or if the server processed it and the response was lost. Without idempotency, retrying a payment could charge a user twice.

            ⚡
            Senior framing: "Idempotency is what lets you safely add retry logic without adding catastrophic side effects. It's the difference between a resilient system and a financial audit nightmare."
          

Q-08

Which HTTP methods are idempotent and why?

Foundational

+

Method	Idempotent?	Why
`GET`	✓ Yes	Read-only. Multiple reads don't change state.
`HEAD`	✓ Yes	Same as GET but no body. Pure metadata.
`OPTIONS`	✓ Yes	Describes capabilities only.
`PUT`	✓ Yes	Sets resource to a specific state. Running 10 times = same result.
`DELETE`	✓ Yes	After first delete, resource is gone. All subsequent are no-ops (404 or 200).
`POST`	✗ No	Creates new resources each time. Two POST /payments = two charges.
`PATCH`	Varies	`{"name":"Alice"}` is idempotent. `{"views": views+1}` is not.

Q-09

How do you implement an idempotent POST endpoint? Design it end-to-end.

Senior Level

+

🎯 What Interviewer Wants The Idempotency-Key pattern used by Stripe/PayPal. They want server-side storage strategy too.

Client
generates UUID key

→

POST /payments
Idempotency-Key: abc-123

→

Server
check Redis for key

→

New?
process + cache

→

Seen?
return cached result

Java — Spring Boot Pseudocode
@PostMapping("/payments")
public ResponseEntity process(
    @RequestHeader("Idempotency-Key") String key,
    @RequestBody PaymentRequest req) {

  // 1. Check Redis for this key
  String cached = redis.get("idem:" + key);
  if (cached != null) {
    return ResponseEntity.ok(deserialize(cached)); // replay cached response
  }

  // 2. Process the payment (first time only)
  PaymentResponse result = paymentService.charge(req);

  // 3. Cache the response with 24hr TTL
  redis.setex("idem:" + key, 86400, serialize(result));

  return ResponseEntity.status(201).body(result);
}

Client must generate a UUID and include it on every retry of the same logical operation
Use Redis with a 24-48 hour TTL — long enough to cover retry windows
For banking: use DB unique constraint on (userId, idempotencyKey) for ACID compliance
Return the exact same response code (e.g., 201) on replayed requests — don't return 200

Q-10

Give me a real-world failure scenario where idempotency saves the day.

Mid Level

+

📦 Real World — Mobile Payment on Spotty Network

T=0.0s

User on a train taps "Pay ₹2,000". App generates key UUID: abc-999. Sends request.

T=0.3s

Server receives the request, charges the card. Payment succeeds internally.

T=0.4s

Train enters a tunnel. TCP connection drops. The 200 OK response is lost.

T=3.0s

App timeout fires. Auto-retry sends the same request with the same Idempotency-Key: abc-999.

T=3.2s

Server finds abc-999 in Redis. Returns cached 200 OK. No second charge.

            💸
            Without idempotency: User is charged ₹4,000. Without a support ticket, they never get refunded. This is a trust-destroying, legally exposing bug.
          

03

Pagination

// offset vs cursor · performance · response design

Q-11

What are the types of pagination and when do you use each?

Mid Level

+

Type	How it works	Best for	Avoid when
Offset	`?page=3&size=20` SQL: `OFFSET 60 LIMIT 20`	Small datasets, admin panels, user expects "page 5"	Large datasets, real-time data
Cursor/Keyset	`?cursor=last_id` SQL: `WHERE id > ? LIMIT 20`	Infinite scroll, social feeds, large datasets	When random page access is required
Seek/Keyset+	Multi-column cursor: `(date, id)`	Complex sorting with non-unique columns	Simple use cases
Time-based	`?since=timestamp&until=timestamp`	Audit logs, event streams, sync APIs	User-facing paginated lists

Q-12

Why is offset pagination slow at scale? Why is cursor pagination better?

Senior Level

+

            🐌
            Offset Problem 1 — Performance: OFFSET 500000 LIMIT 20 forces the DB to scan and discard 500,000 rows before returning 20. It's O(N) and gets slower every page. At page 10,000, your query takes seconds.
          

            👻
            Offset Problem 2 — Phantom/Duplicate Rows: If a new row is inserted between page 1 and page 2 requests, every row shifts. The user sees item A twice (once at end of page 1, once at start of page 2 after the shift). Or they miss an item entirely.
          

SQL — Cursor vs Offset
-- ❌ OFFSET: Must scan 500k rows to get 20
SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 500000;

-- ✅ CURSOR: Jumps directly via B-Tree index. O(log N).
SELECT * FROM orders
WHERE id > 500020   -- last_seen_id from previous page
ORDER BY id ASC
LIMIT 20;

-- ✅ CURSOR with non-unique sort column (e.g., created_at)
SELECT * FROM orders
WHERE (created_at > '2024-01-10')
   OR (created_at = '2024-01-10' AND id > 10500)  -- tie-breaker
ORDER BY created_at, id LIMIT 20;

Q-13

Design the API response shape for cursor-based pagination.

Mid Level

+

JSON — Cursor Pagination Envelope
{
  "data": [
    { "id": 101, "name": "Item A" },
    { "id": 102, "name": "Item B" }
  ],
  "pagination": {
    "nextCursor": "eyJpZCI6MTAyfQ==", // Base64({"id":102}) — opaque to client
    "prevCursor": "eyJpZCI6MTAxfQ==",
    "hasMore": true,
    "pageSize": 20
  }
}

            💡
            Pro tip on Base64 encoding the cursor: Clients should treat cursors as opaque tokens. If you return raw ?cursor=102, devs start building logic around it (e.g., "cursor + 1 = next page"), coupling them to your DB structure. Base64 encoding says "this is a black box — just pass it back."
          

04

Caching

// cache-control · etags · compression · CDN

Q-14

How do you implement HTTP caching properly? Explain ETags and 304.

Senior Level

+

HTTP has a built-in caching system most developers underuse. There are two layers: freshness (Cache-Control) and validation (ETags).

Client
GET /products/5

→

Server
200 OK + ETag: "v3-abc"

→

Client
caches response

→

Re-request
If-None-Match: "v3-abc"

→

Server
304 Not Modified (empty body)

Header	Purpose	Example
`Cache-Control: max-age=3600`	Client caches for 1 hour, no server check	Static assets, config data
`Cache-Control: no-cache`	Must revalidate with server each time	User-specific data
`Cache-Control: private`	CDN won't cache; only browser can	Auth-scoped responses
`ETag: "v1-hash"`	Content fingerprint for conditional requests	Product catalogue, config
`304 Not Modified`	Data unchanged — use your cache	Massive bandwidth savings

            ✅
            Real impact: A product listing API returning 50KB of JSON, called 10,000 times/hour, with a 1-hour Cache-Control cuts your bandwidth from 500MB/hr to near-zero during cache-valid periods.
          

Q-15

How do you handle response compression and what's the difference between gzip and brotli?

Mid Level

+

HTTP Headers — Compression Negotiation
// Client sends (in order of preference):
Accept-Encoding: br, gzip, deflate

// Server responds with chosen encoding:
Content-Encoding: br
Content-Type: application/json

Algorithm	Compression Ratio	Speed	Support
`gzip`	~70% reduction	Fast	Universal (all browsers)
`brotli (br)`	~80% reduction	Slightly slower	Modern browsers + HTTPS only
`deflate`	~65% reduction	Fast	Legacy; avoid

Enable brotli on your CDN/Nginx for modern clients. Fall back to gzip for others.
Compress anything over 1KB. Skip tiny responses — compression overhead isn't worth it.
Never compress binary data (images, videos) — they're already compressed.

05

Async / Long-Running Operations

// 202 accepted · polling · webhooks callback · timeouts

Q-16

How do you design an API for a task that takes 5+ minutes? (e.g., generate a PDF report)

Senior Level

+

🎯 What Interviewer Wants The "Asynchronous Request-Reply" pattern. They want to hear about 202, polling endpoints, and not blocking connections.

POST /reports
client kicks off job

→

202 Accepted
Location: /tasks/job-99

→

GET /tasks/job-99
client polls

→

{"status":"processing"
"percent": 60}

→

{"status":"done"
"result": "/reports/7"}

JSON — Polling Response States
// While processing:
{ "status": "processing", "percent": 60, "estimatedSeconds": 45 }

// On completion:
{ "status": "completed", "resultUrl": "/reports/999" }

// On failure:
{ "status": "failed", "error": "PDF generation timeout", "retryable": true }

            💡
            Senior upgrade: "Instead of polling, I'd prefer to combine this with a Webhook callback. The client registers a callbackUrl in the initial POST. When done, the server pushes the result to that URL. This avoids wasted polling traffic and reduces latency to completion notification."
          

06

Bulk Operations

// batch endpoints · partial failure · 207 multi-status

Q-17

Design a REST API to create 10,000 users at once. Handle partial failures.

Expert Level

+

Never expose batch operations as POST /users in a loop — that's 10,000 HTTP round trips. Create a dedicated batch endpoint.

Endpoint: POST /users/batch
Atomic (All-or-Nothing): Wrap all inserts in a DB transaction. One failure rolls back everything. Use for financial data where partial state is dangerous.
Partial Success (Preferred for large batches): Process all, report per-item success/failure. Use 207 Multi-Status.
Async Batch: For very large batches (10k+), return 202 Accepted + job ID immediately. Process in background. Return results via polling or webhook.

JSON — 207 Multi-Status Response
{
  "summary": { "total": 3, "success": 2, "failed": 1 },
  "results": [
    { "clientId": "row-1", "status": 201, "resourceId": 1001 },
    { "clientId": "row-2", "status": 400, "error": "Email 'x@y' already exists" },
    { "clientId": "row-3", "status": 201, "resourceId": 1003 }
  ]
}

            ⚠️
            Important: The top-level HTTP status is 207 (or 200), NOT 400. The batch endpoint itself succeeded. Individual items may have failed, and that's reflected per-item inside the response.
          

07

API Versioning

// URI · header · content-type · deprecation

Q-18

Compare URI versioning vs Header versioning. What's your recommendation?

Senior Level

+

Strategy	Example	Pros	Cons
URI Versioning	`/v1/users`	Visible in browser, easy to test, cache-friendly, most common	Violates REST (URI should be stable resource address)
Header Versioning	`Accept-Version: v1`	Clean URIs, pure REST	Can't test in browser, complex caching rules
Media Type	`Accept: application/vnd.myapi.v2+json`	Purist REST, content negotiation	Most complex to implement and debug
Query Param	`/users?version=2`	Simple to add	Messy, breaks caching, not recommended

            💡
            My recommendation: URI versioning (/v1/) for public APIs — it wins on developer ergonomics, discoverability, and CDN cacheability. Header versioning for strict internal microservices where URL cleanliness matters. Never query params.
          

Maintain at most 2 major versions simultaneously — deprecation cost is real
Set a deprecation timeline upfront (e.g., v1 → sunset in 6 months after v2 launch)
Add Deprecation: true and Sunset: date response headers to warn clients automatically
Never make breaking changes inside a version (removing fields, changing types)

08

Rate Limiting

// token bucket · leaky bucket · 429 · headers

Q-19

How do you design and communicate rate limiting to clients?

Senior Level

+

HTTP — Rate Limit Response Headers
// On every response, include these:
X-RateLimit-Limit: 1000     // requests allowed per window
X-RateLimit-Remaining: 37 // requests left in current window
X-RateLimit-Reset: 1709459200 // Unix timestamp when window resets

// When limit is exceeded (429):
HTTP/1.1 429 Too Many Requests
Retry-After: 60         // seconds until client can retry

Algorithm	How it works	Use case
Token Bucket	Bucket fills at fixed rate. Each request costs 1 token. Allows short bursts.	API endpoints, default choice
Leaky Bucket	Queue smooths traffic. Output rate is constant regardless of input bursts.	Payment processing, precise throttling
Fixed Window	Reset counter every N seconds. Simple but allows 2x burst at window boundaries.	Simple internal tools
Sliding Window	Counts requests in the last N seconds, rolling. Accurate, no boundary burst.	High-security APIs

            ⚡
            Advanced consideration: Implement rate limits at multiple levels — per IP (DDoS), per API key (fair use), per user, and per endpoint (expensive endpoints get tighter limits). Use Redis with atomic increment operations to ensure correctness across distributed nodes.
          

09

Webhooks

// reliability · security · retries · DLQ · ordering · observability

Q-20

What is a Webhook? How does it differ from Polling and Server-Sent Events?

Mid Level

+

Pattern	Direction	Connection	Best For
Polling	Client → Server (pull)	Repeated short HTTP calls	Relaxed latency requirements, simple setup
Long Polling	Client → Server (pull)	Client waits until data available	Near real-time, no WebSocket
Webhook	Server → Client (push)	One-shot HTTP POST per event	Event-driven callbacks, integrations
SSE	Server → Client (push)	Persistent one-way stream	Live dashboards, notifications in browser
WebSocket	Bi-directional	Persistent full-duplex	Chat, live collaboration, gaming

            💡
            Polling waste example: Checking every 5 seconds for a payment update = 12 requests/minute per user. With 100k users: 1.2M wasted API calls/minute. A webhook makes 1 call when the event occurs.
          

Q-21

How do you secure webhooks against replay attacks and impersonation?

Senior Level

+

🎯 What Interviewer Wants Just saying "HTTPS" is not enough. They want HMAC signatures, replay attack prevention, and IP allowlisting.

🔐 Stripe-style HMAC Signature Flow

Setup

User registers webhook URL. Your system generates a unique webhook_secret (e.g., whsec_abc123xyz). Shared only between you and them.

Signing (Your Server)

When event fires: construct signing string = timestamp + "." + payload_json. Compute HMAC-SHA256(signing_string, webhook_secret). Attach to header: X-Signature: t=1700000000,v1=abc123...

Verification (Client Server)

Client extracts timestamp and signature. Recomputes the HMAC using their stored secret. Compares: computed == X-Signature. If not equal → reject 401.

Replay Prevention

Client checks: |current_time - timestamp| > 300 seconds → reject. An attacker capturing a valid request can't replay it after 5 minutes.

IP Allowlisting: Optional but rigid. Stripe publishes their outbound IP ranges. Allowlist them in your firewall.
Rotate secrets: Allow secret rotation without downtime by accepting two valid signatures during a transition window.
Use HTTPS on both sides: Webhook receiver endpoint must be HTTPS. Never accept webhook deliveries over HTTP.

Q-22

Design a reliable retry strategy for webhook delivery with exponential backoff.

Senior Level

+

⚡ Real Scenario — E-commerce Store is Down During Black Friday

T+0s — Attempt 1

Payment succeeds. Webhook fired. E-commerce server returns 503 (overloaded). Mark attempt FAILED.

T+30s — Attempt 2

Retry. Still 503. Back off.

T+5m — Attempt 3

Retry. Still 503. Exponential back off.

T+30m — Attempt 4

Store is recovering. 200 OK returned. Delivered.

BACKOFF SCHEDULE

Attempt	Delay	Cumulative
1	Immediate	0s
2	30 seconds	30s
3	5 minutes	~5m
4	30 minutes	~35m
5	2 hours	~2.5h
...	Exponential + Jitter	...
25	DLQ	~72 hours

            ⚡
            Add Jitter: Don't retry at exactly T+30s for all clients simultaneously. Add random variance: delay = base_delay * (1 + random(0, 0.2)). Otherwise thousands of retries hit the recovering server at the exact same second — making recovery impossible (Thundering Herd).
          

Timeout per attempt: Hard limit of 5 seconds. If no 2xx in 5s, count as failure.
Success criteria: Only HTTP 2xx counts as success. 3xx, 4xx, 5xx → all retry (except 410 Gone).
Fast failure for 4xx: 400/401/403 responses usually mean a client bug — stop retrying, notify them immediately.

Q-23

How do you prevent processing duplicate webhook events on the consumer side?

Senior Level

+

🎯 What Interviewer Wants The concept of "at-least-once delivery" and idempotent consumers. Not just "use a unique ID" — but HOW.

In distributed systems, webhooks guarantee "at-least-once delivery", never "exactly-once". If your server times out before returning 200 OK, we retry — but you may have already processed it. Duplicates will happen.

Java — Idempotent Webhook Consumer
@PostMapping("/webhooks/stripe")
public ResponseEntity handleWebhook(@RequestBody WebhookEvent event) {

    // 1. Acknowledge FIRST — return 200 immediately
    // (Start a new thread for processing, don't block)

    // 2. Check if we've seen this event ID
    if (processedEventRepo.exists(event.getId())) {
        log.info("Duplicate event {}. Skipping.", event.getId());
        return ResponseEntity.ok().build(); // still return 200!
    }

    // 3. Record it BEFORE processing (optimistic lock or DB unique constraint)
    processedEventRepo.save(event.getId()); // unique constraint on event_id

    // 4. Process business logic
    orderService.markPaid(event.getData().getOrderId());

    return ResponseEntity.ok().build();
}

Return 200 OK even for duplicate events — otherwise the sender will keep retrying infinitely
Use a DB unique constraint on event_id as your safety net — it's atomic and prevents race conditions
Process webhooks asynchronously — return 200 immediately, put event on internal queue
Store processed event IDs for at least the retry window duration (e.g., 72 hours)

Q-24

How do you handle out-of-order webhook events? (Race conditions)

Expert Level

+

⚡ Real Scenario — payment.success arrives before payment.created

Event A — T=1000

payment.created sent. Routed through congested network path.

Event B — T=1005

payment.success sent. Arrives first via fast path. Status set to "Success".

Event A arrives late

payment.created arrives. If naively applied: overwrites status back to "Pending". 💥 Bug.

SOLUTIONS

Timestamp comparison: Include created_at in each event. Only apply update if incoming timestamp is newer than current state's timestamp.
Version/Sequence numbers: Each event has a sequence: 3 field. Only apply if incoming_sequence > current_sequence.
Re-fetch pattern (Best Practice): Treat the webhook as a "nudge" only. Ignore the payload. Call GET /payments/{id} to fetch the absolute latest state from the source of truth. This eliminates ordering entirely.

            ✅
            Re-fetch is the cleanest solution. Your webhook handler becomes: "Something changed for payment X → fetch /payments/X → apply the current state." No ordering logic needed. Always consistent.
          

Q-25

Design the ideal webhook payload structure. Thin vs Fat payload?

Mid Level

+

JSON — Ideal Webhook Payload (Stripe-style)
{
  "id": "evt_5Av7sXY2...",      // Unique event ID — for idempotency
  "object": "event",
  "type": "payment.success",    // Route logic without parsing data
  "apiVersion": "2024-01-01",   // Version client is pinned to
  "created": 1700000000,       // Unix timestamp — for ordering
  "data": {
    "object": {             // Fat payload: full resource snapshot
      "id": "pay_abc123",
      "amount": 2000,
      "currency": "inr",
      "status": "succeeded"
    },
    "previousAttributes": { // What changed (for diff use cases)
      "status": "pending"
    }
  }
}

	Thin Payload	Fat Payload
Contains	Just IDs: `{"paymentId":"pay_123"}`	Full resource snapshot
Pros	Always fresh — client fetches latest state	No extra API call needed
Cons	Extra API call required per event	May be stale by the time client processes it
Best For	Frequently changing resources, security-sensitive data	Immutable events, audit logs

Q-26

How do you build observability and debugging tools for webhook users?

Expert Level

+

Webhooks are asynchronous and run in the background. When something breaks, the developer has no idea why. Observability is what separates a professional webhook system from a prototype.

Delivery Logs Dashboard: Show each sent event: payload sent, timestamp, HTTP status returned, response body, delivery duration, attempt number.
Manual Resend: A "Resend Event" button that re-triggers delivery of a specific event. Invaluable for developers fixing their endpoint during testing.
Test Events: Allow sending synthetic test events (like payment.success with fake data) without creating real transactions.
Proactive Alerting: Email/Slack alert: "Your webhook endpoint has failed 90% of deliveries in the last hour. We will disable it in 24 hours."
CLI / SDK helper: Stripe has stripe listen --forward-to localhost:3000 that proxies live events to local dev environments. Huge DX win.

Q-27

What is the Thundering Herd problem in webhooks? How do you architect around it?

Expert Level

+

            🔴
            The Problem: You have 100,000 events queued during an outage. When the client's server recovers, all retries fire simultaneously — 100k HTTP requests in seconds. This crashes the just-recovering server, triggering more failures, more retries. A death spiral.
          

ARCHITECTURE SOLUTION

Event Occurs
payment.success

→

Kafka / SQS Queue
never send directly

→

Worker Pool
N workers per client

→

HTTP POST
to client endpoint

Queue-based decoupling: Never send webhooks directly from the API server thread. All events go into Kafka/SQS first.
Per-client queues: Maintain separate queues per customer. One slow customer doesn't block others.
Concurrency limits per client: Max N concurrent HTTP connections to a single endpoint. No client gets hammered with 1000 simultaneous requests.
Circuit breaker per endpoint: If an endpoint fails 50% of attempts in 60 seconds, stop sending temporarily. Allow recovery before resuming.
Jitter on retries: Spread retries across a time window instead of all at once.

Q-28

Design a Dead Letter Queue (DLQ) system for webhook failures. End to end.

Expert Level

+

📦 Full DLQ Flow — Payment Gateway

T+0

payment.success event queued in Kafka topic webhook-deliveries.

T+72h

25 delivery attempts. All failed (client server is broken). Max retries exhausted.

DLQ Move

Worker moves event to webhook-dlq Kafka topic. Stores full context: payload, all attempt timestamps, HTTP responses received.

Alert Fired

System sends email: "10 webhook events failed delivery and moved to DLQ. Your endpoint: shop.com/hooks". Dashboard shows DLQ count badge.

Recovery

Developer fixes their endpoint. Clicks "Replay DLQ" in dashboard. Events re-queued to webhook-deliveries with fresh retry budget.

DLQ is not "trash" — it's a recovery mechanism. Make replayability a first-class feature.
Store the reason for each failure (timeout, 500, connection refused) to help devs debug.
Allow selective replay (replay specific events, not all) for large DLQs.
Auto-disable endpoints that consistently fill the DLQ — protect your delivery workers.
Keep DLQ events for 30 days minimum. Financial events potentially longer.

Q-29

How do you handle webhook payload versioning without breaking existing consumers?

Expert Level

+

Breaking changes in webhook payloads can crash production systems without any deployment on the consumer's side. Schema evolution must be extremely careful.

Version pinning: Each webhook subscription stores the API version at registration time. User signed up under apiVersion: "2023-01-15" → always gets that payload shape, even when v2 is live.
Additive changes only: Adding new fields is safe. Renaming, removing, or changing types of existing fields is a breaking change requiring a new version.
Migration period: Announce deprecation 6+ months early. Send Deprecation: true warning header in webhook requests. Document the migration path.
Dual delivery: During migration, send events in both old and new format simultaneously. Consumers can migrate at their pace.

JSON — Versioning in Payload
{
  "id": "evt_123",
  "apiVersion": "2024-01-15",  // Which schema this payload follows
  "type": "payment.success",
  "data": { ... }
}

REF

HTTP Status Codes Cheat Sheet

// the ones that actually matter in interviews

2XX — SUCCESS

200

OK

Standard success for GET, PUT, PATCH, DELETE

201

CREATED

POST success. Include Location header with new resource URL

202

ACCEPTED

Async jobs. Processing started. Include polling URL.

204

NO CONTENT

Success, no response body. Use for DELETE.

207

MULTI-STATUS

Batch operations with mixed success/failure per item

4XX — CLIENT ERROR

400

BAD REQUEST

Validation failed, malformed JSON, missing required fields

401

UNAUTHORIZED

No auth token / invalid token. "Who are you?"

403

FORBIDDEN

Authenticated but no permission. "I know who you are, but no."

404

NOT FOUND

Resource doesn't exist. Also use to hide existence of private resources.

409

CONFLICT

Resource already exists. Duplicate email, concurrent update conflict.

410

GONE

Resource permanently deleted. Stop retrying (unlike 404).

422

UNPROCESSABLE

Syntactically valid but semantically wrong (e.g., start > end date)

429

TOO MANY

Rate limit exceeded. Include Retry-After header.

5XX — SERVER ERROR

500

SERVER ERROR

Unexpected crash. Log it, don't expose internals to client.

502

BAD GATEWAY

Upstream service failed. Proxy/gateway got invalid response.

503

UNAVAILABLE

Server temporarily down. Include Retry-After. Circuit breaker open.

504

GATEWAY TIMEOUT

Upstream took too long. Consider async patterns.

CS

Senior Dev Answer Cheat Sheet

// pattern phrases that show seniority

Key Concepts to Always Mention

Topic	When Asked About...	Always Mention
REST Design	How to design an API	Resource naming, HTTP semantics, status codes, versioning, pagination, security from day 1
Idempotency	POST safety, payments	Idempotency-Key header, Redis with TTL, "at-least-once delivery", UUID client generation
Pagination	List endpoints at scale	Cursor-based > offset for scale, phantom rows problem, Base64 opaque cursors, tie-breaker for non-unique sorts
Caching	Performance	Cache-Control headers, ETags, 304 Not Modified, CDN, brotli compression
Long Tasks	Slow operations	202 Accepted, polling endpoint, webhook callback option, never block connections
Webhooks	Event-driven systems	HMAC signature, replay attack prevention (timestamp check), exponential backoff + jitter, DLQ, idempotent consumer, re-fetch pattern for ordering
Rate Limiting	Abuse prevention	Token Bucket algorithm, 429 status, X-RateLimit-* headers, per-IP + per-key + per-endpoint limits, Redis atomic increments
Versioning	Breaking changes	URI versioning for public, header for internal, additive-only within version, Sunset headers
Bulk APIs	Batch operations	207 Multi-Status, partial success, atomic vs non-atomic trade-offs, async for large batches

Senior-Level Phrasing That Impresses

"I design for the failure path first — what happens on network timeout, retry, crash, partial failure."
"In distributed systems, at-least-once delivery is the guarantee you can make. Exactly-once requires distributed transactions — expensive and often unnecessary."
"The re-fetch pattern treats webhooks as signals, not state. You always fetch fresh data from the source of truth."
"Offset pagination is O(N). Cursor pagination uses the B-Tree index — it's O(log N). At 10M rows, that's the difference between milliseconds and seconds."
"Security is layered: transport (TLS), authentication (OAuth/JWT), authorization (RBAC), input validation, rate limiting. Removing any layer creates a gap."
"I'd monitor P99 latency, not just averages. Averages hide the 1% of users who experience a 10-second timeout."

API Design & Webhooks Interview Mastery

Key Concepts to Always Mention

Senior-Level Phrasing That Impresses