// REST_API + WEBHOOKS + SYSTEM_DESIGN
50+
Questions
9
Topics
10yr
Level

API Design & Webhooks Interview Mastery

Comprehensive senior-level interview preparation covering REST principles, distributed system patterns, reliability engineering, and webhook architecture. Every answer is designed to impress at the system design level.

REST Design Idempotency Pagination Caching Webhooks Rate Limiting Versioning Async APIs Bulk Operations
01
REST Design
// resource modeling · verbs · status codes · security
Q-01
How do you design a REST API for a resource? Walk me through the complete approach.
Mid Level
+
🎯 What Interviewer Wants A standardized approach covering naming conventions, HTTP verb semantics, status codes, and production concerns like pagination, security, and versioning.

I follow a resource-oriented design approach. Everything starts with identifying your nouns (resources), not your actions (verbs).

RESOURCE NAMING

  • /getUser — verb in URI ❌
  • /updateUser?action=activate — RPC style ❌
  • /users — collection (plural noun) ✓
  • /users/{id} — single resource ✓
  • /users/{id}/orders — nested sub-resource ✓

HTTP VERB MAPPING

VerbActionSafe?Idempotent?Example
GETReadGET /users/5
POSTCreatePOST /users
PUTFull ReplacePUT /users/5
PATCHPartial UpdateMostlyPATCH /users/5
DELETERemoveDELETE /users/5

PRODUCTION CONCERNS

  • Versioning: URI versioning /v1/users for public APIs, header versioning for internal microservices
  • Pagination: Cursor-based for large datasets, offset for simple internal tools
  • Security: OAuth2 / JWT + HTTPS always on
  • Error Handling: Structured JSON errors with machine-readable codes and a traceId
  • Rate Limiting: Token Bucket at the gateway level
Q-02
What is the difference between PUT and PATCH? When would you use each?
Foundational
+
🎯 What Interviewer Wants Understanding of idempotency implications, payload size, and real-world side effects.
FeaturePUTPATCH
ActionReplace entire resourcePartial update (delta)
PayloadFull object requiredOnly changed fields
BandwidthHigherLower
Idempotent?Always ✓Usually, not guaranteed
Side EffectMissing fields → set to null/defaultUntouched fields unchanged
⚠️ Critical Trap: If you send PUT /users/5 {"name":"Alice"} without the email field, the email should technically be set to null. Many devs accidentally use PUT when they mean PATCH, causing data loss in production.
📦 Real World Example

Use PUT when onboarding a user via a config file that always has all fields. Use PATCH for user profile edits where they only change their profile picture — you don't want to accidentally erase their name.

Q-03
How do you design consistent error responses in a REST API?
Mid Level
+
🎯 What Interviewer Wants Consistency, debuggability, and not leaking internal implementation details.

Never return plain text or stack traces. I use a structured envelope that serves both machines (the code field) and developers (the traceId field for log correlation).

JSON — Error Response Schema
{ "code": "USER_NOT_FOUND", // machine-readable — client can switch() on this "message": "User with ID 42 not found",// human-readable "traceId": "abc-123-xyz-789", // correlates to Splunk/Datadog logs "timestamp": "2024-10-27T10:00:00Z", // for debugging time-based issues "errors": [ // for validation failures { "field": "email", "issue": "Invalid format" } ] }
  • Never expose stack traces, SQL errors, or internal hostnames to clients
  • Always log full detail server-side, return only what's needed
  • Use Problem Details RFC 7807 format in enterprise APIs for standardization
  • Validation errors (400) should list all field errors at once — not one at a time
Q-04
How do you secure a REST API end-to-end?
Senior Level
+
🎯 What Interviewer Wants Multi-layer security thinking — not just "use HTTPS". They want you to cover AuthN vs AuthZ, attack vectors, and operational concerns.
  • Transport: TLS 1.2+ (HTTPS) everywhere. Reject plain HTTP. Enforce HSTS.
  • Authentication (AuthN): OAuth 2.0 for 3rd-party delegated access. JWT for stateless service-to-service. API Keys for machine clients with IP whitelisting.
  • Authorization (AuthZ): RBAC (Role-Based) or ABAC (Attribute-Based). Validate permissions on every request — never trust client claims.
  • Input Validation: Reject malformed inputs at the gateway. Prevent SQL Injection, XSS, Path Traversal.
  • Rate Limiting: Per-user and per-IP limits. Return 429 Too Many Requests with Retry-After header.
  • CORS: Strict allowlist of trusted origins. Never Access-Control-Allow-Origin: * for authenticated APIs.
  • Secrets: Never put API keys in URLs. Use Authorization header or request body.
  • Audit Logging: Log all write operations (POST/PUT/PATCH/DELETE) with user, IP, timestamp, and payload hash.
🔴 Common 10-Year Mistake: Checking authentication at the gateway but skipping authorization inside microservices. A compromised internal service can then do anything. Always re-check permissions at the resource level too.
Q-05
What is the Richardson Maturity Model? Where do most APIs actually sit?
Senior Level
+
🎯 What Interviewer Wants That you can articulate REST maturity beyond just "use HTTP verbs correctly."
LevelNameWhat it meansReal World
L0Swamp of POXHTTP as a tunnel. One endpoint. SOAP / XML-RPC.Legacy enterprise SOAP
L1ResourcesMultiple URIs like /users, /products. But still just POST everywhere.Early internal APIs
L2HTTP VerbsCorrect GET/POST/PUT/DELETE + Status Codes. This is where 90% of "REST" APIs live.Stripe, GitHub API
L3HATEOASAPI drives the client via links in responses. Self-documenting conversation.Rarely in practice
💡 Senior Answer: "Most APIs are Level 2 and that's perfectly fine for production use. Level 3 (HATEOAS) sounds great in theory but adds complexity without proportional benefit for most teams. I'd only advocate for it in truly public, long-lived APIs."
Q-06
What is HATEOAS and when would you actually use it?
Expert Level
+

HATEOAS (Hypermedia as the Engine of Application State) means the API response tells the client what actions are available next — instead of the client hardcoding URL patterns.

JSON — HATEOAS Response for Pending Order
{ "id": 101, "status": "pending", "amount": 2000, "_links": { "self": { "href": "/orders/101", "method": "GET" }, "cancel": { "href": "/orders/101/cancel", "method": "POST" }, "pay": { "href": "/orders/101/pay", "method": "PUT" } } // If status was "shipped" → NO "cancel" link would appear here }
Key Insight: State-driven links prevent impossible actions from being called. A client never needs to check "is this order cancellable?" — the link either appears or it doesn't. This is the real power.
02
Idempotency
// retry safety · payment systems · distributed reliability
Q-07
What is idempotency and why does it matter in distributed systems?
Mid Level
+

Idempotency means making the same request multiple times produces the exact same result as making it once. Mathematically: f(f(x)) = f(x).

In distributed systems, network retries are inevitable. A client never knows if a request was lost in transit or if the server processed it and the response was lost. Without idempotency, retrying a payment could charge a user twice.

Senior framing: "Idempotency is what lets you safely add retry logic without adding catastrophic side effects. It's the difference between a resilient system and a financial audit nightmare."
Q-08
Which HTTP methods are idempotent and why?
Foundational
+
MethodIdempotent?Why
GET✓ YesRead-only. Multiple reads don't change state.
HEAD✓ YesSame as GET but no body. Pure metadata.
OPTIONS✓ YesDescribes capabilities only.
PUT✓ YesSets resource to a specific state. Running 10 times = same result.
DELETE✓ YesAfter first delete, resource is gone. All subsequent are no-ops (404 or 200).
POST✗ NoCreates new resources each time. Two POST /payments = two charges.
PATCHVaries{"name":"Alice"} is idempotent. {"views": views+1} is not.
Q-09
How do you implement an idempotent POST endpoint? Design it end-to-end.
Senior Level
+
🎯 What Interviewer Wants The Idempotency-Key pattern used by Stripe/PayPal. They want server-side storage strategy too.
Client
generates UUID key
POST /payments
Idempotency-Key: abc-123
Server
check Redis for key
New?
process + cache
Seen?
return cached result
Java — Spring Boot Pseudocode
@PostMapping("/payments") public ResponseEntity process( @RequestHeader("Idempotency-Key") String key, @RequestBody PaymentRequest req) { // 1. Check Redis for this key String cached = redis.get("idem:" + key); if (cached != null) { return ResponseEntity.ok(deserialize(cached)); // replay cached response } // 2. Process the payment (first time only) PaymentResponse result = paymentService.charge(req); // 3. Cache the response with 24hr TTL redis.setex("idem:" + key, 86400, serialize(result)); return ResponseEntity.status(201).body(result); }
  • Client must generate a UUID and include it on every retry of the same logical operation
  • Use Redis with a 24-48 hour TTL — long enough to cover retry windows
  • For banking: use DB unique constraint on (userId, idempotencyKey) for ACID compliance
  • Return the exact same response code (e.g., 201) on replayed requests — don't return 200
Q-10
Give me a real-world failure scenario where idempotency saves the day.
Mid Level
+
📦 Real World — Mobile Payment on Spotty Network
T=0.0s
User on a train taps "Pay ₹2,000". App generates key UUID: abc-999. Sends request.
T=0.3s
Server receives the request, charges the card. Payment succeeds internally.
T=0.4s
Train enters a tunnel. TCP connection drops. The 200 OK response is lost.
T=3.0s
App timeout fires. Auto-retry sends the same request with the same Idempotency-Key: abc-999.
T=3.2s
Server finds abc-999 in Redis. Returns cached 200 OK. No second charge.
💸 Without idempotency: User is charged ₹4,000. Without a support ticket, they never get refunded. This is a trust-destroying, legally exposing bug.
04
Caching
// cache-control · etags · compression · CDN
Q-14
How do you implement HTTP caching properly? Explain ETags and 304.
Senior Level
+

HTTP has a built-in caching system most developers underuse. There are two layers: freshness (Cache-Control) and validation (ETags).

Client
GET /products/5
Server
200 OK + ETag: "v3-abc"
Client
caches response
Re-request
If-None-Match: "v3-abc"
Server
304 Not Modified (empty body)
HeaderPurposeExample
Cache-Control: max-age=3600Client caches for 1 hour, no server checkStatic assets, config data
Cache-Control: no-cacheMust revalidate with server each timeUser-specific data
Cache-Control: privateCDN won't cache; only browser canAuth-scoped responses
ETag: "v1-hash"Content fingerprint for conditional requestsProduct catalogue, config
304 Not ModifiedData unchanged — use your cacheMassive bandwidth savings
Real impact: A product listing API returning 50KB of JSON, called 10,000 times/hour, with a 1-hour Cache-Control cuts your bandwidth from 500MB/hr to near-zero during cache-valid periods.
Q-15
How do you handle response compression and what's the difference between gzip and brotli?
Mid Level
+
HTTP Headers — Compression Negotiation
// Client sends (in order of preference): Accept-Encoding: br, gzip, deflate // Server responds with chosen encoding: Content-Encoding: br Content-Type: application/json
AlgorithmCompression RatioSpeedSupport
gzip~70% reductionFastUniversal (all browsers)
brotli (br)~80% reductionSlightly slowerModern browsers + HTTPS only
deflate~65% reductionFastLegacy; avoid
  • Enable brotli on your CDN/Nginx for modern clients. Fall back to gzip for others.
  • Compress anything over 1KB. Skip tiny responses — compression overhead isn't worth it.
  • Never compress binary data (images, videos) — they're already compressed.
05
Async / Long-Running Operations
// 202 accepted · polling · webhooks callback · timeouts
Q-16
How do you design an API for a task that takes 5+ minutes? (e.g., generate a PDF report)
Senior Level
+
🎯 What Interviewer Wants The "Asynchronous Request-Reply" pattern. They want to hear about 202, polling endpoints, and not blocking connections.
POST /reports
client kicks off job
202 Accepted
Location: /tasks/job-99
GET /tasks/job-99
client polls
{"status":"processing"
"percent": 60}
{"status":"done"
"result": "/reports/7"}
JSON — Polling Response States
// While processing: { "status": "processing", "percent": 60, "estimatedSeconds": 45 } // On completion: { "status": "completed", "resultUrl": "/reports/999" } // On failure: { "status": "failed", "error": "PDF generation timeout", "retryable": true }
💡 Senior upgrade: "Instead of polling, I'd prefer to combine this with a Webhook callback. The client registers a callbackUrl in the initial POST. When done, the server pushes the result to that URL. This avoids wasted polling traffic and reduces latency to completion notification."
06
Bulk Operations
// batch endpoints · partial failure · 207 multi-status
Q-17
Design a REST API to create 10,000 users at once. Handle partial failures.
Expert Level
+

Never expose batch operations as POST /users in a loop — that's 10,000 HTTP round trips. Create a dedicated batch endpoint.

  • Endpoint: POST /users/batch
  • Atomic (All-or-Nothing): Wrap all inserts in a DB transaction. One failure rolls back everything. Use for financial data where partial state is dangerous.
  • Partial Success (Preferred for large batches): Process all, report per-item success/failure. Use 207 Multi-Status.
  • Async Batch: For very large batches (10k+), return 202 Accepted + job ID immediately. Process in background. Return results via polling or webhook.
JSON — 207 Multi-Status Response
{ "summary": { "total": 3, "success": 2, "failed": 1 }, "results": [ { "clientId": "row-1", "status": 201, "resourceId": 1001 }, { "clientId": "row-2", "status": 400, "error": "Email 'x@y' already exists" }, { "clientId": "row-3", "status": 201, "resourceId": 1003 } ] }
⚠️ Important: The top-level HTTP status is 207 (or 200), NOT 400. The batch endpoint itself succeeded. Individual items may have failed, and that's reflected per-item inside the response.
07
API Versioning
// URI · header · content-type · deprecation
Q-18
Compare URI versioning vs Header versioning. What's your recommendation?
Senior Level
+
StrategyExampleProsCons
URI Versioning/v1/usersVisible in browser, easy to test, cache-friendly, most commonViolates REST (URI should be stable resource address)
Header VersioningAccept-Version: v1Clean URIs, pure RESTCan't test in browser, complex caching rules
Media TypeAccept: application/vnd.myapi.v2+jsonPurist REST, content negotiationMost complex to implement and debug
Query Param/users?version=2Simple to addMessy, breaks caching, not recommended
💡 My recommendation: URI versioning (/v1/) for public APIs — it wins on developer ergonomics, discoverability, and CDN cacheability. Header versioning for strict internal microservices where URL cleanliness matters. Never query params.
  • Maintain at most 2 major versions simultaneously — deprecation cost is real
  • Set a deprecation timeline upfront (e.g., v1 → sunset in 6 months after v2 launch)
  • Add Deprecation: true and Sunset: date response headers to warn clients automatically
  • Never make breaking changes inside a version (removing fields, changing types)
08
Rate Limiting
// token bucket · leaky bucket · 429 · headers
Q-19
How do you design and communicate rate limiting to clients?
Senior Level
+
HTTP — Rate Limit Response Headers
// On every response, include these: X-RateLimit-Limit: 1000 // requests allowed per window X-RateLimit-Remaining: 37 // requests left in current window X-RateLimit-Reset: 1709459200 // Unix timestamp when window resets // When limit is exceeded (429): HTTP/1.1 429 Too Many Requests Retry-After: 60 // seconds until client can retry
AlgorithmHow it worksUse case
Token BucketBucket fills at fixed rate. Each request costs 1 token. Allows short bursts.API endpoints, default choice
Leaky BucketQueue smooths traffic. Output rate is constant regardless of input bursts.Payment processing, precise throttling
Fixed WindowReset counter every N seconds. Simple but allows 2x burst at window boundaries.Simple internal tools
Sliding WindowCounts requests in the last N seconds, rolling. Accurate, no boundary burst.High-security APIs
Advanced consideration: Implement rate limits at multiple levels — per IP (DDoS), per API key (fair use), per user, and per endpoint (expensive endpoints get tighter limits). Use Redis with atomic increment operations to ensure correctness across distributed nodes.
09
Webhooks
// reliability · security · retries · DLQ · ordering · observability
Q-20
What is a Webhook? How does it differ from Polling and Server-Sent Events?
Mid Level
+
PatternDirectionConnectionBest For
PollingClient → Server (pull)Repeated short HTTP callsRelaxed latency requirements, simple setup
Long PollingClient → Server (pull)Client waits until data availableNear real-time, no WebSocket
WebhookServer → Client (push)One-shot HTTP POST per eventEvent-driven callbacks, integrations
SSEServer → Client (push)Persistent one-way streamLive dashboards, notifications in browser
WebSocketBi-directionalPersistent full-duplexChat, live collaboration, gaming
💡 Polling waste example: Checking every 5 seconds for a payment update = 12 requests/minute per user. With 100k users: 1.2M wasted API calls/minute. A webhook makes 1 call when the event occurs.
Q-21
How do you secure webhooks against replay attacks and impersonation?
Senior Level
+
🎯 What Interviewer Wants Just saying "HTTPS" is not enough. They want HMAC signatures, replay attack prevention, and IP allowlisting.
🔐 Stripe-style HMAC Signature Flow
Setup
User registers webhook URL. Your system generates a unique webhook_secret (e.g., whsec_abc123xyz). Shared only between you and them.
Signing (Your Server)
When event fires: construct signing string = timestamp + "." + payload_json. Compute HMAC-SHA256(signing_string, webhook_secret). Attach to header: X-Signature: t=1700000000,v1=abc123...
Verification (Client Server)
Client extracts timestamp and signature. Recomputes the HMAC using their stored secret. Compares: computed == X-Signature. If not equal → reject 401.
Replay Prevention
Client checks: |current_time - timestamp| > 300 seconds → reject. An attacker capturing a valid request can't replay it after 5 minutes.
  • IP Allowlisting: Optional but rigid. Stripe publishes their outbound IP ranges. Allowlist them in your firewall.
  • Rotate secrets: Allow secret rotation without downtime by accepting two valid signatures during a transition window.
  • Use HTTPS on both sides: Webhook receiver endpoint must be HTTPS. Never accept webhook deliveries over HTTP.
Q-22
Design a reliable retry strategy for webhook delivery with exponential backoff.
Senior Level
+
⚡ Real Scenario — E-commerce Store is Down During Black Friday
T+0s — Attempt 1
Payment succeeds. Webhook fired. E-commerce server returns 503 (overloaded). Mark attempt FAILED.
T+30s — Attempt 2
Retry. Still 503. Back off.
T+5m — Attempt 3
Retry. Still 503. Exponential back off.
T+30m — Attempt 4
Store is recovering. 200 OK returned. Delivered.

BACKOFF SCHEDULE

AttemptDelayCumulative
1Immediate0s
230 seconds30s
35 minutes~5m
430 minutes~35m
52 hours~2.5h
...Exponential + Jitter...
25DLQ~72 hours
Add Jitter: Don't retry at exactly T+30s for all clients simultaneously. Add random variance: delay = base_delay * (1 + random(0, 0.2)). Otherwise thousands of retries hit the recovering server at the exact same second — making recovery impossible (Thundering Herd).
  • Timeout per attempt: Hard limit of 5 seconds. If no 2xx in 5s, count as failure.
  • Success criteria: Only HTTP 2xx counts as success. 3xx, 4xx, 5xx → all retry (except 410 Gone).
  • Fast failure for 4xx: 400/401/403 responses usually mean a client bug — stop retrying, notify them immediately.
Q-23
How do you prevent processing duplicate webhook events on the consumer side?
Senior Level
+
🎯 What Interviewer Wants The concept of "at-least-once delivery" and idempotent consumers. Not just "use a unique ID" — but HOW.

In distributed systems, webhooks guarantee "at-least-once delivery", never "exactly-once". If your server times out before returning 200 OK, we retry — but you may have already processed it. Duplicates will happen.

Java — Idempotent Webhook Consumer
@PostMapping("/webhooks/stripe") public ResponseEntity handleWebhook(@RequestBody WebhookEvent event) { // 1. Acknowledge FIRST — return 200 immediately // (Start a new thread for processing, don't block) // 2. Check if we've seen this event ID if (processedEventRepo.exists(event.getId())) { log.info("Duplicate event {}. Skipping.", event.getId()); return ResponseEntity.ok().build(); // still return 200! } // 3. Record it BEFORE processing (optimistic lock or DB unique constraint) processedEventRepo.save(event.getId()); // unique constraint on event_id // 4. Process business logic orderService.markPaid(event.getData().getOrderId()); return ResponseEntity.ok().build(); }
  • Return 200 OK even for duplicate events — otherwise the sender will keep retrying infinitely
  • Use a DB unique constraint on event_id as your safety net — it's atomic and prevents race conditions
  • Process webhooks asynchronously — return 200 immediately, put event on internal queue
  • Store processed event IDs for at least the retry window duration (e.g., 72 hours)
Q-24
How do you handle out-of-order webhook events? (Race conditions)
Expert Level
+
⚡ Real Scenario — payment.success arrives before payment.created
Event A — T=1000
payment.created sent. Routed through congested network path.
Event B — T=1005
payment.success sent. Arrives first via fast path. Status set to "Success".
Event A arrives late
payment.created arrives. If naively applied: overwrites status back to "Pending". 💥 Bug.

SOLUTIONS

  • Timestamp comparison: Include created_at in each event. Only apply update if incoming timestamp is newer than current state's timestamp.
  • Version/Sequence numbers: Each event has a sequence: 3 field. Only apply if incoming_sequence > current_sequence.
  • Re-fetch pattern (Best Practice): Treat the webhook as a "nudge" only. Ignore the payload. Call GET /payments/{id} to fetch the absolute latest state from the source of truth. This eliminates ordering entirely.
Re-fetch is the cleanest solution. Your webhook handler becomes: "Something changed for payment X → fetch /payments/X → apply the current state." No ordering logic needed. Always consistent.
Q-25
Design the ideal webhook payload structure. Thin vs Fat payload?
Mid Level
+
JSON — Ideal Webhook Payload (Stripe-style)
{ "id": "evt_5Av7sXY2...", // Unique event ID — for idempotency "object": "event", "type": "payment.success", // Route logic without parsing data "apiVersion": "2024-01-01", // Version client is pinned to "created": 1700000000, // Unix timestamp — for ordering "data": { "object": { // Fat payload: full resource snapshot "id": "pay_abc123", "amount": 2000, "currency": "inr", "status": "succeeded" }, "previousAttributes": { // What changed (for diff use cases) "status": "pending" } } }
Thin PayloadFat Payload
ContainsJust IDs: {"paymentId":"pay_123"}Full resource snapshot
ProsAlways fresh — client fetches latest stateNo extra API call needed
ConsExtra API call required per eventMay be stale by the time client processes it
Best ForFrequently changing resources, security-sensitive dataImmutable events, audit logs
Q-26
How do you build observability and debugging tools for webhook users?
Expert Level
+

Webhooks are asynchronous and run in the background. When something breaks, the developer has no idea why. Observability is what separates a professional webhook system from a prototype.

  • Delivery Logs Dashboard: Show each sent event: payload sent, timestamp, HTTP status returned, response body, delivery duration, attempt number.
  • Manual Resend: A "Resend Event" button that re-triggers delivery of a specific event. Invaluable for developers fixing their endpoint during testing.
  • Test Events: Allow sending synthetic test events (like payment.success with fake data) without creating real transactions.
  • Proactive Alerting: Email/Slack alert: "Your webhook endpoint has failed 90% of deliveries in the last hour. We will disable it in 24 hours."
  • CLI / SDK helper: Stripe has stripe listen --forward-to localhost:3000 that proxies live events to local dev environments. Huge DX win.
Q-27
What is the Thundering Herd problem in webhooks? How do you architect around it?
Expert Level
+
🔴 The Problem: You have 100,000 events queued during an outage. When the client's server recovers, all retries fire simultaneously — 100k HTTP requests in seconds. This crashes the just-recovering server, triggering more failures, more retries. A death spiral.

ARCHITECTURE SOLUTION

Event Occurs
payment.success
Kafka / SQS Queue
never send directly
Worker Pool
N workers per client
HTTP POST
to client endpoint
  • Queue-based decoupling: Never send webhooks directly from the API server thread. All events go into Kafka/SQS first.
  • Per-client queues: Maintain separate queues per customer. One slow customer doesn't block others.
  • Concurrency limits per client: Max N concurrent HTTP connections to a single endpoint. No client gets hammered with 1000 simultaneous requests.
  • Circuit breaker per endpoint: If an endpoint fails 50% of attempts in 60 seconds, stop sending temporarily. Allow recovery before resuming.
  • Jitter on retries: Spread retries across a time window instead of all at once.
Q-28
Design a Dead Letter Queue (DLQ) system for webhook failures. End to end.
Expert Level
+
📦 Full DLQ Flow — Payment Gateway
T+0
payment.success event queued in Kafka topic webhook-deliveries.
T+72h
25 delivery attempts. All failed (client server is broken). Max retries exhausted.
DLQ Move
Worker moves event to webhook-dlq Kafka topic. Stores full context: payload, all attempt timestamps, HTTP responses received.
Alert Fired
System sends email: "10 webhook events failed delivery and moved to DLQ. Your endpoint: shop.com/hooks". Dashboard shows DLQ count badge.
Recovery
Developer fixes their endpoint. Clicks "Replay DLQ" in dashboard. Events re-queued to webhook-deliveries with fresh retry budget.
  • DLQ is not "trash" — it's a recovery mechanism. Make replayability a first-class feature.
  • Store the reason for each failure (timeout, 500, connection refused) to help devs debug.
  • Allow selective replay (replay specific events, not all) for large DLQs.
  • Auto-disable endpoints that consistently fill the DLQ — protect your delivery workers.
  • Keep DLQ events for 30 days minimum. Financial events potentially longer.
Q-29
How do you handle webhook payload versioning without breaking existing consumers?
Expert Level
+

Breaking changes in webhook payloads can crash production systems without any deployment on the consumer's side. Schema evolution must be extremely careful.

  • Version pinning: Each webhook subscription stores the API version at registration time. User signed up under apiVersion: "2023-01-15" → always gets that payload shape, even when v2 is live.
  • Additive changes only: Adding new fields is safe. Renaming, removing, or changing types of existing fields is a breaking change requiring a new version.
  • Migration period: Announce deprecation 6+ months early. Send Deprecation: true warning header in webhook requests. Document the migration path.
  • Dual delivery: During migration, send events in both old and new format simultaneously. Consumers can migrate at their pace.
JSON — Versioning in Payload
{ "id": "evt_123", "apiVersion": "2024-01-15", // Which schema this payload follows "type": "payment.success", "data": { ... } }
REF
HTTP Status Codes Cheat Sheet
// the ones that actually matter in interviews

2XX — SUCCESS

200
OK
Standard success for GET, PUT, PATCH, DELETE
201
CREATED
POST success. Include Location header with new resource URL
202
ACCEPTED
Async jobs. Processing started. Include polling URL.
204
NO CONTENT
Success, no response body. Use for DELETE.
207
MULTI-STATUS
Batch operations with mixed success/failure per item

4XX — CLIENT ERROR

400
BAD REQUEST
Validation failed, malformed JSON, missing required fields
401
UNAUTHORIZED
No auth token / invalid token. "Who are you?"
403
FORBIDDEN
Authenticated but no permission. "I know who you are, but no."
404
NOT FOUND
Resource doesn't exist. Also use to hide existence of private resources.
409
CONFLICT
Resource already exists. Duplicate email, concurrent update conflict.
410
GONE
Resource permanently deleted. Stop retrying (unlike 404).
422
UNPROCESSABLE
Syntactically valid but semantically wrong (e.g., start > end date)
429
TOO MANY
Rate limit exceeded. Include Retry-After header.

5XX — SERVER ERROR

500
SERVER ERROR
Unexpected crash. Log it, don't expose internals to client.
502
BAD GATEWAY
Upstream service failed. Proxy/gateway got invalid response.
503
UNAVAILABLE
Server temporarily down. Include Retry-After. Circuit breaker open.
504
GATEWAY TIMEOUT
Upstream took too long. Consider async patterns.
CS
Senior Dev Answer Cheat Sheet
// pattern phrases that show seniority

Key Concepts to Always Mention

TopicWhen Asked About...Always Mention
REST DesignHow to design an APIResource naming, HTTP semantics, status codes, versioning, pagination, security from day 1
IdempotencyPOST safety, paymentsIdempotency-Key header, Redis with TTL, "at-least-once delivery", UUID client generation
PaginationList endpoints at scaleCursor-based > offset for scale, phantom rows problem, Base64 opaque cursors, tie-breaker for non-unique sorts
CachingPerformanceCache-Control headers, ETags, 304 Not Modified, CDN, brotli compression
Long TasksSlow operations202 Accepted, polling endpoint, webhook callback option, never block connections
WebhooksEvent-driven systemsHMAC signature, replay attack prevention (timestamp check), exponential backoff + jitter, DLQ, idempotent consumer, re-fetch pattern for ordering
Rate LimitingAbuse preventionToken Bucket algorithm, 429 status, X-RateLimit-* headers, per-IP + per-key + per-endpoint limits, Redis atomic increments
VersioningBreaking changesURI versioning for public, header for internal, additive-only within version, Sunset headers
Bulk APIsBatch operations207 Multi-Status, partial success, atomic vs non-atomic trade-offs, async for large batches

Senior-Level Phrasing That Impresses

  • "I design for the failure path first — what happens on network timeout, retry, crash, partial failure."
  • "In distributed systems, at-least-once delivery is the guarantee you can make. Exactly-once requires distributed transactions — expensive and often unnecessary."
  • "The re-fetch pattern treats webhooks as signals, not state. You always fetch fresh data from the source of truth."
  • "Offset pagination is O(N). Cursor pagination uses the B-Tree index — it's O(log N). At 10M rows, that's the difference between milliseconds and seconds."
  • "Security is layered: transport (TLS), authentication (OAuth/JWT), authorization (RBAC), input validation, rate limiting. Removing any layer creates a gap."
  • "I'd monitor P99 latency, not just averages. Averages hide the 1% of users who experience a 10-second timeout."