I follow a resource-oriented design approach. Everything starts with identifying your nouns (resources), not your actions (verbs).
RESOURCE NAMING
/getUser— verb in URI ❌/updateUser?action=activate— RPC style ❌
/users— collection (plural noun) ✓/users/{id}— single resource ✓/users/{id}/orders— nested sub-resource ✓
HTTP VERB MAPPING
| Verb | Action | Safe? | Idempotent? | Example |
|---|---|---|---|---|
GET | Read | ✓ | ✓ | GET /users/5 |
POST | Create | ✗ | ✗ | POST /users |
PUT | Full Replace | ✗ | ✓ | PUT /users/5 |
PATCH | Partial Update | ✗ | Mostly | PATCH /users/5 |
DELETE | Remove | ✗ | ✓ | DELETE /users/5 |
PRODUCTION CONCERNS
- Versioning: URI versioning
/v1/usersfor public APIs, header versioning for internal microservices - Pagination: Cursor-based for large datasets, offset for simple internal tools
- Security: OAuth2 / JWT + HTTPS always on
- Error Handling: Structured JSON errors with machine-readable codes and a
traceId - Rate Limiting: Token Bucket at the gateway level
| Feature | PUT | PATCH |
|---|---|---|
| Action | Replace entire resource | Partial update (delta) |
| Payload | Full object required | Only changed fields |
| Bandwidth | Higher | Lower |
| Idempotent? | Always ✓ | Usually, not guaranteed |
| Side Effect | Missing fields → set to null/default | Untouched fields unchanged |
PUT /users/5 {"name":"Alice"} without the email field, the email should technically be set to null. Many devs accidentally use PUT when they mean PATCH, causing data loss in production.
Use PUT when onboarding a user via a config file that always has all fields. Use PATCH for user profile edits where they only change their profile picture — you don't want to accidentally erase their name.
Never return plain text or stack traces. I use a structured envelope that serves both machines (the code field) and developers (the traceId field for log correlation).
- Never expose stack traces, SQL errors, or internal hostnames to clients
- Always log full detail server-side, return only what's needed
- Use
Problem DetailsRFC 7807 format in enterprise APIs for standardization - Validation errors (
400) should list all field errors at once — not one at a time
- Transport: TLS 1.2+ (HTTPS) everywhere. Reject plain HTTP. Enforce HSTS.
- Authentication (AuthN): OAuth 2.0 for 3rd-party delegated access. JWT for stateless service-to-service. API Keys for machine clients with IP whitelisting.
- Authorization (AuthZ): RBAC (Role-Based) or ABAC (Attribute-Based). Validate permissions on every request — never trust client claims.
- Input Validation: Reject malformed inputs at the gateway. Prevent SQL Injection, XSS, Path Traversal.
- Rate Limiting: Per-user and per-IP limits. Return
429 Too Many RequestswithRetry-Afterheader. - CORS: Strict allowlist of trusted origins. Never
Access-Control-Allow-Origin: *for authenticated APIs. - Secrets: Never put API keys in URLs. Use Authorization header or request body.
- Audit Logging: Log all write operations (
POST/PUT/PATCH/DELETE) with user, IP, timestamp, and payload hash.
| Level | Name | What it means | Real World |
|---|---|---|---|
L0 | Swamp of POX | HTTP as a tunnel. One endpoint. SOAP / XML-RPC. | Legacy enterprise SOAP |
L1 | Resources | Multiple URIs like /users, /products. But still just POST everywhere. | Early internal APIs |
L2 | HTTP Verbs | Correct GET/POST/PUT/DELETE + Status Codes. This is where 90% of "REST" APIs live. | Stripe, GitHub API |
L3 | HATEOAS | API drives the client via links in responses. Self-documenting conversation. | Rarely in practice |
HATEOAS (Hypermedia as the Engine of Application State) means the API response tells the client what actions are available next — instead of the client hardcoding URL patterns.
Idempotency means making the same request multiple times produces the exact same result as making it once. Mathematically: f(f(x)) = f(x).
In distributed systems, network retries are inevitable. A client never knows if a request was lost in transit or if the server processed it and the response was lost. Without idempotency, retrying a payment could charge a user twice.
| Method | Idempotent? | Why |
|---|---|---|
GET | ✓ Yes | Read-only. Multiple reads don't change state. |
HEAD | ✓ Yes | Same as GET but no body. Pure metadata. |
OPTIONS | ✓ Yes | Describes capabilities only. |
PUT | ✓ Yes | Sets resource to a specific state. Running 10 times = same result. |
DELETE | ✓ Yes | After first delete, resource is gone. All subsequent are no-ops (404 or 200). |
POST | ✗ No | Creates new resources each time. Two POST /payments = two charges. |
PATCH | Varies | {"name":"Alice"} is idempotent. {"views": views+1} is not. |
generates UUID key
Idempotency-Key: abc-123
check Redis for key
process + cache
return cached result
- Client must generate a UUID and include it on every retry of the same logical operation
- Use Redis with a 24-48 hour TTL — long enough to cover retry windows
- For banking: use DB unique constraint on
(userId, idempotencyKey)for ACID compliance - Return the exact same response code (e.g.,
201) on replayed requests — don't return200
UUID: abc-999. Sends request.
Idempotency-Key: abc-999.
abc-999 in Redis. Returns cached 200 OK. No second charge.
| Type | How it works | Best for | Avoid when |
|---|---|---|---|
| Offset | ?page=3&size=20SQL: OFFSET 60 LIMIT 20 | Small datasets, admin panels, user expects "page 5" | Large datasets, real-time data |
| Cursor/Keyset | ?cursor=last_idSQL: WHERE id > ? LIMIT 20 | Infinite scroll, social feeds, large datasets | When random page access is required |
| Seek/Keyset+ | Multi-column cursor: (date, id) | Complex sorting with non-unique columns | Simple use cases |
| Time-based | ?since=timestamp&until=timestamp | Audit logs, event streams, sync APIs | User-facing paginated lists |
OFFSET 500000 LIMIT 20 forces the DB to scan and discard 500,000 rows before returning 20. It's O(N) and gets slower every page. At page 10,000, your query takes seconds.
?cursor=102, devs start building logic around it (e.g., "cursor + 1 = next page"), coupling them to your DB structure. Base64 encoding says "this is a black box — just pass it back."
HTTP has a built-in caching system most developers underuse. There are two layers: freshness (Cache-Control) and validation (ETags).
GET /products/5
200 OK + ETag: "v3-abc"
caches response
If-None-Match: "v3-abc"
304 Not Modified (empty body)
| Header | Purpose | Example |
|---|---|---|
Cache-Control: max-age=3600 | Client caches for 1 hour, no server check | Static assets, config data |
Cache-Control: no-cache | Must revalidate with server each time | User-specific data |
Cache-Control: private | CDN won't cache; only browser can | Auth-scoped responses |
ETag: "v1-hash" | Content fingerprint for conditional requests | Product catalogue, config |
304 Not Modified | Data unchanged — use your cache | Massive bandwidth savings |
| Algorithm | Compression Ratio | Speed | Support |
|---|---|---|---|
gzip | ~70% reduction | Fast | Universal (all browsers) |
brotli (br) | ~80% reduction | Slightly slower | Modern browsers + HTTPS only |
deflate | ~65% reduction | Fast | Legacy; avoid |
- Enable brotli on your CDN/Nginx for modern clients. Fall back to gzip for others.
- Compress anything over 1KB. Skip tiny responses — compression overhead isn't worth it.
- Never compress binary data (images, videos) — they're already compressed.
client kicks off job
Location: /tasks/job-99
client polls
"percent": 60}
"result": "/reports/7"}
callbackUrl in the initial POST. When done, the server pushes the result to that URL. This avoids wasted polling traffic and reduces latency to completion notification."
Never expose batch operations as POST /users in a loop — that's 10,000 HTTP round trips. Create a dedicated batch endpoint.
- Endpoint:
POST /users/batch - Atomic (All-or-Nothing): Wrap all inserts in a DB transaction. One failure rolls back everything. Use for financial data where partial state is dangerous.
- Partial Success (Preferred for large batches): Process all, report per-item success/failure. Use
207 Multi-Status. - Async Batch: For very large batches (10k+), return
202 Accepted+ job ID immediately. Process in background. Return results via polling or webhook.
207 (or 200), NOT 400. The batch endpoint itself succeeded. Individual items may have failed, and that's reflected per-item inside the response.
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| URI Versioning | /v1/users | Visible in browser, easy to test, cache-friendly, most common | Violates REST (URI should be stable resource address) |
| Header Versioning | Accept-Version: v1 | Clean URIs, pure REST | Can't test in browser, complex caching rules |
| Media Type | Accept: application/vnd.myapi.v2+json | Purist REST, content negotiation | Most complex to implement and debug |
| Query Param | /users?version=2 | Simple to add | Messy, breaks caching, not recommended |
/v1/) for public APIs — it wins on developer ergonomics, discoverability, and CDN cacheability. Header versioning for strict internal microservices where URL cleanliness matters. Never query params.
- Maintain at most 2 major versions simultaneously — deprecation cost is real
- Set a deprecation timeline upfront (e.g., v1 → sunset in 6 months after v2 launch)
- Add
Deprecation: trueandSunset: dateresponse headers to warn clients automatically - Never make breaking changes inside a version (removing fields, changing types)
| Algorithm | How it works | Use case |
|---|---|---|
| Token Bucket | Bucket fills at fixed rate. Each request costs 1 token. Allows short bursts. | API endpoints, default choice |
| Leaky Bucket | Queue smooths traffic. Output rate is constant regardless of input bursts. | Payment processing, precise throttling |
| Fixed Window | Reset counter every N seconds. Simple but allows 2x burst at window boundaries. | Simple internal tools |
| Sliding Window | Counts requests in the last N seconds, rolling. Accurate, no boundary burst. | High-security APIs |
| Pattern | Direction | Connection | Best For |
|---|---|---|---|
| Polling | Client → Server (pull) | Repeated short HTTP calls | Relaxed latency requirements, simple setup |
| Long Polling | Client → Server (pull) | Client waits until data available | Near real-time, no WebSocket |
| Webhook | Server → Client (push) | One-shot HTTP POST per event | Event-driven callbacks, integrations |
| SSE | Server → Client (push) | Persistent one-way stream | Live dashboards, notifications in browser |
| WebSocket | Bi-directional | Persistent full-duplex | Chat, live collaboration, gaming |
webhook_secret (e.g., whsec_abc123xyz). Shared only between you and them.
timestamp + "." + payload_json. Compute HMAC-SHA256(signing_string, webhook_secret). Attach to header: X-Signature: t=1700000000,v1=abc123...
computed == X-Signature. If not equal → reject 401.
|current_time - timestamp| > 300 seconds → reject. An attacker capturing a valid request can't replay it after 5 minutes.
- IP Allowlisting: Optional but rigid. Stripe publishes their outbound IP ranges. Allowlist them in your firewall.
- Rotate secrets: Allow secret rotation without downtime by accepting two valid signatures during a transition window.
- Use HTTPS on both sides: Webhook receiver endpoint must be HTTPS. Never accept webhook deliveries over HTTP.
BACKOFF SCHEDULE
| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | Immediate | 0s |
| 2 | 30 seconds | 30s |
| 3 | 5 minutes | ~5m |
| 4 | 30 minutes | ~35m |
| 5 | 2 hours | ~2.5h |
| ... | Exponential + Jitter | ... |
| 25 | DLQ | ~72 hours |
delay = base_delay * (1 + random(0, 0.2)). Otherwise thousands of retries hit the recovering server at the exact same second — making recovery impossible (Thundering Herd).
- Timeout per attempt: Hard limit of 5 seconds. If no 2xx in 5s, count as failure.
- Success criteria: Only HTTP 2xx counts as success. 3xx, 4xx, 5xx → all retry (except 410 Gone).
- Fast failure for 4xx: 400/401/403 responses usually mean a client bug — stop retrying, notify them immediately.
In distributed systems, webhooks guarantee "at-least-once delivery", never "exactly-once". If your server times out before returning 200 OK, we retry — but you may have already processed it. Duplicates will happen.
- Return
200 OKeven for duplicate events — otherwise the sender will keep retrying infinitely - Use a DB unique constraint on
event_idas your safety net — it's atomic and prevents race conditions - Process webhooks asynchronously — return 200 immediately, put event on internal queue
- Store processed event IDs for at least the retry window duration (e.g., 72 hours)
payment.created sent. Routed through congested network path.
payment.success sent. Arrives first via fast path. Status set to "Success".
payment.created arrives. If naively applied: overwrites status back to "Pending". 💥 Bug.
SOLUTIONS
- Timestamp comparison: Include
created_atin each event. Only apply update if incoming timestamp is newer than current state's timestamp. - Version/Sequence numbers: Each event has a
sequence: 3field. Only apply ifincoming_sequence > current_sequence. - Re-fetch pattern (Best Practice): Treat the webhook as a "nudge" only. Ignore the payload. Call
GET /payments/{id}to fetch the absolute latest state from the source of truth. This eliminates ordering entirely.
| Thin Payload | Fat Payload | |
|---|---|---|
| Contains | Just IDs: {"paymentId":"pay_123"} | Full resource snapshot |
| Pros | Always fresh — client fetches latest state | No extra API call needed |
| Cons | Extra API call required per event | May be stale by the time client processes it |
| Best For | Frequently changing resources, security-sensitive data | Immutable events, audit logs |
Webhooks are asynchronous and run in the background. When something breaks, the developer has no idea why. Observability is what separates a professional webhook system from a prototype.
- Delivery Logs Dashboard: Show each sent event: payload sent, timestamp, HTTP status returned, response body, delivery duration, attempt number.
- Manual Resend: A "Resend Event" button that re-triggers delivery of a specific event. Invaluable for developers fixing their endpoint during testing.
- Test Events: Allow sending synthetic test events (like
payment.successwith fake data) without creating real transactions. - Proactive Alerting: Email/Slack alert: "Your webhook endpoint has failed 90% of deliveries in the last hour. We will disable it in 24 hours."
- CLI / SDK helper: Stripe has
stripe listen --forward-to localhost:3000that proxies live events to local dev environments. Huge DX win.
ARCHITECTURE SOLUTION
payment.success
never send directly
N workers per client
to client endpoint
- Queue-based decoupling: Never send webhooks directly from the API server thread. All events go into Kafka/SQS first.
- Per-client queues: Maintain separate queues per customer. One slow customer doesn't block others.
- Concurrency limits per client: Max N concurrent HTTP connections to a single endpoint. No client gets hammered with 1000 simultaneous requests.
- Circuit breaker per endpoint: If an endpoint fails 50% of attempts in 60 seconds, stop sending temporarily. Allow recovery before resuming.
- Jitter on retries: Spread retries across a time window instead of all at once.
payment.success event queued in Kafka topic webhook-deliveries.
webhook-dlq Kafka topic. Stores full context: payload, all attempt timestamps, HTTP responses received.
shop.com/hooks". Dashboard shows DLQ count badge.
webhook-deliveries with fresh retry budget.
- DLQ is not "trash" — it's a recovery mechanism. Make replayability a first-class feature.
- Store the reason for each failure (timeout, 500, connection refused) to help devs debug.
- Allow selective replay (replay specific events, not all) for large DLQs.
- Auto-disable endpoints that consistently fill the DLQ — protect your delivery workers.
- Keep DLQ events for 30 days minimum. Financial events potentially longer.
Breaking changes in webhook payloads can crash production systems without any deployment on the consumer's side. Schema evolution must be extremely careful.
- Version pinning: Each webhook subscription stores the API version at registration time. User signed up under
apiVersion: "2023-01-15"→ always gets that payload shape, even when v2 is live. - Additive changes only: Adding new fields is safe. Renaming, removing, or changing types of existing fields is a breaking change requiring a new version.
- Migration period: Announce deprecation 6+ months early. Send
Deprecation: truewarning header in webhook requests. Document the migration path. - Dual delivery: During migration, send events in both old and new format simultaneously. Consumers can migrate at their pace.
2XX — SUCCESS
4XX — CLIENT ERROR
5XX — SERVER ERROR
Key Concepts to Always Mention
| Topic | When Asked About... | Always Mention |
|---|---|---|
| REST Design | How to design an API | Resource naming, HTTP semantics, status codes, versioning, pagination, security from day 1 |
| Idempotency | POST safety, payments | Idempotency-Key header, Redis with TTL, "at-least-once delivery", UUID client generation |
| Pagination | List endpoints at scale | Cursor-based > offset for scale, phantom rows problem, Base64 opaque cursors, tie-breaker for non-unique sorts |
| Caching | Performance | Cache-Control headers, ETags, 304 Not Modified, CDN, brotli compression |
| Long Tasks | Slow operations | 202 Accepted, polling endpoint, webhook callback option, never block connections |
| Webhooks | Event-driven systems | HMAC signature, replay attack prevention (timestamp check), exponential backoff + jitter, DLQ, idempotent consumer, re-fetch pattern for ordering |
| Rate Limiting | Abuse prevention | Token Bucket algorithm, 429 status, X-RateLimit-* headers, per-IP + per-key + per-endpoint limits, Redis atomic increments |
| Versioning | Breaking changes | URI versioning for public, header for internal, additive-only within version, Sunset headers |
| Bulk APIs | Batch operations | 207 Multi-Status, partial success, atomic vs non-atomic trade-offs, async for large batches |
Senior-Level Phrasing That Impresses
- "I design for the failure path first — what happens on network timeout, retry, crash, partial failure."
- "In distributed systems, at-least-once delivery is the guarantee you can make. Exactly-once requires distributed transactions — expensive and often unnecessary."
- "The re-fetch pattern treats webhooks as signals, not state. You always fetch fresh data from the source of truth."
- "Offset pagination is O(N). Cursor pagination uses the B-Tree index — it's O(log N). At 10M rows, that's the difference between milliseconds and seconds."
- "Security is layered: transport (TLS), authentication (OAuth/JWT), authorization (RBAC), input validation, rate limiting. Removing any layer creates a gap."
- "I'd monitor P99 latency, not just averages. Averages hide the 1% of users who experience a 10-second timeout."