Everything a Backend Engineer Should Know
1. API Design & REST Principles {#api-design}
Understanding REST
REST (Representational State Transfer) is an architectural style for designing networked applications. It's not a protocol or standard, but a set of constraints that, when applied correctly, makes your API predictable, scalable, and easy to understand.
Why REST Matters:
- Standardization: Everyone follows similar patterns, making APIs intuitive
- Scalability: Stateless design allows easy horizontal scaling
- Caching: Built-in HTTP caching mechanisms improve performance
- Flexibility: Language and platform agnostic
Core Principles Explained:
-
Client-Server Separation: The client (frontend) and server (backend) are independent. They can evolve separately without affecting each other.
-
Stateless: Each request contains all information needed to process it. The server doesn't store client state between requests. This makes servers simpler and more scalable.
-
Cacheable: Responses must define themselves as cacheable or non-cacheable. This reduces client-server interactions and improves performance.
-
Uniform Interface: Resources are identified by URLs, and operations are performed using standard HTTP methods. This creates consistency across all APIs.
-
Layered System: Client can't tell if it's connected directly to the server or through intermediaries (load balancers, caches, proxies).
RESTful API Design
HTTP Methods: Idempotency Explained
Idempotency means calling the operation multiple times produces the same result as calling it once. This is crucial for retry logic and fault tolerance.
- GET: Idempotent & Safe (no side effects)
- PUT: Idempotent (updating same resource multiple times = same result)
- DELETE: Idempotent (deleting same resource multiple times = same result)
- POST: NOT Idempotent (creates new resource each time)
- PATCH: May or may not be idempotent (depends on implementation)
HTTP Status Codes You Must Know
| Code | Meaning | When to Use |
|---|---|---|
| 2xx Success | ||
| 200 | OK | Successful GET, PUT, PATCH |
| 201 | Created | Successful POST |
| 204 | No Content | Successful DELETE |
| 3xx Redirection | ||
| 301 | Moved Permanently | Resource moved |
| 304 | Not Modified | Cached response valid |
| 4xx Client Errors | ||
| 400 | Bad Request | Invalid input |
| 401 | Unauthorized | Authentication required |
| 403 | Forbidden | Authenticated but no permission |
| 404 | Not Found | Resource doesn't exist |
| 409 | Conflict | Resource conflict (duplicate) |
| 422 | Unprocessable Entity | Validation failed |
| 429 | Too Many Requests | Rate limit exceeded |
| 5xx Server Errors | ||
| 500 | Internal Server Error | Generic server error |
| 502 | Bad Gateway | Upstream server error |
| 503 | Service Unavailable | Server overloaded |
| 504 | Gateway Timeout | Upstream timeout |
API Design Best Practices
# β
GOOD: RESTful resource-based URLs
GET /api/v1/users # List users
GET /api/v1/users/{id} # Get user
POST /api/v1/users # Create user
PUT /api/v1/users/{id} # Update user
PATCH /api/v1/users/{id} # Partial update
DELETE /api/v1/users/{id} # Delete user
# Nested resources
GET /api/v1/users/{id}/posts # User's posts
GET /api/v1/posts?user_id={id} # Alternative
# β BAD: Verb-based URLs
GET /api/v1/getUser?id=123
POST /api/v1/createUser
POST /api/v1/deleteUser?id=123
Request/Response Structure
from pydantic import BaseModel, Field, validator
from typing import Optional, List
from datetime import datetime
# Request Schema
class CreateUserRequest(BaseModel):
email: str = Field(..., regex=r'^[\w\.-]+@[\w\.-]+\.\w+$')
name: str = Field(..., min_length=2, max_length=100)
age: Optional[int] = Field(None, ge=0, le=150)
@validator('email')
def email_must_be_lowercase(cls, v):
return v.lower()
# Response Schema
class UserResponse(BaseModel):
id: int
email: str
name: str
age: Optional[int]
created_at: datetime
updated_at: datetime
class Config:
orm_mode = True
# Error Response
class ErrorResponse(BaseModel):
error: str
message: str
details: Optional[dict] = None
timestamp: datetime = Field(default_factory=datetime.utcnow)
request_id: Optional[str] = None
# Paginated Response
class PaginatedResponse(BaseModel):
items: List[UserResponse]
total: int
page: int
per_page: int
total_pages: int
@property
def has_next(self) -> bool:
return self.page < self.total_pages
@property
def has_prev(self) -> bool:
return self.page > 1
API Versioning Strategies
# 1. URL Path Versioning (Recommended)
@app.get("/api/v1/users")
@app.get("/api/v2/users")
# 2. Header Versioning
@app.get("/api/users")
async def get_users(api_version: str = Header(default="v1")):
if api_version == "v2":
return new_format()
return old_format()
# 3. Query Parameter (Not Recommended)
@app.get("/api/users")
async def get_users(version: str = "v1"):
pass
# 4. Content Negotiation
# Accept: application/vnd.myapi.v2+json
HATEOAS (Hypermedia as the Engine of Application State)
{
"id": 123,
"name": "John Doe",
"email": "john@example.com",
"_links": {
"self": {
"href": "/api/v1/users/123"
},
"posts": {
"href": "/api/v1/users/123/posts"
},
"update": {
"href": "/api/v1/users/123",
"method": "PUT"
},
"delete": {
"href": "/api/v1/users/123",
"method": "DELETE"
}
}
}
2. Database Fundamentals {#database-fundamentals}
Understanding Databases
Databases are the backbone of most applications, storing and managing data efficiently. As a backend engineer, understanding how databases work internally is crucial for building performant systems.
Key Concepts You Must Understand:
1. ACID vs BASE:
- ACID (SQL databases): Strong consistency, reliability, data integrity
- BASE (NoSQL databases): Availability, eventual consistency, scalability
2. Why Database Choice Matters:
- Wrong choice = Performance bottlenecks, scaling issues, increased costs
- Right choice = Smooth operations, happy users, easier maintenance
3. Read vs Write Patterns:
- Read-heavy (social media feeds): Consider read replicas, caching
- Write-heavy (logging, analytics): Consider write-optimized databases
- Balanced (e-commerce): Need careful indexing and optimization
ACID Properties
ACID ensures database transactions are processed reliably. Understanding ACID is fundamental to designing robust systems.
Atomicity - "All or Nothing": Think of it like a bank transfer. Either both the debit and credit happen, or neither happens. There's no in-between state where money disappears or duplicates.
Consistency - "Valid State Always": The database must always be in a valid state. All constraints, triggers, and rules are enforced. If a transaction violates a rule (like negative balance), it's rejected entirely.
Isolation - "Concurrent but Safe": Multiple transactions can run simultaneously without interfering with each other. Each transaction feels like it's the only one running, even though hundreds might be executing concurrently.
Durability - "Permanent Once Committed": Once a transaction is committed, it's permanentβeven if the server crashes immediately after. Data is written to non-volatile storage before the commit completes.
Real-World Example:
Transfer $100 from Account A to Account B:
1. BEGIN TRANSACTION
2. Check if A has >= $100 (Consistency)
3. Deduct $100 from A
4. Add $100 to B
5. If any step fails, ROLLBACK everything (Atomicity)
6. Other transactions can't see partial state (Isolation)
7. COMMIT - now it's permanent (Durability)
Database Normalization
Normalization is the process of organizing data to reduce redundancy and improve data integrity. It's about structuring your database logically.
Why Normalize?
- Eliminate Redundancy: Don't store the same data in multiple places
- Prevent Anomalies: Avoid update, insert, and delete issues
- Improve Integrity: Ensure data consistency
- Easier Maintenance: Changes in one place reflect everywhere
When NOT to Normalize:
- Data warehouses: Denormalized for query performance
- Read-heavy systems: Joins are expensive at scale
- Caching layers: Denormalized for fast access
The Normal Forms Explained:
1NF (First Normal Form) - Atomic Values: Each cell contains a single value, not lists or arrays.
β Bad: products = "Laptop, Mouse, Keyboard"
β
Good: Separate rows for each product
2NF (Second Normal Form) - No Partial Dependencies: Every non-key column depends on the entire primary key, not just part of it.
β Bad: (OrderID, ProductID) -> CustomerName
(CustomerName depends only on OrderID, not ProductID)
β
Good: Separate Orders and OrderItems tables
3NF (Third Normal Form) - No Transitive Dependencies: Non-key columns don't depend on other non-key columns.
β Bad: User table with City and Country
(Country depends on City, not UserID)
β
Good: Separate Cities table with Country
Example:
-- β Unnormalized (Repeating Groups)
CREATE TABLE orders (
order_id INT,
customer_name VARCHAR(100),
products VARCHAR(500), -- "Product1, Product2, Product3"
prices VARCHAR(100) -- "10.00, 20.00, 15.00"
);
-- β
Normalized (3NF)
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100) UNIQUE
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT REFERENCES customers(customer_id),
order_date TIMESTAMP,
total_amount DECIMAL(10,2)
);
CREATE TABLE order_items (
order_item_id INT PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT REFERENCES products(product_id),
quantity INT,
unit_price DECIMAL(10,2)
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
name VARCHAR(200),
price DECIMAL(10,2)
);
Indexing Strategies
Indexes are like a book's table of contentsβthey help you find data quickly without scanning every page. However, indexes come with trade-offs.
How Indexes Work: Instead of scanning every row (Sequential Scan), the database uses a sorted data structure (usually B-Tree) to jump directly to the data you need.
Without Index:
Scan 1 million rows β Find your record β 500ms
With Index:
Jump to index β Find pointer β Get data β 5ms
The Index Trade-off:
- Faster Reads: Queries become much faster (10-100x improvement)
- Slower Writes: Every INSERT/UPDATE/DELETE must update indexes
- Storage Cost: Indexes consume disk space (can be 20-50% of table size)
When to Add Indexes:
- Columns in WHERE clauses
- Columns used in JOINs
- Columns in ORDER BY
- Foreign keys
When NOT to Add Indexes:
- Small tables (<1000 rows)
- Tables with frequent writes and rare reads
- Columns with low cardinality (e.g., boolean fields)
- Columns rarely queried
Index Types Explained:
B-Tree Index (Default):
- Best for: Equality (=) and range queries (<, >, BETWEEN)
- Most common, works for 95% of cases
- Keeps data sorted
Hash Index:
- Best for: Exact matches only (=)
- Very fast but can't do range queries
- Rarely used in practice
GIN/GiST (Generalized Indexes):
- Best for: Full-text search, JSON data, arrays
- Specialized indexes for complex data types
Partial Index:
- Best for: Querying subset of data frequently
- Example: Index only active users, not deleted ones
- Saves space and improves performance
-- B-Tree Index (Default, most common)
CREATE INDEX idx_users_email ON users(email);
-- Unique Index
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);
-- Composite Index (Column order matters!)
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
-- Partial Index (PostgreSQL)
CREATE INDEX idx_active_users ON users(email) WHERE active = true;
-- Covering Index (Includes extra columns)
CREATE INDEX idx_orders_covering ON orders(user_id)
INCLUDE (total_amount, created_at);
-- Full-Text Search (PostgreSQL)
CREATE INDEX idx_posts_content_fts
ON posts USING GIN(to_tsvector('english', content));
-- JSON Index (PostgreSQL)
CREATE INDEX idx_user_metadata ON users USING GIN(metadata);
Query Optimization
# β BAD: N+1 Query Problem
users = db.query("SELECT * FROM users")
for user in users:
posts = db.query("SELECT * FROM posts WHERE user_id = ?", user.id)
# 1 + N queries!
# β
GOOD: Single Query with JOIN
results = db.query("""
SELECT u.*, p.id as post_id, p.title, p.content
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
""")
# β
GOOD: Batch Loading
user_ids = [u.id for u in users]
posts = db.query(
"SELECT * FROM posts WHERE user_id IN (?)",
user_ids
)
posts_by_user = group_by(posts, 'user_id')
Database Transactions
from contextlib import asynccontextmanager
@asynccontextmanager
async def transaction():
"""Transaction context manager"""
conn = await db_pool.acquire()
try:
async with conn.transaction():
yield conn
# Automatic COMMIT
except Exception:
# Automatic ROLLBACK
raise
finally:
await db_pool.release(conn)
# Usage
async def transfer_money(from_user: int, to_user: int, amount: float):
"""Transfer money between users (atomic operation)"""
async with transaction() as conn:
# Deduct from sender
await conn.execute("""
UPDATE accounts
SET balance = balance - $1
WHERE user_id = $2 AND balance >= $1
""", amount, from_user)
# Check if deduction was successful
if conn.get_statusmsg() == "UPDATE 0":
raise ValueError("Insufficient funds")
# Add to receiver
await conn.execute("""
UPDATE accounts
SET balance = balance + $1
WHERE user_id = $2
""", amount, to_user)
# Both succeed or both fail (ACID)
Database Sharding
SQL vs NoSQL
| Feature | SQL (PostgreSQL) | NoSQL (MongoDB) |
|---|---|---|
| Schema | Fixed, Strict | Flexible |
| Scaling | Vertical (Harder) | Horizontal (Easier) |
| Transactions | ACID | Eventually Consistent |
| Joins | Powerful | Limited |
| Use Case | Complex queries, Relations | Large scale, Flexible data |
| When to Use | Financial, Traditional apps | Real-time, Big Data, IoT |
3. Authentication & Authorization {#authentication-authorization}
Understanding Auth: The Foundation of Security
Authentication answers: "Who are you?" Authorization answers: "What are you allowed to do?"
These are often confused but are fundamentally different concepts. Getting auth right is crucialβmistakes lead to security breaches, data leaks, and compliance issues.
Common Mistakes Developers Make:
- Storing passwords in plain text - NEVER do this!
- Rolling your own crypto - Use established libraries
- Confusing authentication with authorization - They're different!
- Trusting client-side validation - Always validate server-side
- Not implementing rate limiting - Opens door to brute force attacks
Authentication Methods Compared
| Method | Use Case | Pros | Cons |
|---|---|---|---|
| Session Cookies | Traditional web apps | Secure, server controls sessions | Not stateless, scaling issues |
| JWT | Modern APIs, mobile apps | Stateless, scalable | Can't revoke easily, token size |
| OAuth 2.0 | Third-party login | User convenience, no password storage | Complex implementation |
| API Keys | Service-to-service | Simple | Less secure, hard to rotate |
JWT (JSON Web Token) Deep Dive
What is JWT? JWT is a compact, URL-safe token that contains claims (information) about a user. It's digitally signed, so you can verify it wasn't tampered with.
JWT Structure:
header.payload.signature
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.
eyJzdWIiOiIxMjM0IiwiZXhwIjoxNjE2MjM5MDIyfQ.
SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c
How JWT Works:
- User logs in with credentials
- Server verifies credentials
- Server creates JWT with user info (payload)
- Server signs JWT with secret key
- JWT sent to client
- Client includes JWT in subsequent requests
- Server verifies signature and extracts user info
JWT vs Sessions:
JWT Advantages:
- Stateless: No server-side session storage needed
- Scalable: Works across multiple servers
- Mobile-friendly: Easy to use in mobile apps
- Microservices: Each service can verify tokens independently
JWT Disadvantages:
- Can't revoke: Once issued, valid until expiration
- Size: Larger than session IDs (sent with every request)
- Secret management: Must keep signing key secure across all servers
When to Use JWT:
- Building APIs consumed by mobile apps
- Microservices architecture
- Need to scale horizontally
- Want stateless authentication
When to Use Sessions:
- Traditional server-rendered web apps
- Need to revoke access immediately
- Single server or few servers
- Smaller request payloads preferred
JWT (JSON Web Token)
from datetime import datetime, timedelta
import jwt
from passlib.context import CryptContext
pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
SECRET_KEY = "your-secret-key-keep-it-secret"
ALGORITHM = "HS256"
class AuthService:
@staticmethod
def hash_password(password: str) -> str:
"""Hash password using bcrypt"""
return pwd_context.hash(password)
@staticmethod
def verify_password(plain_password: str, hashed_password: str) -> bool:
"""Verify password against hash"""
return pwd_context.verify(plain_password, hashed_password)
@staticmethod
def create_access_token(
user_id: int,
expires_delta: timedelta = timedelta(hours=1)
) -> str:
"""Create JWT access token"""
expire = datetime.utcnow() + expires_delta
payload = {
"sub": str(user_id), # Subject (user ID)
"exp": expire, # Expiration
"iat": datetime.utcnow(), # Issued at
"type": "access"
}
return jwt.encode(payload, SECRET_KEY, algorithm=ALGORITHM)
@staticmethod
def create_refresh_token(user_id: int) -> str:
"""Create long-lived refresh token"""
expire = datetime.utcnow() + timedelta(days=30)
payload = {
"sub": str(user_id),
"exp": expire,
"iat": datetime.utcnow(),
"type": "refresh"
}
return jwt.encode(payload, SECRET_KEY, algorithm=ALGORITHM)
@staticmethod
def decode_token(token: str) -> dict:
"""Decode and verify JWT token"""
try:
payload = jwt.decode(
token,
SECRET_KEY,
algorithms=[ALGORITHM]
)
return payload
except jwt.ExpiredSignatureError:
raise HTTPException(
status_code=401,
detail="Token has expired"
)
except jwt.InvalidTokenError:
raise HTTPException(
status_code=401,
detail="Invalid token"
)
# Usage
@app.post("/auth/login")
async def login(credentials: LoginRequest):
# Verify credentials
user = await db.get_user_by_email(credentials.email)
if not user or not AuthService.verify_password(
credentials.password,
user.hashed_password
):
raise HTTPException(status_code=401, detail="Invalid credentials")
# Generate tokens
access_token = AuthService.create_access_token(user.id)
refresh_token = AuthService.create_refresh_token(user.id)
return {
"access_token": access_token,
"refresh_token": refresh_token,
"token_type": "bearer"
}
# Protected endpoint
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer
security = HTTPBearer()
async def get_current_user(
credentials: HTTPAuthorizationCredentials = Depends(security)
):
"""Dependency to get current authenticated user"""
token = credentials.credentials
payload = AuthService.decode_token(token)
user_id = int(payload["sub"])
user = await db.get_user_by_id(user_id)
if not user:
raise HTTPException(status_code=401, detail="User not found")
return user
@app.get("/api/me")
async def get_me(current_user: User = Depends(get_current_user)):
"""Get current user info"""
return current_user
Role-Based Access Control (RBAC)
from enum import Enum
from functools import wraps
class Role(str, Enum):
ADMIN = "admin"
MODERATOR = "moderator"
USER = "user"
class Permission(str, Enum):
CREATE_USER = "create:user"
READ_USER = "read:user"
UPDATE_USER = "update:user"
DELETE_USER = "delete:user"
MANAGE_ROLES = "manage:roles"
# Role-Permission mapping
ROLE_PERMISSIONS = {
Role.ADMIN: [
Permission.CREATE_USER,
Permission.READ_USER,
Permission.UPDATE_USER,
Permission.DELETE_USER,
Permission.MANAGE_ROLES
],
Role.MODERATOR: [
Permission.READ_USER,
Permission.UPDATE_USER
],
Role.USER: [
Permission.READ_USER
]
}
def require_permission(permission: Permission):
"""Decorator to check user permission"""
def decorator(func):
@wraps(func)
async def wrapper(*args, current_user: User, **kwargs):
user_permissions = ROLE_PERMISSIONS.get(current_user.role, [])
if permission not in user_permissions:
raise HTTPException(
status_code=403,
detail=f"Permission denied: {permission}"
)
return await func(*args, current_user=current_user, **kwargs)
return wrapper
return decorator
# Usage
@app.delete("/api/users/{user_id}")
@require_permission(Permission.DELETE_USER)
async def delete_user(
user_id: int,
current_user: User = Depends(get_current_user)
):
"""Delete user (admin only)"""
await db.delete_user(user_id)
return {"status": "deleted"}
OAuth 2.0 Flow
4. Caching Strategies {#caching-strategies}
Understanding Caching: The Performance Multiplier
Caching is storing frequently accessed data in a fast-access location to avoid expensive operations (database queries, API calls, computations). It's one of the most effective ways to improve performance.
The Caching Golden Rule:
"There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton
Why Caching Matters:
Without Caching:
User Request β API Server β Database Query (100ms) β Response
100 requests = 10 seconds of DB time
With Caching:
User Request β API Server β Cache Hit (2ms) β Response
100 requests = 0.2 seconds
50x improvement!
Real-World Impact:
- Amazon: 100ms delay = 1% revenue loss
- Google: 500ms delay = 20% traffic drop
- Walmart: 1 second delay = 2% conversion loss
Cache Levels Explained
Modern applications use multiple cache layers, each serving different purposes:
1. CDN Cache (Edge)
- Speed: ~10-50ms
- Location: Geographically distributed
- Purpose: Static assets (images, CSS, JS)
- Example: CloudFront, Cloudflare
2. Application Cache (In-Memory)
- Speed: ~1-5ms
- Location: Same server as application
- Purpose: Hot data, session data
- Example: In-process dictionary, LRU cache
3. Distributed Cache (Redis/Memcached)
- Speed: ~5-20ms
- Location: Separate server(s)
- Purpose: Shared data across app servers
- Example: Redis, Memcached
4. Database Cache
- Speed: ~50-100ms
- Location: Database server
- Purpose: Query results, table data
- Example: PostgreSQL query cache
Cache Patterns: When to Use What
1. Cache-Aside (Lazy Loading)
- Best for: Read-heavy workloads
- How it works: Check cache β Miss β Query DB β Store in cache
- Pros: Only cache what's actually used
- Cons: Cache miss penalty, potential cache stampede
When to use:
- User profiles
- Product catalogs
- Blog posts
- Any read-heavy data
2. Write-Through
- Best for: Read-heavy with occasional writes
- How it works: Write to DB β Immediately write to cache
- Pros: Cache always consistent with DB
- Cons: Write latency (two operations)
When to use:
- Data that's read often after being written
- Consistency is critical
- Write performance acceptable
3. Write-Behind (Write-Back)
- Best for: Write-heavy workloads
- How it works: Write to cache β Async write to DB later
- Pros: Very fast writes
- Cons: Data loss risk if cache crashes
When to use:
- Logging systems
- Analytics counters
- Session data
- Gaming leaderboards
4. Read-Through
- Best for: Encapsulating cache logic
- How it works: Cache library handles DB queries automatically
- Pros: Application doesn't handle cache misses
- Cons: Tighter coupling between cache and data source
When to use:
- Simplifying application code
- Standard data access patterns
- When cache library provides this feature
Cache Invalidation: The Hard Problem
Cache invalidation is deciding when cached data is no longer valid and needs refreshing. Getting this wrong leads to stale data or cache thrashing.
Invalidation Strategies:
1. Time-based (TTL - Time To Live)
Best for: Data that changes predictably
Example: "Weather forecast valid for 1 hour"
Pros: Simple, predictable
Cons: Might serve stale data, or expire too early
2. Event-based
Best for: Data you control
Example: "When user updates profile, invalidate cache"
Pros: Always fresh, efficient
Cons: Complex to implement, must track all events
3. Hybrid (TTL + Events)
Best for: Most production systems
Example: "Cache for 5 minutes, but invalidate on updates"
Pros: Best of both worlds
Cons: More complexity
Cache Stampede (Thundering Herd): When cache expires, multiple requests hit database simultaneously, causing overload.
11:00:00 - Cache expires
11:00:01 - 1000 requests arrive
11:00:01 - All 1000 hit database (STAMPEDE!)
11:00:05 - Database crashes
Solution: Cache Locking First request acquires lock, fetches data. Others wait for first to complete.
Caching Patterns
# 1. Cache-Aside (Lazy Loading)
async def get_user_cache_aside(user_id: int):
# Try cache first
cached = await redis.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss - query database
user = await db.get_user(user_id)
# Store in cache
await redis.setex(
f"user:{user_id}",
3600, # TTL: 1 hour
json.dumps(user)
)
return user
# 2. Write-Through
async def update_user_write_through(user_id: int, data: dict):
# Update database
user = await db.update_user(user_id, data)
# Update cache immediately
await redis.setex(
f"user:{user_id}",
3600,
json.dumps(user)
)
return user
# 3. Write-Behind (Write-Back)
from asyncio import Queue
write_queue = Queue()
async def update_user_write_behind(user_id: int, data: dict):
# Update cache immediately
await redis.setex(f"user:{user_id}", 3600, json.dumps(data))
# Queue database write
await write_queue.put(("update_user", user_id, data))
return data
async def process_write_queue():
"""Background worker to process queued writes"""
while True:
operation, user_id, data = await write_queue.get()
try:
if operation == "update_user":
await db.update_user(user_id, data)
except Exception as e:
logger.error(f"Write-behind error: {e}")
write_queue.task_done()
# 4. Read-Through
class CachingRepository:
async def get_user(self, user_id: int):
# Cache handles everything
return await cache.get_or_fetch(
f"user:{user_id}",
lambda: db.get_user(user_id),
ttl=3600
)
Cache Invalidation Strategies
# 1. Time-based (TTL)
await redis.setex("key", 300, value) # Expires in 5 minutes
# 2. Event-based
async def update_user(user_id: int, data: dict):
user = await db.update_user(user_id, data)
# Invalidate related caches
await redis.delete(f"user:{user_id}")
await redis.delete(f"user:{user_id}:posts")
await redis.delete(f"user:{user_id}:profile")
# 3. Pattern-based
async def delete_user_caches(user_id: int):
"""Delete all caches related to user"""
pattern = f"user:{user_id}:*"
cursor = 0
while True:
cursor, keys = await redis.scan(cursor, match=pattern)
if keys:
await redis.delete(*keys)
if cursor == 0:
break
# 4. Tag-based
async def set_with_tags(key: str, value: any, tags: List[str]):
"""Store value with tags for group invalidation"""
await redis.set(key, value)
for tag in tags:
await redis.sadd(f"tag:{tag}", key)
async def invalidate_by_tag(tag: str):
"""Invalidate all keys with given tag"""
keys = await redis.smembers(f"tag:{tag}")
if keys:
await redis.delete(*keys)
await redis.delete(f"tag:{tag}")
Cache Stampede Prevention
import asyncio
from typing import Optional
class AntiStampede:
"""Prevent cache stampede with locking"""
def __init__(self, redis):
self.redis = redis
self.local_locks = {}
async def get_or_compute(
self,
key: str,
compute_func: callable,
ttl: int = 300
):
# Try cache
cached = await self.redis.get(key)
if cached:
return json.loads(cached)
# Acquire lock
lock_key = f"lock:{key}"
lock_acquired = await self.redis.set(
lock_key,
"1",
ex=30, # Lock expires in 30s
nx=True # Only set if not exists
)
if lock_acquired:
try:
# We have the lock - compute value
value = await compute_func()
# Store in cache
await self.redis.setex(key, ttl, json.dumps(value))
return value
finally:
# Release lock
await self.redis.delete(lock_key)
else:
# Someone else is computing - wait and retry
await asyncio.sleep(0.1)
return await self.get_or_compute(key, compute_func, ttl)
# Usage
anti_stampede = AntiStampede(redis)
@app.get("/expensive-data")
async def get_expensive_data():
async def compute():
# Expensive operation
await asyncio.sleep(5)
return {"data": "result"}
return await anti_stampede.get_or_compute(
"expensive_data",
compute,
ttl=3600
)
5. Message Queues & Async Processing {#message-queues}
Understanding Asynchronous Processing
Synchronous operations blockβyou wait for them to complete before moving on. Asynchronous operations don't blockβyou start them and continue with other work.
Why Async Processing Matters:
Imagine ordering food at a restaurant:
Synchronous (Bad):
1. Take order from customer 1
2. Cook food for customer 1
3. Serve customer 1
4. Now take order from customer 2
β Customer 2 waits 20 minutes just to order!
Asynchronous (Good):
1. Take order from customer 1 β Give ticket #1
2. Take order from customer 2 β Give ticket #2
3. Kitchen cooks both simultaneously
4. Serve when ready
β Both customers happy!
Message Queues Explained
A message queue is like a todo list that multiple workers can process. It decouples producers (who create tasks) from consumers (who execute tasks).
Key Concepts:
Producer: Creates messages and adds them to queue Consumer/Worker: Pulls messages from queue and processes them Message: Unit of work (e.g., "send email", "process order") Queue: Storage for messages waiting to be processed Broker: System managing queues (RabbitMQ, Redis, Kafka)
Benefits of Message Queues:
1. Decoupling:
Without Queue:
API β Email Service (if it's down, API fails)
With Queue:
API β Queue β Email Service (API succeeds immediately)
2. Load Leveling:
Traffic spike: 1000 requests/second
API: Add all to queue immediately (fast)
Workers: Process 100/second steadily
3. Reliability:
- Messages persist even if consumer crashes
- Automatic retry on failure
- Dead letter queue for problematic messages
4. Scalability:
- Add more workers during high load
- Remove workers during low load
- Workers can be on different servers
When to Use Message Queues:
Use for:
- Sending emails/notifications
- Processing images/videos
- Generating reports
- Data synchronization
- Background jobs
- Any slow operation (>1 second)
Don't use for:
- Real-time user interactions
- Operations needing immediate response
- Simple, fast operations (<100ms)
Queue Patterns
1. Work Queue (Task Queue)
- One message β One worker
- Used for: Background jobs
- Example: Celery, RQ
2. Publish/Subscribe
- One message β Multiple subscribers
- Used for: Event broadcasting
- Example: User registered β Send email + Update analytics + Create profile
3. Request/Reply
- Send request β Wait for response
- Used for: RPC-style communication
- Example: Microservice communication
4. Priority Queue
- Process high-priority messages first
- Used for: VIP users, critical operations
- Example: Premium user orders before regular orders
Task Queue with Celery
from celery import Celery
from kombu import Queue
# Initialize Celery
celery_app = Celery(
'tasks',
broker='redis://localhost:6379/0',
backend='redis://localhost:6379/1'
)
# Configure queues
celery_app.conf.task_queues = (
Queue('high_priority', routing_key='high'),
Queue('default', routing_key='default'),
Queue('low_priority', routing_key='low'),
)
celery_app.conf.task_routes = {
'tasks.send_email': {'queue': 'low_priority'},
'tasks.process_payment': {'queue': 'high_priority'},
}
# Define tasks
@celery_app.task(bind=True, max_retries=3)
def send_email(self, to: str, subject: str, body: str):
"""Send email asynchronously"""
try:
# Send email logic
smtp.send(to, subject, body)
return {"status": "sent", "to": to}
except Exception as e:
# Retry with exponential backoff
raise self.retry(exc=e, countdown=2 ** self.request.retries)
@celery_app.task
def process_order(order_id: int):
"""Process order in background"""
order = db.get_order(order_id)
# Update inventory
inventory.reserve(order.items)
# Charge payment
payment.charge(order.total)
# Send confirmation
send_email.delay(
order.user.email,
"Order Confirmation",
f"Your order #{order_id} is confirmed"
)
@celery_app.task
def generate_report(user_id: int, report_type: str):
"""Generate report - long running task"""
data = fetch_report_data(user_id, report_type)
# Process data (may take minutes)
report = process_report_data(data)
# Save to S3
s3_url = upload_to_s3(report)
# Notify user
send_email.delay(
user.email,
"Report Ready",
f"Your report is ready: {s3_url}"
)
# FastAPI integration
@app.post("/orders")
async def create_order(order: OrderCreate):
# Create order in database
order_id = await db.create_order(order)
# Process asynchronously
process_order.delay(order_id)
return {
"order_id": order_id,
"status": "processing"
}
@app.post("/reports")
async def request_report(report: ReportRequest, user: User = Depends(get_current_user)):
# Queue report generation
task = generate_report.delay(user.id, report.type)
return {
"task_id": task.id,
"status": "queued",
"message": "Report generation started"
}
@app.get("/tasks/{task_id}")
async def get_task_status(task_id: str):
"""Check task status"""
task = celery_app.AsyncResult(task_id)
return {
"task_id": task_id,
"status": task.state,
"result": task.result if task.ready() else None
}
Event-Driven Architecture
from typing import Callable, Dict, List
import asyncio
class EventBus:
"""Simple in-memory event bus"""
def __init__(self):
self.subscribers: Dict[str, List[Callable]] = {}
def subscribe(self, event_type: str, handler: Callable):
"""Subscribe to event"""
if event_type not in self.subscribers:
self.subscribers[event_type] = []
self.subscribers[event_type].append(handler)
async def publish(self, event_type: str, data: dict):
"""Publish event to all subscribers"""
if event_type not in self.subscribers:
return
tasks = []
for handler in self.subscribers[event_type]:
tasks.append(handler(data))
# Execute all handlers concurrently
await asyncio.gather(*tasks, return_exceptions=True)
# Initialize event bus
event_bus = EventBus()
# Event handlers
async def send_welcome_email(data: dict):
"""Handler: Send welcome email"""
user_id = data['user_id']
email = data['email']
await send_email.delay(
email,
"Welcome!",
f"Welcome to our platform, {data['name']}!"
)
async def track_user_registration(data: dict):
"""Handler: Track analytics"""
await analytics.track('user_registered', data)
async def create_user_profile(data: dict):
"""Handler: Create default profile"""
await db.create_profile(data['user_id'])
# Subscribe handlers
event_bus.subscribe('user_registered', send_welcome_email)
event_bus.subscribe('user_registered', track_user_registration)
event_bus.subscribe('user_registered', create_user_profile)
# Publish events
@app.post("/auth/register")
async def register_user(user: UserCreate):
# Create user
new_user = await db.create_user(user)
# Publish event
await event_bus.publish('user_registered', {
'user_id': new_user.id,
'email': new_user.email,
'name': new_user.name
})
return new_user
6. Microservices Architecture {#microservices}
The Monolith vs Microservices Debate
Monolith: Single application with all features in one codebase. Microservices: Multiple small applications, each handling one feature.
The Harsh Truth:
"You need to be this tall to use microservices" - Martin Fowler
Most startups should start with a monolith. Microservices add complexity that only makes sense at scale.
When Monolith Makes Sense
Use Monolith when:
- Team size: <10 developers
- Traffic: <1000 requests/second
- You're building MVP or new product
- Team lacks microservices experience
- Deployment simplicity is important
Monolith Advantages:
- Simple deployment: One application to deploy
- Easy debugging: Everything in one place
- No network latency: Direct function calls
- Easier testing: Test entire app together
- Lower infrastructure cost: One server, not many
Monolith Done Right:
Modular monolith:
- Organize code into modules/packages
- Clear boundaries between domains
- Can extract to microservices later
When Microservices Make Sense
Use Microservices when:
- Team size: >20 developers
- Different parts scale differently
- Need to deploy features independently
- Different parts use different technologies
- Have DevOps expertise
Microservices Advantages:
- Independent deployment: Deploy one service without affecting others
- Technology diversity: Use best tool for each job
- Isolated failures: One service down β entire system down
- Team autonomy: Teams own their services
- Selective scaling: Scale only what needs it
Microservices Disadvantages:
- Complexity: Network calls, distributed tracing, service discovery
- Data consistency: No ACID transactions across services
- Testing difficulty: Integration tests are complex
- Operational overhead: More services = more monitoring, logging
- Initial investment: Requires infrastructure and tooling
The Migration Path
Don't Rewrite Everything!
Phase 1: Modular Monolith
βββ user_service (module)
βββ order_service (module)
βββ payment_service (module)
βββ notification_service (module)
Phase 2: Extract Bottlenecks
βββ Monolith (most features)
βββ Image Processing Service (extracted)
Phase 3: Gradual Extraction
βββ Monolith (core features)
βββ Image Processing Service
βββ Notification Service (extracted)
βββ Payment Service (extracted)
Phase 4: Full Microservices
βββ User Service
βββ Order Service
βββ Payment Service
βββ Notification Service
Service Communication: Sync vs Async
Synchronous (REST, gRPC):
Pros:
- Simple to understand
- Immediate response
- Easy debugging
Cons:
- Tight coupling
- Cascading failures
- Lower throughput
Best for:
- Real-time user requests
- Operations needing immediate response
- Simple request-response patterns
Asynchronous (Message Queue, Events):
Pros:
- Loose coupling
- Higher throughput
- Fault tolerance
Cons:
- Complex to debug
- Eventual consistency
- More infrastructure
Best for:
- Background processing
- Event broadcasting
- Long-running operations
The Hybrid Approach (Most Production Systems):
- Sync for user-facing APIs
- Async for internal service communication
- Message queues for background jobs
Microservices Patterns You Must Know
1. API Gateway
- Single entry point for all clients
- Handles routing, auth, rate limiting
- Aggregates data from multiple services
- Problem it solves: Clients don't manage multiple endpoints
2. Service Discovery
- Services register themselves (name + location)
- Clients discover services dynamically
- Handles service instances coming/going
- Problem it solves: Hardcoded service URLs don't work at scale
3. Circuit Breaker
- Prevents cascading failures
- Fast-fail when service is down
- Automatic recovery attempts
- Problem it solves: One slow service doesn't bring down entire system
4. Saga Pattern
- Manages distributed transactions
- Each service does local transaction
- Compensation for rollback
- Problem it solves: Can't use database transactions across services
Service Communication Patterns
1. Synchronous (REST/gRPC)
import httpx
class OrderService:
async def create_order(self, user_id: int, items: List[dict]):
# Call User Service
async with httpx.AsyncClient() as client:
user_response = await client.get(
f"http://user-service/api/users/{user_id}"
)
user = user_response.json()
# Call Inventory Service
async with httpx.AsyncClient() as client:
inventory_response = await client.post(
"http://inventory-service/api/reserve",
json={"items": items}
)
# Create order
order = await db.create_order(user_id, items)
return order
2. Asynchronous (Message Queue)
# Order Service
@app.post("/orders")
async def create_order(order: OrderCreate):
order_id = await db.create_order(order)
# Publish event
await event_bus.publish('order_created', {
'order_id': order_id,
'user_id': order.user_id,
'items': order.items
})
return {"order_id": order_id}
# Inventory Service (separate service)
@event_bus.subscribe('order_created')
async def reserve_inventory(data: dict):
items = data['items']
await inventory.reserve(items)
# Notification Service (separate service)
@event_bus.subscribe('order_created')
async def send_order_notification(data: dict):
user_id = data['user_id']
order_id = data['order_id']
user = await user_service.get_user(user_id)
await send_email(user.email, f"Order #{order_id} created")
Service Discovery
import consul
class ServiceRegistry:
"""Service discovery with Consul"""
def __init__(self):
self.consul = consul.Consul(host='localhost', port=8500)
async def register_service(
self,
service_name: str,
service_id: str,
host: str,
port: int
):
"""Register service with Consul"""
self.consul.agent.service.register(
name=service_name,
service_id=service_id,
address=host,
port=port,
check=consul.Check.http(
f"http://{host}:{port}/health",
interval="10s"
)
)
async def discover_service(self, service_name: str) -> str:
"""Discover service endpoint"""
index, services = self.consul.health.service(
service_name,
passing=True
)
if not services:
raise Exception(f"Service {service_name} not found")
# Simple load balancing - random selection
import random
service = random.choice(services)
host = service['Service']['Address']
port = service['Service']['Port']
return f"http://{host}:{port}"
# Usage
registry = ServiceRegistry()
# On service startup
await registry.register_service(
"order-service",
"order-service-1",
"10.0.1.5",
8000
)
# When calling another service
user_service_url = await registry.discover_service("user-service")
async with httpx.AsyncClient() as client:
response = await client.get(f"{user_service_url}/api/users/123")
API Gateway Pattern
from fastapi import FastAPI, HTTPException
import httpx
app = FastAPI()
# Service URLs (from service discovery)
SERVICES = {
"user": "http://user-service:8001",
"order": "http://order-service:8002",
"payment": "http://payment-service:8003"
}
class APIGateway:
"""API Gateway for routing and aggregation"""
@staticmethod
async def forward_request(
service: str,
path: str,
method: str = "GET",
**kwargs
):
"""Forward request to microservice"""
base_url = SERVICES.get(service)
if not base_url:
raise HTTPException(404, f"Service {service} not found")
url = f"{base_url}{path}"
async with httpx.AsyncClient(timeout=10.0) as client:
if method == "GET":
response = await client.get(url, **kwargs)
elif method == "POST":
response = await client.post(url, **kwargs)
elif method == "PUT":
response = await client.put(url, **kwargs)
elif method == "DELETE":
response = await client.delete(url, **kwargs)
return response.json()
@staticmethod
async def aggregate_data(user_id: int):
"""Aggregate data from multiple services"""
# Parallel requests
user_task = APIGateway.forward_request(
"user", f"/api/users/{user_id}"
)
orders_task = APIGateway.forward_request(
"order", f"/api/orders?user_id={user_id}"
)
user, orders = await asyncio.gather(
user_task,
orders_task,
return_exceptions=True
)
return {
"user": user if not isinstance(user, Exception) else None,
"orders": orders if not isinstance(orders, Exception) else []
}
# Gateway endpoints
@app.get("/api/users/{user_id}")
async def get_user(user_id: int):
"""Route to user service"""
return await APIGateway.forward_request(
"user",
f"/api/users/{user_id}"
)
@app.get("/api/users/{user_id}/dashboard")
async def get_user_dashboard(user_id: int):
"""Aggregate data from multiple services"""
return await APIGateway.aggregate_data(user_id)
Circuit Breaker for Microservices
from enum import Enum
from datetime import datetime, timedelta
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
"""Circuit breaker for service calls"""
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: int = 60,
success_threshold: int = 2
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
async def call(self, service_name: str, func, *args, **kwargs):
"""Execute function with circuit breaker"""
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
self.success_count = 0
else:
# Return cached or fallback response
raise ServiceUnavailableError(
f"{service_name} circuit breaker is OPEN"
)
try:
result = await func(*args, **kwargs)
await self._on_success()
return result
except Exception as e:
await self._on_failure()
raise
def _should_attempt_reset(self) -> bool:
if not self.last_failure_time:
return True
elapsed = (datetime.now() - self.last_failure_time).total_seconds()
return elapsed >= self.recovery_timeout
async def _on_success(self):
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
if self.state == CircuitState.CLOSED:
self.failure_count = 0
async def _on_failure(self):
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
logger.warning(
f"Circuit breaker opened after {self.failure_count} failures"
)
# Usage
user_service_breaker = CircuitBreaker()
async def call_user_service(user_id: int):
async def make_request():
async with httpx.AsyncClient() as client:
response = await client.get(
f"http://user-service/api/users/{user_id}"
)
return response.json()
return await user_service_breaker.call(
"user-service",
make_request
)
7. System Design Patterns {#system-design-patterns}
Understanding System Design Patterns
System design patterns are proven solutions to common architectural problems. They're the "recipes" of software engineeringβyou don't invent them, you apply them.
Why Patterns Matter:
- Speed: Don't reinvent the wheel
- Communication: Shared vocabulary with team ("Let's use circuit breaker")
- Reliability: Battle-tested solutions
- Maintainability: Well-understood patterns are easier to maintain
When to Use Patterns:
- You recognize the problem they solve
- You understand their trade-offs
- Your scale justifies the complexity
When NOT to Use Patterns:
- Because they're trendy
- Because Netflix uses them (you're not Netflix)
- Without understanding the problem first
The Saga Pattern: Distributed Transactions
The Problem: In microservices, you can't use database transactions across services. How do you ensure consistency?
Example Scenario:
Order Process:
1. Reserve payment ($100)
2. Reserve inventory (1 widget)
3. Schedule shipping
What if step 3 fails?
Need to undo steps 1 and 2 (compensate)
Traditional Database Transaction (Monolith):
BEGIN TRANSACTION;
INSERT INTO orders ...;
UPDATE inventory SET quantity = quantity - 1 ...;
UPDATE payments SET status = 'charged' ...;
COMMIT; -- All succeed or all fail
Saga Pattern (Microservices):
No distributed transaction available!
Solution: Compensating transactions
Success Flow:
1. Payment Service: Reserve $100 β
2. Inventory Service: Reserve 1 widget β
3. Shipping Service: Schedule β
β Order complete
Failure Flow:
1. Payment Service: Reserve $100 β
2. Inventory Service: Reserve 1 widget β
3. Shipping Service: Schedule β FAIL
4. Inventory Service: COMPENSATE - Release widget
5. Payment Service: COMPENSATE - Refund $100
β Order cancelled, no inconsistency
Two Saga Approaches:
Choreography (Event-Based):
- Services publish events
- Other services react to events
- Decentralized coordination
Orchestration (Coordinator):
- Central coordinator manages flow
- Explicitly calls each service
- Centralized coordination
When to Use Saga:
- Microservices architecture
- Need consistency across services
- Can't use distributed transactions
- Business process spans multiple services
Trade-offs:
- Pro: Eventual consistency across services
- Con: Complex to implement and debug
- Con: Compensations must be idempotent
from dataclasses import dataclass
from typing import Callable, List
from enum import Enum
class SagaStatus(Enum):
PENDING = "pending"
COMPLETED = "completed"
FAILED = "failed"
COMPENSATING = "compensating"
COMPENSATED = "compensated"
@dataclass
class SagaStep:
name: str
action: Callable
compensate: Callable
class SagaOrchestrator:
"""Orchestrator for distributed transactions"""
def __init__(self):
self.steps: List[SagaStep] = []
self.completed_steps: List[str] = []
def add_step(
self,
name: str,
action: Callable,
compensate: Callable
):
"""Add step to saga"""
self.steps.append(SagaStep(name, action, compensate))
async def execute(self, context: dict):
"""Execute saga"""
try:
# Execute all steps
for step in self.steps:
logger.info(f"Executing step: {step.name}")
result = await step.action(context)
context[step.name] = result
self.completed_steps.append(step.name)
logger.info("Saga completed successfully")
return context
except Exception as e:
logger.error(f"Saga failed at step: {e}")
await self.compensate(context)
raise
async def compensate(self, context: dict):
"""Compensate completed steps in reverse order"""
logger.info("Starting compensation")
for step_name in reversed(self.completed_steps):
step = next(s for s in self.steps if s.name == step_name)
try:
logger.info(f"Compensating: {step.name}")
await step.compensate(context)
except Exception as e:
logger.error(f"Compensation failed for {step.name}: {e}")
# Continue compensating other steps
# Example: Order Saga
async def create_order_saga(order_data: dict):
saga = SagaOrchestrator()
# Step 1: Reserve Payment
saga.add_step(
"reserve_payment",
action=lambda ctx: payment_service.reserve(
ctx['order_data']['user_id'],
ctx['order_data']['total']
),
compensate=lambda ctx: payment_service.release(
ctx['reserve_payment']['payment_id']
)
)
# Step 2: Reserve Inventory
saga.add_step(
"reserve_inventory",
action=lambda ctx: inventory_service.reserve(
ctx['order_data']['items']
),
compensate=lambda ctx: inventory_service.release(
ctx['reserve_inventory']['reservation_id']
)
)
# Step 3: Create Order
saga.add_step(
"create_order",
action=lambda ctx: order_service.create(ctx['order_data']),
compensate=lambda ctx: order_service.cancel(
ctx['create_order']['order_id']
)
)
# Step 4: Schedule Shipping
saga.add_step(
"schedule_shipping",
action=lambda ctx: shipping_service.schedule(
ctx['create_order']['order_id']
),
compensate=lambda ctx: shipping_service.cancel(
ctx['schedule_shipping']['shipping_id']
)
)
# Execute saga
context = {"order_data": order_data}
return await saga.execute(context)
CQRS (Command Query Responsibility Segregation)
The Big Idea: Separate reads (queries) and writes (commands) into different models. They don't have to use the same database structure or even the same database!
Why CQRS Exists:
Traditional Approach (Single Model):
Database Table: Users
- Optimized for both reads and writes
- Compromise: Not optimal for either
- Complex queries hurt write performance
- Write traffic impacts read performance
CQRS Approach (Separate Models):
Write Model (Commands):
- Simple, normalized structure
- Optimized for data integrity
- Fast writes
Read Model (Queries):
- Denormalized, flat structure
- Optimized for specific queries
- Fast reads
- Can use different database (e.g., Elasticsearch)
Real-World Example: E-commerce Product Page
Without CQRS:
-- One complex query to get everything
SELECT
p.*,
c.name as category,
AVG(r.rating) as avg_rating,
COUNT(r.id) as review_count,
i.quantity as stock
FROM products p
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN reviews r ON p.id = r.product_id
LEFT JOIN inventory i ON p.id = i.product_id
WHERE p.id = 123
GROUP BY p.id;
-- Slow (200ms), impacts write performance
With CQRS:
# Write Model (Creating product)
def create_product(data):
product = db.insert('products', data)
publish_event('product_created', product)
return product.id
# Read Model (Materialized view)
redis.set('product:123', {
'id': 123,
'name': 'Laptop',
'category': 'Electronics',
'avg_rating': 4.5,
'review_count': 234,
'stock': 15
})
# Query (instant!)
def get_product(product_id):
return redis.get(f'product:{product_id}') # 2ms!
When to Use CQRS:
- Read/Write patterns differ significantly
- Read-heavy with complex queries (reporting, dashboards)
- Need different consistency models (strong writes, eventual read)
- Want to optimize each independently
When NOT to Use CQRS:
- Simple CRUD applications
- Read/write patterns are similar
- Team lacks experience (adds complexity)
- Don't have eventual consistency tolerance
CQRS Benefits:
- Performance: Optimize reads and writes independently
- Scalability: Scale read and write databases separately
- Flexibility: Use best database for each (PostgreSQL for writes, Elasticsearch for reads)
- Simplicity: Each model is simpler (no compromise)
CQRS Challenges:
- Eventual consistency: Read model lags behind writes
- Complexity: Two models to maintain
- Synchronization: Must keep read model updated
- Learning curve: Team needs to understand pattern
Rate Limiting: Protecting Your System
Rate limiting prevents abuse by limiting how many requests a client can make. It's essential for stability, security, and fair usage.
Why Rate Limiting Matters:
Without Rate Limiting:
Malicious actor:
- Sends 100,000 requests/second
- Your servers crash
- Legitimate users can't access system
- You pay massive cloud bills
With Rate Limiting:
After 1,000 requests/hour:
- Block additional requests
- Return 429 (Too Many Requests)
- System stays stable
- Fair usage for everyone
Rate Limiting Algorithms Explained:
1. Token Bucket (Most Common)
How it works:
- Bucket holds tokens (e.g., 100 tokens)
- Each request consumes 1 token
- Bucket refills at constant rate (e.g., 10 tokens/second)
- If bucket empty, reject request
Characteristics:
- Allows bursts: Can use all tokens at once
- Smooth over time: Refills constantly
- Flexible: Different rates for different operations
Use Case: API rate limiting (100 requests/minute with burst of 20)
2. Leaky Bucket
How it works:
- Requests enter a queue (bucket)
- Requests leave at constant rate (leak)
- If bucket full, reject request
Characteristics:
- Smooth output: Enforces constant rate
- No bursts: Can't exceed rate
- Queue-based: FIFO processing
Use Case: Network traffic shaping, preventing spikes
3. Fixed Window
How it works:
- Count requests in fixed time window (e.g., per minute)
- Reset counter at window boundary
- If count exceeds limit, reject
Characteristics:
- Simple: Easy to implement
- Boundary issue: 2x rate at window edge
- Not accurate: Burst at window transitions
Example Problem:
Limit: 100 requests/minute
11:00:30 - 100 requests (allowed)
11:01:00 - Window resets
11:01:01 - 100 requests (allowed)
Result: 200 requests in 31 seconds!
Use Case: Simple rate limiting, low-accuracy needs
4. Sliding Window (Most Accurate)
How it works:
- Track requests in rolling time window
- Count requests in last N seconds
- Remove old requests as window slides
Characteristics:
- Accurate: No boundary issues
- Fair: True rolling average
- Memory intensive: Must store timestamps
Use Case: Premium APIs, precise rate limiting
Comparing Algorithms:
| Algorithm | Accuracy | Complexity | Bursts | Memory |
|---|---|---|---|---|
| Token Bucket | Good | Low | Yes | Low |
| Leaky Bucket | Good | Medium | No | Medium |
| Fixed Window | Poor | Very Low | Yes | Very Low |
| Sliding Window | Excellent | High | Controlled | High |
Production Recommendation:
- API Gateway: Sliding Window (accuracy matters)
- Internal Services: Token Bucket (flexibility + performance)
- Simple Apps: Fixed Window (ease of implementation)
Rate Limiting Strategies:
Per User:
User A: 1000 requests/hour
User B: 1000 requests/hour
Per IP:
IP 1.2.3.4: 1000 requests/hour
(Useful for anonymous APIs)
Per API Key:
Key abc123: 10,000 requests/hour (paid)
Key xyz789: 1,000 requests/hour (free)
Global:
All users combined: 100,000 requests/hour
(Protects system capacity)
Per Endpoint:
POST /orders: 100 requests/hour (expensive)
GET /products: 1000 requests/hour (cheap)
Response Headers (Be Transparent):
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 842
X-RateLimit-Reset: 1617123456
Retry-After: 60
# Command Side (Writes)
class CreateUserCommand:
def __init__(self, email: str, name: str):
self.email = email
self.name = name
class CommandHandler:
async def handle_create_user(self, command: CreateUserCommand):
# Write to primary database
user = await write_db.create_user(
email=command.email,
name=command.name
)
# Publish event
await event_bus.publish('user_created', {
'user_id': user.id,
'email': user.email,
'name': user.name
})
return user.id
# Query Side (Reads)
class UserQuery:
async def get_user_profile(self, user_id: int):
# Read from optimized read database
return await read_db.get_user_profile(user_id)
async def search_users(self, query: str):
# Read from search-optimized database
return await elasticsearch.search_users(query)
# Event Handler (Sync read database)
@event_bus.subscribe('user_created')
async def sync_user_to_read_db(data: dict):
"""Synchronize write DB to read DB"""
await read_db.upsert_user(data)
await elasticsearch.index_user(data)
# API Endpoints
@app.post("/users") # Command
async def create_user(user: CreateUserCommand):
handler = CommandHandler()
user_id = await handler.handle_create_user(user)
return {"user_id": user_id}
@app.get("/users/{user_id}") # Query
async def get_user(user_id: int):
query = UserQuery()
return await query.get_user_profile(user_id)
Rate Limiting Algorithms
1. Token Bucket
class TokenBucket:
def __init__(self, capacity: int, refill_rate: float):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate # tokens per second
self.last_refill = time.time()
def consume(self, tokens: int = 1) -> bool:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def _refill(self):
now = time.time()
elapsed = now - self.last_refill
tokens_to_add = elapsed * self.refill_rate
self.tokens = min(self.capacity, self.tokens + tokens_to_add)
self.last_refill = now
2. Sliding Window
async def sliding_window_rate_limit(
user_id: int,
max_requests: int = 100,
window_seconds: int = 60
) -> bool:
"""Sliding window rate limiting with Redis"""
now = time.time()
window_start = now - window_seconds
key = f"rate_limit:{user_id}"
# Remove old entries
await redis.zremrangebyscore(key, 0, window_start)
# Count requests in window
count = await redis.zcard(key)
if count < max_requests:
# Add current request
await redis.zadd(key, {str(now): now})
await redis.expire(key, window_seconds)
return True
return False
8. Scaling & Performance {#scaling-performance}
Understanding Scalability
Scalability is your system's ability to handle increased load. But here's the catch: scaling isn't just about handling more trafficβit's about handling it cost-effectively.
The Scaling Journey:
Stage 1: Single Server (0-100 users)
βββ Everything on one machine
Stage 2: Vertical Scaling (100-1K users)
βββ Bigger server
Stage 3: Horizontal Scaling (1K-100K users)
βββ Multiple app servers
βββ Load balancer
βββ Database replication
Stage 4: Distributed (100K-1M+ users)
βββ Microservices
βββ Database sharding
βββ CDN
βββ Caching layers
βββ Message queues
Common Misconception:
"Scaling = Adding more servers"
Reality:
"Scaling = Identifying bottlenecks and addressing them systematically"
Vertical vs Horizontal Scaling
Vertical Scaling (Scale Up):
Before: 4 CPU, 8GB RAM
After: 16 CPU, 64GB RAM
Pros:
β Simpler (no code changes)
β No consistency issues
β Lower latency (no network)
β Easier to maintain
Cons:
β Physical limits (can't scale infinitely)
β Expensive (exponential cost)
β Single point of failure
β Downtime during upgrades
Best for:
- Databases (harder to scale horizontally)
- Legacy applications
- Limited budget/time
Horizontal Scaling (Scale Out):
Before: 1 server
After: 10 servers
Pros:
β Unlimited scaling (add more servers)
β Linear cost
β Fault tolerance (one server fails, others work)
β Zero-downtime deployments
Cons:
β Complex architecture
β Network latency
β Consistency challenges
β More operational overhead
Best for:
- Stateless applications
- Web servers
- Microservices
- High availability needs
The Sweet Spot: Most applications use both:
- Vertical scaling for databases
- Horizontal scaling for application servers
Performance Bottlenecks: The Hierarchy
Fix in this order (biggest impact first):
1. Database Queries (Biggest Impact)
Problem: Slow queries
Solution: Indexes, query optimization
Impact: 10-100x improvement
Without index: 500ms
With index: 5ms
2. N+1 Queries
Problem: Multiple database calls
Solution: JOINs or batch loading
Impact: 10-50x improvement
100 queries = 5 seconds
1 query = 100ms
3. Caching
Problem: Repeated expensive operations
Solution: Redis/Memcached
Impact: 5-50x improvement
Database: 100ms
Cache: 2ms
4. Async Processing
Problem: Blocking operations
Solution: Background jobs
Impact: 5-10x improvement
Sync: User waits 5 seconds
Async: User waits 0.1 seconds
5. Code Optimization
Problem: Inefficient algorithms
Solution: Better algorithms/data structures
Impact: 2-5x improvement
O(nΒ²) β O(n log n) or O(n)
6. Hardware/Scaling
Problem: Not enough resources
Solution: More servers
Impact: 2-10x improvement
1 server β 5 servers = 5x capacity
Load Balancing Strategies
Load balancing distributes traffic across multiple servers. But different algorithms suit different scenarios.
1. Round Robin (Default)
Request 1 β Server A
Request 2 β Server B
Request 3 β Server C
Request 4 β Server A (repeat)
Pros: Simple, fair distribution
Cons: Doesn't consider server load
Best for: Homogeneous servers, stateless apps
2. Least Connections
Server A: 5 connections
Server B: 3 connections β Route here
Server C: 8 connections
Pros: Adapts to load
Cons: Needs connection tracking
Best for: Long-lived connections (WebSockets)
3. IP Hash (Sticky Sessions)
hash(client_ip) % num_servers = server
Same client always routes to same server
Pros: Session affinity
Cons: Uneven distribution if few clients
Best for: Stateful apps, caching benefits
4. Weighted Round Robin
Server A (weight 3): 60% traffic
Server B (weight 2): 40% traffic
Pros: Handles different server capacities
Cons: Manual weight configuration
Best for: Heterogeneous servers
Database Scaling Strategies
1. Read Replicas (Read-Heavy Workloads)
Primary DB (writes)
β
βββ Replica 1 (reads)
βββ Replica 2 (reads)
βββ Replica 3 (reads)
Benefits:
- Distribute read load across replicas
- 3 replicas = 3x read capacity
- No application changes needed
Trade-offs:
- Replication lag (eventual consistency)
- Writes still bottlenecked on primary
2. Database Sharding (Write-Heavy Workloads)
User IDs 1-1000 β Shard 1
User IDs 1001-2000 β Shard 2
User IDs 2001-3000 β Shard 3
Benefits:
- Distribute both reads and writes
- Near-linear scaling
- Each shard is smaller (better performance)
Trade-offs:
- Complex queries (cross-shard JOINs)
- Rebalancing is difficult
- Application must handle routing
3. Caching Layer (Read-Heavy with Hot Data)
Application β Cache (Redis) β Database
Benefits:
- Extremely fast (1-5ms vs 50-100ms)
- Reduces database load by 70-90%
- Easy to add
Trade-offs:
- Cache invalidation complexity
- Additional infrastructure
- Memory cost
Load Balancing Algorithms
from typing import List
import random
import hashlib
class LoadBalancer:
def __init__(self, servers: List[str]):
self.servers = servers
self.current_index = 0
def round_robin(self) -> str:
"""Distribute requests evenly"""
server = self.servers[self.current_index]
self.current_index = (self.current_index + 1) % len(self.servers)
return server
def least_connections(self, connections: dict) -> str:
"""Route to server with fewest connections"""
return min(self.servers, key=lambda s: connections.get(s, 0))
def ip_hash(self, client_ip: str) -> str:
"""Consistent routing based on IP"""
hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
index = hash_value % len(self.servers)
return self.servers[index]
def weighted_round_robin(self, weights: dict) -> str:
"""Round robin with server weights"""
weighted_servers = []
for server in self.servers:
weight = weights.get(server, 1)
weighted_servers.extend([server] * weight)
return random.choice(weighted_servers)
def random_selection(self) -> str:
"""Random server selection"""
return random.choice(self.servers)
Database Replication
class DatabaseRouter:
"""Route queries to appropriate database"""
def __init__(self):
self.primary = create_connection("primary-db:5432")
self.replicas = [
create_connection("replica-1:5432"),
create_connection("replica-2:5432"),
create_connection("replica-3:5432")
]
self.replica_index = 0
def execute_write(self, query: str, *args):
"""Execute write on primary"""
return self.primary.execute(query, *args)
def execute_read(self, query: str, *args):
"""Execute read on replica (load balanced)"""
replica = self.replicas[self.replica_index]
self.replica_index = (self.replica_index + 1) % len(self.replicas)
return replica.execute(query, *args)
# Usage
db = DatabaseRouter()
# Writes go to primary
await db.execute_write("INSERT INTO users (name) VALUES ($1)", "John")
# Reads distributed across replicas
users = await db.execute_read("SELECT * FROM users")
Connection Pooling Best Practices
# Optimal pool sizing formula
# connections = ((core_count * 2) + effective_spindle_count)
import asyncpg
import multiprocessing
async def create_optimized_pool():
core_count = multiprocessing.cpu_count()
# For SSD: effective_spindle_count β core_count
# For HDD: effective_spindle_count β number of physical drives
optimal_size = (core_count * 2) + core_count
pool = await asyncpg.create_pool(
dsn="postgresql://user:pass@localhost/db",
min_size=optimal_size // 2, # Always maintain half
max_size=optimal_size,
command_timeout=60.0,
max_queries=50000, # Recycle connections
max_inactive_connection_lifetime=300.0 # 5 minutes
)
return pool
9. Security Best Practices {#security}
Security: Not Optional, Not Later
The Harsh Reality:
- Average cost of data breach: $4.35 million
- 43% of cyberattacks target small businesses
- 60% of companies go out of business within 6 months of a breach
Security is NOT:
- Something to add later
- Just for big companies
- Only frontend (CORS, XSS)
- A one-time implementation
Security IS:
- Prevention: Stop attacks before they happen
- Detection: Know when you're under attack
- Response: Minimize damage when breached
- Recovery: Get back online quickly
OWASP Top 10: The Most Critical Vulnerabilities
The Open Web Application Security Project (OWASP) maintains a list of the most critical security risks. Every backend engineer must know these.
1. Injection (SQL, NoSQL, Command)
What: Attacker inserts malicious code into queries
Impact: Database takeover, data theft
Example:
Input: "admin' OR '1'='1"
Query: SELECT * FROM users WHERE username='admin' OR '1'='1'
Result: Bypasses authentication!
Prevention: Parameterized queries, input validation
2. Broken Authentication
What: Weak password policies, session management
Impact: Account takeover
Examples:
- No rate limiting β Brute force
- Predictable session IDs
- Passwords in URLs
Prevention: Strong passwords, MFA, secure sessions
3. Sensitive Data Exposure
What: Exposing confidential data
Impact: Privacy violations, compliance issues
Examples:
- Sending credit cards in logs
- Storing passwords in plain text
- API keys in code
Prevention: Encryption, hashing, secure storage
4. XML External Entities (XXE)
What: Processing untrusted XML
Impact: Server-side request forgery, DOS
Prevention: Disable XML external entity processing
5. Broken Access Control
What: Users access unauthorized resources
Impact: Privilege escalation, data leaks
Example:
User A: /api/users/123/orders β
User A: /api/users/456/orders β (Should fail but doesn't!)
Prevention: Server-side authorization checks
6. Security Misconfiguration
What: Default configs, unnecessary features
Impact: Various vulnerabilities
Examples:
- Debug mode in production
- Default passwords
- Unnecessary services running
Prevention: Security hardening, minimal setup
7. Cross-Site Scripting (XSS)
What: Injecting malicious scripts
Impact: Session hijacking, defacement
Example:
Input: <script>steal_cookies()</script>
Display: Executes in victim's browser!
Prevention: Input sanitization, CSP headers
8. Insecure Deserialization
What: Deserializing untrusted data
Impact: Remote code execution
Prevention: Avoid deserialization of user input
9. Using Components with Known Vulnerabilities
What: Outdated libraries with CVEs
Impact: Various exploits
Example: Using Node.js package with known RCE
Prevention: Regular updates, dependency scanning
10. Insufficient Logging & Monitoring
What: Not tracking security events
Impact: Can't detect or respond to attacks
Prevention: Log all security events, alert on anomalies
Defense in Depth: Layered Security
Don't rely on a single security measure. Use multiple layers:
Layer 1: Network (Firewall, VPC)
β
Layer 2: Infrastructure (OS hardening, patches)
β
Layer 3: Application (Input validation, auth)
β
Layer 4: Data (Encryption, hashing)
β
Layer 5: Monitoring (Logging, alerts)
If one layer fails, others protect you.
Password Security: Beyond Hashing
Password Storage Hierarchy (Worst to Best):
β Plain Text
Database: password = "MyPassword123"
Impact: Immediate breach if database leaked
β MD5/SHA1 (Broken)
Database: password = md5("MyPassword123")
Impact: Rainbow table attacks, extremely fast cracking
β SHA256 (Better but still bad)
Database: password = sha256("MyPassword123")
Impact: GPU cracking is very fast
β Bcrypt/Argon2 (Correct)
Database: password = bcrypt("MyPassword123", rounds=12)
Impact: Computationally expensive to crack
Time to crack: Years with current hardware
Why Bcrypt/Argon2?
- Slow by design: Takes 100-500ms to hash (vs 0.001ms for SHA256)
- Adaptive: Can increase rounds as hardware improves
- Salt built-in: Protects against rainbow tables
- Battle-tested: Industry standard
Password Policies That Actually Work:
Effective:
- Minimum 8-12 characters
- Check against breached password database (Have I Been Pwned)
- Account lockout after failed attempts
- Multi-factor authentication
Ineffective (Stop Doing These):
- Forced password rotation every 90 days
- Complex requirements (1 uppercase, 1 number, 1 special)
- Security questions
- Password hints
Common Attack Vectors
1. Brute Force
Attack: Try many passwords rapidly
Defense: Rate limiting + account lockout
Example:
- Allow 5 login attempts per 15 minutes
- After 5 failures, lock account for 15 minutes
- Alert security team after 10 failures
2. DDoS (Distributed Denial of Service)
Attack: Overwhelm server with traffic
Defense: Rate limiting + CDN + Auto-scaling
Example:
- Use CloudFlare/AWS Shield
- Implement global rate limits
- Auto-scale during attacks
3. Session Hijacking
Attack: Steal user session
Defense: Secure cookies + HTTPS + regenerate sessions
Example:
- HttpOnly cookies (prevent JavaScript access)
- Secure flag (HTTPS only)
- SameSite attribute (CSRF protection)
- Regenerate session ID after login
4. Man-in-the-Middle (MITM)
Attack: Intercept communication
Defense: HTTPS everywhere + certificate pinning
Example:
- Force HTTPS (redirect HTTP to HTTPS)
- HSTS header (browser-enforced HTTPS)
- Valid SSL certificates
SQL Injection Prevention
# β VULNERABLE to SQL Injection
def get_user_unsafe(user_id: str):
query = f"SELECT * FROM users WHERE id = {user_id}"
return db.execute(query)
# Attacker: user_id = "1 OR 1=1; DROP TABLE users;--"
# β
SAFE: Use parameterized queries
def get_user_safe(user_id: int):
query = "SELECT * FROM users WHERE id = $1"
return db.execute(query, user_id)
# β
SAFE: Use ORM
def get_user_orm(user_id: int):
return User.objects.filter(id=user_id).first()
XSS Prevention
from html import escape
# β VULNERABLE to XSS
@app.get("/user/{name}")
async def greet_unsafe(name: str):
return {"message": f"<h1>Hello {name}</h1>"}
# Attacker: name = "<script>alert('XSS')</script>"
# β
SAFE: Escape HTML
@app.get("/user/{name}")
async def greet_safe(name: str):
safe_name = escape(name)
return {"message": f"<h1>Hello {safe_name}</h1>"}
# β
SAFE: Use templates with auto-escaping
from jinja2 import Template
template = Template("<h1>Hello {{ name }}</h1>") # Auto-escapes
return template.render(name=name)
Password Security
from passlib.context import CryptContext
import secrets
pwd_context = CryptContext(
schemes=["bcrypt"],
deprecated="auto",
bcrypt__rounds=12 # Work factor (higher = slower = more secure)
)
class PasswordSecurity:
@staticmethod
def hash_password(password: str) -> str:
"""Hash password with bcrypt"""
return pwd_context.hash(password)
@staticmethod
def verify_password(plain: str, hashed: str) -> bool:
"""Verify password against hash"""
return pwd_context.verify(plain, hashed)
@staticmethod
def generate_strong_password(length: int = 16) -> str:
"""Generate cryptographically secure password"""
alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!@#$%^&*()"
return ''.join(secrets.choice(alphabet) for _ in range(length))
@staticmethod
def validate_password_strength(password: str) -> bool:
"""Validate password meets requirements"""
if len(password) < 8:
return False
has_upper = any(c.isupper() for c in password)
has_lower = any(c.islower() for c in password)
has_digit = any(c.isdigit() for c in password)
has_special = any(c in "!@#$%^&*()" for c in password)
return all([has_upper, has_lower, has_digit, has_special])
CORS Configuration
from fastapi.middleware.cors import CORSMiddleware
app.add_middleware(
CORSMiddleware,
allow_origins=[
"https://example.com",
"https://app.example.com"
], # Specific origins, NOT "*" in production!
allow_credentials=True,
allow_methods=["GET", "POST", "PUT", "DELETE"],
allow_headers=["Authorization", "Content-Type"],
max_age=3600 # Cache preflight requests for 1 hour
)
Rate Limiting for Security
from datetime import datetime, timedelta
from collections import defaultdict
class SecurityRateLimiter:
"""Rate limiter to prevent brute force attacks"""
def __init__(self):
self.attempts = defaultdict(list)
async def check_login_attempts(
self,
identifier: str, # email or IP
max_attempts: int = 5,
window_minutes: int = 15
) -> bool:
"""Check if too many login attempts"""
now = datetime.now()
window_start = now - timedelta(minutes=window_minutes)
# Clean old attempts
self.attempts[identifier] = [
attempt for attempt in self.attempts[identifier]
if attempt > window_start
]
# Check limit
if len(self.attempts[identifier]) >= max_attempts:
return False # Too many attempts
# Record attempt
self.attempts[identifier].append(now)
return True
security_limiter = SecurityRateLimiter()
@app.post("/auth/login")
async def login(credentials: LoginRequest, request: Request):
identifier = f"{credentials.email}:{request.client.host}"
if not await security_limiter.check_login_attempts(identifier):
raise HTTPException(
status_code=429,
detail="Too many login attempts. Try again in 15 minutes."
)
# Proceed with login...
10. Monitoring & Observability {#monitoring}
Monitoring vs Observability: Understanding the Difference
Monitoring tells you WHAT is wrong:
- CPU is at 90%
- Response time is 2 seconds
- Error rate is 5%
Observability tells you WHY it's wrong:
- CPU is high because database queries are slow
- Response time is slow because external API is timing out
- Errors are from payment gateway being down
The Reality:
"If you can't measure it, you can't improve it." - Peter Drucker
Most production issues are discovered by users, not monitoring. This should never happen.
The Three Pillars of Observability
These three pillars work together to give you complete visibility:
1. Metrics (WHAT is happening?)
Examples:
- Request count: 1,250 requests/second
- Error rate: 0.5%
- CPU usage: 65%
- Memory usage: 4.2 GB
- Database connections: 45/100
Use Case: Quick overview, dashboards, alerts
Tools: Prometheus, Grafana, CloudWatch
2. Logs (WHY did it happen?)
Examples:
- "Failed to connect to database: connection timeout"
- "User 123 attempted invalid payment"
- "API rate limit exceeded for client ABC"
Use Case: Debugging, audit trails, troubleshooting
Tools: ELK Stack, Splunk, CloudWatch Logs
3. Traces (WHERE did it happen?)
Example:
Request ID: abc123
ββ API Gateway: 5ms
ββ Auth Service: 10ms
ββ User Service: 150ms β SLOW!
β ββ Database Query: 145ms β ROOT CAUSE!
β ββ Cache Check: 5ms
ββ Response: 2ms
Use Case: Distributed systems, microservices debugging
Tools: Jaeger, Zipkin, AWS X-Ray
Why You Need All Three:
Metrics alone:
- "API latency is 500ms" β But which endpoint? Which user?
Logs alone:
- "1 million log lines" β How do you find the problem?
Traces alone:
- "This request was slow" β But is it always slow? How often?
Together:
- Metrics alert you: "API latency spiking!"
- Traces show you: "User service database query is slow"
- Logs tell you: "Deadlock detected on orders table"
What to Monitor: The Essential Metrics
System Metrics (Infrastructure Health):
β CPU Usage (target: <70%)
β Memory Usage (target: <80%)
β Disk Usage (target: <85%)
β Network I/O
β File Descriptors (often overlooked!)
Alert when:
- Any metric >80% for 5 minutes
- Trend shows rapid increase
Application Metrics (Business Health):
β Request Rate (requests/second)
β Error Rate (% of failed requests)
β Latency (P50, P95, P99)
β Active Users
β Business KPIs (orders, signups, etc.)
Alert when:
- Error rate >1% for 5 minutes
- P99 latency >1 second
- Business KPI drops >20%
Database Metrics (Data Layer Health):
β Connection Pool Usage
β Query Latency
β Slow Query Count
β Replication Lag
β Deadlocks
Alert when:
- Connection pool >90% for 5 minutes
- Slow queries >100/minute
- Replication lag >5 seconds
External Dependencies:
β Third-party API response time
β Payment gateway status
β Email service status
β CDN health
Alert when:
- External API error rate >5%
- Response time >2x baseline
The Golden Signals (Google SRE)
Google's Site Reliability Engineering team identified 4 critical metrics:
1. Latency (Speed)
How long does it take to serve a request?
Measure:
- P50 (median): 50% of requests faster than this
- P95: 95% of requests faster than this
- P99: 99% of requests faster than this
Why P99 matters:
If P99 = 2 seconds, 1 in 100 users has a terrible experience.
At 1 million requests/day, that's 10,000 angry users!
2. Traffic (Demand)
How much demand is placed on your system?
Measure:
- Requests per second
- Active connections
- Bandwidth usage
Why it matters:
Helps you understand if slow response is due to load
3. Errors (Quality)
What percentage of requests are failing?
Measure:
- HTTP 5xx errors
- HTTP 4xx errors
- Exception rate
- Failed background jobs
Why it matters:
Users can't complete their tasks
4. Saturation (Capacity)
How "full" is your service?
Measure:
- CPU/Memory utilization
- Disk I/O
- Network I/O
- Queue depth
Why it matters:
Predicts when you'll need to scale
Alerting Best Practices
The Problem with Bad Alerts:
- Too many β Alert fatigue β Ignore real issues
- Too few β Miss critical problems
- Unclear β Waste time investigating
Good Alert Rules:
1. Actionable
β Bad: "CPU usage is 85%"
β
Good: "CPU usage >80% for 5 minutes. Check logs for slow queries."
Every alert should answer:
- What's wrong?
- Why should I care?
- What should I do?
2. Severity Levels
P1 (Critical): System down, immediate action
- Page on-call engineer
- Example: Website returning 500 errors
P2 (High): Degraded service, fix within hours
- Email + Slack
- Example: Slow database queries
P3 (Medium): Needs attention, fix within days
- Ticket created
- Example: Disk usage at 85%
P4 (Low): Informational, fix when convenient
- Log only
- Example: Non-critical service restarted
3. Avoid Alert Fatigue
Rules:
- Alert on symptoms, not causes
- Use thresholds with time windows (not instant)
- Implement alert aggregation
- Regular alert review and tuning
Example:
β "5xx error occurred" (fires 1000 times)
β
"5xx error rate >1% for 5 minutes" (fires once)
Health Checks: The Early Warning System
Health checks tell load balancers and orchestrators if your service is healthy.
Types of Health Checks:
1. Liveness Check
Question: "Are you alive?"
Purpose: Should we restart you?
Check:
- Process is running
- Can accept requests
Endpoint: GET /health/live
Response: 200 OK or 503 Service Unavailable
2. Readiness Check
Question: "Are you ready to serve traffic?"
Purpose: Should we send you requests?
Check:
- Database connected
- Cache connected
- Dependencies available
Endpoint: GET /health/ready
Response: 200 OK or 503 Service Unavailable
3. Startup Check
Question: "Have you finished starting up?"
Purpose: Is initialization complete?
Check:
- Configuration loaded
- Database migrations run
- Caches warmed
Endpoint: GET /health/startup
Response: 200 OK or 503 Service Unavailable
Health Check Best Practices:
- Keep checks fast (<1 second)
- Don't check every dependency deeply (too slow)
- Return appropriate HTTP status codes
- Include timestamp in response
- Log failed checks
Structured Logging
import logging
import json
from datetime import datetime
from contextvars import ContextVar
# Context variable for request ID
request_id_var: ContextVar[str] = ContextVar('request_id', default='')
class JSONFormatter(logging.Formatter):
"""JSON formatter for structured logging"""
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"logger": record.name,
"module": record.module,
"function": record.funcName,
"line": record.lineno
}
# Add request ID if available
request_id = request_id_var.get()
if request_id:
log_data["request_id"] = request_id
# Add extra fields
if hasattr(record, 'user_id'):
log_data["user_id"] = record.user_id
if hasattr(record, 'duration_ms'):
log_data["duration_ms"] = record.duration_ms
if record.exc_info:
log_data["exception"] = self.formatException(record.exc_info)
return json.dumps(log_data)
# Configure logger
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
logger.info("User logged in", extra={"user_id": 123})
logger.error("Payment failed", extra={
"user_id": 123,
"order_id": 456,
"amount": 99.99
})
Health Check Endpoint
from enum import Enum
class HealthStatus(str, Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
class HealthCheck:
async def check_database(self) -> tuple[bool, str]:
"""Check database connection"""
try:
await db.execute("SELECT 1")
return True, "Database OK"
except Exception as e:
return False, f"Database error: {str(e)}"
async def check_redis(self) -> tuple[bool, str]:
"""Check Redis connection"""
try:
await redis.ping()
return True, "Redis OK"
except Exception as e:
return False, f"Redis error: {str(e)}"
async def check_disk_space(self) -> tuple[bool, str]:
"""Check available disk space"""
import shutil
stats = shutil.disk_usage("/")
percent_used = (stats.used / stats.total) * 100
if percent_used > 90:
return False, f"Disk usage critical: {percent_used:.1f}%"
elif percent_used > 80:
return True, f"Disk usage warning: {percent_used:.1f}%"
return True, f"Disk usage OK: {percent_used:.1f}%"
async def get_health(self) -> dict:
"""Get overall health status"""
checks = {
"database": await self.check_database(),
"redis": await self.check_redis(),
"disk": await self.check_disk_space()
}
# Determine overall status
failed = [k for k, (ok, _) in checks.items() if not ok]
if not failed:
status = HealthStatus.HEALTHY
elif len(failed) < len(checks):
status = HealthStatus.DEGRADED
else:
status = HealthStatus.UNHEALTHY
return {
"status": status,
"checks": {
k: {"status": "ok" if ok else "failed", "message": msg}
for k, (ok, msg) in checks.items()
},
"timestamp": datetime.utcnow().isoformat()
}
health_checker = HealthCheck()
@app.get("/health")
async def health():
result = await health_checker.get_health()
status_code = {
HealthStatus.HEALTHY: 200,
HealthStatus.DEGRADED: 200,
HealthStatus.UNHEALTHY: 503
}[result["status"]]
return Response(
content=json.dumps(result),
status_code=status_code,
media_type="application/json"
)
Distributed Tracing
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
# Usage
@app.get("/users/{user_id}")
async def get_user(user_id: int):
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user_id", user_id)
# Database query
with tracer.start_as_current_span("db_query"):
user = await db.get_user(user_id)
# Cache store
with tracer.start_as_current_span("cache_store"):
await redis.set(f"user:{user_id}", json.dumps(user))
return user
11. CI/CD & DevOps {#cicd-devops}
CI/CD Pipeline
Dockerfile Best Practices
# Multi-stage build for smaller images
FROM python:3.11-slim as builder
# Install dependencies in builder stage
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt
# Final stage
FROM python:3.11-slim
# Create non-root user
RUN useradd -m -u 1000 appuser
# Copy dependencies from builder
COPY --from=builder /root/.local /home/appuser/.local
# Set working directory
WORKDIR /app
# Copy application code
COPY --chown=appuser:appuser . .
# Switch to non-root user
USER appuser
# Add .local/bin to PATH
ENV PATH=/home/appuser/.local/bin:$PATH
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python healthcheck.py || exit 1
# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Docker Compose for Development
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://user:pass@db:5432/mydb
- REDIS_URL=redis://redis:6379/0
depends_on:
- db
- redis
volumes:
- ./:/app
command: uvicorn main:app --reload --host 0.0.0.0
db:
image: postgres:15
environment:
- POSTGRES_USER=user
- POSTGRES_PASSWORD=pass
- POSTGRES_DB=mydb
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
ports:
- "6379:6379"
worker:
build: .
command: celery -A tasks worker --loglevel=info
environment:
- DATABASE_URL=postgresql://user:pass@db:5432/mydb
- REDIS_URL=redis://redis:6379/0
depends_on:
- db
- redis
volumes:
postgres_data:
12. Data Structures & Algorithms {#data-structures}
Time Complexity Cheat Sheet
| Operation | Array | Linked List | Hash Table | Binary Tree | Heap |
|---|---|---|---|---|---|
| Access | O(1) | O(n) | O(1) avg | O(log n) | O(1) |
| Search | O(n) | O(n) | O(1) avg | O(log n) | O(n) |
| Insert | O(n) | O(1) | O(1) avg | O(log n) | O(log n) |
| Delete | O(n) | O(1) | O(1) avg | O(log n) | O(log n) |
Common Algorithms Every Backend Engineer Should Know
# 1. Binary Search (O(log n))
def binary_search(arr: List[int], target: int) -> int:
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1
# 2. Two Pointers Pattern
def remove_duplicates(arr: List[int]) -> int:
"""Remove duplicates from sorted array in-place"""
if not arr:
return 0
write = 1
for read in range(1, len(arr)):
if arr[read] != arr[read - 1]:
arr[write] = arr[read]
write += 1
return write
# 3. Sliding Window
def max_sum_subarray(arr: List[int], k: int) -> int:
"""Maximum sum of k consecutive elements"""
window_sum = sum(arr[:k])
max_sum = window_sum
for i in range(k, len(arr)):
window_sum = window_sum - arr[i - k] + arr[i]
max_sum = max(max_sum, window_sum)
return max_sum
# 4. BFS (Breadth-First Search)
from collections import deque
def bfs(graph: dict, start: str):
visited = set()
queue = deque([start])
visited.add(start)
while queue:
node = queue.popleft()
print(node)
for neighbor in graph[node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append(neighbor)
# 5. DFS (Depth-First Search)
def dfs(graph: dict, node: str, visited: set = None):
if visited is None:
visited = set()
visited.add(node)
print(node)
for neighbor in graph[node]:
if neighbor not in visited:
dfs(graph, neighbor, visited)
# 6. LRU Cache
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity: int):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key: int) -> int:
if key not in self.cache:
return -1
# Move to end (most recently used)
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key: int, value: int) -> None:
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
# Remove least recently used (first item)
self.cache.popitem(last=False)
Essential Backend Concepts Checklist
β API Design
- RESTful principles
- HTTP status codes
- API versioning
- Request/Response schemas
- Error handling patterns
β Database
- SQL and NoSQL differences
- ACID properties
- Normalization
- Indexing strategies
- Query optimization
- Transactions
- Replication and sharding
β Authentication & Security
- JWT tokens
- OAuth 2.0
- Password hashing (bcrypt)
- RBAC (Role-Based Access Control)
- SQL injection prevention
- XSS prevention
- CORS configuration
β Performance & Scaling
- Caching strategies
- Database connection pooling
- Horizontal vs Vertical scaling
- Load balancing
- CDN usage
- Async processing
β Architecture
- Monolith vs Microservices
- API Gateway pattern
- Service discovery
- Circuit breaker
- Message queues
- Event-driven architecture
β DevOps
- Docker containerization
- CI/CD pipelines
- Infrastructure as Code
- Monitoring & logging
- Health checks
- Blue-green deployment
β Data Structures & Algorithms
- Time/Space complexity
- Hash tables
- Trees and graphs
- Sorting algorithms
- Search algorithms
- Dynamic programming
Recommended Learning Path
The 6-Month Backend Engineer Journey
This isn't just a curriculumβit's a battle-tested path from junior to mid-level backend engineer. Each month builds on the previous one.
Month 1-2: Fundamentals (Foundation Phase)
Goal: Build solid foundations. 80% of problems in production come from not understanding basics.
Week 1-2: API Design
Learn:
- RESTful principles (read Richardson Maturity Model)
- HTTP methods and status codes
- Request/Response design
- API versioning strategies
Practice:
- Build a TODO API
- Implement proper error handling
- Add pagination
- Write API documentation (OpenAPI/Swagger)
Outcome: Can design clean, intuitive APIs
Week 3-4: Database Fundamentals
Learn:
- SQL basics (SELECT, JOIN, WHERE, GROUP BY)
- Database design (normalization)
- Indexes and their impact
- ACID properties
Practice:
- Design database for blog platform
- Write complex queries with JOINs
- Create indexes and measure improvement
- Use EXPLAIN to understand query plans
Outcome: Can design efficient database schemas
Week 5-6: Authentication & Security
Learn:
- Password hashing (bcrypt)
- JWT tokens
- OAuth 2.0 flow
- OWASP Top 10
Practice:
- Implement user registration/login
- Add JWT authentication
- Implement role-based access control
- Security audit your code
Outcome: Can implement secure authentication
Week 7-8: Testing & Best Practices
Learn:
- Unit testing
- Integration testing
- Test-driven development (TDD)
- Code organization
Practice:
- Write tests for your API
- Achieve >80% code coverage
- Refactor code for testability
- Set up CI/CD pipeline
Outcome: Can write maintainable, tested code
Month 1-2 Milestone Project: Build a complete blog API with:
- User authentication (JWT)
- CRUD operations for posts
- Comments system
- Search functionality
- Pagination
- 80%+ test coverage
Month 3-4: Intermediate (Scaling Phase)
Goal: Learn to build systems that scale and perform well.
Week 9-10: Caching
Learn:
- Cache-aside pattern
- Write-through vs write-behind
- Redis fundamentals
- Cache invalidation strategies
Practice:
- Add Redis caching to your blog API
- Implement cache warming
- Measure performance improvements
- Handle cache stampede
Outcome: Can dramatically improve API performance
Week 11-12: Async Processing
Learn:
- Message queues (RabbitMQ/Redis)
- Background jobs (Celery)
- Event-driven architecture
- Task queues
Practice:
- Add email notifications (async)
- Implement image processing queue
- Handle task failures and retries
- Monitor queue depth
Outcome: Can offload slow operations
Week 13-14: Database Optimization
Learn:
- Query optimization
- Connection pooling
- N+1 query problem
- Database replication
Practice:
- Profile slow queries
- Add appropriate indexes
- Implement connection pooling
- Set up read replica
Outcome: Can optimize database performance
Week 15-16: API Optimization
Learn:
- Rate limiting
- Response compression
- Pagination strategies
- API versioning
Practice:
- Implement rate limiting
- Add response compression
- Optimize payload sizes
- Version your API
Outcome: Can build production-ready APIs
Month 3-4 Milestone Project: Scale your blog API to handle:
- 1000 requests/second
- 1 million users
- Response time <100ms (P95)
- Background image processing
Month 5-6: Advanced (Architecture Phase)
Goal: Understand system architecture and distributed systems.
Week 17-18: Microservices
Learn:
- Microservices vs monolith
- Service communication
- API Gateway pattern
- Service discovery
Practice:
- Break blog into microservices
- Implement API gateway
- Add service-to-service auth
- Handle partial failures
Outcome: Can design microservices architecture
Week 19-20: Monitoring & Observability
Learn:
- Metrics (Prometheus)
- Logging (structured logs)
- Tracing (Jaeger)
- Alerting
Practice:
- Add Prometheus metrics
- Implement structured logging
- Set up distributed tracing
- Create meaningful alerts
Outcome: Can debug production issues
Week 21-22: System Design
Learn:
- Design patterns (Saga, CQRS, Circuit Breaker)
- Load balancing
- Database sharding
- CDN usage
Practice:
- Design Twitter-like system
- Design URL shortener
- Design notification service
- Design for 100M users
Outcome: Can design scalable systems
Week 23-24: DevOps & Deployment
Learn:
- Docker containerization
- Kubernetes basics
- CI/CD pipelines
- Infrastructure as code
Practice:
- Containerize your application
- Set up CI/CD (GitHub Actions)
- Deploy to Kubernetes
- Implement blue-green deployment
Outcome: Can deploy and operate systems
Month 5-6 Milestone Project: Design and implement a URL shortener that:
- Handles 10,000 writes/second
- Handles 100,000 reads/second
- 99.99% uptime
- Global distribution (multi-region)
- Complete monitoring and alerting
Ongoing: Continuous Learning
Daily (30 minutes):
- Read engineering blogs
- Netflix Tech Blog
- Uber Engineering
- Cloudflare Blog
- AWS Architecture Blog
Weekly (2 hours):
- Practice algorithms (LeetCode)
- Study system design
- Read documentation
- Experiment with new tech
Monthly:
- Deep dive into one topic
- Build side project
- Contribute to open source
- Write technical blog post
Quarterly:
- Learn new framework/language
- Take online course
- Attend conference/meetup
- Review and update goals
The Learning Strategy That Actually Works
1. Learn by Building (Not Just Watching)
β Bad: Watch 10 tutorials
β
Good: Watch 1 tutorial, build 10 projects
2. Teach to Learn
- Write blog posts
- Answer StackOverflow questions
- Mentor juniors
- Give tech talks
Teaching forces deep understanding
3. Read Code More Than Write
Study production codebases:
- Django (web framework)
- Flask (minimalist framework)
- FastAPI (modern async)
- Redis (database)
Learn patterns and practices
4. Failure is the Best Teacher
Break things intentionally:
- Crash your database
- Overload your server
- Simulate network failures
- Practice recovery
Production won't be forgiving
5. Build in Public
- Share progress on Twitter/LinkedIn
- Open source your projects
- Get feedback early
- Build a portfolio
Visibility leads to opportunities
Common Pitfalls to Avoid
1. Tutorial Hell
Problem: Endlessly watching tutorials
Solution: Build projects after each tutorial
2. Premature Optimization
Problem: Optimizing before measuring
Solution: Profile first, optimize second
3. Overengineering
Problem: Using microservices for todo app
Solution: Start simple, scale when needed
4. Ignoring Fundamentals
Problem: Jumping to advanced topics
Solution: Master basics first
5. Learning Alone
Problem: No feedback, no motivation
Solution: Join communities, find mentors
Success Metrics: Are You Ready?
Junior β Mid-Level Engineer:
- Can design and implement RESTful APIs
- Understand database design and optimization
- Can implement authentication and authorization
- Write maintainable, tested code
- Debug production issues independently
- Estimate tasks accurately
- Communicate technical decisions clearly
Mid-Level β Senior Engineer:
- Design systems for scale (1M+ users)
- Lead technical discussions
- Mentor junior engineers
- Make architectural decisions
- Handle incidents and postmortems
- Balance technical debt vs features
- Think about business impact
Final Thoughts: The Backend Engineer Mindset
1. User First
Every technical decision impacts users.
Fast response time = happy users
Downtime = lost revenue
Security breach = lost trust
2. Measure Everything
"In God we trust; all others must bring data."
No measurement = no improvement
Metrics guide decisions
3. Simplicity Wins
Simple solution working > Complex solution perfect
Complexity is the enemy of reliability
Start simple, scale when needed
4. Plan for Failure
Everything fails:
- Servers crash
- Networks partition
- Databases deadlock
- APIs timeout
Design for failure, not success
5. Keep Learning
Technology changes constantly:
- New frameworks
- New paradigms
- New best practices
Learning never stops
The Journey Never Ends
Backend engineering is a marathon, not a sprint. You'll never "finish" learningβand that's the exciting part!
Remember:
- Month 1: You'll feel overwhelmed β Normal
- Month 3: You'll start connecting dots β Progress
- Month 6: You'll feel competent β Achievement
- Year 1: You'll realize how much you don't know β Wisdom
- Year 3: You'll mentor others β Mastery
- Year 5+: You'll specialize β Expertise
Every senior engineer was once a confused junior who didn't give up.
Your Next Step
Pick one topic from this guide that you don't fully understand. Spend the next week:
- Reading about it
- Building something with it
- Breaking it
- Fixing it
- Teaching someone else
Then move to the next topic.
Small consistent progress > Big sporadic effort
Recommended Resources
Books
- "Designing Data-Intensive Applications" by Martin Kleppmann
- "System Design Interview" by Alex Xu
- "Clean Code" by Robert C. Martin
- "Building Microservices" by Sam Newman
Websites
- System Design Primer (GitHub)
- Web Dev Simplified (YouTube)
- Backend.fyi - System design articles
- High Scalability blog
Practice
- LeetCode - Algorithms
- System Design Primer - Architecture
- HackerRank - Coding challenges
- Exercism - Code practice with mentoring
Remember: Backend engineering is a journey, not a destination. Keep learning, keep building, and always optimize for simplicity first.