Everything a Backend Engineer Should Know

November 4, 202574 min read
"Any fool can write code that a computer can understand. Good programmers write code that humans can understand." - Martin Fowle

1. API Design & REST Principles {#api-design}

Understanding REST

REST (Representational State Transfer) is an architectural style for designing networked applications. It's not a protocol or standard, but a set of constraints that, when applied correctly, makes your API predictable, scalable, and easy to understand.

Why REST Matters:

  • Standardization: Everyone follows similar patterns, making APIs intuitive
  • Scalability: Stateless design allows easy horizontal scaling
  • Caching: Built-in HTTP caching mechanisms improve performance
  • Flexibility: Language and platform agnostic

Core Principles Explained:

  1. Client-Server Separation: The client (frontend) and server (backend) are independent. They can evolve separately without affecting each other.

  2. Stateless: Each request contains all information needed to process it. The server doesn't store client state between requests. This makes servers simpler and more scalable.

  3. Cacheable: Responses must define themselves as cacheable or non-cacheable. This reduces client-server interactions and improves performance.

  4. Uniform Interface: Resources are identified by URLs, and operations are performed using standard HTTP methods. This creates consistency across all APIs.

  5. Layered System: Client can't tell if it's connected directly to the server or through intermediaries (load balancers, caches, proxies).

RESTful API Design

HTTP Methods: Idempotency Explained

Idempotency means calling the operation multiple times produces the same result as calling it once. This is crucial for retry logic and fault tolerance.

  • GET: Idempotent & Safe (no side effects)
  • PUT: Idempotent (updating same resource multiple times = same result)
  • DELETE: Idempotent (deleting same resource multiple times = same result)
  • POST: NOT Idempotent (creates new resource each time)
  • PATCH: May or may not be idempotent (depends on implementation)

HTTP Status Codes You Must Know

CodeMeaningWhen to Use
2xx Success
200OKSuccessful GET, PUT, PATCH
201CreatedSuccessful POST
204No ContentSuccessful DELETE
3xx Redirection
301Moved PermanentlyResource moved
304Not ModifiedCached response valid
4xx Client Errors
400Bad RequestInvalid input
401UnauthorizedAuthentication required
403ForbiddenAuthenticated but no permission
404Not FoundResource doesn't exist
409ConflictResource conflict (duplicate)
422Unprocessable EntityValidation failed
429Too Many RequestsRate limit exceeded
5xx Server Errors
500Internal Server ErrorGeneric server error
502Bad GatewayUpstream server error
503Service UnavailableServer overloaded
504Gateway TimeoutUpstream timeout

API Design Best Practices

# βœ… GOOD: RESTful resource-based URLs
GET    /api/v1/users              # List users
GET    /api/v1/users/{id}         # Get user
POST   /api/v1/users              # Create user
PUT    /api/v1/users/{id}         # Update user
PATCH  /api/v1/users/{id}         # Partial update
DELETE /api/v1/users/{id}         # Delete user

# Nested resources
GET    /api/v1/users/{id}/posts   # User's posts
GET    /api/v1/posts?user_id={id} # Alternative

# ❌ BAD: Verb-based URLs
GET    /api/v1/getUser?id=123
POST   /api/v1/createUser
POST   /api/v1/deleteUser?id=123

Request/Response Structure

from pydantic import BaseModel, Field, validator
from typing import Optional, List
from datetime import datetime

# Request Schema
class CreateUserRequest(BaseModel):
    email: str = Field(..., regex=r'^[\w\.-]+@[\w\.-]+\.\w+$')
    name: str = Field(..., min_length=2, max_length=100)
    age: Optional[int] = Field(None, ge=0, le=150)
    
    @validator('email')
    def email_must_be_lowercase(cls, v):
        return v.lower()

# Response Schema
class UserResponse(BaseModel):
    id: int
    email: str
    name: str
    age: Optional[int]
    created_at: datetime
    updated_at: datetime
    
    class Config:
        orm_mode = True

# Error Response
class ErrorResponse(BaseModel):
    error: str
    message: str
    details: Optional[dict] = None
    timestamp: datetime = Field(default_factory=datetime.utcnow)
    request_id: Optional[str] = None

# Paginated Response
class PaginatedResponse(BaseModel):
    items: List[UserResponse]
    total: int
    page: int
    per_page: int
    total_pages: int
    
    @property
    def has_next(self) -> bool:
        return self.page < self.total_pages
    
    @property
    def has_prev(self) -> bool:
        return self.page > 1

API Versioning Strategies

# 1. URL Path Versioning (Recommended)
@app.get("/api/v1/users")
@app.get("/api/v2/users")

# 2. Header Versioning
@app.get("/api/users")
async def get_users(api_version: str = Header(default="v1")):
    if api_version == "v2":
        return new_format()
    return old_format()

# 3. Query Parameter (Not Recommended)
@app.get("/api/users")
async def get_users(version: str = "v1"):
    pass

# 4. Content Negotiation
# Accept: application/vnd.myapi.v2+json

HATEOAS (Hypermedia as the Engine of Application State)

{
    "id": 123,
    "name": "John Doe",
    "email": "john@example.com",
    "_links": {
        "self": {
            "href": "/api/v1/users/123"
        },
        "posts": {
            "href": "/api/v1/users/123/posts"
        },
        "update": {
            "href": "/api/v1/users/123",
            "method": "PUT"
        },
        "delete": {
            "href": "/api/v1/users/123",
            "method": "DELETE"
        }
    }
}

2. Database Fundamentals {#database-fundamentals}

Understanding Databases

Databases are the backbone of most applications, storing and managing data efficiently. As a backend engineer, understanding how databases work internally is crucial for building performant systems.

Key Concepts You Must Understand:

1. ACID vs BASE:

  • ACID (SQL databases): Strong consistency, reliability, data integrity
  • BASE (NoSQL databases): Availability, eventual consistency, scalability

2. Why Database Choice Matters:

  • Wrong choice = Performance bottlenecks, scaling issues, increased costs
  • Right choice = Smooth operations, happy users, easier maintenance

3. Read vs Write Patterns:

  • Read-heavy (social media feeds): Consider read replicas, caching
  • Write-heavy (logging, analytics): Consider write-optimized databases
  • Balanced (e-commerce): Need careful indexing and optimization

ACID Properties

ACID ensures database transactions are processed reliably. Understanding ACID is fundamental to designing robust systems.

Atomicity - "All or Nothing": Think of it like a bank transfer. Either both the debit and credit happen, or neither happens. There's no in-between state where money disappears or duplicates.

Consistency - "Valid State Always": The database must always be in a valid state. All constraints, triggers, and rules are enforced. If a transaction violates a rule (like negative balance), it's rejected entirely.

Isolation - "Concurrent but Safe": Multiple transactions can run simultaneously without interfering with each other. Each transaction feels like it's the only one running, even though hundreds might be executing concurrently.

Durability - "Permanent Once Committed": Once a transaction is committed, it's permanentβ€”even if the server crashes immediately after. Data is written to non-volatile storage before the commit completes.

Real-World Example:

Transfer $100 from Account A to Account B:

1. BEGIN TRANSACTION
2. Check if A has >= $100 (Consistency)
3. Deduct $100 from A
4. Add $100 to B
5. If any step fails, ROLLBACK everything (Atomicity)
6. Other transactions can't see partial state (Isolation)
7. COMMIT - now it's permanent (Durability)

Database Normalization

Normalization is the process of organizing data to reduce redundancy and improve data integrity. It's about structuring your database logically.

Why Normalize?

  • Eliminate Redundancy: Don't store the same data in multiple places
  • Prevent Anomalies: Avoid update, insert, and delete issues
  • Improve Integrity: Ensure data consistency
  • Easier Maintenance: Changes in one place reflect everywhere

When NOT to Normalize:

  • Data warehouses: Denormalized for query performance
  • Read-heavy systems: Joins are expensive at scale
  • Caching layers: Denormalized for fast access

The Normal Forms Explained:

1NF (First Normal Form) - Atomic Values: Each cell contains a single value, not lists or arrays.

❌ Bad: products = "Laptop, Mouse, Keyboard"
βœ… Good: Separate rows for each product

2NF (Second Normal Form) - No Partial Dependencies: Every non-key column depends on the entire primary key, not just part of it.

❌ Bad: (OrderID, ProductID) -> CustomerName
     (CustomerName depends only on OrderID, not ProductID)
βœ… Good: Separate Orders and OrderItems tables

3NF (Third Normal Form) - No Transitive Dependencies: Non-key columns don't depend on other non-key columns.

❌ Bad: User table with City and Country
     (Country depends on City, not UserID)
βœ… Good: Separate Cities table with Country

Example:

-- ❌ Unnormalized (Repeating Groups)
CREATE TABLE orders (
    order_id INT,
    customer_name VARCHAR(100),
    products VARCHAR(500), -- "Product1, Product2, Product3"
    prices VARCHAR(100)    -- "10.00, 20.00, 15.00"
);

-- βœ… Normalized (3NF)
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100) UNIQUE
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT REFERENCES customers(customer_id),
    order_date TIMESTAMP,
    total_amount DECIMAL(10,2)
);

CREATE TABLE order_items (
    order_item_id INT PRIMARY KEY,
    order_id INT REFERENCES orders(order_id),
    product_id INT REFERENCES products(product_id),
    quantity INT,
    unit_price DECIMAL(10,2)
);

CREATE TABLE products (
    product_id INT PRIMARY KEY,
    name VARCHAR(200),
    price DECIMAL(10,2)
);

Indexing Strategies

Indexes are like a book's table of contentsβ€”they help you find data quickly without scanning every page. However, indexes come with trade-offs.

How Indexes Work: Instead of scanning every row (Sequential Scan), the database uses a sorted data structure (usually B-Tree) to jump directly to the data you need.

Without Index:
Scan 1 million rows β†’ Find your record β†’ 500ms

With Index:
Jump to index β†’ Find pointer β†’ Get data β†’ 5ms

The Index Trade-off:

  • Faster Reads: Queries become much faster (10-100x improvement)
  • Slower Writes: Every INSERT/UPDATE/DELETE must update indexes
  • Storage Cost: Indexes consume disk space (can be 20-50% of table size)

When to Add Indexes:

  • Columns in WHERE clauses
  • Columns used in JOINs
  • Columns in ORDER BY
  • Foreign keys

When NOT to Add Indexes:

  • Small tables (<1000 rows)
  • Tables with frequent writes and rare reads
  • Columns with low cardinality (e.g., boolean fields)
  • Columns rarely queried

Index Types Explained:

B-Tree Index (Default):

  • Best for: Equality (=) and range queries (<, >, BETWEEN)
  • Most common, works for 95% of cases
  • Keeps data sorted

Hash Index:

  • Best for: Exact matches only (=)
  • Very fast but can't do range queries
  • Rarely used in practice

GIN/GiST (Generalized Indexes):

  • Best for: Full-text search, JSON data, arrays
  • Specialized indexes for complex data types

Partial Index:

  • Best for: Querying subset of data frequently
  • Example: Index only active users, not deleted ones
  • Saves space and improves performance
-- B-Tree Index (Default, most common)
CREATE INDEX idx_users_email ON users(email);

-- Unique Index
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

-- Composite Index (Column order matters!)
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);

-- Partial Index (PostgreSQL)
CREATE INDEX idx_active_users ON users(email) WHERE active = true;

-- Covering Index (Includes extra columns)
CREATE INDEX idx_orders_covering ON orders(user_id) 
    INCLUDE (total_amount, created_at);

-- Full-Text Search (PostgreSQL)
CREATE INDEX idx_posts_content_fts 
    ON posts USING GIN(to_tsvector('english', content));

-- JSON Index (PostgreSQL)
CREATE INDEX idx_user_metadata ON users USING GIN(metadata);

Query Optimization

# ❌ BAD: N+1 Query Problem
users = db.query("SELECT * FROM users")
for user in users:
    posts = db.query("SELECT * FROM posts WHERE user_id = ?", user.id)
    # 1 + N queries!

# βœ… GOOD: Single Query with JOIN
results = db.query("""
    SELECT u.*, p.id as post_id, p.title, p.content
    FROM users u
    LEFT JOIN posts p ON u.id = p.user_id
""")

# βœ… GOOD: Batch Loading
user_ids = [u.id for u in users]
posts = db.query(
    "SELECT * FROM posts WHERE user_id IN (?)",
    user_ids
)
posts_by_user = group_by(posts, 'user_id')

Database Transactions

from contextlib import asynccontextmanager

@asynccontextmanager
async def transaction():
    """Transaction context manager"""
    conn = await db_pool.acquire()
    try:
        async with conn.transaction():
            yield conn
        # Automatic COMMIT
    except Exception:
        # Automatic ROLLBACK
        raise
    finally:
        await db_pool.release(conn)

# Usage
async def transfer_money(from_user: int, to_user: int, amount: float):
    """Transfer money between users (atomic operation)"""
    async with transaction() as conn:
        # Deduct from sender
        await conn.execute("""
            UPDATE accounts 
            SET balance = balance - $1 
            WHERE user_id = $2 AND balance >= $1
        """, amount, from_user)
        
        # Check if deduction was successful
        if conn.get_statusmsg() == "UPDATE 0":
            raise ValueError("Insufficient funds")
        
        # Add to receiver
        await conn.execute("""
            UPDATE accounts 
            SET balance = balance + $1 
            WHERE user_id = $2
        """, amount, to_user)
        
        # Both succeed or both fail (ACID)

Database Sharding

SQL vs NoSQL

FeatureSQL (PostgreSQL)NoSQL (MongoDB)
SchemaFixed, StrictFlexible
ScalingVertical (Harder)Horizontal (Easier)
TransactionsACIDEventually Consistent
JoinsPowerfulLimited
Use CaseComplex queries, RelationsLarge scale, Flexible data
When to UseFinancial, Traditional appsReal-time, Big Data, IoT

3. Authentication & Authorization {#authentication-authorization}

Understanding Auth: The Foundation of Security

Authentication answers: "Who are you?" Authorization answers: "What are you allowed to do?"

These are often confused but are fundamentally different concepts. Getting auth right is crucialβ€”mistakes lead to security breaches, data leaks, and compliance issues.

Common Mistakes Developers Make:

  1. Storing passwords in plain text - NEVER do this!
  2. Rolling your own crypto - Use established libraries
  3. Confusing authentication with authorization - They're different!
  4. Trusting client-side validation - Always validate server-side
  5. Not implementing rate limiting - Opens door to brute force attacks

Authentication Methods Compared

MethodUse CaseProsCons
Session CookiesTraditional web appsSecure, server controls sessionsNot stateless, scaling issues
JWTModern APIs, mobile appsStateless, scalableCan't revoke easily, token size
OAuth 2.0Third-party loginUser convenience, no password storageComplex implementation
API KeysService-to-serviceSimpleLess secure, hard to rotate

JWT (JSON Web Token) Deep Dive

What is JWT? JWT is a compact, URL-safe token that contains claims (information) about a user. It's digitally signed, so you can verify it wasn't tampered with.

JWT Structure:

header.payload.signature

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.
eyJzdWIiOiIxMjM0IiwiZXhwIjoxNjE2MjM5MDIyfQ.
SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c

How JWT Works:

  1. User logs in with credentials
  2. Server verifies credentials
  3. Server creates JWT with user info (payload)
  4. Server signs JWT with secret key
  5. JWT sent to client
  6. Client includes JWT in subsequent requests
  7. Server verifies signature and extracts user info

JWT vs Sessions:

JWT Advantages:

  • Stateless: No server-side session storage needed
  • Scalable: Works across multiple servers
  • Mobile-friendly: Easy to use in mobile apps
  • Microservices: Each service can verify tokens independently

JWT Disadvantages:

  • Can't revoke: Once issued, valid until expiration
  • Size: Larger than session IDs (sent with every request)
  • Secret management: Must keep signing key secure across all servers

When to Use JWT:

  • Building APIs consumed by mobile apps
  • Microservices architecture
  • Need to scale horizontally
  • Want stateless authentication

When to Use Sessions:

  • Traditional server-rendered web apps
  • Need to revoke access immediately
  • Single server or few servers
  • Smaller request payloads preferred

JWT (JSON Web Token)

from datetime import datetime, timedelta
import jwt
from passlib.context import CryptContext

pwd_context = CryptContext(schemes=["bcrypt"], deprecated="auto")
SECRET_KEY = "your-secret-key-keep-it-secret"
ALGORITHM = "HS256"

class AuthService:
    @staticmethod
    def hash_password(password: str) -> str:
        """Hash password using bcrypt"""
        return pwd_context.hash(password)
    
    @staticmethod
    def verify_password(plain_password: str, hashed_password: str) -> bool:
        """Verify password against hash"""
        return pwd_context.verify(plain_password, hashed_password)
    
    @staticmethod
    def create_access_token(
        user_id: int,
        expires_delta: timedelta = timedelta(hours=1)
    ) -> str:
        """Create JWT access token"""
        expire = datetime.utcnow() + expires_delta
        
        payload = {
            "sub": str(user_id),  # Subject (user ID)
            "exp": expire,         # Expiration
            "iat": datetime.utcnow(),  # Issued at
            "type": "access"
        }
        
        return jwt.encode(payload, SECRET_KEY, algorithm=ALGORITHM)
    
    @staticmethod
    def create_refresh_token(user_id: int) -> str:
        """Create long-lived refresh token"""
        expire = datetime.utcnow() + timedelta(days=30)
        
        payload = {
            "sub": str(user_id),
            "exp": expire,
            "iat": datetime.utcnow(),
            "type": "refresh"
        }
        
        return jwt.encode(payload, SECRET_KEY, algorithm=ALGORITHM)
    
    @staticmethod
    def decode_token(token: str) -> dict:
        """Decode and verify JWT token"""
        try:
            payload = jwt.decode(
                token,
                SECRET_KEY,
                algorithms=[ALGORITHM]
            )
            return payload
        except jwt.ExpiredSignatureError:
            raise HTTPException(
                status_code=401,
                detail="Token has expired"
            )
        except jwt.InvalidTokenError:
            raise HTTPException(
                status_code=401,
                detail="Invalid token"
            )

# Usage
@app.post("/auth/login")
async def login(credentials: LoginRequest):
    # Verify credentials
    user = await db.get_user_by_email(credentials.email)
    
    if not user or not AuthService.verify_password(
        credentials.password,
        user.hashed_password
    ):
        raise HTTPException(status_code=401, detail="Invalid credentials")
    
    # Generate tokens
    access_token = AuthService.create_access_token(user.id)
    refresh_token = AuthService.create_refresh_token(user.id)
    
    return {
        "access_token": access_token,
        "refresh_token": refresh_token,
        "token_type": "bearer"
    }

# Protected endpoint
from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer

security = HTTPBearer()

async def get_current_user(
    credentials: HTTPAuthorizationCredentials = Depends(security)
):
    """Dependency to get current authenticated user"""
    token = credentials.credentials
    payload = AuthService.decode_token(token)
    
    user_id = int(payload["sub"])
    user = await db.get_user_by_id(user_id)
    
    if not user:
        raise HTTPException(status_code=401, detail="User not found")
    
    return user

@app.get("/api/me")
async def get_me(current_user: User = Depends(get_current_user)):
    """Get current user info"""
    return current_user

Role-Based Access Control (RBAC)

from enum import Enum
from functools import wraps

class Role(str, Enum):
    ADMIN = "admin"
    MODERATOR = "moderator"
    USER = "user"

class Permission(str, Enum):
    CREATE_USER = "create:user"
    READ_USER = "read:user"
    UPDATE_USER = "update:user"
    DELETE_USER = "delete:user"
    MANAGE_ROLES = "manage:roles"

# Role-Permission mapping
ROLE_PERMISSIONS = {
    Role.ADMIN: [
        Permission.CREATE_USER,
        Permission.READ_USER,
        Permission.UPDATE_USER,
        Permission.DELETE_USER,
        Permission.MANAGE_ROLES
    ],
    Role.MODERATOR: [
        Permission.READ_USER,
        Permission.UPDATE_USER
    ],
    Role.USER: [
        Permission.READ_USER
    ]
}

def require_permission(permission: Permission):
    """Decorator to check user permission"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, current_user: User, **kwargs):
            user_permissions = ROLE_PERMISSIONS.get(current_user.role, [])
            
            if permission not in user_permissions:
                raise HTTPException(
                    status_code=403,
                    detail=f"Permission denied: {permission}"
                )
            
            return await func(*args, current_user=current_user, **kwargs)
        return wrapper
    return decorator

# Usage
@app.delete("/api/users/{user_id}")
@require_permission(Permission.DELETE_USER)
async def delete_user(
    user_id: int,
    current_user: User = Depends(get_current_user)
):
    """Delete user (admin only)"""
    await db.delete_user(user_id)
    return {"status": "deleted"}

OAuth 2.0 Flow


4. Caching Strategies {#caching-strategies}

Understanding Caching: The Performance Multiplier

Caching is storing frequently accessed data in a fast-access location to avoid expensive operations (database queries, API calls, computations). It's one of the most effective ways to improve performance.

The Caching Golden Rule:

"There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton

Why Caching Matters:

Without Caching:

User Request β†’ API Server β†’ Database Query (100ms) β†’ Response
100 requests = 10 seconds of DB time

With Caching:

User Request β†’ API Server β†’ Cache Hit (2ms) β†’ Response
100 requests = 0.2 seconds
50x improvement!

Real-World Impact:

  • Amazon: 100ms delay = 1% revenue loss
  • Google: 500ms delay = 20% traffic drop
  • Walmart: 1 second delay = 2% conversion loss

Cache Levels Explained

Modern applications use multiple cache layers, each serving different purposes:

1. CDN Cache (Edge)

  • Speed: ~10-50ms
  • Location: Geographically distributed
  • Purpose: Static assets (images, CSS, JS)
  • Example: CloudFront, Cloudflare

2. Application Cache (In-Memory)

  • Speed: ~1-5ms
  • Location: Same server as application
  • Purpose: Hot data, session data
  • Example: In-process dictionary, LRU cache

3. Distributed Cache (Redis/Memcached)

  • Speed: ~5-20ms
  • Location: Separate server(s)
  • Purpose: Shared data across app servers
  • Example: Redis, Memcached

4. Database Cache

  • Speed: ~50-100ms
  • Location: Database server
  • Purpose: Query results, table data
  • Example: PostgreSQL query cache

Cache Patterns: When to Use What

1. Cache-Aside (Lazy Loading)

  • Best for: Read-heavy workloads
  • How it works: Check cache β†’ Miss β†’ Query DB β†’ Store in cache
  • Pros: Only cache what's actually used
  • Cons: Cache miss penalty, potential cache stampede

When to use:

  • User profiles
  • Product catalogs
  • Blog posts
  • Any read-heavy data

2. Write-Through

  • Best for: Read-heavy with occasional writes
  • How it works: Write to DB β†’ Immediately write to cache
  • Pros: Cache always consistent with DB
  • Cons: Write latency (two operations)

When to use:

  • Data that's read often after being written
  • Consistency is critical
  • Write performance acceptable

3. Write-Behind (Write-Back)

  • Best for: Write-heavy workloads
  • How it works: Write to cache β†’ Async write to DB later
  • Pros: Very fast writes
  • Cons: Data loss risk if cache crashes

When to use:

  • Logging systems
  • Analytics counters
  • Session data
  • Gaming leaderboards

4. Read-Through

  • Best for: Encapsulating cache logic
  • How it works: Cache library handles DB queries automatically
  • Pros: Application doesn't handle cache misses
  • Cons: Tighter coupling between cache and data source

When to use:

  • Simplifying application code
  • Standard data access patterns
  • When cache library provides this feature

Cache Invalidation: The Hard Problem

Cache invalidation is deciding when cached data is no longer valid and needs refreshing. Getting this wrong leads to stale data or cache thrashing.

Invalidation Strategies:

1. Time-based (TTL - Time To Live)

Best for: Data that changes predictably
Example: "Weather forecast valid for 1 hour"
Pros: Simple, predictable
Cons: Might serve stale data, or expire too early

2. Event-based

Best for: Data you control
Example: "When user updates profile, invalidate cache"
Pros: Always fresh, efficient
Cons: Complex to implement, must track all events

3. Hybrid (TTL + Events)

Best for: Most production systems
Example: "Cache for 5 minutes, but invalidate on updates"
Pros: Best of both worlds
Cons: More complexity

Cache Stampede (Thundering Herd): When cache expires, multiple requests hit database simultaneously, causing overload.

11:00:00 - Cache expires
11:00:01 - 1000 requests arrive
11:00:01 - All 1000 hit database (STAMPEDE!)
11:00:05 - Database crashes

Solution: Cache Locking First request acquires lock, fetches data. Others wait for first to complete.

Caching Patterns

# 1. Cache-Aside (Lazy Loading)
async def get_user_cache_aside(user_id: int):
    # Try cache first
    cached = await redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - query database
    user = await db.get_user(user_id)
    
    # Store in cache
    await redis.setex(
        f"user:{user_id}",
        3600,  # TTL: 1 hour
        json.dumps(user)
    )
    
    return user


# 2. Write-Through
async def update_user_write_through(user_id: int, data: dict):
    # Update database
    user = await db.update_user(user_id, data)
    
    # Update cache immediately
    await redis.setex(
        f"user:{user_id}",
        3600,
        json.dumps(user)
    )
    
    return user


# 3. Write-Behind (Write-Back)
from asyncio import Queue

write_queue = Queue()

async def update_user_write_behind(user_id: int, data: dict):
    # Update cache immediately
    await redis.setex(f"user:{user_id}", 3600, json.dumps(data))
    
    # Queue database write
    await write_queue.put(("update_user", user_id, data))
    
    return data

async def process_write_queue():
    """Background worker to process queued writes"""
    while True:
        operation, user_id, data = await write_queue.get()
        
        try:
            if operation == "update_user":
                await db.update_user(user_id, data)
        except Exception as e:
            logger.error(f"Write-behind error: {e}")
        
        write_queue.task_done()


# 4. Read-Through
class CachingRepository:
    async def get_user(self, user_id: int):
        # Cache handles everything
        return await cache.get_or_fetch(
            f"user:{user_id}",
            lambda: db.get_user(user_id),
            ttl=3600
        )

Cache Invalidation Strategies

# 1. Time-based (TTL)
await redis.setex("key", 300, value)  # Expires in 5 minutes

# 2. Event-based
async def update_user(user_id: int, data: dict):
    user = await db.update_user(user_id, data)
    
    # Invalidate related caches
    await redis.delete(f"user:{user_id}")
    await redis.delete(f"user:{user_id}:posts")
    await redis.delete(f"user:{user_id}:profile")

# 3. Pattern-based
async def delete_user_caches(user_id: int):
    """Delete all caches related to user"""
    pattern = f"user:{user_id}:*"
    
    cursor = 0
    while True:
        cursor, keys = await redis.scan(cursor, match=pattern)
        if keys:
            await redis.delete(*keys)
        if cursor == 0:
            break

# 4. Tag-based
async def set_with_tags(key: str, value: any, tags: List[str]):
    """Store value with tags for group invalidation"""
    await redis.set(key, value)
    
    for tag in tags:
        await redis.sadd(f"tag:{tag}", key)

async def invalidate_by_tag(tag: str):
    """Invalidate all keys with given tag"""
    keys = await redis.smembers(f"tag:{tag}")
    if keys:
        await redis.delete(*keys)
    await redis.delete(f"tag:{tag}")

Cache Stampede Prevention

import asyncio
from typing import Optional

class AntiStampede:
    """Prevent cache stampede with locking"""
    
    def __init__(self, redis):
        self.redis = redis
        self.local_locks = {}
    
    async def get_or_compute(
        self,
        key: str,
        compute_func: callable,
        ttl: int = 300
    ):
        # Try cache
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)
        
        # Acquire lock
        lock_key = f"lock:{key}"
        lock_acquired = await self.redis.set(
            lock_key,
            "1",
            ex=30,  # Lock expires in 30s
            nx=True  # Only set if not exists
        )
        
        if lock_acquired:
            try:
                # We have the lock - compute value
                value = await compute_func()
                
                # Store in cache
                await self.redis.setex(key, ttl, json.dumps(value))
                
                return value
            finally:
                # Release lock
                await self.redis.delete(lock_key)
        else:
            # Someone else is computing - wait and retry
            await asyncio.sleep(0.1)
            return await self.get_or_compute(key, compute_func, ttl)

# Usage
anti_stampede = AntiStampede(redis)

@app.get("/expensive-data")
async def get_expensive_data():
    async def compute():
        # Expensive operation
        await asyncio.sleep(5)
        return {"data": "result"}
    
    return await anti_stampede.get_or_compute(
        "expensive_data",
        compute,
        ttl=3600
    )

5. Message Queues & Async Processing {#message-queues}

Understanding Asynchronous Processing

Synchronous operations blockβ€”you wait for them to complete before moving on. Asynchronous operations don't blockβ€”you start them and continue with other work.

Why Async Processing Matters:

Imagine ordering food at a restaurant:

Synchronous (Bad):

1. Take order from customer 1
2. Cook food for customer 1
3. Serve customer 1
4. Now take order from customer 2
β†’ Customer 2 waits 20 minutes just to order!

Asynchronous (Good):

1. Take order from customer 1 β†’ Give ticket #1
2. Take order from customer 2 β†’ Give ticket #2
3. Kitchen cooks both simultaneously
4. Serve when ready
β†’ Both customers happy!

Message Queues Explained

A message queue is like a todo list that multiple workers can process. It decouples producers (who create tasks) from consumers (who execute tasks).

Key Concepts:

Producer: Creates messages and adds them to queue Consumer/Worker: Pulls messages from queue and processes them Message: Unit of work (e.g., "send email", "process order") Queue: Storage for messages waiting to be processed Broker: System managing queues (RabbitMQ, Redis, Kafka)

Benefits of Message Queues:

1. Decoupling:

Without Queue:
API β†’ Email Service (if it's down, API fails)

With Queue:
API β†’ Queue β†’ Email Service (API succeeds immediately)

2. Load Leveling:

Traffic spike: 1000 requests/second
API: Add all to queue immediately (fast)
Workers: Process 100/second steadily

3. Reliability:

  • Messages persist even if consumer crashes
  • Automatic retry on failure
  • Dead letter queue for problematic messages

4. Scalability:

  • Add more workers during high load
  • Remove workers during low load
  • Workers can be on different servers

When to Use Message Queues:

Use for:

  • Sending emails/notifications
  • Processing images/videos
  • Generating reports
  • Data synchronization
  • Background jobs
  • Any slow operation (>1 second)

Don't use for:

  • Real-time user interactions
  • Operations needing immediate response
  • Simple, fast operations (<100ms)

Queue Patterns

1. Work Queue (Task Queue)

  • One message β†’ One worker
  • Used for: Background jobs
  • Example: Celery, RQ

2. Publish/Subscribe

  • One message β†’ Multiple subscribers
  • Used for: Event broadcasting
  • Example: User registered β†’ Send email + Update analytics + Create profile

3. Request/Reply

  • Send request β†’ Wait for response
  • Used for: RPC-style communication
  • Example: Microservice communication

4. Priority Queue

  • Process high-priority messages first
  • Used for: VIP users, critical operations
  • Example: Premium user orders before regular orders

Task Queue with Celery

from celery import Celery
from kombu import Queue

# Initialize Celery
celery_app = Celery(
    'tasks',
    broker='redis://localhost:6379/0',
    backend='redis://localhost:6379/1'
)

# Configure queues
celery_app.conf.task_queues = (
    Queue('high_priority', routing_key='high'),
    Queue('default', routing_key='default'),
    Queue('low_priority', routing_key='low'),
)

celery_app.conf.task_routes = {
    'tasks.send_email': {'queue': 'low_priority'},
    'tasks.process_payment': {'queue': 'high_priority'},
}

# Define tasks
@celery_app.task(bind=True, max_retries=3)
def send_email(self, to: str, subject: str, body: str):
    """Send email asynchronously"""
    try:
        # Send email logic
        smtp.send(to, subject, body)
        return {"status": "sent", "to": to}
    except Exception as e:
        # Retry with exponential backoff
        raise self.retry(exc=e, countdown=2 ** self.request.retries)

@celery_app.task
def process_order(order_id: int):
    """Process order in background"""
    order = db.get_order(order_id)
    
    # Update inventory
    inventory.reserve(order.items)
    
    # Charge payment
    payment.charge(order.total)
    
    # Send confirmation
    send_email.delay(
        order.user.email,
        "Order Confirmation",
        f"Your order #{order_id} is confirmed"
    )

@celery_app.task
def generate_report(user_id: int, report_type: str):
    """Generate report - long running task"""
    data = fetch_report_data(user_id, report_type)
    
    # Process data (may take minutes)
    report = process_report_data(data)
    
    # Save to S3
    s3_url = upload_to_s3(report)
    
    # Notify user
    send_email.delay(
        user.email,
        "Report Ready",
        f"Your report is ready: {s3_url}"
    )

# FastAPI integration
@app.post("/orders")
async def create_order(order: OrderCreate):
    # Create order in database
    order_id = await db.create_order(order)
    
    # Process asynchronously
    process_order.delay(order_id)
    
    return {
        "order_id": order_id,
        "status": "processing"
    }

@app.post("/reports")
async def request_report(report: ReportRequest, user: User = Depends(get_current_user)):
    # Queue report generation
    task = generate_report.delay(user.id, report.type)
    
    return {
        "task_id": task.id,
        "status": "queued",
        "message": "Report generation started"
    }

@app.get("/tasks/{task_id}")
async def get_task_status(task_id: str):
    """Check task status"""
    task = celery_app.AsyncResult(task_id)
    
    return {
        "task_id": task_id,
        "status": task.state,
        "result": task.result if task.ready() else None
    }

Event-Driven Architecture

from typing import Callable, Dict, List
import asyncio

class EventBus:
    """Simple in-memory event bus"""
    
    def __init__(self):
        self.subscribers: Dict[str, List[Callable]] = {}
    
    def subscribe(self, event_type: str, handler: Callable):
        """Subscribe to event"""
        if event_type not in self.subscribers:
            self.subscribers[event_type] = []
        self.subscribers[event_type].append(handler)
    
    async def publish(self, event_type: str, data: dict):
        """Publish event to all subscribers"""
        if event_type not in self.subscribers:
            return
        
        tasks = []
        for handler in self.subscribers[event_type]:
            tasks.append(handler(data))
        
        # Execute all handlers concurrently
        await asyncio.gather(*tasks, return_exceptions=True)

# Initialize event bus
event_bus = EventBus()

# Event handlers
async def send_welcome_email(data: dict):
    """Handler: Send welcome email"""
    user_id = data['user_id']
    email = data['email']
    
    await send_email.delay(
        email,
        "Welcome!",
        f"Welcome to our platform, {data['name']}!"
    )

async def track_user_registration(data: dict):
    """Handler: Track analytics"""
    await analytics.track('user_registered', data)

async def create_user_profile(data: dict):
    """Handler: Create default profile"""
    await db.create_profile(data['user_id'])

# Subscribe handlers
event_bus.subscribe('user_registered', send_welcome_email)
event_bus.subscribe('user_registered', track_user_registration)
event_bus.subscribe('user_registered', create_user_profile)

# Publish events
@app.post("/auth/register")
async def register_user(user: UserCreate):
    # Create user
    new_user = await db.create_user(user)
    
    # Publish event
    await event_bus.publish('user_registered', {
        'user_id': new_user.id,
        'email': new_user.email,
        'name': new_user.name
    })
    
    return new_user

6. Microservices Architecture {#microservices}

The Monolith vs Microservices Debate

Monolith: Single application with all features in one codebase. Microservices: Multiple small applications, each handling one feature.

The Harsh Truth:

"You need to be this tall to use microservices" - Martin Fowler

Most startups should start with a monolith. Microservices add complexity that only makes sense at scale.

When Monolith Makes Sense

Use Monolith when:

  • Team size: <10 developers
  • Traffic: <1000 requests/second
  • You're building MVP or new product
  • Team lacks microservices experience
  • Deployment simplicity is important

Monolith Advantages:

  • Simple deployment: One application to deploy
  • Easy debugging: Everything in one place
  • No network latency: Direct function calls
  • Easier testing: Test entire app together
  • Lower infrastructure cost: One server, not many

Monolith Done Right:

Modular monolith:
- Organize code into modules/packages
- Clear boundaries between domains
- Can extract to microservices later

When Microservices Make Sense

Use Microservices when:

  • Team size: >20 developers
  • Different parts scale differently
  • Need to deploy features independently
  • Different parts use different technologies
  • Have DevOps expertise

Microservices Advantages:

  • Independent deployment: Deploy one service without affecting others
  • Technology diversity: Use best tool for each job
  • Isolated failures: One service down β‰  entire system down
  • Team autonomy: Teams own their services
  • Selective scaling: Scale only what needs it

Microservices Disadvantages:

  • Complexity: Network calls, distributed tracing, service discovery
  • Data consistency: No ACID transactions across services
  • Testing difficulty: Integration tests are complex
  • Operational overhead: More services = more monitoring, logging
  • Initial investment: Requires infrastructure and tooling

The Migration Path

Don't Rewrite Everything!

Phase 1: Modular Monolith
β”œβ”€β”€ user_service (module)
β”œβ”€β”€ order_service (module)
β”œβ”€β”€ payment_service (module)
└── notification_service (module)

Phase 2: Extract Bottlenecks
β”œβ”€β”€ Monolith (most features)
└── Image Processing Service (extracted)

Phase 3: Gradual Extraction
β”œβ”€β”€ Monolith (core features)
β”œβ”€β”€ Image Processing Service
β”œβ”€β”€ Notification Service (extracted)
└── Payment Service (extracted)

Phase 4: Full Microservices
β”œβ”€β”€ User Service
β”œβ”€β”€ Order Service
β”œβ”€β”€ Payment Service
└── Notification Service

Service Communication: Sync vs Async

Synchronous (REST, gRPC):

Pros:
- Simple to understand
- Immediate response
- Easy debugging

Cons:
- Tight coupling
- Cascading failures
- Lower throughput

Best for:
- Real-time user requests
- Operations needing immediate response
- Simple request-response patterns

Asynchronous (Message Queue, Events):

Pros:
- Loose coupling
- Higher throughput
- Fault tolerance

Cons:
- Complex to debug
- Eventual consistency
- More infrastructure

Best for:
- Background processing
- Event broadcasting
- Long-running operations

The Hybrid Approach (Most Production Systems):

  • Sync for user-facing APIs
  • Async for internal service communication
  • Message queues for background jobs

Microservices Patterns You Must Know

1. API Gateway

  • Single entry point for all clients
  • Handles routing, auth, rate limiting
  • Aggregates data from multiple services
  • Problem it solves: Clients don't manage multiple endpoints

2. Service Discovery

  • Services register themselves (name + location)
  • Clients discover services dynamically
  • Handles service instances coming/going
  • Problem it solves: Hardcoded service URLs don't work at scale

3. Circuit Breaker

  • Prevents cascading failures
  • Fast-fail when service is down
  • Automatic recovery attempts
  • Problem it solves: One slow service doesn't bring down entire system

4. Saga Pattern

  • Manages distributed transactions
  • Each service does local transaction
  • Compensation for rollback
  • Problem it solves: Can't use database transactions across services

Service Communication Patterns

1. Synchronous (REST/gRPC)

import httpx

class OrderService:
    async def create_order(self, user_id: int, items: List[dict]):
        # Call User Service
        async with httpx.AsyncClient() as client:
            user_response = await client.get(
                f"http://user-service/api/users/{user_id}"
            )
            user = user_response.json()
        
        # Call Inventory Service
        async with httpx.AsyncClient() as client:
            inventory_response = await client.post(
                "http://inventory-service/api/reserve",
                json={"items": items}
            )
        
        # Create order
        order = await db.create_order(user_id, items)
        
        return order

2. Asynchronous (Message Queue)

# Order Service
@app.post("/orders")
async def create_order(order: OrderCreate):
    order_id = await db.create_order(order)
    
    # Publish event
    await event_bus.publish('order_created', {
        'order_id': order_id,
        'user_id': order.user_id,
        'items': order.items
    })
    
    return {"order_id": order_id}

# Inventory Service (separate service)
@event_bus.subscribe('order_created')
async def reserve_inventory(data: dict):
    items = data['items']
    await inventory.reserve(items)

# Notification Service (separate service)
@event_bus.subscribe('order_created')
async def send_order_notification(data: dict):
    user_id = data['user_id']
    order_id = data['order_id']
    
    user = await user_service.get_user(user_id)
    await send_email(user.email, f"Order #{order_id} created")

Service Discovery

import consul

class ServiceRegistry:
    """Service discovery with Consul"""
    
    def __init__(self):
        self.consul = consul.Consul(host='localhost', port=8500)
    
    async def register_service(
        self,
        service_name: str,
        service_id: str,
        host: str,
        port: int
    ):
        """Register service with Consul"""
        self.consul.agent.service.register(
            name=service_name,
            service_id=service_id,
            address=host,
            port=port,
            check=consul.Check.http(
                f"http://{host}:{port}/health",
                interval="10s"
            )
        )
    
    async def discover_service(self, service_name: str) -> str:
        """Discover service endpoint"""
        index, services = self.consul.health.service(
            service_name,
            passing=True
        )
        
        if not services:
            raise Exception(f"Service {service_name} not found")
        
        # Simple load balancing - random selection
        import random
        service = random.choice(services)
        
        host = service['Service']['Address']
        port = service['Service']['Port']
        
        return f"http://{host}:{port}"

# Usage
registry = ServiceRegistry()

# On service startup
await registry.register_service(
    "order-service",
    "order-service-1",
    "10.0.1.5",
    8000
)

# When calling another service
user_service_url = await registry.discover_service("user-service")
async with httpx.AsyncClient() as client:
    response = await client.get(f"{user_service_url}/api/users/123")

API Gateway Pattern

from fastapi import FastAPI, HTTPException
import httpx

app = FastAPI()

# Service URLs (from service discovery)
SERVICES = {
    "user": "http://user-service:8001",
    "order": "http://order-service:8002",
    "payment": "http://payment-service:8003"
}

class APIGateway:
    """API Gateway for routing and aggregation"""
    
    @staticmethod
    async def forward_request(
        service: str,
        path: str,
        method: str = "GET",
        **kwargs
    ):
        """Forward request to microservice"""
        base_url = SERVICES.get(service)
        if not base_url:
            raise HTTPException(404, f"Service {service} not found")
        
        url = f"{base_url}{path}"
        
        async with httpx.AsyncClient(timeout=10.0) as client:
            if method == "GET":
                response = await client.get(url, **kwargs)
            elif method == "POST":
                response = await client.post(url, **kwargs)
            elif method == "PUT":
                response = await client.put(url, **kwargs)
            elif method == "DELETE":
                response = await client.delete(url, **kwargs)
            
            return response.json()
    
    @staticmethod
    async def aggregate_data(user_id: int):
        """Aggregate data from multiple services"""
        # Parallel requests
        user_task = APIGateway.forward_request(
            "user", f"/api/users/{user_id}"
        )
        orders_task = APIGateway.forward_request(
            "order", f"/api/orders?user_id={user_id}"
        )
        
        user, orders = await asyncio.gather(
            user_task,
            orders_task,
            return_exceptions=True
        )
        
        return {
            "user": user if not isinstance(user, Exception) else None,
            "orders": orders if not isinstance(orders, Exception) else []
        }

# Gateway endpoints
@app.get("/api/users/{user_id}")
async def get_user(user_id: int):
    """Route to user service"""
    return await APIGateway.forward_request(
        "user",
        f"/api/users/{user_id}"
    )

@app.get("/api/users/{user_id}/dashboard")
async def get_user_dashboard(user_id: int):
    """Aggregate data from multiple services"""
    return await APIGateway.aggregate_data(user_id)

Circuit Breaker for Microservices

from enum import Enum
from datetime import datetime, timedelta

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    """Circuit breaker for service calls"""
    
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        success_threshold: int = 2
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
    
    async def call(self, service_name: str, func, *args, **kwargs):
        """Execute function with circuit breaker"""
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
                self.success_count = 0
            else:
                # Return cached or fallback response
                raise ServiceUnavailableError(
                    f"{service_name} circuit breaker is OPEN"
                )
        
        try:
            result = await func(*args, **kwargs)
            await self._on_success()
            return result
        except Exception as e:
            await self._on_failure()
            raise
    
    def _should_attempt_reset(self) -> bool:
        if not self.last_failure_time:
            return True
        
        elapsed = (datetime.now() - self.last_failure_time).total_seconds()
        return elapsed >= self.recovery_timeout
    
    async def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        
        if self.state == CircuitState.CLOSED:
            self.failure_count = 0
    
    async def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            logger.warning(
                f"Circuit breaker opened after {self.failure_count} failures"
            )

# Usage
user_service_breaker = CircuitBreaker()

async def call_user_service(user_id: int):
    async def make_request():
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"http://user-service/api/users/{user_id}"
            )
            return response.json()
    
    return await user_service_breaker.call(
        "user-service",
        make_request
    )

7. System Design Patterns {#system-design-patterns}

Understanding System Design Patterns

System design patterns are proven solutions to common architectural problems. They're the "recipes" of software engineeringβ€”you don't invent them, you apply them.

Why Patterns Matter:

  • Speed: Don't reinvent the wheel
  • Communication: Shared vocabulary with team ("Let's use circuit breaker")
  • Reliability: Battle-tested solutions
  • Maintainability: Well-understood patterns are easier to maintain

When to Use Patterns:

  • You recognize the problem they solve
  • You understand their trade-offs
  • Your scale justifies the complexity

When NOT to Use Patterns:

  • Because they're trendy
  • Because Netflix uses them (you're not Netflix)
  • Without understanding the problem first

The Saga Pattern: Distributed Transactions

The Problem: In microservices, you can't use database transactions across services. How do you ensure consistency?

Example Scenario:

Order Process:
1. Reserve payment ($100)
2. Reserve inventory (1 widget)
3. Schedule shipping

What if step 3 fails?
Need to undo steps 1 and 2 (compensate)

Traditional Database Transaction (Monolith):

BEGIN TRANSACTION;
  INSERT INTO orders ...;
  UPDATE inventory SET quantity = quantity - 1 ...;
  UPDATE payments SET status = 'charged' ...;
COMMIT; -- All succeed or all fail

Saga Pattern (Microservices):

No distributed transaction available!
Solution: Compensating transactions

Success Flow:
1. Payment Service: Reserve $100 βœ“
2. Inventory Service: Reserve 1 widget βœ“
3. Shipping Service: Schedule βœ“
β†’ Order complete

Failure Flow:
1. Payment Service: Reserve $100 βœ“
2. Inventory Service: Reserve 1 widget βœ“
3. Shipping Service: Schedule βœ— FAIL
4. Inventory Service: COMPENSATE - Release widget
5. Payment Service: COMPENSATE - Refund $100
β†’ Order cancelled, no inconsistency

Two Saga Approaches:

Choreography (Event-Based):

  • Services publish events
  • Other services react to events
  • Decentralized coordination

Orchestration (Coordinator):

  • Central coordinator manages flow
  • Explicitly calls each service
  • Centralized coordination

When to Use Saga:

  • Microservices architecture
  • Need consistency across services
  • Can't use distributed transactions
  • Business process spans multiple services

Trade-offs:

  • Pro: Eventual consistency across services
  • Con: Complex to implement and debug
  • Con: Compensations must be idempotent
from dataclasses import dataclass
from typing import Callable, List
from enum import Enum

class SagaStatus(Enum):
    PENDING = "pending"
    COMPLETED = "completed"
    FAILED = "failed"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensate: Callable

class SagaOrchestrator:
    """Orchestrator for distributed transactions"""
    
    def __init__(self):
        self.steps: List[SagaStep] = []
        self.completed_steps: List[str] = []
    
    def add_step(
        self,
        name: str,
        action: Callable,
        compensate: Callable
    ):
        """Add step to saga"""
        self.steps.append(SagaStep(name, action, compensate))
    
    async def execute(self, context: dict):
        """Execute saga"""
        try:
            # Execute all steps
            for step in self.steps:
                logger.info(f"Executing step: {step.name}")
                
                result = await step.action(context)
                context[step.name] = result
                self.completed_steps.append(step.name)
            
            logger.info("Saga completed successfully")
            return context
        
        except Exception as e:
            logger.error(f"Saga failed at step: {e}")
            await self.compensate(context)
            raise
    
    async def compensate(self, context: dict):
        """Compensate completed steps in reverse order"""
        logger.info("Starting compensation")
        
        for step_name in reversed(self.completed_steps):
            step = next(s for s in self.steps if s.name == step_name)
            
            try:
                logger.info(f"Compensating: {step.name}")
                await step.compensate(context)
            except Exception as e:
                logger.error(f"Compensation failed for {step.name}: {e}")
                # Continue compensating other steps

# Example: Order Saga
async def create_order_saga(order_data: dict):
    saga = SagaOrchestrator()
    
    # Step 1: Reserve Payment
    saga.add_step(
        "reserve_payment",
        action=lambda ctx: payment_service.reserve(
            ctx['order_data']['user_id'],
            ctx['order_data']['total']
        ),
        compensate=lambda ctx: payment_service.release(
            ctx['reserve_payment']['payment_id']
        )
    )
    
    # Step 2: Reserve Inventory
    saga.add_step(
        "reserve_inventory",
        action=lambda ctx: inventory_service.reserve(
            ctx['order_data']['items']
        ),
        compensate=lambda ctx: inventory_service.release(
            ctx['reserve_inventory']['reservation_id']
        )
    )
    
    # Step 3: Create Order
    saga.add_step(
        "create_order",
        action=lambda ctx: order_service.create(ctx['order_data']),
        compensate=lambda ctx: order_service.cancel(
            ctx['create_order']['order_id']
        )
    )
    
    # Step 4: Schedule Shipping
    saga.add_step(
        "schedule_shipping",
        action=lambda ctx: shipping_service.schedule(
            ctx['create_order']['order_id']
        ),
        compensate=lambda ctx: shipping_service.cancel(
            ctx['schedule_shipping']['shipping_id']
        )
    )
    
    # Execute saga
    context = {"order_data": order_data}
    return await saga.execute(context)

CQRS (Command Query Responsibility Segregation)

The Big Idea: Separate reads (queries) and writes (commands) into different models. They don't have to use the same database structure or even the same database!

Why CQRS Exists:

Traditional Approach (Single Model):

Database Table: Users
- Optimized for both reads and writes
- Compromise: Not optimal for either
- Complex queries hurt write performance
- Write traffic impacts read performance

CQRS Approach (Separate Models):

Write Model (Commands):
- Simple, normalized structure
- Optimized for data integrity
- Fast writes

Read Model (Queries):
- Denormalized, flat structure
- Optimized for specific queries
- Fast reads
- Can use different database (e.g., Elasticsearch)

Real-World Example: E-commerce Product Page

Without CQRS:

-- One complex query to get everything
SELECT 
  p.*, 
  c.name as category,
  AVG(r.rating) as avg_rating,
  COUNT(r.id) as review_count,
  i.quantity as stock
FROM products p
  LEFT JOIN categories c ON p.category_id = c.id
  LEFT JOIN reviews r ON p.id = r.product_id
  LEFT JOIN inventory i ON p.id = i.product_id
WHERE p.id = 123
GROUP BY p.id;

-- Slow (200ms), impacts write performance

With CQRS:

# Write Model (Creating product)
def create_product(data):
    product = db.insert('products', data)
    publish_event('product_created', product)
    return product.id

# Read Model (Materialized view)
redis.set('product:123', {
    'id': 123,
    'name': 'Laptop',
    'category': 'Electronics',
    'avg_rating': 4.5,
    'review_count': 234,
    'stock': 15
})

# Query (instant!)
def get_product(product_id):
    return redis.get(f'product:{product_id}')  # 2ms!

When to Use CQRS:

  • Read/Write patterns differ significantly
  • Read-heavy with complex queries (reporting, dashboards)
  • Need different consistency models (strong writes, eventual read)
  • Want to optimize each independently

When NOT to Use CQRS:

  • Simple CRUD applications
  • Read/write patterns are similar
  • Team lacks experience (adds complexity)
  • Don't have eventual consistency tolerance

CQRS Benefits:

  • Performance: Optimize reads and writes independently
  • Scalability: Scale read and write databases separately
  • Flexibility: Use best database for each (PostgreSQL for writes, Elasticsearch for reads)
  • Simplicity: Each model is simpler (no compromise)

CQRS Challenges:

  • Eventual consistency: Read model lags behind writes
  • Complexity: Two models to maintain
  • Synchronization: Must keep read model updated
  • Learning curve: Team needs to understand pattern

Rate Limiting: Protecting Your System

Rate limiting prevents abuse by limiting how many requests a client can make. It's essential for stability, security, and fair usage.

Why Rate Limiting Matters:

Without Rate Limiting:

Malicious actor:
- Sends 100,000 requests/second
- Your servers crash
- Legitimate users can't access system
- You pay massive cloud bills

With Rate Limiting:

After 1,000 requests/hour:
- Block additional requests
- Return 429 (Too Many Requests)
- System stays stable
- Fair usage for everyone

Rate Limiting Algorithms Explained:

1. Token Bucket (Most Common)

How it works:

  • Bucket holds tokens (e.g., 100 tokens)
  • Each request consumes 1 token
  • Bucket refills at constant rate (e.g., 10 tokens/second)
  • If bucket empty, reject request

Characteristics:

  • Allows bursts: Can use all tokens at once
  • Smooth over time: Refills constantly
  • Flexible: Different rates for different operations

Use Case: API rate limiting (100 requests/minute with burst of 20)

2. Leaky Bucket

How it works:

  • Requests enter a queue (bucket)
  • Requests leave at constant rate (leak)
  • If bucket full, reject request

Characteristics:

  • Smooth output: Enforces constant rate
  • No bursts: Can't exceed rate
  • Queue-based: FIFO processing

Use Case: Network traffic shaping, preventing spikes

3. Fixed Window

How it works:

  • Count requests in fixed time window (e.g., per minute)
  • Reset counter at window boundary
  • If count exceeds limit, reject

Characteristics:

  • Simple: Easy to implement
  • Boundary issue: 2x rate at window edge
  • Not accurate: Burst at window transitions

Example Problem:

Limit: 100 requests/minute
11:00:30 - 100 requests (allowed)
11:01:00 - Window resets
11:01:01 - 100 requests (allowed)
Result: 200 requests in 31 seconds!

Use Case: Simple rate limiting, low-accuracy needs

4. Sliding Window (Most Accurate)

How it works:

  • Track requests in rolling time window
  • Count requests in last N seconds
  • Remove old requests as window slides

Characteristics:

  • Accurate: No boundary issues
  • Fair: True rolling average
  • Memory intensive: Must store timestamps

Use Case: Premium APIs, precise rate limiting

Comparing Algorithms:

AlgorithmAccuracyComplexityBurstsMemory
Token BucketGoodLowYesLow
Leaky BucketGoodMediumNoMedium
Fixed WindowPoorVery LowYesVery Low
Sliding WindowExcellentHighControlledHigh

Production Recommendation:

  • API Gateway: Sliding Window (accuracy matters)
  • Internal Services: Token Bucket (flexibility + performance)
  • Simple Apps: Fixed Window (ease of implementation)

Rate Limiting Strategies:

Per User:

User A: 1000 requests/hour
User B: 1000 requests/hour

Per IP:

IP 1.2.3.4: 1000 requests/hour
(Useful for anonymous APIs)

Per API Key:

Key abc123: 10,000 requests/hour (paid)
Key xyz789: 1,000 requests/hour (free)

Global:

All users combined: 100,000 requests/hour
(Protects system capacity)

Per Endpoint:

POST /orders: 100 requests/hour (expensive)
GET /products: 1000 requests/hour (cheap)

Response Headers (Be Transparent):

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 842
X-RateLimit-Reset: 1617123456
Retry-After: 60
# Command Side (Writes)
class CreateUserCommand:
    def __init__(self, email: str, name: str):
        self.email = email
        self.name = name

class CommandHandler:
    async def handle_create_user(self, command: CreateUserCommand):
        # Write to primary database
        user = await write_db.create_user(
            email=command.email,
            name=command.name
        )
        
        # Publish event
        await event_bus.publish('user_created', {
            'user_id': user.id,
            'email': user.email,
            'name': user.name
        })
        
        return user.id

# Query Side (Reads)
class UserQuery:
    async def get_user_profile(self, user_id: int):
        # Read from optimized read database
        return await read_db.get_user_profile(user_id)
    
    async def search_users(self, query: str):
        # Read from search-optimized database
        return await elasticsearch.search_users(query)

# Event Handler (Sync read database)
@event_bus.subscribe('user_created')
async def sync_user_to_read_db(data: dict):
    """Synchronize write DB to read DB"""
    await read_db.upsert_user(data)
    await elasticsearch.index_user(data)

# API Endpoints
@app.post("/users")  # Command
async def create_user(user: CreateUserCommand):
    handler = CommandHandler()
    user_id = await handler.handle_create_user(user)
    return {"user_id": user_id}

@app.get("/users/{user_id}")  # Query
async def get_user(user_id: int):
    query = UserQuery()
    return await query.get_user_profile(user_id)

Rate Limiting Algorithms

1. Token Bucket

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()
    
    def consume(self, tokens: int = 1) -> bool:
        self._refill()
        
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now

2. Sliding Window

async def sliding_window_rate_limit(
    user_id: int,
    max_requests: int = 100,
    window_seconds: int = 60
) -> bool:
    """Sliding window rate limiting with Redis"""
    now = time.time()
    window_start = now - window_seconds
    
    key = f"rate_limit:{user_id}"
    
    # Remove old entries
    await redis.zremrangebyscore(key, 0, window_start)
    
    # Count requests in window
    count = await redis.zcard(key)
    
    if count < max_requests:
        # Add current request
        await redis.zadd(key, {str(now): now})
        await redis.expire(key, window_seconds)
        return True
    
    return False

8. Scaling & Performance {#scaling-performance}

Understanding Scalability

Scalability is your system's ability to handle increased load. But here's the catch: scaling isn't just about handling more trafficβ€”it's about handling it cost-effectively.

The Scaling Journey:

Stage 1: Single Server (0-100 users)
└── Everything on one machine

Stage 2: Vertical Scaling (100-1K users)
└── Bigger server

Stage 3: Horizontal Scaling (1K-100K users)
β”œβ”€β”€ Multiple app servers
β”œβ”€β”€ Load balancer
└── Database replication

Stage 4: Distributed (100K-1M+ users)
β”œβ”€β”€ Microservices
β”œβ”€β”€ Database sharding
β”œβ”€β”€ CDN
β”œβ”€β”€ Caching layers
└── Message queues

Common Misconception:

"Scaling = Adding more servers"

Reality:

"Scaling = Identifying bottlenecks and addressing them systematically"

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up):

Before: 4 CPU, 8GB RAM
After:  16 CPU, 64GB RAM

Pros:
βœ“ Simpler (no code changes)
βœ“ No consistency issues
βœ“ Lower latency (no network)
βœ“ Easier to maintain

Cons:
βœ— Physical limits (can't scale infinitely)
βœ— Expensive (exponential cost)
βœ— Single point of failure
βœ— Downtime during upgrades

Best for:
- Databases (harder to scale horizontally)
- Legacy applications
- Limited budget/time

Horizontal Scaling (Scale Out):

Before: 1 server
After:  10 servers

Pros:
βœ“ Unlimited scaling (add more servers)
βœ“ Linear cost
βœ“ Fault tolerance (one server fails, others work)
βœ“ Zero-downtime deployments

Cons:
βœ— Complex architecture
βœ— Network latency
βœ— Consistency challenges
βœ— More operational overhead

Best for:
- Stateless applications
- Web servers
- Microservices
- High availability needs

The Sweet Spot: Most applications use both:

  1. Vertical scaling for databases
  2. Horizontal scaling for application servers

Performance Bottlenecks: The Hierarchy

Fix in this order (biggest impact first):

1. Database Queries (Biggest Impact)

Problem: Slow queries
Solution: Indexes, query optimization
Impact: 10-100x improvement

Without index: 500ms
With index: 5ms

2. N+1 Queries

Problem: Multiple database calls
Solution: JOINs or batch loading
Impact: 10-50x improvement

100 queries = 5 seconds
1 query = 100ms

3. Caching

Problem: Repeated expensive operations
Solution: Redis/Memcached
Impact: 5-50x improvement

Database: 100ms
Cache: 2ms

4. Async Processing

Problem: Blocking operations
Solution: Background jobs
Impact: 5-10x improvement

Sync: User waits 5 seconds
Async: User waits 0.1 seconds

5. Code Optimization

Problem: Inefficient algorithms
Solution: Better algorithms/data structures
Impact: 2-5x improvement

O(nΒ²) β†’ O(n log n) or O(n)

6. Hardware/Scaling

Problem: Not enough resources
Solution: More servers
Impact: 2-10x improvement

1 server β†’ 5 servers = 5x capacity

Load Balancing Strategies

Load balancing distributes traffic across multiple servers. But different algorithms suit different scenarios.

1. Round Robin (Default)

Request 1 β†’ Server A
Request 2 β†’ Server B
Request 3 β†’ Server C
Request 4 β†’ Server A (repeat)

Pros: Simple, fair distribution
Cons: Doesn't consider server load
Best for: Homogeneous servers, stateless apps

2. Least Connections

Server A: 5 connections
Server B: 3 connections  ← Route here
Server C: 8 connections

Pros: Adapts to load
Cons: Needs connection tracking
Best for: Long-lived connections (WebSockets)

3. IP Hash (Sticky Sessions)

hash(client_ip) % num_servers = server

Same client always routes to same server

Pros: Session affinity
Cons: Uneven distribution if few clients
Best for: Stateful apps, caching benefits

4. Weighted Round Robin

Server A (weight 3): 60% traffic
Server B (weight 2): 40% traffic

Pros: Handles different server capacities
Cons: Manual weight configuration
Best for: Heterogeneous servers

Database Scaling Strategies

1. Read Replicas (Read-Heavy Workloads)

Primary DB (writes)
   ↓
   β”œβ”€β†’ Replica 1 (reads)
   β”œβ”€β†’ Replica 2 (reads)
   └─→ Replica 3 (reads)

Benefits:
- Distribute read load across replicas
- 3 replicas = 3x read capacity
- No application changes needed

Trade-offs:
- Replication lag (eventual consistency)
- Writes still bottlenecked on primary

2. Database Sharding (Write-Heavy Workloads)

User IDs 1-1000     β†’ Shard 1
User IDs 1001-2000  β†’ Shard 2
User IDs 2001-3000  β†’ Shard 3

Benefits:
- Distribute both reads and writes
- Near-linear scaling
- Each shard is smaller (better performance)

Trade-offs:
- Complex queries (cross-shard JOINs)
- Rebalancing is difficult
- Application must handle routing

3. Caching Layer (Read-Heavy with Hot Data)

Application β†’ Cache (Redis) β†’ Database

Benefits:
- Extremely fast (1-5ms vs 50-100ms)
- Reduces database load by 70-90%
- Easy to add

Trade-offs:
- Cache invalidation complexity
- Additional infrastructure
- Memory cost

Load Balancing Algorithms

from typing import List
import random
import hashlib

class LoadBalancer:
    def __init__(self, servers: List[str]):
        self.servers = servers
        self.current_index = 0
    
    def round_robin(self) -> str:
        """Distribute requests evenly"""
        server = self.servers[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.servers)
        return server
    
    def least_connections(self, connections: dict) -> str:
        """Route to server with fewest connections"""
        return min(self.servers, key=lambda s: connections.get(s, 0))
    
    def ip_hash(self, client_ip: str) -> str:
        """Consistent routing based on IP"""
        hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
        index = hash_value % len(self.servers)
        return self.servers[index]
    
    def weighted_round_robin(self, weights: dict) -> str:
        """Round robin with server weights"""
        weighted_servers = []
        for server in self.servers:
            weight = weights.get(server, 1)
            weighted_servers.extend([server] * weight)
        
        return random.choice(weighted_servers)
    
    def random_selection(self) -> str:
        """Random server selection"""
        return random.choice(self.servers)

Database Replication

class DatabaseRouter:
    """Route queries to appropriate database"""
    
    def __init__(self):
        self.primary = create_connection("primary-db:5432")
        self.replicas = [
            create_connection("replica-1:5432"),
            create_connection("replica-2:5432"),
            create_connection("replica-3:5432")
        ]
        self.replica_index = 0
    
    def execute_write(self, query: str, *args):
        """Execute write on primary"""
        return self.primary.execute(query, *args)
    
    def execute_read(self, query: str, *args):
        """Execute read on replica (load balanced)"""
        replica = self.replicas[self.replica_index]
        self.replica_index = (self.replica_index + 1) % len(self.replicas)
        return replica.execute(query, *args)

# Usage
db = DatabaseRouter()

# Writes go to primary
await db.execute_write("INSERT INTO users (name) VALUES ($1)", "John")

# Reads distributed across replicas
users = await db.execute_read("SELECT * FROM users")

Connection Pooling Best Practices

# Optimal pool sizing formula
# connections = ((core_count * 2) + effective_spindle_count)

import asyncpg
import multiprocessing

async def create_optimized_pool():
    core_count = multiprocessing.cpu_count()
    
    # For SSD: effective_spindle_count β‰ˆ core_count
    # For HDD: effective_spindle_count β‰ˆ number of physical drives
    optimal_size = (core_count * 2) + core_count
    
    pool = await asyncpg.create_pool(
        dsn="postgresql://user:pass@localhost/db",
        min_size=optimal_size // 2,  # Always maintain half
        max_size=optimal_size,
        command_timeout=60.0,
        max_queries=50000,  # Recycle connections
        max_inactive_connection_lifetime=300.0  # 5 minutes
    )
    
    return pool

9. Security Best Practices {#security}

Security: Not Optional, Not Later

The Harsh Reality:

  • Average cost of data breach: $4.35 million
  • 43% of cyberattacks target small businesses
  • 60% of companies go out of business within 6 months of a breach

Security is NOT:

  • Something to add later
  • Just for big companies
  • Only frontend (CORS, XSS)
  • A one-time implementation

Security IS:

  • Prevention: Stop attacks before they happen
  • Detection: Know when you're under attack
  • Response: Minimize damage when breached
  • Recovery: Get back online quickly

OWASP Top 10: The Most Critical Vulnerabilities

The Open Web Application Security Project (OWASP) maintains a list of the most critical security risks. Every backend engineer must know these.

1. Injection (SQL, NoSQL, Command)

What: Attacker inserts malicious code into queries
Impact: Database takeover, data theft
Example: 
Input: "admin' OR '1'='1"
Query: SELECT * FROM users WHERE username='admin' OR '1'='1'
Result: Bypasses authentication!

Prevention: Parameterized queries, input validation

2. Broken Authentication

What: Weak password policies, session management
Impact: Account takeover
Examples:
- No rate limiting β†’ Brute force
- Predictable session IDs
- Passwords in URLs

Prevention: Strong passwords, MFA, secure sessions

3. Sensitive Data Exposure

What: Exposing confidential data
Impact: Privacy violations, compliance issues
Examples:
- Sending credit cards in logs
- Storing passwords in plain text
- API keys in code

Prevention: Encryption, hashing, secure storage

4. XML External Entities (XXE)

What: Processing untrusted XML
Impact: Server-side request forgery, DOS
Prevention: Disable XML external entity processing

5. Broken Access Control

What: Users access unauthorized resources
Impact: Privilege escalation, data leaks
Example:
User A: /api/users/123/orders βœ“
User A: /api/users/456/orders βœ— (Should fail but doesn't!)

Prevention: Server-side authorization checks

6. Security Misconfiguration

What: Default configs, unnecessary features
Impact: Various vulnerabilities
Examples:
- Debug mode in production
- Default passwords
- Unnecessary services running

Prevention: Security hardening, minimal setup

7. Cross-Site Scripting (XSS)

What: Injecting malicious scripts
Impact: Session hijacking, defacement
Example:
Input: <script>steal_cookies()</script>
Display: Executes in victim's browser!

Prevention: Input sanitization, CSP headers

8. Insecure Deserialization

What: Deserializing untrusted data
Impact: Remote code execution
Prevention: Avoid deserialization of user input

9. Using Components with Known Vulnerabilities

What: Outdated libraries with CVEs
Impact: Various exploits
Example: Using Node.js package with known RCE

Prevention: Regular updates, dependency scanning

10. Insufficient Logging & Monitoring

What: Not tracking security events
Impact: Can't detect or respond to attacks
Prevention: Log all security events, alert on anomalies

Defense in Depth: Layered Security

Don't rely on a single security measure. Use multiple layers:

Layer 1: Network (Firewall, VPC)
   ↓
Layer 2: Infrastructure (OS hardening, patches)
   ↓
Layer 3: Application (Input validation, auth)
   ↓
Layer 4: Data (Encryption, hashing)
   ↓
Layer 5: Monitoring (Logging, alerts)

If one layer fails, others protect you.

Password Security: Beyond Hashing

Password Storage Hierarchy (Worst to Best):

❌ Plain Text

Database: password = "MyPassword123"
Impact: Immediate breach if database leaked

❌ MD5/SHA1 (Broken)

Database: password = md5("MyPassword123")
Impact: Rainbow table attacks, extremely fast cracking

❌ SHA256 (Better but still bad)

Database: password = sha256("MyPassword123")
Impact: GPU cracking is very fast

βœ… Bcrypt/Argon2 (Correct)

Database: password = bcrypt("MyPassword123", rounds=12)
Impact: Computationally expensive to crack
Time to crack: Years with current hardware

Why Bcrypt/Argon2?

  • Slow by design: Takes 100-500ms to hash (vs 0.001ms for SHA256)
  • Adaptive: Can increase rounds as hardware improves
  • Salt built-in: Protects against rainbow tables
  • Battle-tested: Industry standard

Password Policies That Actually Work:

Effective:

  • Minimum 8-12 characters
  • Check against breached password database (Have I Been Pwned)
  • Account lockout after failed attempts
  • Multi-factor authentication

Ineffective (Stop Doing These):

  • Forced password rotation every 90 days
  • Complex requirements (1 uppercase, 1 number, 1 special)
  • Security questions
  • Password hints

Common Attack Vectors

1. Brute Force

Attack: Try many passwords rapidly
Defense: Rate limiting + account lockout

Example:
- Allow 5 login attempts per 15 minutes
- After 5 failures, lock account for 15 minutes
- Alert security team after 10 failures

2. DDoS (Distributed Denial of Service)

Attack: Overwhelm server with traffic
Defense: Rate limiting + CDN + Auto-scaling

Example:
- Use CloudFlare/AWS Shield
- Implement global rate limits
- Auto-scale during attacks

3. Session Hijacking

Attack: Steal user session
Defense: Secure cookies + HTTPS + regenerate sessions

Example:
- HttpOnly cookies (prevent JavaScript access)
- Secure flag (HTTPS only)
- SameSite attribute (CSRF protection)
- Regenerate session ID after login

4. Man-in-the-Middle (MITM)

Attack: Intercept communication
Defense: HTTPS everywhere + certificate pinning

Example:
- Force HTTPS (redirect HTTP to HTTPS)
- HSTS header (browser-enforced HTTPS)
- Valid SSL certificates

SQL Injection Prevention

# ❌ VULNERABLE to SQL Injection
def get_user_unsafe(user_id: str):
    query = f"SELECT * FROM users WHERE id = {user_id}"
    return db.execute(query)
    # Attacker: user_id = "1 OR 1=1; DROP TABLE users;--"

# βœ… SAFE: Use parameterized queries
def get_user_safe(user_id: int):
    query = "SELECT * FROM users WHERE id = $1"
    return db.execute(query, user_id)

# βœ… SAFE: Use ORM
def get_user_orm(user_id: int):
    return User.objects.filter(id=user_id).first()

XSS Prevention

from html import escape

# ❌ VULNERABLE to XSS
@app.get("/user/{name}")
async def greet_unsafe(name: str):
    return {"message": f"<h1>Hello {name}</h1>"}
    # Attacker: name = "<script>alert('XSS')</script>"

# βœ… SAFE: Escape HTML
@app.get("/user/{name}")
async def greet_safe(name: str):
    safe_name = escape(name)
    return {"message": f"<h1>Hello {safe_name}</h1>"}

# βœ… SAFE: Use templates with auto-escaping
from jinja2 import Template

template = Template("<h1>Hello {{ name }}</h1>")  # Auto-escapes
return template.render(name=name)

Password Security

from passlib.context import CryptContext
import secrets

pwd_context = CryptContext(
    schemes=["bcrypt"],
    deprecated="auto",
    bcrypt__rounds=12  # Work factor (higher = slower = more secure)
)

class PasswordSecurity:
    @staticmethod
    def hash_password(password: str) -> str:
        """Hash password with bcrypt"""
        return pwd_context.hash(password)
    
    @staticmethod
    def verify_password(plain: str, hashed: str) -> bool:
        """Verify password against hash"""
        return pwd_context.verify(plain, hashed)
    
    @staticmethod
    def generate_strong_password(length: int = 16) -> str:
        """Generate cryptographically secure password"""
        alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!@#$%^&*()"
        return ''.join(secrets.choice(alphabet) for _ in range(length))
    
    @staticmethod
    def validate_password_strength(password: str) -> bool:
        """Validate password meets requirements"""
        if len(password) < 8:
            return False
        
        has_upper = any(c.isupper() for c in password)
        has_lower = any(c.islower() for c in password)
        has_digit = any(c.isdigit() for c in password)
        has_special = any(c in "!@#$%^&*()" for c in password)
        
        return all([has_upper, has_lower, has_digit, has_special])

CORS Configuration

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "https://example.com",
        "https://app.example.com"
    ],  # Specific origins, NOT "*" in production!
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["Authorization", "Content-Type"],
    max_age=3600  # Cache preflight requests for 1 hour
)

Rate Limiting for Security

from datetime import datetime, timedelta
from collections import defaultdict

class SecurityRateLimiter:
    """Rate limiter to prevent brute force attacks"""
    
    def __init__(self):
        self.attempts = defaultdict(list)
    
    async def check_login_attempts(
        self,
        identifier: str,  # email or IP
        max_attempts: int = 5,
        window_minutes: int = 15
    ) -> bool:
        """Check if too many login attempts"""
        now = datetime.now()
        window_start = now - timedelta(minutes=window_minutes)
        
        # Clean old attempts
        self.attempts[identifier] = [
            attempt for attempt in self.attempts[identifier]
            if attempt > window_start
        ]
        
        # Check limit
        if len(self.attempts[identifier]) >= max_attempts:
            return False  # Too many attempts
        
        # Record attempt
        self.attempts[identifier].append(now)
        return True

security_limiter = SecurityRateLimiter()

@app.post("/auth/login")
async def login(credentials: LoginRequest, request: Request):
    identifier = f"{credentials.email}:{request.client.host}"
    
    if not await security_limiter.check_login_attempts(identifier):
        raise HTTPException(
            status_code=429,
            detail="Too many login attempts. Try again in 15 minutes."
        )
    
    # Proceed with login...

10. Monitoring & Observability {#monitoring}

Monitoring vs Observability: Understanding the Difference

Monitoring tells you WHAT is wrong:

  • CPU is at 90%
  • Response time is 2 seconds
  • Error rate is 5%

Observability tells you WHY it's wrong:

  • CPU is high because database queries are slow
  • Response time is slow because external API is timing out
  • Errors are from payment gateway being down

The Reality:

"If you can't measure it, you can't improve it." - Peter Drucker

Most production issues are discovered by users, not monitoring. This should never happen.

The Three Pillars of Observability

These three pillars work together to give you complete visibility:

1. Metrics (WHAT is happening?)

Examples:
- Request count: 1,250 requests/second
- Error rate: 0.5%
- CPU usage: 65%
- Memory usage: 4.2 GB
- Database connections: 45/100

Use Case: Quick overview, dashboards, alerts
Tools: Prometheus, Grafana, CloudWatch

2. Logs (WHY did it happen?)

Examples:
- "Failed to connect to database: connection timeout"
- "User 123 attempted invalid payment"
- "API rate limit exceeded for client ABC"

Use Case: Debugging, audit trails, troubleshooting
Tools: ELK Stack, Splunk, CloudWatch Logs

3. Traces (WHERE did it happen?)

Example:
Request ID: abc123
  β”œβ”€ API Gateway: 5ms
  β”œβ”€ Auth Service: 10ms
  β”œβ”€ User Service: 150ms ← SLOW!
  β”‚   β”œβ”€ Database Query: 145ms ← ROOT CAUSE!
  β”‚   └─ Cache Check: 5ms
  └─ Response: 2ms

Use Case: Distributed systems, microservices debugging
Tools: Jaeger, Zipkin, AWS X-Ray

Why You Need All Three:

Metrics alone:

  • "API latency is 500ms" β†’ But which endpoint? Which user?

Logs alone:

  • "1 million log lines" β†’ How do you find the problem?

Traces alone:

  • "This request was slow" β†’ But is it always slow? How often?

Together:

  1. Metrics alert you: "API latency spiking!"
  2. Traces show you: "User service database query is slow"
  3. Logs tell you: "Deadlock detected on orders table"

What to Monitor: The Essential Metrics

System Metrics (Infrastructure Health):

βœ“ CPU Usage (target: <70%)
βœ“ Memory Usage (target: <80%)
βœ“ Disk Usage (target: <85%)
βœ“ Network I/O
βœ“ File Descriptors (often overlooked!)

Alert when:
- Any metric >80% for 5 minutes
- Trend shows rapid increase

Application Metrics (Business Health):

βœ“ Request Rate (requests/second)
βœ“ Error Rate (% of failed requests)
βœ“ Latency (P50, P95, P99)
βœ“ Active Users
βœ“ Business KPIs (orders, signups, etc.)

Alert when:
- Error rate >1% for 5 minutes
- P99 latency >1 second
- Business KPI drops >20%

Database Metrics (Data Layer Health):

βœ“ Connection Pool Usage
βœ“ Query Latency
βœ“ Slow Query Count
βœ“ Replication Lag
βœ“ Deadlocks

Alert when:
- Connection pool >90% for 5 minutes
- Slow queries >100/minute
- Replication lag >5 seconds

External Dependencies:

βœ“ Third-party API response time
βœ“ Payment gateway status
βœ“ Email service status
βœ“ CDN health

Alert when:
- External API error rate >5%
- Response time >2x baseline

The Golden Signals (Google SRE)

Google's Site Reliability Engineering team identified 4 critical metrics:

1. Latency (Speed)

How long does it take to serve a request?

Measure:
- P50 (median): 50% of requests faster than this
- P95: 95% of requests faster than this
- P99: 99% of requests faster than this

Why P99 matters:
If P99 = 2 seconds, 1 in 100 users has a terrible experience.
At 1 million requests/day, that's 10,000 angry users!

2. Traffic (Demand)

How much demand is placed on your system?

Measure:
- Requests per second
- Active connections
- Bandwidth usage

Why it matters:
Helps you understand if slow response is due to load

3. Errors (Quality)

What percentage of requests are failing?

Measure:
- HTTP 5xx errors
- HTTP 4xx errors
- Exception rate
- Failed background jobs

Why it matters:
Users can't complete their tasks

4. Saturation (Capacity)

How "full" is your service?

Measure:
- CPU/Memory utilization
- Disk I/O
- Network I/O
- Queue depth

Why it matters:
Predicts when you'll need to scale

Alerting Best Practices

The Problem with Bad Alerts:

  • Too many β†’ Alert fatigue β†’ Ignore real issues
  • Too few β†’ Miss critical problems
  • Unclear β†’ Waste time investigating

Good Alert Rules:

1. Actionable

❌ Bad: "CPU usage is 85%"
βœ… Good: "CPU usage >80% for 5 minutes. Check logs for slow queries."

Every alert should answer:
- What's wrong?
- Why should I care?
- What should I do?

2. Severity Levels

P1 (Critical): System down, immediate action
- Page on-call engineer
- Example: Website returning 500 errors

P2 (High): Degraded service, fix within hours
- Email + Slack
- Example: Slow database queries

P3 (Medium): Needs attention, fix within days
- Ticket created
- Example: Disk usage at 85%

P4 (Low): Informational, fix when convenient
- Log only
- Example: Non-critical service restarted

3. Avoid Alert Fatigue

Rules:
- Alert on symptoms, not causes
- Use thresholds with time windows (not instant)
- Implement alert aggregation
- Regular alert review and tuning

Example:
❌ "5xx error occurred" (fires 1000 times)
βœ… "5xx error rate >1% for 5 minutes" (fires once)

Health Checks: The Early Warning System

Health checks tell load balancers and orchestrators if your service is healthy.

Types of Health Checks:

1. Liveness Check

Question: "Are you alive?"
Purpose: Should we restart you?

Check:
- Process is running
- Can accept requests

Endpoint: GET /health/live
Response: 200 OK or 503 Service Unavailable

2. Readiness Check

Question: "Are you ready to serve traffic?"
Purpose: Should we send you requests?

Check:
- Database connected
- Cache connected
- Dependencies available

Endpoint: GET /health/ready
Response: 200 OK or 503 Service Unavailable

3. Startup Check

Question: "Have you finished starting up?"
Purpose: Is initialization complete?

Check:
- Configuration loaded
- Database migrations run
- Caches warmed

Endpoint: GET /health/startup
Response: 200 OK or 503 Service Unavailable

Health Check Best Practices:

  • Keep checks fast (<1 second)
  • Don't check every dependency deeply (too slow)
  • Return appropriate HTTP status codes
  • Include timestamp in response
  • Log failed checks

Structured Logging

import logging
import json
from datetime import datetime
from contextvars import ContextVar

# Context variable for request ID
request_id_var: ContextVar[str] = ContextVar('request_id', default='')

class JSONFormatter(logging.Formatter):
    """JSON formatter for structured logging"""
    
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "logger": record.name,
            "module": record.module,
            "function": record.funcName,
            "line": record.lineno
        }
        
        # Add request ID if available
        request_id = request_id_var.get()
        if request_id:
            log_data["request_id"] = request_id
        
        # Add extra fields
        if hasattr(record, 'user_id'):
            log_data["user_id"] = record.user_id
        
        if hasattr(record, 'duration_ms'):
            log_data["duration_ms"] = record.duration_ms
        
        if record.exc_info:
            log_data["exception"] = self.formatException(record.exc_info)
        
        return json.dumps(log_data)

# Configure logger
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("User logged in", extra={"user_id": 123})
logger.error("Payment failed", extra={
    "user_id": 123,
    "order_id": 456,
    "amount": 99.99
})

Health Check Endpoint

from enum import Enum

class HealthStatus(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

class HealthCheck:
    async def check_database(self) -> tuple[bool, str]:
        """Check database connection"""
        try:
            await db.execute("SELECT 1")
            return True, "Database OK"
        except Exception as e:
            return False, f"Database error: {str(e)}"
    
    async def check_redis(self) -> tuple[bool, str]:
        """Check Redis connection"""
        try:
            await redis.ping()
            return True, "Redis OK"
        except Exception as e:
            return False, f"Redis error: {str(e)}"
    
    async def check_disk_space(self) -> tuple[bool, str]:
        """Check available disk space"""
        import shutil
        stats = shutil.disk_usage("/")
        percent_used = (stats.used / stats.total) * 100
        
        if percent_used > 90:
            return False, f"Disk usage critical: {percent_used:.1f}%"
        elif percent_used > 80:
            return True, f"Disk usage warning: {percent_used:.1f}%"
        return True, f"Disk usage OK: {percent_used:.1f}%"
    
    async def get_health(self) -> dict:
        """Get overall health status"""
        checks = {
            "database": await self.check_database(),
            "redis": await self.check_redis(),
            "disk": await self.check_disk_space()
        }
        
        # Determine overall status
        failed = [k for k, (ok, _) in checks.items() if not ok]
        
        if not failed:
            status = HealthStatus.HEALTHY
        elif len(failed) < len(checks):
            status = HealthStatus.DEGRADED
        else:
            status = HealthStatus.UNHEALTHY
        
        return {
            "status": status,
            "checks": {
                k: {"status": "ok" if ok else "failed", "message": msg}
                for k, (ok, msg) in checks.items()
            },
            "timestamp": datetime.utcnow().isoformat()
        }

health_checker = HealthCheck()

@app.get("/health")
async def health():
    result = await health_checker.get_health()
    
    status_code = {
        HealthStatus.HEALTHY: 200,
        HealthStatus.DEGRADED: 200,
        HealthStatus.UNHEALTHY: 503
    }[result["status"]]
    
    return Response(
        content=json.dumps(result),
        status_code=status_code,
        media_type="application/json"
    )

Distributed Tracing

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

# Usage
@app.get("/users/{user_id}")
async def get_user(user_id: int):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user_id", user_id)
        
        # Database query
        with tracer.start_as_current_span("db_query"):
            user = await db.get_user(user_id)
        
        # Cache store
        with tracer.start_as_current_span("cache_store"):
            await redis.set(f"user:{user_id}", json.dumps(user))
        
        return user

11. CI/CD & DevOps {#cicd-devops}

CI/CD Pipeline

Dockerfile Best Practices

# Multi-stage build for smaller images
FROM python:3.11-slim as builder

# Install dependencies in builder stage
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Final stage
FROM python:3.11-slim

# Create non-root user
RUN useradd -m -u 1000 appuser

# Copy dependencies from builder
COPY --from=builder /root/.local /home/appuser/.local

# Set working directory
WORKDIR /app

# Copy application code
COPY --chown=appuser:appuser . .

# Switch to non-root user
USER appuser

# Add .local/bin to PATH
ENV PATH=/home/appuser/.local/bin:$PATH

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD python healthcheck.py || exit 1

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose for Development

version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/mydb
      - REDIS_URL=redis://redis:6379/0
    depends_on:
      - db
      - redis
    volumes:
      - ./:/app
    command: uvicorn main:app --reload --host 0.0.0.0

  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=mydb
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/mydb
      - REDIS_URL=redis://redis:6379/0
    depends_on:
      - db
      - redis

volumes:
  postgres_data:

12. Data Structures & Algorithms {#data-structures}

Time Complexity Cheat Sheet

OperationArrayLinked ListHash TableBinary TreeHeap
AccessO(1)O(n)O(1) avgO(log n)O(1)
SearchO(n)O(n)O(1) avgO(log n)O(n)
InsertO(n)O(1)O(1) avgO(log n)O(log n)
DeleteO(n)O(1)O(1) avgO(log n)O(log n)

Common Algorithms Every Backend Engineer Should Know

# 1. Binary Search (O(log n))
def binary_search(arr: List[int], target: int) -> int:
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = (left + right) // 2
        
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1


# 2. Two Pointers Pattern
def remove_duplicates(arr: List[int]) -> int:
    """Remove duplicates from sorted array in-place"""
    if not arr:
        return 0
    
    write = 1
    for read in range(1, len(arr)):
        if arr[read] != arr[read - 1]:
            arr[write] = arr[read]
            write += 1
    
    return write


# 3. Sliding Window
def max_sum_subarray(arr: List[int], k: int) -> int:
    """Maximum sum of k consecutive elements"""
    window_sum = sum(arr[:k])
    max_sum = window_sum
    
    for i in range(k, len(arr)):
        window_sum = window_sum - arr[i - k] + arr[i]
        max_sum = max(max_sum, window_sum)
    
    return max_sum


# 4. BFS (Breadth-First Search)
from collections import deque

def bfs(graph: dict, start: str):
    visited = set()
    queue = deque([start])
    visited.add(start)
    
    while queue:
        node = queue.popleft()
        print(node)
        
        for neighbor in graph[node]:
            if neighbor not in visited:
                visited.add(neighbor)
                queue.append(neighbor)


# 5. DFS (Depth-First Search)
def dfs(graph: dict, node: str, visited: set = None):
    if visited is None:
        visited = set()
    
    visited.add(node)
    print(node)
    
    for neighbor in graph[node]:
        if neighbor not in visited:
            dfs(graph, neighbor, visited)


# 6. LRU Cache
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        
        # Move to end (most recently used)
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            self.cache.move_to_end(key)
        
        self.cache[key] = value
        
        if len(self.cache) > self.capacity:
            # Remove least recently used (first item)
            self.cache.popitem(last=False)

Essential Backend Concepts Checklist

βœ… API Design

  • RESTful principles
  • HTTP status codes
  • API versioning
  • Request/Response schemas
  • Error handling patterns

βœ… Database

  • SQL and NoSQL differences
  • ACID properties
  • Normalization
  • Indexing strategies
  • Query optimization
  • Transactions
  • Replication and sharding

βœ… Authentication & Security

  • JWT tokens
  • OAuth 2.0
  • Password hashing (bcrypt)
  • RBAC (Role-Based Access Control)
  • SQL injection prevention
  • XSS prevention
  • CORS configuration

βœ… Performance & Scaling

  • Caching strategies
  • Database connection pooling
  • Horizontal vs Vertical scaling
  • Load balancing
  • CDN usage
  • Async processing

βœ… Architecture

  • Monolith vs Microservices
  • API Gateway pattern
  • Service discovery
  • Circuit breaker
  • Message queues
  • Event-driven architecture

βœ… DevOps

  • Docker containerization
  • CI/CD pipelines
  • Infrastructure as Code
  • Monitoring & logging
  • Health checks
  • Blue-green deployment

βœ… Data Structures & Algorithms

  • Time/Space complexity
  • Hash tables
  • Trees and graphs
  • Sorting algorithms
  • Search algorithms
  • Dynamic programming

Recommended Learning Path

The 6-Month Backend Engineer Journey

This isn't just a curriculumβ€”it's a battle-tested path from junior to mid-level backend engineer. Each month builds on the previous one.

Month 1-2: Fundamentals (Foundation Phase)

Goal: Build solid foundations. 80% of problems in production come from not understanding basics.

Week 1-2: API Design

Learn:
- RESTful principles (read Richardson Maturity Model)
- HTTP methods and status codes
- Request/Response design
- API versioning strategies

Practice:
- Build a TODO API
- Implement proper error handling
- Add pagination
- Write API documentation (OpenAPI/Swagger)

Outcome: Can design clean, intuitive APIs

Week 3-4: Database Fundamentals

Learn:
- SQL basics (SELECT, JOIN, WHERE, GROUP BY)
- Database design (normalization)
- Indexes and their impact
- ACID properties

Practice:
- Design database for blog platform
- Write complex queries with JOINs
- Create indexes and measure improvement
- Use EXPLAIN to understand query plans

Outcome: Can design efficient database schemas

Week 5-6: Authentication & Security

Learn:
- Password hashing (bcrypt)
- JWT tokens
- OAuth 2.0 flow
- OWASP Top 10

Practice:
- Implement user registration/login
- Add JWT authentication
- Implement role-based access control
- Security audit your code

Outcome: Can implement secure authentication

Week 7-8: Testing & Best Practices

Learn:
- Unit testing
- Integration testing
- Test-driven development (TDD)
- Code organization

Practice:
- Write tests for your API
- Achieve >80% code coverage
- Refactor code for testability
- Set up CI/CD pipeline

Outcome: Can write maintainable, tested code

Month 1-2 Milestone Project: Build a complete blog API with:

  • User authentication (JWT)
  • CRUD operations for posts
  • Comments system
  • Search functionality
  • Pagination
  • 80%+ test coverage

Month 3-4: Intermediate (Scaling Phase)

Goal: Learn to build systems that scale and perform well.

Week 9-10: Caching

Learn:
- Cache-aside pattern
- Write-through vs write-behind
- Redis fundamentals
- Cache invalidation strategies

Practice:
- Add Redis caching to your blog API
- Implement cache warming
- Measure performance improvements
- Handle cache stampede

Outcome: Can dramatically improve API performance

Week 11-12: Async Processing

Learn:
- Message queues (RabbitMQ/Redis)
- Background jobs (Celery)
- Event-driven architecture
- Task queues

Practice:
- Add email notifications (async)
- Implement image processing queue
- Handle task failures and retries
- Monitor queue depth

Outcome: Can offload slow operations

Week 13-14: Database Optimization

Learn:
- Query optimization
- Connection pooling
- N+1 query problem
- Database replication

Practice:
- Profile slow queries
- Add appropriate indexes
- Implement connection pooling
- Set up read replica

Outcome: Can optimize database performance

Week 15-16: API Optimization

Learn:
- Rate limiting
- Response compression
- Pagination strategies
- API versioning

Practice:
- Implement rate limiting
- Add response compression
- Optimize payload sizes
- Version your API

Outcome: Can build production-ready APIs

Month 3-4 Milestone Project: Scale your blog API to handle:

  • 1000 requests/second
  • 1 million users
  • Response time <100ms (P95)
  • Background image processing

Month 5-6: Advanced (Architecture Phase)

Goal: Understand system architecture and distributed systems.

Week 17-18: Microservices

Learn:
- Microservices vs monolith
- Service communication
- API Gateway pattern
- Service discovery

Practice:
- Break blog into microservices
- Implement API gateway
- Add service-to-service auth
- Handle partial failures

Outcome: Can design microservices architecture

Week 19-20: Monitoring & Observability

Learn:
- Metrics (Prometheus)
- Logging (structured logs)
- Tracing (Jaeger)
- Alerting

Practice:
- Add Prometheus metrics
- Implement structured logging
- Set up distributed tracing
- Create meaningful alerts

Outcome: Can debug production issues

Week 21-22: System Design

Learn:
- Design patterns (Saga, CQRS, Circuit Breaker)
- Load balancing
- Database sharding
- CDN usage

Practice:
- Design Twitter-like system
- Design URL shortener
- Design notification service
- Design for 100M users

Outcome: Can design scalable systems

Week 23-24: DevOps & Deployment

Learn:
- Docker containerization
- Kubernetes basics
- CI/CD pipelines
- Infrastructure as code

Practice:
- Containerize your application
- Set up CI/CD (GitHub Actions)
- Deploy to Kubernetes
- Implement blue-green deployment

Outcome: Can deploy and operate systems

Month 5-6 Milestone Project: Design and implement a URL shortener that:

  • Handles 10,000 writes/second
  • Handles 100,000 reads/second
  • 99.99% uptime
  • Global distribution (multi-region)
  • Complete monitoring and alerting

Ongoing: Continuous Learning

Daily (30 minutes):

  • Read engineering blogs
    • Netflix Tech Blog
    • Uber Engineering
    • Cloudflare Blog
    • AWS Architecture Blog

Weekly (2 hours):

  • Practice algorithms (LeetCode)
  • Study system design
  • Read documentation
  • Experiment with new tech

Monthly:

  • Deep dive into one topic
  • Build side project
  • Contribute to open source
  • Write technical blog post

Quarterly:

  • Learn new framework/language
  • Take online course
  • Attend conference/meetup
  • Review and update goals

The Learning Strategy That Actually Works

1. Learn by Building (Not Just Watching)

❌ Bad: Watch 10 tutorials
βœ… Good: Watch 1 tutorial, build 10 projects

2. Teach to Learn

- Write blog posts
- Answer StackOverflow questions
- Mentor juniors
- Give tech talks

Teaching forces deep understanding

3. Read Code More Than Write

Study production codebases:
- Django (web framework)
- Flask (minimalist framework)
- FastAPI (modern async)
- Redis (database)

Learn patterns and practices

4. Failure is the Best Teacher

Break things intentionally:
- Crash your database
- Overload your server
- Simulate network failures
- Practice recovery

Production won't be forgiving

5. Build in Public

- Share progress on Twitter/LinkedIn
- Open source your projects
- Get feedback early
- Build a portfolio

Visibility leads to opportunities

Common Pitfalls to Avoid

1. Tutorial Hell

Problem: Endlessly watching tutorials
Solution: Build projects after each tutorial

2. Premature Optimization

Problem: Optimizing before measuring
Solution: Profile first, optimize second

3. Overengineering

Problem: Using microservices for todo app
Solution: Start simple, scale when needed

4. Ignoring Fundamentals

Problem: Jumping to advanced topics
Solution: Master basics first

5. Learning Alone

Problem: No feedback, no motivation
Solution: Join communities, find mentors

Success Metrics: Are You Ready?

Junior β†’ Mid-Level Engineer:

  • Can design and implement RESTful APIs
  • Understand database design and optimization
  • Can implement authentication and authorization
  • Write maintainable, tested code
  • Debug production issues independently
  • Estimate tasks accurately
  • Communicate technical decisions clearly

Mid-Level β†’ Senior Engineer:

  • Design systems for scale (1M+ users)
  • Lead technical discussions
  • Mentor junior engineers
  • Make architectural decisions
  • Handle incidents and postmortems
  • Balance technical debt vs features
  • Think about business impact

Final Thoughts: The Backend Engineer Mindset

1. User First

Every technical decision impacts users.
Fast response time = happy users
Downtime = lost revenue
Security breach = lost trust

2. Measure Everything

"In God we trust; all others must bring data."
No measurement = no improvement
Metrics guide decisions

3. Simplicity Wins

Simple solution working > Complex solution perfect
Complexity is the enemy of reliability
Start simple, scale when needed

4. Plan for Failure

Everything fails:
- Servers crash
- Networks partition
- Databases deadlock
- APIs timeout

Design for failure, not success

5. Keep Learning

Technology changes constantly:
- New frameworks
- New paradigms
- New best practices

Learning never stops

The Journey Never Ends

Backend engineering is a marathon, not a sprint. You'll never "finish" learningβ€”and that's the exciting part!

Remember:

  • Month 1: You'll feel overwhelmed β†’ Normal
  • Month 3: You'll start connecting dots β†’ Progress
  • Month 6: You'll feel competent β†’ Achievement
  • Year 1: You'll realize how much you don't know β†’ Wisdom
  • Year 3: You'll mentor others β†’ Mastery
  • Year 5+: You'll specialize β†’ Expertise

Every senior engineer was once a confused junior who didn't give up.

Your Next Step

Pick one topic from this guide that you don't fully understand. Spend the next week:

  1. Reading about it
  2. Building something with it
  3. Breaking it
  4. Fixing it
  5. Teaching someone else

Then move to the next topic.

Small consistent progress > Big sporadic effort


Recommended Resources

Books

  • "Designing Data-Intensive Applications" by Martin Kleppmann
  • "System Design Interview" by Alex Xu
  • "Clean Code" by Robert C. Martin
  • "Building Microservices" by Sam Newman

Websites

  • System Design Primer (GitHub)
  • Web Dev Simplified (YouTube)
  • Backend.fyi - System design articles
  • High Scalability blog

Practice

  • LeetCode - Algorithms
  • System Design Primer - Architecture
  • HackerRank - Coding challenges
  • Exercism - Code practice with mentoring

Remember: Backend engineering is a journey, not a destination. Keep learning, keep building, and always optimize for simplicity first.