Interactive System Design Practice: 20+ Complete Problems with Solutions
🎯 Comprehensive System Design Problem Bank
Complete Practice Framework
This comprehensive system design practice collection provides 25+ problems with step-by-step solutions, evaluation criteria, self-scoring rubrics, and time management guidelines for both L6 and L7 engineering manager interviews.
📋 Problem Categories and Difficulty Progression
L6 Problems: Component-Level Systems (15 Problems)
Focus on designing individual services or systems with clear scope and requirements.
Focus on building platforms, infrastructure, and systems used by multiple teams.
🏗️ L6 System Design Problems
Problem 1: URL Shortening Service (Beginner)
Time Limit: 45 minutes
Problem Statement
Design a URL shortening service like bit.ly that can handle 100 million URLs shortened per day with a 10:1 read-to-write ratio.
Requirements Gathering (5 minutes)
Functional Requirements:
- Shorten long URLs to 6-7 character short URLs
- Redirect short URLs to original URLs
- Custom aliases (optional)
- Basic analytics (click count)
- URL expiration (optional)
Non-Functional Requirements:
- 100M URLs/day = ~1,200 writes/second
- 10:1 read ratio = ~12,000 reads/second
- 99.9% availability
- Sub-100ms latency for redirects
- Global distribution
Scale Estimation:
Python |
---|
| # Daily capacity planning
urls_per_day = 100_000_000
reads_per_day = urls_per_day * 10
writes_per_second = urls_per_day / 86400 # ~1,200/sec
reads_per_second = reads_per_day / 86400 # ~12,000/sec
# Storage estimation
avg_url_length = 100 # bytes
storage_per_day = urls_per_day * avg_url_length # ~10GB/day
storage_5_years = storage_per_day * 365 * 5 # ~18TB
|
Step-by-Step Solution Walkthrough
Step 1: High-Level Architecture (10 minutes)
graph TB
Client[Client] --> LB[Load Balancer]
LB --> API[API Gateway]
API --> App[Application Servers]
App --> Cache[(Redis Cache)]
App --> DB[(Database)]
App --> Analytics[(Analytics DB)]
Components:
- Load Balancer: Distribute traffic, SSL termination
- API Gateway: Rate limiting, authentication, monitoring
- Application Servers: Business logic, URL encoding/decoding
- Cache: Hot URLs and metadata
- Database: URL mappings storage
- Analytics DB: Click tracking and reporting
Step 2: Database Design (10 minutes)
SQL |
---|
| -- URL mappings table
CREATE TABLE url_mappings (
short_url VARCHAR(10) PRIMARY KEY,
long_url TEXT NOT NULL,
user_id BIGINT,
created_at TIMESTAMP,
expires_at TIMESTAMP,
created_index INDEX(created_at),
user_index INDEX(user_id)
);
-- Analytics table
CREATE TABLE click_analytics (
short_url VARCHAR(10),
click_time TIMESTAMP,
ip_address INET,
user_agent TEXT,
country VARCHAR(2),
INDEX time_url (click_time, short_url),
INDEX url_time (short_url, click_time)
);
|
Step 3: URL Encoding Algorithm (8 minutes)
Python |
---|
| import hashlib
import base62
class URLShortener:
def __init__(self):
self.counter = 0
self.base_url = "https://short.ly/"
def encode_url(self, long_url, user_id=None):
# Method 1: Counter-based (predictable but simple)
self.counter += 1
short_code = base62.encode(self.counter)
# Method 2: Hash-based (avoid collisions)
hash_input = f"{long_url}{user_id}{time.time()}"
hash_digest = hashlib.md5(hash_input.encode()).hexdigest()
short_code = hash_digest[:7] # Take first 7 chars
# Method 3: Random + collision check
while True:
short_code = ''.join(random.choices(string.ascii_letters + string.digits, k=7))
if not self.exists(short_code):
break
return f"{self.base_url}{short_code}"
|
Step 4: Scaling Considerations (8 minutes)
Read Scaling:
- Redis cache for hot URLs (80/20 rule)
- Read replicas for database
- CDN for global distribution
- Cache warming strategies
Write Scaling:
- Database sharding by short_url prefix
- Write-through caching
- Async analytics processing
Step 5: Operations and Monitoring (4 minutes)
YAML |
---|
| # Key Metrics
- URL creation rate (per second)
- Redirect latency (p95, p99)
- Cache hit ratio
- Error rates (404s, 5xxs)
- Database connection pool health
# Alerts
- Latency > 100ms for 5 minutes
- Error rate > 1% for 2 minutes
- Cache hit ratio < 80%
- Database lag > 1 second
|
Self-Assessment Criteria
Total Score: ___/50 points
Time Management:
- Requirements: 5 min ✓
- High-level design: 10 min ✓
- Database design: 10 min ✓
- Algorithm: 8 min ✓
- Scaling: 8 min ✓
- Operations: 4 min ✓
Time Limit: 60 minutes
Problem Statement
Design a real-time chat system like WhatsApp that supports 50 million daily active users, with 1:1 messaging, group chats, message history, and online presence.
Requirements Analysis (8 minutes)
Functional Requirements:
- Send/receive messages in real-time
- 1:1 and group messaging (up to 100 members)
- Message history and search
- Online/offline presence
- Message delivery status (sent, delivered, read)
- File sharing (images, documents)
Non-Functional Requirements:
- 50M daily active users
- Peak: 10M concurrent users
- Message latency < 100ms
- 99.99% availability
- Global distribution
- End-to-end encryption
Scale Estimation:
Python |
---|
| # User and message calculations
daily_active_users = 50_000_000
concurrent_users_peak = 10_000_000
messages_per_user_per_day = 50
total_daily_messages = daily_active_users * messages_per_user_per_day # 2.5B messages/day
messages_per_second = total_daily_messages / 86400 # ~29,000 messages/sec
peak_messages_per_second = messages_per_second * 3 # ~87,000 messages/sec
# Storage calculations
avg_message_size = 100 # bytes
daily_storage = total_daily_messages * avg_message_size # 250GB/day
monthly_storage = daily_storage * 30 # 7.5TB/month
|
Step-by-Step Solution
Step 1: High-Level Architecture (15 minutes)
graph TB
Client[Mobile/Web Client] --> Gateway[API Gateway]
Gateway --> Chat[Chat Service]
Gateway --> Presence[Presence Service]
Gateway --> Notification[Notification Service]
Chat --> MsgDB[(Message DB)]
Chat --> Queue[Message Queue]
Presence --> PresenceDB[(Presence Cache)]
Queue --> Delivery[Message Delivery Service]
Delivery --> Push[Push Notification]
Chat --> FileStore[File Storage]
Chat --> Search[Search Service]
Core Components:
- API Gateway: Authentication, rate limiting, routing
- Chat Service: Message processing, validation, storage
- Presence Service: Online/offline status tracking
- Message Queue: Reliable message delivery
- Delivery Service: Handle message routing and delivery status
- Notification Service: Push notifications for offline users
Step 2: Real-Time Communication (12 minutes)
WebSocket Connection Management:
Python |
---|
| class ConnectionManager:
def __init__(self):
self.connections = {} # user_id -> [websocket_connections]
self.user_servers = {} # user_id -> server_id
async def connect_user(self, user_id, websocket, server_id):
if user_id not in self.connections:
self.connections[user_id] = []
self.connections[user_id].append(websocket)
self.user_servers[user_id] = server_id
# Update presence
await self.update_presence(user_id, "online")
async def send_message(self, recipient_id, message):
if recipient_id in self.connections:
# User online on this server
for connection in self.connections[recipient_id]:
await connection.send_text(json.dumps(message))
else:
# Route to correct server or queue for delivery
await self.route_message(recipient_id, message)
|
Message Queue Architecture:
YAML |
---|
| # Message Queue Topics
topics:
- chat.direct.{user_id} # Direct messages to user
- chat.group.{group_id} # Group messages
- presence.updates # Presence updates
- delivery.status # Message delivery confirmations
# Message Format
message_schema:
message_id: uuid
sender_id: integer
recipient_id: integer
group_id: integer (optional)
content: text
message_type: "text|image|file"
timestamp: iso8601
delivery_status: "sent|delivered|read"
|
Step 3: Database Design (10 minutes)
SQL |
---|
| -- Messages table (partitioned by time)
CREATE TABLE messages (
message_id UUID PRIMARY KEY,
conversation_id BIGINT NOT NULL,
sender_id BIGINT NOT NULL,
content TEXT,
message_type VARCHAR(20),
timestamp TIMESTAMP NOT NULL,
delivery_status VARCHAR(20),
INDEX conv_time (conversation_id, timestamp),
INDEX sender_time (sender_id, timestamp)
) PARTITION BY RANGE (timestamp);
-- Conversations table
CREATE TABLE conversations (
conversation_id BIGINT PRIMARY KEY,
conversation_type ENUM('direct', 'group'),
created_at TIMESTAMP,
last_message_at TIMESTAMP,
participant_count INTEGER,
INDEX type_updated (conversation_type, last_message_at)
);
-- Conversation participants
CREATE TABLE conversation_participants (
conversation_id BIGINT,
user_id BIGINT,
joined_at TIMESTAMP,
role ENUM('admin', 'member'),
last_read_message_id UUID,
PRIMARY KEY (conversation_id, user_id),
INDEX user_conversations (user_id, last_read_message_id)
);
|
Step 4: Scaling Strategy (10 minutes)
Horizontal Scaling:
- Chat Service: Stateless, scale by adding servers
- Database Sharding: Partition by conversation_id or user_id
- Message Queue: Use Apache Kafka with multiple partitions
- File Storage: Distributed object storage (S3, GCS)
Caching Strategy:
Python |
---|
| # Multi-level caching
cache_layers = {
"L1": "Local server cache (recent messages)",
"L2": "Redis cluster (conversation metadata)",
"L3": "CDN (file attachments, media)"
}
# Cache patterns
patterns = {
"recent_messages": "conversation:{conv_id}:messages:recent",
"presence": "presence:{user_id}",
"unread_counts": "unread:{user_id}",
"group_metadata": "group:{group_id}:metadata"
}
|
Step 5: Advanced Features (8 minutes)
Message Delivery Guarantees:
Python |
---|
| class MessageDeliveryService:
def __init__(self):
self.retry_queue = {}
self.delivery_timeout = 30 # seconds
async def deliver_message(self, message):
try:
# Attempt immediate delivery
success = await self.send_to_recipient(message)
if success:
await self.mark_delivered(message.id)
else:
# Queue for retry
await self.queue_for_retry(message)
except Exception as e:
await self.handle_delivery_error(message, e)
async def handle_offline_users(self, message):
# Store in offline message queue
await self.store_offline_message(message)
# Send push notification
await self.send_push_notification(message)
|
End-to-End Encryption:
- Client-side key generation and management
- Message encryption before sending
- Key exchange protocols for new conversations
- Forward secrecy with rotating keys
YAML |
---|
| Optimizations:
- Message batching for group chats
- Connection pooling and keep-alive
- Message compression for large content
- Smart presence updates (debouncing)
- Read receipt aggregation
- Lazy loading of message history
- Image/file thumbnail generation
|
Self-Assessment Criteria
Total Score: ___/50 points
Problem 3: Distributed Cache System (Advanced)
Time Limit: 60 minutes
Problem Statement
Design a distributed caching system like Redis Cluster that can store 1TB of data across multiple nodes with high availability, automatic failover, and consistent hashing.
Requirements Analysis (10 minutes)
Functional Requirements:
- GET, SET, DELETE operations
- Data expiration (TTL)
- Data persistence options
- Atomic operations
- Pub/Sub functionality
- Support for various data types (strings, lists, sets, hashes)
Non-Functional Requirements:
- 1TB total storage capacity
- 10M operations per second
- Sub-millisecond latency (p99 < 1ms)
- 99.99% availability
- Automatic failover
- Horizontal scalability
Step-by-Step Solution
Step 1: Architecture Design (15 minutes)
graph TB
Client[Client Applications] --> Proxy[Cache Proxy/Router]
Proxy --> Node1[Cache Node 1]
Proxy --> Node2[Cache Node 2]
Proxy --> Node3[Cache Node 3]
Proxy --> NodeN[Cache Node N]
Node1 --> Replica1[Replica 1-1]
Node2 --> Replica2[Replica 2-1]
Node3 --> Replica3[Replica 3-1]
ConfigMgr[Config Manager] --> Proxy
ConfigMgr --> Node1
ConfigMgr --> Node2
ConfigMgr --> Node3
Step 2: Consistent Hashing Implementation (12 minutes)
Python |
---|
| import hashlib
import bisect
class ConsistentHashRing:
def __init__(self, nodes=None, replicas=150):
self.replicas = replicas
self.ring = {}
self.sorted_keys = []
if nodes:
for node in nodes:
self.add_node(node)
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.replicas):
virtual_key = f"{node}:{i}"
hash_key = self._hash(virtual_key)
self.ring[hash_key] = node
bisect.insort(self.sorted_keys, hash_key)
def remove_node(self, node):
for i in range(self.replicas):
virtual_key = f"{node}:{i}"
hash_key = self._hash(virtual_key)
if hash_key in self.ring:
del self.ring[hash_key]
self.sorted_keys.remove(hash_key)
def get_node(self, key):
if not self.ring:
return None
hash_key = self._hash(key)
idx = bisect.bisect_right(self.sorted_keys, hash_key)
if idx == len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
|
Step 3: Replication and Consistency (10 minutes)
Python |
---|
| class ReplicationManager:
def __init__(self, replication_factor=3):
self.replication_factor = replication_factor
self.hash_ring = ConsistentHashRing()
def get_replica_nodes(self, key):
"""Get primary and replica nodes for a key"""
primary = self.hash_ring.get_node(key)
replicas = []
# Get next N-1 nodes in the ring
start_idx = self.hash_ring.sorted_keys.index(
self.hash_ring._hash(f"{primary}:0")
)
for i in range(1, self.replication_factor):
idx = (start_idx + i) % len(self.hash_ring.sorted_keys)
replica_hash = self.hash_ring.sorted_keys[idx]
replica_node = self.hash_ring.ring[replica_hash]
if replica_node not in replicas and replica_node != primary:
replicas.append(replica_node)
return primary, replicas
async def write_with_replication(self, key, value, ttl=None):
primary, replicas = self.get_replica_nodes(key)
# Write to primary first
success = await self.write_to_node(primary, key, value, ttl)
if not success:
raise Exception(f"Primary write failed for key {key}")
# Asynchronously replicate to replicas
tasks = []
for replica in replicas:
task = asyncio.create_task(
self.write_to_node(replica, key, value, ttl)
)
tasks.append(task)
# Wait for at least one replica to succeed
successful_replicas = 0
for task in asyncio.as_completed(tasks):
try:
await task
successful_replicas += 1
except Exception:
pass # Log error but continue
return successful_replicas >= 1
|
Step 4: High Availability and Failover (8 minutes)
Python |
---|
| class FailoverManager:
def __init__(self):
self.health_check_interval = 5 # seconds
self.failed_nodes = set()
self.node_health = {}
async def health_check_loop(self):
while True:
await asyncio.sleep(self.health_check_interval)
await self.check_all_nodes()
async def check_all_nodes(self):
for node in self.all_nodes:
try:
# Simple ping/health check
response = await self.ping_node(node)
if response.status_code == 200:
self.mark_node_healthy(node)
else:
self.mark_node_failed(node)
except Exception:
self.mark_node_failed(node)
def mark_node_failed(self, node):
if node not in self.failed_nodes:
self.failed_nodes.add(node)
# Trigger redistribution of data
asyncio.create_task(self.redistribute_data(node))
# Update client routing tables
self.update_routing_tables()
async def redistribute_data(self, failed_node):
"""Redistribute data from failed node to remaining nodes"""
# Get all keys from the failed node's replica
keys_to_redistribute = await self.get_keys_for_node(failed_node)
for key in keys_to_redistribute:
# Find new primary and replicas
new_primary, new_replicas = self.get_replica_nodes(key)
# Copy data from existing replicas
value = await self.read_from_replica(key)
if value:
await self.write_with_replication(key, value)
|
Step 5: Performance Optimization (8 minutes)
Memory Management:
Python |
---|
| class MemoryManager:
def __init__(self, max_memory_gb=100):
self.max_memory = max_memory_gb * 1024 * 1024 * 1024 # bytes
self.current_memory = 0
self.lru_cache = OrderedDict()
self.ttl_heap = [] # Min heap for TTL expiration
def set_with_eviction(self, key, value, ttl=None):
value_size = self.calculate_size(value)
# Check if we need to evict
while self.current_memory + value_size > self.max_memory:
self.evict_oldest()
# Set the value
self.lru_cache[key] = {
'value': value,
'size': value_size,
'ttl': time.time() + ttl if ttl else None
}
self.current_memory += value_size
# Add to TTL heap if needed
if ttl:
heapq.heappush(self.ttl_heap, (time.time() + ttl, key))
def evict_oldest(self):
if not self.lru_cache:
return
# LRU eviction
key, data = self.lru_cache.popitem(last=False)
self.current_memory -= data['size']
|
Network Optimization:
YAML |
---|
| Optimizations:
- Connection pooling and pipelining
- Binary protocol (faster than text)
- Batch operations support
- Compression for large values
- Zero-copy operations where possible
- Async I/O with epoll/kqueue
- NUMA-aware memory allocation
|
Self-Assessment Criteria
Total Score: ___/50 points
🚀 L7 System Design Problems
Time Limit: 90 minutes
Problem Statement
Design a serverless computing platform like AWS Lambda that can execute millions of functions per day across multiple regions with automatic scaling, multiple runtime support, and comprehensive billing.
Requirements Analysis (15 minutes)
Functional Requirements:
- Support multiple runtimes (Node.js, Python, Java, Go, .NET)
- Automatic scaling (0 to thousands of instances)
- Function deployment and versioning
- Event-driven triggers from various sources
- VPC support and security isolation
- Environment variables and configuration
- Monitoring and logging
Non-Functional Requirements:
- Handle millions of function executions per day
- Cold start times < 1 second for most runtimes
- Support functions from 128MB to 10GB memory
- 99.99% availability
- Multi-region deployment
- Pay-per-use billing model
- Enterprise security and compliance
Step-by-Step Solution
Step 1: Platform Architecture (25 minutes)
graph TB
API[Lambda API] --> ControlPlane[Control Plane]
EventSources[Event Sources] --> EventRouter[Event Router]
ControlPlane --> FunctionRegistry[Function Registry]
ControlPlane --> DeploymentService[Deployment Service]
EventRouter --> InvokeService[Invoke Service]
InvokeService --> WorkerPool[Worker Pool Manager]
WorkerPool --> Worker1[Worker Node 1]
WorkerPool --> Worker2[Worker Node 2]
WorkerPool --> WorkerN[Worker Node N]
Worker1 --> Container1[Container Pool]
Worker2 --> Container2[Container Pool]
ControlPlane --> Billing[Billing Service]
ControlPlane --> Monitoring[Monitoring Service]
FunctionRegistry --> CodeStore[Code Storage S3]
Monitoring --> Logs[CloudWatch Logs]
Core Platform Components:
Control Plane Services:
Python |
---|
| class LambdaControlPlane:
def __init__(self):
self.function_registry = FunctionRegistry()
self.deployment_service = DeploymentService()
self.billing_service = BillingService()
self.monitoring = MonitoringService()
async def create_function(self, function_config):
# Validate function configuration
await self.validate_config(function_config)
# Store function metadata
function_id = await self.function_registry.register(function_config)
# Deploy function code
await self.deployment_service.deploy(function_id, function_config)
# Setup monitoring
await self.monitoring.setup_function_monitoring(function_id)
return function_id
async def invoke_function(self, function_id, event, context):
# Get function configuration
config = await self.function_registry.get_config(function_id)
# Route to appropriate worker pool
worker_pool = await self.select_worker_pool(config)
# Execute function
result = await worker_pool.execute(function_id, event, context)
# Record billing metrics
await self.billing_service.record_execution(
function_id, result.duration, result.memory_used
)
return result
|
Step 2: Container Management and Runtime (20 minutes)
Container Lifecycle Management:
Python |
---|
| class ContainerManager:
def __init__(self):
self.container_pools = {} # runtime -> container pool
self.cold_start_cache = {} # function_id -> prepared container
self.warm_containers = {} # function_id -> list of warm containers
async def get_container(self, function_id, runtime, memory_limit):
# Check for warm container first
if function_id in self.warm_containers:
if self.warm_containers[function_id]:
container = self.warm_containers[function_id].pop()
return container
# Check cold start cache
if function_id in self.cold_start_cache:
container = self.cold_start_cache[function_id]
del self.cold_start_cache[function_id]
return container
# Create new container (cold start)
container = await self.create_container(function_id, runtime, memory_limit)
return container
async def create_container(self, function_id, runtime, memory_limit):
# Create isolated container environment
container_config = {
'image': f'lambda-runtime-{runtime}',
'memory_limit': memory_limit,
'cpu_limit': self.calculate_cpu_limit(memory_limit),
'network_mode': 'none', # Isolated by default
'security_opts': ['no-new-privileges'],
'read_only': True,
'tmpfs': {'/tmp': 'rw,size=512m'}
}
container = await self.docker_client.create_container(**container_config)
# Load function code
await self.load_function_code(container, function_id)
# Initialize runtime environment
await self.initialize_runtime(container, runtime)
return container
async def warm_up_containers(self, function_id, expected_concurrency):
"""Pre-warm containers based on predicted load"""
for _ in range(expected_concurrency):
container = await self.create_container(function_id)
if function_id not in self.warm_containers:
self.warm_containers[function_id] = []
self.warm_containers[function_id].append(container)
|
Runtime Environment Setup:
Python |
---|
| class RuntimeManager:
def __init__(self):
self.runtime_configs = {
'nodejs18.x': {
'base_image': 'public.ecr.aws/lambda/nodejs:18',
'init_command': 'npm install',
'handler_format': 'index.handler'
},
'python3.9': {
'base_image': 'public.ecr.aws/lambda/python:3.9',
'init_command': 'pip install -r requirements.txt',
'handler_format': 'lambda_function.lambda_handler'
},
'java11': {
'base_image': 'public.ecr.aws/lambda/java:11',
'init_command': 'mvn package',
'handler_format': 'com.example.Handler::handleRequest'
}
}
async def setup_runtime_environment(self, container, runtime, function_code):
config = self.runtime_configs[runtime]
# Copy function code to container
await self.copy_code_to_container(container, function_code)
# Install dependencies
if config['init_command']:
await self.run_in_container(container, config['init_command'])
# Set up runtime-specific environment
await self.configure_runtime(container, runtime)
# Prepare function handler
await self.prepare_handler(container, config['handler_format'])
|
Step 3: Auto-Scaling and Load Management (15 minutes)
Scaling Algorithm:
Python |
---|
| class AutoScaler:
def __init__(self):
self.scaling_metrics = {}
self.scaling_policies = {}
self.current_capacity = {}
async def scale_function(self, function_id):
metrics = await self.get_function_metrics(function_id)
# Calculate required capacity
required_capacity = self.calculate_required_capacity(metrics)
current_capacity = self.current_capacity.get(function_id, 0)
if required_capacity > current_capacity:
# Scale up
scale_up_count = min(
required_capacity - current_capacity,
self.max_scale_up_rate
)
await self.provision_containers(function_id, scale_up_count)
elif required_capacity < current_capacity * 0.7: # Scale down threshold
# Scale down
scale_down_count = current_capacity - required_capacity
await self.terminate_containers(function_id, scale_down_count)
def calculate_required_capacity(self, metrics):
# Factors to consider:
# - Current request rate
# - Request queue length
# - Average execution time
# - Cold start frequency
# - Predicted traffic patterns
request_rate = metrics['requests_per_second']
avg_execution_time = metrics['avg_execution_time_ms'] / 1000
queue_length = metrics['queue_length']
# Basic capacity calculation
base_capacity = request_rate * avg_execution_time
# Add buffer for queue and variability
buffer_capacity = queue_length * 0.5
# Add cold start buffer
cold_start_buffer = base_capacity * 0.2
return int(base_capacity + buffer_capacity + cold_start_buffer)
|
Step 4: Billing and Metering (10 minutes)
Python |
---|
| class BillingService:
def __init__(self):
self.billing_rates = {
'requests': 0.0000002, # $0.20 per 1M requests
'compute_gb_second': 0.0000166667 # $0.0000166667 per GB-second
}
self.usage_aggregator = UsageAggregator()
async def record_execution(self, function_id, duration_ms, memory_mb):
# Calculate billable units
duration_seconds = duration_ms / 1000
memory_gb = memory_mb / 1024
compute_gb_seconds = memory_gb * duration_seconds
# Record usage metrics
usage_record = {
'function_id': function_id,
'timestamp': time.time(),
'duration_ms': duration_ms,
'memory_mb': memory_mb,
'compute_gb_seconds': compute_gb_seconds,
'billable_requests': 1
}
await self.usage_aggregator.record(usage_record)
async def calculate_monthly_bill(self, account_id, month):
usage_data = await self.usage_aggregator.get_monthly_usage(
account_id, month
)
total_requests = usage_data['total_requests']
total_compute_gb_seconds = usage_data['total_compute_gb_seconds']
request_charges = total_requests * self.billing_rates['requests']
compute_charges = total_compute_gb_seconds * self.billing_rates['compute_gb_second']
return {
'request_charges': request_charges,
'compute_charges': compute_charges,
'total': request_charges + compute_charges
}
|
Step 5: Monitoring and Operations (5 minutes)
YAML |
---|
| # Key Platform Metrics
platform_metrics:
- total_executions_per_second
- cold_start_rate
- error_rate_by_runtime
- average_execution_duration
- container_utilization
- queue_depth_by_function
- billing_accuracy_rate
# Operational Dashboards
dashboards:
- Platform Overview: High-level metrics
- Function Performance: Per-function metrics
- Resource Utilization: Container and compute usage
- Billing Analytics: Cost breakdown and trends
- Error Analysis: Error patterns and debugging
|
Self-Assessment Criteria
Total Score: ___/50 points
Time Limit: 90 minutes
Problem Statement
Design a comprehensive machine learning platform that supports the full ML lifecycle: data ingestion, model training, deployment, monitoring, and A/B testing for hundreds of ML teams.
Requirements Analysis (15 minutes)
Functional Requirements:
- Data pipeline orchestration
- Distributed model training (CPU/GPU)
- Model versioning and registry
- Model deployment and serving
- A/B testing and experimentation
- Feature store management
- Model monitoring and drift detection
- Multi-framework support (TensorFlow, PyTorch, XGBoost)
Non-Functional Requirements:
- Support 500+ ML engineers
- Handle petabytes of training data
- Serve millions of predictions/second
- Sub-100ms model inference latency
- 99.9% uptime for model serving
- Global deployment across regions
- Auto-scaling for training and serving
Step-by-Step Solution
Step 1: Platform Architecture (25 minutes)
graph TB
DataSources[Data Sources] --> DataPipeline[Data Pipeline]
DataPipeline --> FeatureStore[Feature Store]
MLWorkflow[ML Workflow Engine] --> Training[Training Service]
MLWorkflow --> ModelRegistry[Model Registry]
Training --> ComputeCluster[Compute Cluster]
ComputeCluster --> GPU[GPU Nodes]
ComputeCluster --> CPU[CPU Nodes]
ModelRegistry --> Serving[Model Serving]
Serving --> LoadBalancer[Load Balancer]
LoadBalancer --> ServingNodes[Serving Nodes]
ExperimentService[Experiment Service] --> ABTesting[A/B Testing]
Monitoring[ML Monitoring] --> DriftDetection[Drift Detection]
MetadataStore[Metadata Store] --> MLWorkflow
MetadataStore --> Training
MetadataStore --> Serving
Platform Components:
ML Workflow Engine:
Python |
---|
| class MLWorkflowEngine:
def __init__(self):
self.workflow_scheduler = WorkflowScheduler()
self.resource_manager = ResourceManager()
self.metadata_store = MetadataStore()
async def create_training_job(self, job_config):
# Validate job configuration
await self.validate_job_config(job_config)
# Allocate resources
resources = await self.resource_manager.allocate(
job_config['resource_requirements']
)
# Create training pipeline
pipeline = TrainingPipeline(
job_config=job_config,
resources=resources,
metadata_store=self.metadata_store
)
# Schedule execution
job_id = await self.workflow_scheduler.schedule(pipeline)
return job_id
async def deploy_model(self, model_id, deployment_config):
# Get model from registry
model = await self.model_registry.get_model(model_id)
# Create deployment
deployment = ModelDeployment(
model=model,
config=deployment_config,
serving_engine=self.serving_engine
)
# Deploy with blue-green strategy
deployment_id = await deployment.deploy_blue_green()
return deployment_id
|
Step 2: Distributed Training System (20 minutes)
Training Job Management:
Python |
---|
| class DistributedTrainingManager:
def __init__(self):
self.job_queue = JobQueue()
self.cluster_manager = ClusterManager()
self.storage_manager = StorageManager()
async def submit_training_job(self, job_spec):
# Analyze resource requirements
resource_needs = await self.analyze_resource_needs(job_spec)
# Queue job with priority
job_id = await self.job_queue.enqueue(job_spec, resource_needs)
# Try immediate scheduling if resources available
if await self.cluster_manager.has_available_resources(resource_needs):
await self.schedule_job_immediately(job_id)
return job_id
async def schedule_job_immediately(self, job_id):
job_spec = await self.job_queue.get_job(job_id)
# Allocate compute nodes
nodes = await self.cluster_manager.allocate_nodes(
job_spec['resource_requirements']
)
# Set up distributed training
if job_spec['framework'] == 'tensorflow':
await self.setup_tensorflow_distributed(job_spec, nodes)
elif job_spec['framework'] == 'pytorch':
await self.setup_pytorch_distributed(job_spec, nodes)
# Start training
await self.start_training_job(job_id, nodes)
async def setup_pytorch_distributed(self, job_spec, nodes):
# Set up PyTorch distributed training
master_node = nodes[0]
worker_nodes = nodes[1:]
# Configure distributed settings
dist_config = {
'backend': 'nccl' if job_spec['use_gpu'] else 'gloo',
'init_method': f'tcp://{master_node.ip}:23456',
'rank': 0,
'world_size': len(nodes)
}
# Deploy training script to all nodes
for i, node in enumerate(nodes):
node_config = dist_config.copy()
node_config['rank'] = i
await self.deploy_training_script(node, job_spec, node_config)
|
Auto-scaling for Training:
Python |
---|
| class TrainingAutoScaler:
def __init__(self):
self.scaling_policies = {}
self.node_pool = NodePool()
async def scale_training_cluster(self, job_type, current_queue_length):
policy = self.scaling_policies.get(job_type, self.default_policy)
# Calculate required nodes based on queue
required_nodes = self.calculate_required_nodes(
current_queue_length, policy
)
current_nodes = await self.node_pool.get_active_nodes(job_type)
if required_nodes > len(current_nodes):
# Scale up
additional_nodes = required_nodes - len(current_nodes)
await self.node_pool.add_nodes(job_type, additional_nodes)
elif required_nodes < len(current_nodes) * 0.7:
# Scale down
excess_nodes = len(current_nodes) - required_nodes
await self.node_pool.remove_idle_nodes(job_type, excess_nodes)
|
Step 3: Model Serving and Inference (15 minutes)
High-Performance Model Serving:
Python |
---|
| class ModelServingEngine:
def __init__(self):
self.model_cache = ModelCache()
self.load_balancer = LoadBalancer()
self.batch_processor = BatchProcessor()
async def serve_prediction(self, model_id, input_data):
# Get model from cache or load
model = await self.model_cache.get_model(model_id)
# Add to batch if batching is enabled
if model.supports_batching:
return await self.batch_processor.add_to_batch(
model_id, input_data
)
else:
# Single prediction
return await self.predict_single(model, input_data)
async def predict_single(self, model, input_data):
# Preprocess input
processed_input = await model.preprocess(input_data)
# Run inference
start_time = time.time()
prediction = await model.predict(processed_input)
inference_time = time.time() - start_time
# Postprocess output
result = await model.postprocess(prediction)
# Record metrics
await self.record_prediction_metrics(
model.id, inference_time, len(input_data)
)
return result
class BatchProcessor:
def __init__(self, batch_size=32, max_wait_time=10):
self.batch_size = batch_size
self.max_wait_time = max_wait_time
self.pending_batches = {}
async def add_to_batch(self, model_id, input_data):
if model_id not in self.pending_batches:
self.pending_batches[model_id] = {
'inputs': [],
'futures': [],
'created_at': time.time()
}
# Create future for this request
future = asyncio.Future()
batch = self.pending_batches[model_id]
batch['inputs'].append(input_data)
batch['futures'].append(future)
# Process batch if full or timeout reached
if (len(batch['inputs']) >= self.batch_size or
time.time() - batch['created_at'] > self.max_wait_time):
asyncio.create_task(self.process_batch(model_id))
return await future
|
Step 4: Feature Store and Data Management (10 minutes)
Feature Store Implementation:
Python |
---|
| class FeatureStore:
def __init__(self):
self.online_store = OnlineFeatureStore() # Low latency serving
self.offline_store = OfflineFeatureStore() # Batch processing
self.feature_registry = FeatureRegistry()
async def register_feature_group(self, feature_group_spec):
# Validate feature definitions
await self.validate_feature_group(feature_group_spec)
# Register in feature registry
group_id = await self.feature_registry.register(feature_group_spec)
# Set up data pipelines
await self.setup_feature_pipelines(group_id, feature_group_spec)
return group_id
async def get_features(self, feature_names, entity_id):
# Get features for real-time serving
features = {}
for feature_name in feature_names:
feature_value = await self.online_store.get_feature(
feature_name, entity_id
)
features[feature_name] = feature_value
return features
async def get_historical_features(self, feature_names, time_range):
# Get features for training data
return await self.offline_store.get_features(
feature_names, time_range
)
class FeaturePipeline:
def __init__(self):
self.data_processor = DataProcessor()
self.quality_checker = DataQualityChecker()
async def process_feature_updates(self, raw_data):
# Data quality checks
quality_report = await self.quality_checker.validate(raw_data)
if not quality_report.is_valid:
raise DataQualityException(quality_report.errors)
# Transform data
processed_features = await self.data_processor.transform(raw_data)
# Update both online and offline stores
await self.update_stores(processed_features)
|
Step 5: ML Monitoring and Experimentation (5 minutes)
Python |
---|
| class MLMonitoringSystem:
def __init__(self):
self.drift_detector = DriftDetector()
self.performance_monitor = PerformanceMonitor()
self.alert_manager = AlertManager()
async def monitor_model_performance(self, model_id):
# Get recent predictions and ground truth
recent_data = await self.get_recent_model_data(model_id)
# Check for data drift
drift_score = await self.drift_detector.detect_drift(
model_id, recent_data
)
if drift_score > self.drift_threshold:
await self.alert_manager.send_drift_alert(model_id, drift_score)
# Monitor prediction quality
quality_metrics = await self.performance_monitor.calculate_metrics(
recent_data
)
if quality_metrics['accuracy'] < self.quality_threshold:
await self.alert_manager.send_quality_alert(
model_id, quality_metrics
)
|
Self-Assessment Criteria
Total Score: ___/50 points
🎯 Evaluation Criteria and Self-Scoring Rubrics
L6 System Design Evaluation Framework
Technical Competency Assessment (1-5 Scale)
Architecture Design (25% of total score):
- 5 - Expert: Innovative, scalable architecture with deep trade-off analysis
- 4 - Proficient: Solid architecture addressing most requirements with good trade-offs
- 3 - Competent: Basic working solution with reasonable component design
- 2 - Developing: Architecture has gaps or doesn't address key requirements
- 1 - Inadequate: Fundamental architectural flaws or missing components
Scalability Considerations (25% of total score):
- 5: Handles massive scale (billions of operations) with sophisticated optimization
- 4: Addresses high scale (millions to billions) with appropriate strategies
- 3: Considers moderate scale with basic scaling approaches
- 2: Limited scalability planning or unrealistic assumptions
- 1: No scalability considerations or doesn't work at stated scale
Database and Storage Design (20% of total score):
- 5: Optimal data modeling with advanced partitioning and consistency strategies
- 4: Good schema design with appropriate database choices and indexing
- 3: Functional database design meeting basic requirements
- 2: Database design has performance issues or missing considerations
- 1: Poor data modeling or inappropriate technology choices
System Integration and APIs (15% of total score):
- 5: Elegant API design with comprehensive integration patterns
- 4: Good API design with appropriate integration approaches
- 3: Basic APIs that meet functional requirements
- 2: API design has usability or integration issues
- 1: Poor API design or missing integration considerations
Operational Concerns (15% of total score):
- 5: Comprehensive operational strategy including monitoring, alerting, deployment
- 4: Good operational planning with key concerns addressed
- 3: Basic operational considerations covered
- 2: Limited operational planning
- 1: No operational considerations
L7 System Design Evaluation Framework
Platform Architecture (30% of total score):
- 5: Innovative platform design enabling multiple use cases and teams
- 4: Solid platform architecture with good abstraction and extensibility
- 3: Platform meets requirements with reasonable architecture
- 2: Limited platform thinking or extensibility concerns
- 1: Not truly platform-level thinking
Multi-Tenant Considerations (20% of total score):
- 5: Sophisticated multi-tenancy with isolation, security, and resource management
- 4: Good multi-tenant design with key concerns addressed
- 3: Basic multi-tenancy support
- 2: Limited multi-tenant considerations
- 1: No multi-tenancy planning
Organizational Impact (20% of total score):
- 5: Platform enables hundreds of teams with clear governance and standards
- 4: Supports multiple teams with good coordination mechanisms
- 3: Designed for team collaboration
- 2: Limited organizational considerations
- 1: Single team focus
Technology Strategy (15% of total score):
- 5: Forward-looking technology choices with industry impact potential
- 4: Strategic technology decisions aligned with long-term goals
- 3: Appropriate technology choices for requirements
- 2: Limited strategic technology thinking
- 1: Poor or outdated technology choices
Business Impact (15% of total score):
- 5: Clear business value with quantified impact and ROI considerations
- 4: Good business alignment with impact awareness
- 3: Understands business requirements
- 2: Limited business consideration
- 1: No business impact awareness
Time Management Assessment
L6 Problems (60 minutes)
Markdown |
---|
| Expected Time Distribution:
- Requirements and Scale (8 minutes) - 13%
- High-Level Architecture (15 minutes) - 25%
- Detailed Design (20 minutes) - 33%
- Scaling and Optimization (12 minutes) - 20%
- Operations and Trade-offs (5 minutes) - 9%
Scoring:
- Finished within time limit: Full points
- 10% over time: -5 points
- 20% over time: -10 points
- Did not complete: Score based on completed sections only
|
L7 Problems (90 minutes)
Markdown |
---|
| Expected Time Distribution:
- Requirements and Strategy (15 minutes) - 17%
- Platform Architecture (25 minutes) - 28%
- Core System Design (25 minutes) - 28%
- Advanced Features (15 minutes) - 17%
- Business and Operations (10 minutes) - 11%
Scoring:
- Finished within time limit: Full points
- 10% over time: -3 points
- 20% over time: -7 points
- Did not complete: Score based on completed sections only
|
Communication and Explanation Assessment
Clarity and Structure (1-5 Scale)
- 5: Crystal clear explanations with logical flow and excellent use of diagrams
- 4: Good explanations that are easy to follow with appropriate visual aids
- 3: Clear enough explanations with basic structure
- 2: Some clarity issues or poor structure
- 1: Confusing or hard to follow explanations
Technical Depth vs Breadth Balance
- 5: Perfect balance of high-level architecture and detailed implementation
- 4: Good balance with appropriate level of detail
- 3: Adequate balance for the audience
- 2: Too much detail in some areas, not enough in others
- 1: Poor balance, either too shallow or too deep throughout
Trade-off Analysis and Decision Making
- 5: Sophisticated analysis of trade-offs with clear decision criteria
- 4: Good trade-off analysis with justified decisions
- 3: Basic trade-off consideration
- 2: Limited trade-off analysis
- 1: No trade-off consideration or poor decisions
📈 Practice Progression and Difficulty Scaling
Beginner Level Practice (Months 1-2)
Focus: Building foundational system design skills
Recommended Problems:
1. URL Shortener (45 min)
2. Pastebin (45 min)
3. Basic Chat Application (60 min)
4. Simple Social Media Feed (60 min)
5. Basic File Storage System (60 min)
Success Criteria:
- Complete problems within time limits
- Address all functional requirements
- Basic understanding of scaling concepts
- Clear communication of design decisions
Focus: Handling complexity and scale
Recommended Problems:
1. Twitter-like Social Network (60 min)
2. Ride Sharing System (60 min)
3. Video Streaming Platform (75 min)
4. E-commerce Platform (75 min)
5. Distributed Cache System (75 min)
Success Criteria:
- Handle millions to billions of scale
- Consider multiple scaling dimensions
- Address consistency and availability trade-offs
- Include operational and monitoring aspects
Advanced Level Practice (Months 5-6)
Focus: Platform-level thinking and innovation
Recommended L7 Problems:
1. AWS Lambda Platform (90 min)
2. Container Orchestration System (90 min)
3. Machine Learning Platform (90 min)
4. Data Processing Platform (90 min)
5. Developer Platform (90 min)
Success Criteria:
- Platform-level architecture thinking
- Multi-tenant and organizational considerations
- Advanced scaling and optimization techniques
- Business and strategic impact awareness
🚀 Getting Started with System Design Practice
Week 1: Foundation Building
Day 1-2: Study system design fundamentals
- Review scaling concepts
- Learn about common architectural patterns
- Understand database scaling strategies
Day 3-4: First practice problems
- Complete URL Shortener problem (untimed)
- Focus on understanding over speed
- Review solution and identify gaps
Day 5-7: Timed practice introduction
- Attempt Pastebin problem with 45-minute timer
- Practice explaining design decisions out loud
- Self-assess using provided rubrics
Week 2-3: Building Competency
Week 2: Complete 3 L6 beginner problems with timing
Week 3: Move to intermediate problems, focus on scaling
Month 2: Developing Expertise
Weeks 4-7: Work through all intermediate problems
Week 8: First attempt at L7 problem (if targeting L7 role)
Ongoing Practice Routine
Weekly Schedule:
- Monday: New problem practice (60-90 minutes)
- Wednesday: Review previous solutions, identify patterns (30 minutes)
- Friday: Mock interview with peer using these problems (60 minutes)
- Weekend: Study advanced concepts, read case studies (60 minutes)
Start with URL Shortener problem and progress through the difficulty levels. Use our Mock Interview Framework to practice these problems under realistic interview conditions.
Continue to: Behavioral Question Bank →