Amazon-Scale System Design Problems for L6/L7 Interviews
Overview
This guide presents real-world system design problems at Amazon scale, focusing on the complexity and trade-offs expected in L6/L7 engineering manager interviews. Each problem includes actual Amazon constraints, AWS service integration, and leadership considerations.
Problem Complexity by Level
L6 System Design Focus
- Scale: Millions of users, regional scope
- Complexity: 5-10 services, single region optimization
- Leadership: Technical decision trade-offs, team coordination
- Timeline: Quarterly delivery with annual vision
L7 System Design Focus
- Scale: Hundreds of millions of users, global scope
- Complexity: 20+ services, multi-region architecture
- Leadership: Organizational capability, industry standards
- Timeline: Multi-year platform evolution
Problem 1: Design Amazon's Black Friday Flash Sale System
Problem Statement
Design a flash sale system for Amazon that can handle Black Friday/Cyber Monday traffic. The system needs to manage limited inventory items (like PS5, latest iPhone) that sell out in seconds while maintaining fairness and preventing bot attacks.
Real Amazon Constraints
- Peak Traffic: 140,000 requests/second during lightning deals
- Inventory: 10,000 units selling out in <30 seconds
- Global Distribution: 20+ countries simultaneously
- Fairness: One item per customer enforced
- Bot Prevention: 65% of traffic from automated systems
L6 Approach: Regional Flash Sale System
Architecture Components
Text Only |
---|
| ┌─────────────────────────────────────────────────────────────┐
│ CloudFront (CDN) │
│ Rate Limiting: 100 req/min/IP │
└─────────────────┬───────────────────────────────────────────┘
│
┌─────────────────▼───────────────────────────────────────────┐
│ API Gateway + WAF │
│ - Bot detection rules │
│ - Geographic restrictions │
└─────────────────┬───────────────────────────────────────────┘
│
┌─────────────────▼───────────────────────────────────────────┐
│ Application Load Balancer │
│ Auto-scaling: 50-500 instances │
└─────────┬───────────────────────────┬───────────────────────┘
│ │
┌─────────▼──────────┐ ┌────────▼──────────────┐
│ Queue Service │ │ Inventory Service │
│ (ECS Fargate) │ │ (ECS Fargate) │
│ │ │ │
│ - Fair queuing │ │ - Atomic operations │
│ - Position update │ │ - Stock reservation │
└─────────┬──────────┘ └────────┬──────────────┘
│ │
┌─────────▼──────────────────────────▼───────────────┐
│ DynamoDB (Inventory Store) │
│ - Conditional writes │
│ - Atomic counters │
│ - Global secondary indexes │
└─────────────────────────────────────────────────────┘
|
Key Design Decisions
1. Virtual Queue System
Python |
---|
| # Queue entry with fair positioning
def enter_queue(customer_id, item_id):
# Check eligibility (one per customer)
if dynamodb.get_item(Key={'customer_id': customer_id, 'item_id': item_id}):
return {'error': 'Already in queue or purchased'}
# Assign queue position using atomic counter
position = dynamodb.update_item(
Key={'item_id': item_id},
UpdateExpression='ADD queue_counter :inc',
ExpressionAttributeValues={':inc': 1},
ReturnValues='UPDATED_NEW'
)
# Create queue entry with TTL
queue_entry = {
'customer_id': customer_id,
'item_id': item_id,
'position': position,
'timestamp': time.now(),
'ttl': time.now() + 300, # 5-minute reservation window
'status': 'waiting'
}
# Store in DynamoDB with conditional put
dynamodb.put_item(
Item=queue_entry,
ConditionExpression='attribute_not_exists(customer_id)'
)
|
2. Inventory Management
Python |
---|
| # Atomic inventory decrement with reservation
def reserve_inventory(customer_id, item_id):
try:
response = dynamodb.update_item(
Key={'item_id': item_id},
UpdateExpression='SET available_stock = available_stock - :dec',
ConditionExpression='available_stock > :zero',
ExpressionAttributeValues={
':dec': 1,
':zero': 0
}
)
# Create reservation record
create_reservation(customer_id, item_id)
# Send to payment processing
sqs.send_message(
QueueUrl=PAYMENT_QUEUE,
MessageBody=json.dumps({
'customer_id': customer_id,
'item_id': item_id,
'reservation_id': reservation_id
})
)
except ConditionalCheckFailedException:
return {'error': 'Out of stock'}
|
3. Bot Prevention
- AWS WAF rules with rate limiting per IP
- CAPTCHA challenges for suspicious behavior patterns
- Device fingerprinting using AWS Fraud Detector
- Behavioral analysis: mouse movement, click patterns
Scalability Considerations
- Auto-scaling: ECS Fargate scales from 50 to 500 tasks in 90 seconds
- Database: DynamoDB auto-scaling with 40,000 RCU/WCU burst capacity
- Caching: ElastiCache for session management and queue positions
- Queue Processing: SQS with parallel consumers for payment processing
Monitoring & Metrics
- Business Metrics: Sales velocity, conversion rate, fairness index
- Technical Metrics: P50/P99 latency, error rates, scaling events
- Security Metrics: Bot detection rate, fraud attempts blocked
Enhanced Architecture for Global Scale
Text Only |
---|
| ┌────────────────────────────────────────────────────────────────┐
│ Global Traffic Management │
│ Route 53 with Geolocation │
└────────┬──────────────────┬──────────────────┬────────────────┘
│ │ │
┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ US-EAST │ │ EU-WEST │ │ AP-SOUTH │
│ Region │ │ Region │ │ Region │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────────▼──────────────────▼──────────────────▼────────────────┐
│ Global Inventory Coordination Layer │
│ (Custom service on EKS with Istio) │
│ │
│ - Distributed consensus (Raft) │
│ - Global inventory allocation │
│ - Cross-region replication │
└────────┬───────────────────────────────────────────────────────┘
│
┌────────▼───────────────────────────────────────────────────────┐
│ DynamoDB Global Tables │
│ with custom conflict resolution │
└─────────────────────────────────────────────────────────────────┘
|
Advanced Features for L7 Scale
1. Global Inventory Coordination
Python |
---|
| class GlobalInventoryCoordinator:
def __init__(self):
self.regions = ['us-east-1', 'eu-west-1', 'ap-south-1']
self.raft_cluster = RaftConsensus(nodes=self.regions)
def allocate_inventory(self, item_id, total_inventory):
"""Distribute inventory across regions based on demand prediction"""
# Get demand forecast per region
demand_forecast = self.get_demand_forecast(item_id)
# Calculate optimal allocation
allocations = {}
for region in self.regions:
weight = demand_forecast[region] / sum(demand_forecast.values())
allocations[region] = int(total_inventory * weight)
# Handle remainder with round-robin
remainder = total_inventory - sum(allocations.values())
for i in range(remainder):
region = self.regions[i % len(self.regions)]
allocations[region] += 1
# Achieve consensus across regions
if self.raft_cluster.propose_allocation(allocations):
self.apply_allocation(allocations)
return allocations
def handle_stockout(self, region, item_id):
"""Redistribute inventory when a region sells out"""
# Check other regions for available stock
available = self.check_global_inventory(item_id)
if available > 0:
# Initiate cross-region transfer
source_region = self.find_donor_region(item_id)
transfer_amount = min(available, 100) # Transfer in batches
self.initiate_transfer(
from_region=source_region,
to_region=region,
item_id=item_id,
amount=transfer_amount
)
|
2. Machine Learning for Fraud Detection
Python |
---|
| class FraudDetectionPipeline:
def __init__(self):
self.model = self.load_sagemaker_model()
self.feature_store = FeatureStore()
def score_transaction(self, customer_data):
features = self.extract_features(customer_data)
# Real-time features
features.update({
'velocity_1min': self.get_velocity(customer_data['ip'], 60),
'device_trust_score': self.get_device_score(customer_data['device_id']),
'behavioral_score': self.analyze_behavior(customer_data['session'])
})
# Get prediction from SageMaker endpoint
fraud_score = self.model.predict(features)
if fraud_score > 0.8:
# High risk - require additional verification
return 'CHALLENGE'
elif fraud_score > 0.5:
# Medium risk - add to queue with delay
return 'DELAY'
else:
# Low risk - proceed normally
return 'ALLOW'
|
3. Global State Synchronization
YAML |
---|
| # Kubernetes StatefulSet for distributed coordination
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: inventory-coordinator
spec:
serviceName: inventory-coordinator
replicas: 5
template:
spec:
containers:
- name: coordinator
image: inventory-coordinator:latest
env:
- name: RAFT_CLUSTER_SIZE
value: "5"
- name: CONSENSUS_TIMEOUT
value: "500ms"
volumeMounts:
- name: data
mountPath: /var/lib/coordinator
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
|
Leadership Considerations for L7
Organizational Alignment:
- Coordinate with 15+ teams across fulfillment, payments, fraud
- Establish SLAs with each team for flash sale events
- Create war room protocols for major sale events
Capacity Planning:
Python |
---|
| def capacity_planning_model():
# Historical data analysis
peak_traffic = analyze_historical_peaks()
growth_rate = calculate_yoy_growth()
# Predict next year's requirements
predicted_peak = peak_traffic * (1 + growth_rate) * 1.5 # 50% buffer
# Resource requirements
compute_instances = predicted_peak / 1000 # 1000 RPS per instance
dynamodb_capacity = predicted_peak * 2 # 2x for burst
# Cost analysis
annual_cost = calculate_infrastructure_cost(
instances=compute_instances,
dynamodb_wcu=dynamodb_capacity,
data_transfer=predicted_peak * 100 # 100KB per request
)
return {
'predicted_peak_rps': predicted_peak,
'required_instances': compute_instances,
'dynamodb_capacity': dynamodb_capacity,
'annual_cost': annual_cost,
'cost_per_transaction': annual_cost / total_transactions
}
|
Problem 2: Design Amazon's Recommendation System
Problem Statement
Design a recommendation system that powers "Customers who bought this also bought" and "Recommended for you" features across Amazon's entire catalog of 350M+ products.
Real Amazon Constraints
- Catalog Size: 350M+ products across 50+ categories
- User Base: 300M+ active users globally
- Latency Requirements: <100ms for recommendations
- Personalization: Real-time updates based on browsing behavior
- Scale: 1M+ recommendation requests per second
L6 Solution: Category-Specific Recommendation Engine
[Content continues with detailed architecture...]
[Content continues with advanced ML pipelines...]
Problem 3: Design Amazon Prime Video's Live Sports Streaming
Problem Statement
Design a system to stream Thursday Night Football to 15M concurrent viewers with <3 second latency and 4K quality, including real-time stats and multiple camera angles.
Real Amazon Constraints
- Concurrent Viewers: 15M peak during prime games
- Video Quality: 4K HDR at 60fps (25 Mbps bitrate)
- Latency: <3 seconds end-to-end
- Global Reach: Available in 200+ countries
- Interactive Features: Real-time stats, replays, multiple angles
[Content continues with detailed solutions...]
Interview Success Framework
How to Approach These Problems
For L6 Candidates
- Start with requirements clarification (5 minutes)
- Estimate scale and constraints (5 minutes)
- Design high-level architecture (15 minutes)
- Deep dive into 2-3 critical components (15 minutes)
- Discuss trade-offs and optimizations (10 minutes)
- Address operational concerns (5 minutes)
- Leave time for questions (5 minutes)
For L7 Candidates
- Business context and strategic implications (5 minutes)
- Global scale estimation and constraints (5 minutes)
- Platform architecture with extensibility (20 minutes)
- Organizational and team structure (10 minutes)
- Innovation and industry leadership (10 minutes)
- Risk management and compliance (5 minutes)
- Q&A and strategic discussion (5 minutes)
Key Differentiators by Level
L6 Technical Excellence
- Detailed knowledge of AWS services
- Clear articulation of trade-offs
- Focus on operational excellence
- Team-level implementation planning
L7 Strategic Vision
- Platform thinking and extensibility
- Industry-wide impact consideration
- Organizational capability building
- Innovation and IP creation
Common Pitfalls to Avoid
- Over-engineering: Don't design for 10x scale unless justified
- Under-specifying: Include specific AWS services and configurations
- Ignoring costs: Always discuss cost implications
- Missing operations: Include monitoring, deployment, rollback
- Forgetting security: Address authentication, encryption, compliance
Practice Checklist