System Design Fundamentals for Amazon L6/L7 Interviews¶

🏗️ The Amazon System Design Reality¶

System design interviews at Amazon are not just about technical architecturethey're about demonstrating your ability to think at scale, make trade-offs under uncertainty, and design systems that embody Amazon's operational excellence. For L6/L7 candidates, the expectation is that you can design production-ready systems that serve millions to billions of users while considering cost, reliability, and maintainability.

2024 Candidate Insights

L7 Success (September 2024): "They didn't just want to see I could design a systemthey wanted to see I could design a system that would survive Amazon's scale and operational complexity."

L6 Feedback (November 2024): "I focused too much on the perfect technical solution and not enough on the practical trade-offs. Amazon wants to see how you make decisions with incomplete information."

📊 System Design Interview Structure at Amazon¶

Interview Format and Expectations¶

Level	Duration	Scope	Success Criteria
L6	60 minutes	Multi-service system	Production-ready design with clear scaling strategy
L7	90 minutes	Platform/ecosystem	Innovative architecture with organizational impact

Typical Question Progression¶

Markdown
**Minutes 0-10: Problem Clarification**
- Functional requirements gathering
- Non-functional requirements (scale, performance, availability)
- Constraints and assumptions validation
- Success metrics definition

**Minutes 10-30: High-Level Architecture**
- System components identification
- Data flow design
- API design and contracts
- Technology stack selection

**Minutes 30-50: Deep Dive and Scaling**
- Database design and sharding strategies
- Caching layers and performance optimization
- Reliability and fault tolerance
- Security and compliance considerations

**Minutes 50-60: Operational Excellence**
- Monitoring and alerting strategy
- Deployment and rollback procedures
- Cost optimization approaches
- Maintenance and evolution planning

🎯 Amazon's System Design Philosophy¶

Core Principles for L6/L7 Design¶

1. Customer Obsession in Architecture¶

Python
# Amazon's approach to system design prioritizes customer experience
customer_obsession_design = {
    "latency": "Single-digit milliseconds for critical paths",
    "availability": "99.99% for customer-facing services",
    "consistency": "Eventual consistency acceptable if UX benefits",
    "error_handling": "Graceful degradation over hard failures",
    "personalization": "Built-in from day one, not bolted on"
}

# Example: Design decisions for Prime Video streaming
design_decisions = {
    "CDN_strategy": "Edge locations globally for < 50ms latency",
    "adaptive_bitrate": "Quality adjusts to network conditions",
    "offline_viewing": "Download capability for poor connectivity",
    "error_recovery": "Seamless failover to different quality streams"
}

2. Operational Excellence Framework¶

Markdown
**Amazon's Operational Excellence Pillars:**

1. **Design for Operations**
   - Every component has monitoring/alerting built-in
   - Automated deployment and rollback procedures
   - Clear ownership and on-call responsibilities
   - Runbooks for all failure scenarios

2. **Build for Scale**
   - Horizontal scaling as default approach
   - Stateless services and shared-nothing architecture
   - Database sharding strategies from day one
   - Queue-based asynchronous processing

3. **Embrace Failure**
   - Circuit breakers and bulkheads pattern
   - Chaos engineering and fault injection
   - Multi-region active-active architecture
   - Graceful degradation over hard failures

4. **Optimize Costs**
   - Resource-based pricing models
   - Auto-scaling based on demand patterns
   - Spot instances and reserved capacity optimization
   - Data lifecycle management and archival strategies

3. Leadership Principles in Technical Design¶

Ownership in System Design: - Design systems you can operate and maintain long-term - Include monitoring, alerting, and debugging capabilities - Plan for on-call burden and operational complexity - Consider total cost of ownership, not just development cost

Deliver Results Through Architecture: - Focus on business outcomes, not just technical elegance - Design for measurable customer impact - Include A/B testing and experimentation capabilities - Plan for rapid iteration and feature development

🏗️ Essential System Design Components¶

1. Load Balancing and Traffic Management¶

Amazon's Approach to Load Balancing¶

Markdown
**Application Load Balancer (ALB) Patterns:**
- Layer 7 routing based on content, headers, or path
- Health checks with custom endpoints and criteria
- Weighted routing for gradual traffic shifting
- Multi-target group routing for blue-green deployments

**Network Load Balancer (NLB) for High Performance:**
- Layer 4 routing for ultra-low latency requirements
- Static IP addresses for whitelisting scenarios
- Cross-zone load balancing for availability
- Connection draining for graceful scaling

**Global Load Balancing with Route 53:**
- Geolocation routing for compliance requirements
- Latency-based routing for performance optimization
- Health checks with fallback regions
- Weighted policies for traffic splitting

Real-World Example: E-commerce Platform Load Balancing¶

YAML
# Production load balancing configuration for L6 scale
load_balancer_architecture:
  global_dns:
    primary_region: "us-east-1"
    secondary_region: "us-west-2"
    health_check_interval: "30s"
    failover_threshold: "3_consecutive_failures"

  application_tier:
    alb_configuration:
      target_groups:
        - web_servers: "weight: 80"
        - mobile_api: "weight: 20"
      health_check: "/health"
      deregistration_delay: "300s"

  auto_scaling:
    min_capacity: 4
    max_capacity: 100
    target_cpu_utilization: 70
    scale_out_cooldown: "300s"
    scale_in_cooldown: "300s"

2. Database Design and Data Management¶

Choosing the Right Database for Amazon Scale¶

Markdown
**RDS vs DynamoDB Decision Matrix:**

| Requirement | RDS (PostgreSQL/MySQL) | DynamoDB |
|-------------|------------------------|----------|
| **ACID Transactions** |  Full ACID compliance | L Limited transaction support |
| **Complex Queries** |  SQL, joins, aggregations | L Simple key-value queries only |
| **Scaling** | L Vertical scaling limits |  Automatic horizontal scaling |
| **Performance** | L I/O bound, predictable load |  Single-digit millisecond latency |
| **Cost at Scale** | L Expensive for large datasets |  Pay-per-request pricing |
| **Operational Burden** | L Requires tuning and maintenance |  Fully managed |

**Decision Framework:**
- Use RDS for: Financial systems, complex reporting, existing SQL workflows
- Use DynamoDB for: User sessions, gaming leaderboards, IoT data, real-time applications
- Use both: CQRS pattern with RDS for writes, DynamoDB for reads

Database Sharding Strategies for L6/L7 Scale¶

Python
# Horizontal sharding implementation example
class ShardingStrategy:
    def __init__(self, num_shards=16):
        self.num_shards = num_shards
        self.shard_map = {}

    def get_shard_key(self, user_id):
        """Consistent hashing for user data distribution"""
        return hash(user_id) % self.num_shards

    def get_database_connection(self, shard_key):
        """Return appropriate database connection for shard"""
        return f"user_db_shard_{shard_key}"

    def cross_shard_query(self, query_params):
        """Handle queries that span multiple shards"""
        results = []
        for shard in range(self.num_shards):
            shard_result = self.query_shard(shard, query_params)
            results.extend(shard_result)
        return self.merge_results(results)

# L6 Challenge: Design for 10M users
# L7 Challenge: Design for 1B+ users with global distribution

3. Caching Architecture¶

Multi-Layer Caching Strategy¶

Markdown
**Amazon's Caching Hierarchy:**

1. **CDN Layer (CloudFront)**
   - Static assets (images, CSS, JavaScript)
   - API responses with appropriate TTL
   - Geographic distribution for global latency
   - Cache invalidation strategies

2. **Application-Level Caching (ElastiCache)**
   - Redis for complex data structures and pub/sub
   - Memcached for simple key-value caching
   - Cluster mode for high availability
   - Cache-aside vs write-through patterns

3. **Database Query Caching**
   - Query result caching for expensive operations
   - Materialized views for complex aggregations
   - Database connection pooling
   - Query optimization and indexing strategies

**Cache Invalidation Strategies:**
- Time-based TTL for frequently changing data
- Event-driven invalidation for critical updates
- Cache versioning for gradual rollouts
- Cache warming for predictable traffic patterns

Real-World Caching Implementation¶

JSON
{
  "caching_strategy": {
    "user_profile": {
      "cache_type": "Redis",
      "ttl": "1_hour",
      "invalidation": "user_update_event",
      "backup_strategy": "database_fallback"
    },
    "product_catalog": {
      "cache_type": "CloudFront + ElastiCache",
      "ttl": "24_hours",
      "invalidation": "inventory_change_event",
      "warming_strategy": "scheduled_job_4am"
    },
    "search_results": {
      "cache_type": "Elasticsearch + Redis",
      "ttl": "15_minutes",
      "invalidation": "content_update_event",
      "personalization": "user_based_cache_keys"
    }
  },
  "performance_targets": {
    "cache_hit_ratio": "> 95%",
    "cache_latency": "< 1ms",
    "fallback_latency": "< 100ms"
  }
}

4. Microservices Architecture and Communication¶

Service Decomposition Strategy¶

Markdown
**Amazon's Microservices Principles:**

1. **Single Responsibility**
   - Each service owns one business capability
   - Clear boundaries and minimal coupling
   - Independent deployment and scaling
   - Team ownership and accountability

2. **Database per Service**
   - No shared databases between services
   - Data consistency through event sourcing
   - SAGA pattern for distributed transactions
   - API contracts for data access

3. **Fault Isolation**
   - Circuit breaker pattern implementation
   - Bulkhead pattern for resource isolation
   - Timeout and retry configurations
   - Graceful degradation strategies

**Service Communication Patterns:**
- Synchronous: REST APIs for real-time operations
- Asynchronous: SQS/SNS for eventual consistency
- Event sourcing: Event store for audit trails
- CQRS: Separate read/write models for optimization

Inter-Service Communication Example¶

YAML
# Microservices communication architecture
services:
  user_service:
    responsibilities:
      - User authentication and authorization
      - Profile management
      - Preference settings
    apis:
      - POST /users/authenticate
      - GET /users/{user_id}/profile
      - PUT /users/{user_id}/preferences

  order_service:
    responsibilities:
      - Order creation and management
      - Payment processing coordination
      - Order history and tracking
    dependencies:
      - user_service: "user authentication"
      - inventory_service: "stock validation"
      - payment_service: "payment processing"

  notification_service:
    responsibilities:
      - Email and SMS notifications
      - Push notification management
      - Notification preferences
    communication_pattern: "event_driven"
    message_queues:
      - order_events: "SQS queue for order updates"
      - user_events: "SNS topic for user changes"

<� Amazon-Specific System Design Patterns¶

1. The "Everything is an API" Pattern¶

Markdown
**Amazon's API-First Philosophy:**

Every system component should be accessible via well-defined APIs:
- Internal teams consume APIs just like external customers
- Versioning strategy for backward compatibility
- Rate limiting and authentication for all APIs
- Comprehensive documentation and SDK generation

**API Design Principles:**
1. **RESTful when possible, GraphQL when complex**
2. **Idempotent operations for reliability**
3. **Consistent error handling and status codes**
4. **Comprehensive monitoring and analytics**
5. **Security by default (HTTPS, authentication, authorization)**

2. The "Two-Pizza Team" Architecture Pattern¶

Python
# Service architecture for independent team ownership
team_service_architecture = {
    "team_size": "6-8 engineers",
    "service_ownership": "full_lifecycle",
    "deployment_independence": True,
    "technology_choice_freedom": True,
    "operational_responsibility": "24/7_on_call",

    "boundaries": {
        "data_ownership": "exclusive_database_access",
        "api_contracts": "backward_compatible_versioning",
        "deployment_pipeline": "independent_ci_cd",
        "monitoring": "team_specific_dashboards"
    },

    "communication": {
        "sync_apis": "real_time_operations",
        "async_events": "eventual_consistency",
        "shared_nothing": "no_direct_database_access"
    }
}

3. The "Backwards Compatibility" Pattern¶

Markdown
**API Evolution Strategy:**

1. **Additive Changes Only**
   - New fields can be added to responses
   - New optional parameters can be added to requests
   - New endpoints can be added to existing services

2. **Deprecation Timeline**
   - Announce deprecation 6 months in advance
   - Provide migration guides and tools
   - Monitor usage and assist remaining clients
   - Remove deprecated functionality after 12 months

3. **Versioning Strategy**
   - URL versioning: /v1/users, /v2/users
   - Header versioning: API-Version: 2.0
   - Content negotiation: Accept: application/vnd.api+json;version=2

**Example: User Service API Evolution**

Version 1.0¶

GET /v1/users/{id} Response: {"id": 123, "name": "John", "email": "john@example.com"}

Version 2.0 (Backwards Compatible)¶

GET /v1/users/{id} Response: {"id": 123, "name": "John", "email": "john@example.com", "preferences": {...}}

Version 2.0 (New Endpoint)¶

GET /v2/users/{id} Response: {"user_id": 123, "full_name": "John Doe", "contact": {"email": "john@example.com"}}

Text Only

## =� Scaling Strategies for L6/L7 Interviews

### Horizontal vs Vertical Scaling Decision Framework

```markdown
**When to Use Horizontal Scaling:**
 Stateless applications
 Read-heavy workloads
 Predictable traffic patterns
 Cost-sensitive operations
 High availability requirements

**When to Use Vertical Scaling:**
 Database operations requiring ACID properties
 Legacy applications with shared state
 Complex in-memory caching requirements
 Temporary traffic spikes
 Rapid deployment needs

**Hybrid Approach for L7 Scale:**
- Auto-scaling groups for web tier (horizontal)
- RDS with read replicas (mixed approach)
- Elasticsearch clusters (horizontal)
- Cache clusters with automatic failover (horizontal)

Real-World Scaling Example: Video Streaming Service¶

YAML
# L7-level scaling architecture for global video streaming
global_streaming_architecture:
  content_delivery:
    edge_locations: 200+
    origin_servers: 
      - primary: "us-east-1"
      - secondary: "eu-west-1"
      - tertiary: "ap-southeast-1"

  video_processing:
    encoding_pipeline:
      instances: "auto-scaling 10-1000"
      instance_type: "c5.24xlarge with GPU"
      processing_queue: "SQS FIFO with visibility timeout"

    storage_tier:
      hot_storage: "S3 Standard for recent content"
      warm_storage: "S3 IA for older content"
      cold_storage: "Glacier for archival"

  user_facing_services:
    api_gateway:
      throttling: "10000 requests/second per user"
      caching: "CloudFront with 24-hour TTL"

    recommendation_engine:
      compute: "EMR clusters with Spark"
      data_store: "DynamoDB with GSI"
      real_time_updates: "Kinesis Data Streams"

  database_architecture:
    user_data: "DynamoDB with global tables"
    content_metadata: "Aurora PostgreSQL with cross-region replicas"
    analytics: "Redshift with daily ETL from S3"

= Security and Compliance at Scale¶

Security by Design Principles¶

Markdown
**Amazon's Security Framework for System Design:**

1. **Identity and Access Management**
   - IAM roles and policies for service-to-service communication
   - Least privilege principle for all system components
   - Regular access reviews and automatic policy enforcement
   - Multi-factor authentication for administrative access

2. **Data Protection**
   - Encryption at rest using AWS KMS
   - Encryption in transit with TLS 1.3
   - Data classification and handling procedures
   - Key rotation and access logging

3. **Network Security**
   - VPC isolation and security groups
   - Private subnets for database and application tiers
   - WAF protection for web-facing applications
   - DDoS protection with AWS Shield

4. **Monitoring and Incident Response**
   - CloudTrail for API call logging
   - GuardDuty for threat detection
   - Security hub for compliance monitoring
   - Automated incident response workflows

Compliance Considerations for L6/L7 Design¶

Python
# Compliance requirements affecting system architecture
compliance_requirements = {
    "GDPR": {
        "data_residency": "EU data must stay in EU regions",
        "right_to_be_forgotten": "Hard delete capability required",
        "data_portability": "Export APIs for user data",
        "consent_management": "Granular privacy preferences"
    },

    "SOX": {
        "audit_trails": "Immutable logs for financial transactions",
        "change_management": "Approval workflows for production changes",
        "access_controls": "Segregation of duties enforcement",
        "data_integrity": "Checksums and validation for financial data"
    },

    "HIPAA": {
        "encryption": "AES-256 for all PHI data",
        "access_logging": "Detailed audit logs for data access",
        "business_associate_agreements": "Third-party service contracts",
        "minimum_necessary": "Role-based data access controls"
    }
}

=� Performance Optimization and Monitoring¶

Key Performance Metrics for L6/L7 Systems¶

Markdown
**Application Performance Metrics:**
- Response time percentiles (P50, P95, P99, P99.9)
- Throughput (requests per second)
- Error rates and error distribution
- Availability and uptime measurements

**Infrastructure Performance Metrics:**
- CPU utilization and memory usage
- Network throughput and latency
- Disk I/O and storage utilization
- Database connection pool metrics

**Business Performance Metrics:**
- Conversion rates and user engagement
- Revenue per user and customer lifetime value
- Feature adoption and usage patterns
- Customer satisfaction scores

Monitoring Architecture Example¶

JSON
{
  "monitoring_stack": {
    "metrics_collection": {
      "application_metrics": "CloudWatch custom metrics",
      "infrastructure_metrics": "CloudWatch agent",
      "custom_dashboards": "CloudWatch dashboards + Grafana"
    },

    "logging_strategy": {
      "application_logs": "CloudWatch Logs with structured JSON",
      "access_logs": "ALB access logs to S3",
      "audit_logs": "CloudTrail for API calls"
    },

    "alerting_framework": {
      "real_time_alerts": "CloudWatch alarms + SNS",
      "escalation_policy": "PagerDuty integration",
      "alert_fatigue_prevention": "ML-based anomaly detection"
    },

    "distributed_tracing": {
      "service": "AWS X-Ray",
      "custom_instrumentation": "OpenTelemetry",
      "performance_analysis": "Request flow visualization"
    }
  }
}

<� L6 vs L7 System Design Differentiation¶

L6 System Design Expectations¶

Markdown
**L6 Focus Areas:**
- Design scalable systems for millions of users
- Demonstrate understanding of common patterns
- Show ability to handle typical scaling challenges
- Focus on production readiness and reliability
- Understand cost implications and trade-offs

**Typical L6 Problems:**
- Design a URL shortener (like bit.ly)
- Build a chat messaging system
- Create a ride-sharing service backend
- Design a social media feed system
- Build a payment processing system

**L6 Success Criteria:**
 Clear understanding of system components
 Appropriate technology choices with justification
 Scaling strategy for expected growth
 Consideration of failure modes and reliability
 Monitoring and operational aspects

L7 System Design Expectations¶

Markdown
**L7 Focus Areas:**
- Design platform-level systems affecting entire organizations
- Demonstrate innovative thinking and industry influence
- Show deep understanding of distributed systems trade-offs
- Focus on long-term evolution and organizational impact
- Consider regulatory, compliance, and global scale challenges

**Typical L7 Problems:**
- Design AWS Lambda (serverless computing platform)
- Build a global CDN from scratch
- Create a machine learning platform for the company
- Design a real-time fraud detection system
- Build a global payment infrastructure

**L7 Success Criteria:**
 Innovation beyond standard patterns
 Deep understanding of distributed systems principles
 Consideration of organizational and team implications
 Global scale and multi-region considerations
 Platform thinking and ecosystem development

=� System Design Interview Checklist¶

Pre-Interview Preparation (4 Weeks Out)¶

Markdown
**Fundamental Concepts Review:**
- [ ] CAP theorem and consistency models
- [ ] Database sharding and replication strategies
- [ ] Caching patterns and cache invalidation
- [ ] Load balancing and traffic routing
- [ ] Microservices communication patterns
- [ ] Security and authentication mechanisms

**AWS Services Deep Dive:**
- [ ] EC2, ELB, Auto Scaling fundamentals
- [ ] RDS vs DynamoDB use cases and trade-offs
- [ ] S3 storage classes and lifecycle policies
- [ ] CloudFront and global content delivery
- [ ] SQS, SNS, and asynchronous messaging
- [ ] Lambda and serverless architecture patterns

**Practice Problems:**
- [ ] Complete 20+ system design problems
- [ ] Practice both L6 and L7 level problems
- [ ] Time yourself: 60 minutes for L6, 90 minutes for L7
- [ ] Record yourself and review for clarity
- [ ] Get feedback from experienced engineers

During the Interview¶

Markdown
**First 10 Minutes - Requirements Gathering:**
- [ ] Ask clarifying questions about functional requirements
- [ ] Establish scale estimates (users, requests, data size)
- [ ] Confirm non-functional requirements (availability, consistency)
- [ ] Validate assumptions with the interviewer
- [ ] Define success metrics and constraints

**Next 20 Minutes - High-Level Design:**
- [ ] Draw system components and their relationships
- [ ] Explain data flow through the system
- [ ] Choose appropriate technologies with justification
- [ ] Address the primary use cases
- [ ] Get interviewer buy-in before diving deeper

**Next 20 Minutes - Detailed Design:**
- [ ] Design database schema and access patterns
- [ ] Implement caching strategy
- [ ] Address scaling bottlenecks
- [ ] Design APIs and data contracts
- [ ] Consider security and authentication

**Final 10 Minutes - Wrap Up:**
- [ ] Discuss monitoring and operational aspects
- [ ] Address failure modes and reliability
- [ ] Consider cost optimization opportunities
- [ ] Discuss future evolution and maintenance
- [ ] Answer any remaining questions

Post-Interview Self-Assessment¶

Markdown
**Technical Accuracy:**
- [ ] Did I choose appropriate technologies for the scale?
- [ ] Were my estimates reasonable and well-justified?
- [ ] Did I address the major scaling bottlenecks?
- [ ] Were my API designs clear and consistent?
- [ ] Did I consider security and reliability adequately?

**Communication Effectiveness:**
- [ ] Did I ask good clarifying questions?
- [ ] Was my explanation clear and well-structured?
- [ ] Did I engage with the interviewer's feedback?
- [ ] Did I manage time effectively?
- [ ] Was I able to defend my design choices?

**Amazon Culture Alignment:**
- [ ] Did I demonstrate customer obsession in my design?
- [ ] Did I show ownership thinking about operations?
- [ ] Did I consider cost optimization throughout?
- [ ] Did I show dive deep technical understanding?
- [ ] Did I demonstrate deliver results focus?

=� Advanced Topics for L7 Candidates¶

Distributed Systems Consensus¶

Python
# Raft consensus algorithm implementation concepts
class RaftNode:
    def __init__(self, node_id, cluster_nodes):
        self.node_id = node_id
        self.cluster_nodes = cluster_nodes
        self.state = "follower"  # follower, candidate, leader
        self.current_term = 0
        self.voted_for = None
        self.log = []
        self.commit_index = 0

    def start_election(self):
        """Candidate starts election process"""
        self.state = "candidate"
        self.current_term += 1
        self.voted_for = self.node_id

        # Request votes from other nodes
        votes_received = 1  # Vote for self
        for node in self.cluster_nodes:
            if node != self.node_id:
                vote = self.request_vote(node)
                if vote:
                    votes_received += 1

        # Become leader if majority votes received
        if votes_received > len(self.cluster_nodes) // 2:
            self.state = "leader"
            self.send_heartbeats()

    def replicate_log_entry(self, entry):
        """Leader replicates log entry to followers"""
        if self.state != "leader":
            return False

        self.log.append(entry)
        successful_replications = 1  # Leader has entry

        for node in self.cluster_nodes:
            if node != self.node_id:
                success = self.send_append_entries(node, entry)
                if success:
                    successful_replications += 1

        # Commit if majority of nodes have replicated
        if successful_replications > len(self.cluster_nodes) // 2:
            self.commit_index = len(self.log) - 1
            return True

        return False

Event Sourcing and CQRS Implementation¶

Markdown
**Event Sourcing for L7 Scale Systems:**

Benefits for Amazon-scale systems:
- Complete audit trail for compliance
- Time-travel queries for debugging
- Event replay for system recovery
- Horizontal scaling of read models

Implementation considerations:
- Event versioning and schema evolution
- Snapshot strategies for performance
- Event ordering and causality
- Eventual consistency handling

**CQRS Implementation Pattern:**
```json
{
  "command_side": {
    "purpose": "Handle writes and business logic",
    "database": "Aurora PostgreSQL with ACID properties",
    "scaling": "Vertical scaling with read replicas",
    "events": "Publish events to Kinesis after commit"
  },
  "query_side": {
    "purpose": "Optimized read models for different use cases",
    "databases": [
      "DynamoDB for user-facing queries",
      "Elasticsearch for search functionality",
      "Redshift for analytics and reporting"
    ],
    "scaling": "Horizontal scaling per use case",
    "consistency": "Eventual consistency acceptable"
  },
  "event_stream": {
    "technology": "Kinesis Data Streams",
    "retention": "7 days for replay capability",
    "consumers": "Lambda functions for read model updates",
    "ordering": "Partition by aggregate ID"
  }
}

=� Key Takeaways and Next Steps¶

Critical Success Factors

Start with Requirements: Always clarify functional and non-functional requirements before designing
Think in Trade-offs: Every design decision has trade-offs; articulate them clearly
Design for Operations: Include monitoring, alerting, and debugging from the beginning
Scale Appropriately: Don't over-engineer for scale you don't need, but plan for growth
Embrace Amazon Culture: Show customer obsession, ownership, and operational excellence in your design

Recommended Learning Path¶

Markdown
**Week 1-2: Foundation Building**
- [ ] Review distributed systems fundamentals
- [ ] Study AWS services and their use cases
- [ ] Practice basic system design problems

**Week 3-4: Pattern Recognition**
- [ ] Learn common system design patterns
- [ ] Study real-world architectures (Netflix, Uber, etc.)
- [ ] Practice intermediate complexity problems

**Week 5-6: Advanced Concepts**
- [ ] Dive deep into consensus algorithms and distributed transactions
- [ ] Study microservices patterns and anti-patterns
- [ ] Practice L7-level platform design problems

**Week 7-8: Interview Simulation**
- [ ] Conduct mock interviews with experienced engineers
- [ ] Practice time management and communication
- [ ] Refine your design process and templates

🔄 Trade-Off Analysis Framework for Amazon Interviews¶

The Amazon Decision-Making Process¶

Markdown
**Amazon's Trade-off Analysis Framework:**

1. **Customer Impact Assessment**
   - How does this decision affect customer experience?
   - What are the latency/availability implications?
   - How does it impact cost passed to customers?

2. **Operational Complexity Evaluation**
   - What's the operational burden of this choice?
   - How does it affect on-call load and debugging?
   - What's the learning curve for the team?

3. **Cost-Benefit Analysis**
   - What are the infrastructure costs at scale?
   - How does this impact development velocity?
   - What's the total cost of ownership over 3 years?

4. **Risk Assessment**
   - What are the failure modes of this approach?
   - How does this affect our blast radius?
   - What's our recovery strategy if this fails?

CAP Theorem Applied to Real Amazon Systems¶

graph TD
    A[CAP Theorem in Practice] --> B[Consistency]
    A --> C[Availability] 
    A --> D[Partition Tolerance]

    B --> B1[Banking Systems<br/>Choose CP]
    B --> B2[Strong Consistency<br/>ACID Transactions]

    C --> C1[User Sessions<br/>Choose AP]
    C --> C2[Shopping Cart<br/>Eventual Consistency OK]

    D --> D1[Always Required<br/>in Distributed Systems]
    D --> D2[Network Partitions<br/>Will Happen]

    B1 --> E[RDS with Multi-AZ<br/>Sacrifice Availability<br/>During Failover]
    C1 --> F[DynamoDB Global Tables<br/>Sacrifice Consistency<br/>For Availability]

Real-World CAP Theorem Examples¶

Python
# Example: Amazon shopping cart trade-offs
shopping_cart_decisions = {
    "consistency_model": "eventual_consistency",
    "rationale": {
        "customer_impact": "User can continue shopping even during network issues",
        "business_impact": "Better conversion rates vs perfect consistency",
        "technical_trade_off": "Accept duplicate items vs cart unavailability"
    },

    "implementation": {
        "primary_storage": "DynamoDB with local strong consistency",
        "conflict_resolution": "Last writer wins with timestamp",
        "reconciliation": "Background process merges conflicting states",
        "user_experience": "Show merged cart with confirmation dialog"
    },

    "monitoring": {
        "conflict_rate": "< 0.01% of cart operations",
        "reconciliation_latency": "< 30 seconds",
        "customer_complaints": "Track 'missing items' reports"
    }
}

Latency vs Throughput Trade-offs¶

Markdown
**When to Optimize for Latency:**
- Real-time user interactions (search suggestions, chat)
- Financial transactions requiring immediate feedback
- Gaming and interactive applications
- Mobile applications with limited bandwidth

**When to Optimize for Throughput:**
- Batch processing and ETL operations
- Log processing and analytics
- Image/video processing pipelines
- Background synchronization tasks

**Amazon's Approach to Latency-Throughput Balance:**

graph LR
    A[Request] --> B{Latency Critical?}
    B -->|Yes| C[P99 < 100ms]
    B -->|No| D[Batch Processing]

    C --> E[Cache First]
    C --> F[Synchronous API]
    C --> G[Premium Infrastructure]

    D --> H[Queue Based]
    D --> I[Asynchronous Processing]
    D --> J[Cost Optimized Instances]

    E --> K[ElastiCache]
    F --> L[Direct Database]
    G --> M[Provisioned Capacity]

    H --> N[SQS/SNS]
    I --> O[Lambda/Batch]
    J --> P[Spot Instances]

Cost vs Performance Analysis¶

JSON
{
  "cost_performance_matrix": {
    "high_performance_high_cost": {
      "use_cases": ["Real-time trading", "Gaming leaderboards", "Live streaming"],
      "aws_services": ["DynamoDB On-Demand", "Lambda Provisioned Concurrency", "ElastiCache"],
      "cost_multiplier": "3-5x standard",
      "performance_gain": "10-100x latency improvement"
    },

    "balanced_approach": {
      "use_cases": ["E-commerce", "Social media", "Content management"],
      "aws_services": ["RDS", "EC2 Auto Scaling", "CloudFront"],
      "cost_multiplier": "1-2x standard",
      "performance_gain": "2-5x latency improvement"
    },

    "cost_optimized": {
      "use_cases": ["Data processing", "Backup systems", "Analytics"],
      "aws_services": ["Spot Instances", "S3 IA/Glacier", "Reserved Instances"],
      "cost_multiplier": "0.1-0.5x standard",
      "performance_trade_off": "Higher latency acceptable"
    }
  }
}

🏗️ Complete System Design Walkthrough: URL Shortener¶

L6-Level URL Shortener Design (60-Minute Interview)¶

Phase 1: Requirements Gathering (10 minutes)¶

Markdown
**Functional Requirements:**
- Shorten long URLs to 6-7 character codes
- Redirect users when they click shortened URLs
- Custom aliases (optional)
- URL expiration (optional)
- Analytics on click counts

**Non-Functional Requirements:**
- 100M URLs created per month
- 100:1 read-to-write ratio (10B redirects per month)
- 99.9% availability
- < 100ms latency for redirects
- 5 years of URL retention

**Scale Estimation:**
- 100M URLs/month = ~40 URLs/second
- 10B redirects/month = ~4000 redirects/second
- Storage: 100M URLs × 500 bytes = 50GB per month
- Bandwidth: 4000 RPS × 500 bytes = 2MB/s

Phase 2: High-Level Architecture (15 minutes)¶

graph TB
    subgraph "User Layer"
        U1[Web Browser]
        U2[Mobile App]
        U3[API Clients]
    end

    subgraph "Load Balancer Layer"
        LB[Application Load Balancer]
    end

    subgraph "Application Layer"
        WS1[Web Server 1]
        WS2[Web Server 2]
        WS3[Web Server N]
    end

    subgraph "Cache Layer"
        RC[Redis Cluster]
    end

    subgraph "Database Layer"
        RW[Primary DB<br/>Read/Write]
        R1[Read Replica 1]
        R2[Read Replica 2]
    end

    subgraph "Analytics Layer"
        KDS[Kinesis Data Streams]
        KDA[Kinesis Analytics]
        S3[S3 Analytics Store]
    end

    U1 --> LB
    U2 --> LB
    U3 --> LB

    LB --> WS1
    LB --> WS2
    LB --> WS3

    WS1 --> RC
    WS2 --> RC
    WS3 --> RC

    WS1 --> RW
    WS1 --> R1
    WS2 --> R2

    WS1 --> KDS
    WS2 --> KDS
    WS3 --> KDS

    KDS --> KDA
    KDA --> S3

Phase 3: API Design (10 minutes)¶

Python
# REST API Design for URL Shortener

class URLShortenerAPI:
    """
    API endpoints for L6-level URL shortener system
    """

    def create_short_url(self, request):
        """
        POST /api/v1/urls
        {
            "long_url": "https://example.com/very/long/url",
            "custom_alias": "my-link",  # optional
            "expiration_date": "2024-12-31T23:59:59Z"  # optional
        }

        Response:
        {
            "short_url": "https://short.ly/abc123",
            "short_code": "abc123",
            "long_url": "https://example.com/very/long/url",
            "created_at": "2024-01-01T00:00:00Z",
            "expires_at": "2024-12-31T23:59:59Z"
        }
        """
        pass

    def redirect_url(self, short_code):
        """
        GET /{short_code}

        Response: 301 Redirect to long URL
        Headers: Location: https://example.com/very/long/url
        """
        pass

    def get_analytics(self, short_code):
        """
        GET /api/v1/analytics/{short_code}

        Response:
        {
            "short_code": "abc123",
            "total_clicks": 1500,
            "unique_clicks": 1200,
            "clicks_by_date": {...},
            "top_referrers": [...],
            "geographic_distribution": {...}
        }
        """
        pass

# URL encoding algorithm
import base62
import hashlib
import time

def generate_short_code(long_url, user_id=None):
    """
    Generate short code using base62 encoding
    """
    # Approach 1: Counter-based (simple, sequential)
    # Pros: Predictable length, no collisions
    # Cons: Predictable, requires coordination

    # Approach 2: Hash-based (chosen for this design)
    # Pros: Distributed generation, random
    # Cons: Potential collisions, need retry logic

    hash_input = f"{long_url}_{user_id}_{time.time()}"
    hash_value = hashlib.sha256(hash_input.encode()).hexdigest()

    # Take first 43 bits for base62 encoding (7 characters)
    numeric_hash = int(hash_value[:11], 16)
    short_code = base62.encode(numeric_hash)[:7]

    return short_code

Phase 4: Database Design (10 minutes)¶

SQL
-- Database schema for URL shortener
-- Using PostgreSQL for ACID properties and complex queries

-- Main URLs table
CREATE TABLE urls (
    id BIGSERIAL PRIMARY KEY,
    short_code VARCHAR(10) UNIQUE NOT NULL,
    long_url TEXT NOT NULL,
    user_id BIGINT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    is_active BOOLEAN DEFAULT TRUE,
    click_count BIGINT DEFAULT 0
);

-- Indexes for performance
CREATE INDEX idx_urls_short_code ON urls(short_code);
CREATE INDEX idx_urls_user_id ON urls(user_id);
CREATE INDEX idx_urls_created_at ON urls(created_at);

-- Analytics table for detailed click tracking
CREATE TABLE url_clicks (
    id BIGSERIAL PRIMARY KEY,
    short_code VARCHAR(10) NOT NULL,
    clicked_at TIMESTAMP DEFAULT NOW(),
    ip_address INET,
    user_agent TEXT,
    referrer TEXT,
    country VARCHAR(2),
    city VARCHAR(100)
);

-- Partitioning strategy for analytics table
CREATE TABLE url_clicks_y2024m01 PARTITION OF url_clicks
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Users table (optional, for authenticated users)
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE,
    api_key VARCHAR(64) UNIQUE,
    created_at TIMESTAMP DEFAULT NOW(),
    plan_type VARCHAR(20) DEFAULT 'free',
    monthly_limit INTEGER DEFAULT 1000
);

Phase 5: Scaling Strategy (10 minutes)¶

Markdown
**Scaling Bottlenecks and Solutions:**

1. **Database Write Bottleneck (URL Creation)**
   - Problem: 40 writes/second on single primary
   - Solution: Database sharding by short_code hash
   - Implementation: 16 shards, route by hash(short_code) % 16

2. **Database Read Bottleneck (URL Redirects)**
   - Problem: 4000 reads/second from single database
   - Solution: Read replicas + aggressive caching
   - Implementation: 3 read replicas, 95% cache hit ratio

3. **Cache Layer Scaling**
   - Problem: Redis memory limits and single point of failure
   - Solution: Redis Cluster with automatic failover
   - Implementation: 6 nodes (3 primary + 3 replicas)

4. **Application Layer Scaling**
   - Problem: EC2 instances hitting CPU/memory limits
   - Solution: Auto Scaling Groups with target tracking
   - Implementation: Min 2, Max 20 instances, 70% CPU target

Python
# Sharding strategy implementation
class URLShardingStrategy:
    def __init__(self, num_shards=16):
        self.num_shards = num_shards
        self.shard_connections = self._initialize_connections()

    def get_shard_for_write(self, short_code):
        """Route new URL creation to appropriate shard"""
        shard_id = hash(short_code) % self.num_shards
        return self.shard_connections[shard_id]['write']

    def get_shard_for_read(self, short_code):
        """Route URL reads to appropriate shard (with read replicas)"""
        shard_id = hash(short_code) % self.num_shards
        # Use read replica for better performance
        return self.shard_connections[shard_id]['read']

    def get_user_urls(self, user_id):
        """Query across all shards for user's URLs"""
        results = []
        for shard in self.shard_connections:
            shard_results = shard['read'].execute(
                "SELECT * FROM urls WHERE user_id = %s", (user_id,)
            )
            results.extend(shard_results)
        return results

# Caching strategy
class URLCacheManager:
    def __init__(self, redis_cluster):
        self.redis = redis_cluster
        self.url_ttl = 24 * 60 * 60  # 24 hours
        self.analytics_ttl = 5 * 60   # 5 minutes

    def get_url(self, short_code):
        """Get URL from cache with fallback to database"""
        cached_url = self.redis.get(f"url:{short_code}")
        if cached_url:
            return json.loads(cached_url)

        # Cache miss - fetch from database
        url_data = self.database.get_url(short_code)
        if url_data:
            self.redis.setex(
                f"url:{short_code}", 
                self.url_ttl, 
                json.dumps(url_data)
            )
        return url_data

    def increment_click_count(self, short_code):
        """Async increment click count"""
        # Use Redis for fast increment, sync to DB periodically
        pipeline = self.redis.pipeline()
        pipeline.incr(f"clicks:{short_code}")
        pipeline.expire(f"clicks:{short_code}", self.analytics_ttl)
        pipeline.execute()

Phase 6: Monitoring and Operations (5 minutes)¶

JSON
{
  "monitoring_strategy": {
    "application_metrics": {
      "url_creation_rate": "Target: 40/sec, Alert if > 100/sec",
      "redirect_latency": "P95 < 50ms, Alert if P95 > 100ms",
      "cache_hit_ratio": "Target: > 95%, Alert if < 90%",
      "error_rate": "Target: < 0.1%, Alert if > 1%"
    },

    "infrastructure_metrics": {
      "database_connections": "Alert if > 80% of pool",
      "redis_memory_usage": "Alert if > 75%",
      "ec2_cpu_utilization": "Auto-scale at 70%",
      "alb_target_response_time": "Alert if > 200ms"
    },

    "business_metrics": {
      "daily_url_creation": "Track trends and capacity planning",
      "top_domains": "Identify potential abuse patterns",
      "geographic_distribution": "Plan CDN expansion",
      "user_retention": "Track API usage patterns"
    }
  },

  "alerting_runbooks": {
    "high_redirect_latency": {
      "investigation_steps": [
        "Check cache hit ratio",
        "Verify database performance",
        "Check network connectivity",
        "Review recent deployments"
      ],
      "escalation_path": "L1 -> L2 -> Engineering Manager",
      "auto_mitigation": "Scale out application tier"
    }
  }
}

Common L6 Interview Mistakes to Avoid¶

Markdown
**Technical Mistakes:**
❌ Starting with complex sharding before simple solutions
❌ Over-engineering for scale not in requirements  
❌ Ignoring caching until the end of the interview
❌ Not considering data consistency requirements
❌ Forgetting about URL validation and security

**Process Mistakes:**
❌ Jumping to detailed design without high-level architecture
❌ Not asking clarifying questions about scale and requirements
❌ Spending too much time on one component
❌ Not considering operational aspects (monitoring, deployment)
❌ Not justifying technology choices

**Amazon-Specific Mistakes:**
❌ Not considering cost optimization strategies
❌ Ignoring customer impact of design decisions  
❌ Not planning for failure scenarios
❌ Overlooking compliance and security requirements
❌ Not showing ownership thinking about operations

Related Resources: - AWS Services Deep Dive - Detailed AWS service patterns and use cases - L6 System Design Problems - Practice problems for L6 candidates - L7 System Design Problems - Advanced problems for L7 candidates
- Scale Architecture Patterns - Advanced scaling strategies - System Design Case Studies - Real-world implementation examples

Continue to: AWS Services for System Design �