🏗️
Domain 1.0 — Cloud Architecture & Design
Service models, deployment models, cloud-native concepts, storage, networking, HA, and design patterns
24%
Cloud Service Models — IaaS, PaaS, SaaS, FaaS
| Model | You Manage | Provider Manages | Examples |
| IaaS | OS, middleware, runtime, apps, data | Virtualization, servers, storage, networking | AWS EC2, Azure VMs, GCP Compute Engine |
| PaaS | Applications, data | OS, runtime, middleware, infrastructure | AWS Elastic Beanstalk, Azure App Service, Heroku |
| SaaS | Data, user config only | Everything — entire stack | Microsoft 365, Salesforce, Google Workspace |
| FaaS | Function code only | Execution environment, scaling, servers | AWS Lambda, Azure Functions, Google Cloud Functions |
| CaaS | Containers, apps | Container orchestration infrastructure | AWS ECS/EKS, Azure AKS, Google GKE |
Memory Tip: As you go from IaaS → PaaS → SaaS, you manage LESS and the provider manages MORE. IaaS = most control. SaaS = least control. The trade-off is always control vs convenience.
Cloud Deployment Models
- Public cloud: Infrastructure owned and operated by a CSP, shared among multiple tenants. Pay-as-you-go. Examples: AWS, Azure, GCP. Lowest upfront cost.
- Private cloud: Dedicated infrastructure for a single organization. On-premises or hosted. Maximum control and compliance. Higher cost.
- Hybrid cloud: Mix of public and private cloud with orchestration between them. Enables workload portability and "cloud bursting."
- Multi-cloud: Using services from multiple CSPs simultaneously (e.g., AWS + Azure). Avoids vendor lock-in, improves resilience.
- Community cloud: Shared among organizations with common concerns (government, healthcare). Managed by members or third party.
- Cloud bursting: Running workloads on-premises but automatically scaling into public cloud when demand spikes.
- Vendor lock-in: Dependency on a specific CSP's proprietary services. Mitigated by using open standards, containers, and multi-cloud architecture.
Virtualization & Containers
- Hypervisor Type 1 (bare-metal): Runs directly on hardware. No host OS. Examples: VMware ESXi, Microsoft Hyper-V, KVM. Best performance.
- Hypervisor Type 2 (hosted): Runs on top of a host OS. Examples: VMware Workstation, VirtualBox. Used for development/testing.
- VM: Full OS virtualization. Each VM has its own kernel. Heavy — minutes to boot. Strong isolation.
- Container: Shares host OS kernel. Lightweight — seconds to start. Application + dependencies packaged together. Docker is dominant runtime.
- Docker image: Read-only template built from a Dockerfile. Images are layered and cached.
- Docker container: Running instance of an image. Ephemeral by default — data lost when container stops unless mounted volume is used.
- Kubernetes (K8s): Container orchestration platform. Manages deployment, scaling, self-healing, and load balancing of containerized apps.
- Pod: Smallest K8s unit — one or more containers sharing network and storage resources.
- Container registry: Stores Docker images. Examples: Docker Hub, AWS ECR, Azure ACR, Harbor (private).
Serverless & Cloud-Native Concepts
- Serverless: No server management. Code runs in response to events. Provider handles all infrastructure, scaling, and availability. Pay per execution.
- FaaS: Function as a Service. Stateless functions triggered by events. Cold start latency is a limitation.
- Microservices: Application decomposed into small, independent, loosely coupled services. Each deployable independently. Enables agility.
- Monolith vs microservices: Monolith = single deployable unit, easy to develop but hard to scale. Microservices = independent scaling but complex orchestration.
- API Gateway: Single entry point for all API calls. Handles routing, auth, rate limiting, SSL termination. Examples: AWS API Gateway, Kong.
- Event-driven architecture: Services communicate via events. Decoupled. Examples: AWS SNS/SQS, Azure Event Hubs, Kafka.
- Service mesh: Infrastructure layer for microservice communication. Handles service discovery, load balancing, encryption (mTLS). Example: Istio.
- 12-Factor App: Methodology for building cloud-native apps. Key factors: one codebase, explicit dependencies, config in environment variables, stateless processes.
Cloud Storage Types
| Type | Description | Use Cases | Examples |
| Object storage | Stores unstructured data as objects with metadata + unique key. Flat namespace. | Backups, media, static websites, archives | AWS S3, Azure Blob, GCP Cloud Storage |
| Block storage | Raw storage volumes attached to VMs. Like a virtual hard drive. Low latency. | OS volumes, databases, high-performance apps | AWS EBS, Azure Managed Disks, GCP Persistent Disk |
| File storage | Shared file system accessed via NFS or SMB. Hierarchical directories. | Shared application data, lift-and-shift, home dirs | AWS EFS, Azure Files, GCP Filestore |
| Archive storage | Lowest cost tier. Long retrieval times (minutes to hours). | Long-term compliance retention, cold backups | AWS Glacier, Azure Archive, GCP Coldline |
- Storage tiering: Automatically moves data between hot/warm/cold tiers based on access frequency. Reduces cost.
- Ephemeral storage: Temporary storage tied to a VM's lifecycle. Lost when instance stops. Use for temp files only.
- IOPS: Input/Output Operations Per Second — key metric for block storage performance.
Cloud Networking
- VPC/VNet: Virtual Private Cloud (AWS) / Virtual Network (Azure). Logically isolated network within the cloud. You control IP ranges, subnets, routing, gateways.
- Subnet: Subdivision of a VPC. Public subnet = has route to internet gateway. Private subnet = no direct internet route.
- Internet Gateway (IGW): Enables internet access for resources in a public subnet. Attached to VPC.
- NAT Gateway: Allows private subnet instances to initiate outbound internet connections without exposing them inbound.
- Security Group: Stateful virtual firewall at instance/resource level. Allow rules only — if you allow inbound, return traffic is automatic.
- Network ACL (NACL): Stateless firewall at subnet level. Explicit allow AND deny rules. Both inbound and outbound must be configured.
- VPC Peering: Direct private connectivity between two VPCs. Non-transitive — A↔B and B↔C does NOT mean A↔C.
- Transit Gateway: Hub-and-spoke model connecting multiple VPCs and on-premises networks centrally. Transitive routing supported.
- VPN Gateway / Direct Connect: VPN = encrypted tunnel over internet. Direct Connect / ExpressRoute = dedicated private fiber link to cloud. Higher bandwidth and lower latency.
- CDN: Content Delivery Network — distributes content to edge locations globally for low-latency access. AWS CloudFront, Azure CDN, Cloudflare.
- Load balancer types: Application LB (Layer 7, HTTP/HTTPS, path-based routing). Network LB (Layer 4, TCP/UDP, ultra-low latency). Global LB (across regions).
High Availability & Fault Tolerance
- Availability Zone (AZ): Isolated data center within a cloud region. Running across multiple AZs protects against single-AZ failure.
- Region: Geographic area containing multiple AZs. Regions are completely independent.
- RTO: Recovery Time Objective — maximum acceptable downtime. How fast must you restore?
- RPO: Recovery Point Objective — maximum acceptable data loss. How old can recovered data be?
- Active-active: Multiple instances serve traffic simultaneously. Instant failover. Zero downtime.
- Active-passive: Primary instance serves traffic; standby ready to take over. Small failover delay.
- Auto Scaling: Automatically adjusts number of compute instances based on demand metrics (CPU, requests). Maintains performance and controls cost.
- Health check: Load balancers periodically check instance health. Unhealthy instances removed from rotation automatically.
- Multi-region deployment: Workload runs across multiple geographic regions. Highest availability but highest cost and complexity.
- Chaos engineering: Intentionally injecting failures (Netflix Chaos Monkey) to test resilience and validate HA design.
Cloud Design Patterns
- Loose coupling: Services interact via APIs or queues, not direct calls. Failure of one service doesn't cascade. Foundational cloud design principle.
- Stateless design: Applications store no session state locally. State stored in shared cache (Redis) or database. Required for horizontal scaling.
- Circuit breaker: Detects repeated failures to a service and "opens" the circuit (stops calling it) to prevent cascade failures. Self-heals over time.
- Queue-based load leveling: Place a message queue between producer and consumer. Smooths traffic spikes and decouples systems.
- Retry pattern: Automatically retry failed transient operations (network hiccup, throttling). Use exponential backoff to avoid overload.
- CQRS: Command Query Responsibility Segregation — separate read and write models for scalability.
- Strangler fig: Incrementally replace a legacy monolith by routing functionality piece by piece to new microservices until old system is retired.
- Blue/Green deployment: Run two identical production environments. Switch traffic from blue (old) to green (new) instantly. Easy rollback.
- Canary deployment: Route small percentage of traffic to new version. Monitor for errors before full rollout.
🎯 Most-Tested Architecture Concepts: Know the service model stack cold (IaaS/PaaS/SaaS/FaaS). Understand VPC architecture including public/private subnets, IGW, NAT Gateway, Security Groups vs NACLs. Know the difference between VMs and containers (kernel sharing, boot time, isolation). Master HA concepts: AZs vs Regions, active-active vs active-passive, RTO vs RPO.
🔒
Domain 2.0 — Cloud Security
Shared responsibility, IAM, encryption, network security, compliance, zero trust, and cloud-specific threats
22%
Shared Responsibility Model
- Core principle: Security is a shared responsibility between the CSP and the customer. The division depends on the service model.
- CSP always responsible for: Physical security of data centers, hardware, network infrastructure, hypervisor layer.
- Customer always responsible for: Data classification, identity & access management, client-side encryption, network traffic protection.
- IaaS split: CSP = hardware/virtualization/physical. Customer = OS, middleware, applications, data, network config.
- PaaS split: CSP adds OS and runtime management. Customer = application code and data only.
- SaaS split: CSP manages nearly everything. Customer = user access, data, and compliance responsibility.
- Key exam trap: Even in SaaS, the customer is responsible for their DATA and who has ACCESS to it. The CSP is never responsible for your data loss due to misconfigured permissions.
Identity & Access Management (IAM)
- IAM: Framework controlling who can do what to which cloud resources. Foundation of cloud security.
- Principle of Least Privilege: Grant only the minimum permissions required to perform a task. Always.
- IAM user: Individual identity with credentials. Avoid using root/admin accounts for daily tasks.
- IAM role: Temporary credentials assumed by services, applications, or users. Better than long-term access keys for inter-service auth.
- IAM policy: JSON document defining allowed/denied actions on specific resources. Attached to users, groups, or roles.
- RBAC: Role-Based Access Control — permissions based on job role. Most common model.
- ABAC: Attribute-Based Access Control — permissions based on resource/user attributes (tags). More granular than RBAC.
- MFA: Multi-Factor Authentication. Should be enforced for all privileged accounts and console access.
- Service account: Non-human identity for applications/workloads to authenticate to cloud services without user credentials.
- Federated identity: Uses existing external identity provider (SAML 2.0, OIDC, Active Directory) for SSO into cloud. Avoids managing separate cloud credentials.
- Just-in-time access: Privileged access granted only when needed, for a limited time. Reduces attack surface.
Encryption & Key Management
- Encryption at rest: Data encrypted when stored on disk. Default in most cloud services. Uses AES-256.
- Encryption in transit: Data encrypted while moving over network. TLS 1.2+ required. Enforced via HTTPS, VPN, TLS.
- End-to-end encryption (E2EE): Data encrypted from source to destination. CSP cannot decrypt even if they wanted to.
- CSP-managed keys: Provider generates and manages keys. Easiest, least control. Default for most services.
- Customer-managed keys (CMK): You create and manage keys in a KMS. Provider encrypts/decrypts using your key. Revoke key = data inaccessible.
- Customer-provided keys (BYOK): Bring Your Own Key. You import your keys into cloud KMS. Maximum control.
- KMS: Key Management Service — cloud service to create, store, rotate, and audit cryptographic keys. AWS KMS, Azure Key Vault, GCP Cloud KMS.
- HSM: Hardware Security Module — dedicated hardware for key operations. Tamper-resistant. AWS CloudHSM, Azure Dedicated HSM.
- Key rotation: Periodically generating new encryption keys. Limits exposure if a key is compromised. Should be automated.
- Secrets management: Store API keys, credentials, certificates securely. AWS Secrets Manager, HashiCorp Vault, Azure Key Vault.
Network Security in the Cloud
- Security Group: Stateful instance-level firewall. Only allow rules. Return traffic automatically allowed. Default: deny all inbound.
- NACL: Stateless subnet-level ACL. Both allow and deny rules. Must explicitly allow return traffic. Rules processed in order.
- WAF: Web Application Firewall. Layer 7 protection. Blocks SQL injection, XSS, OWASP Top 10 attacks. AWS WAF, Azure WAF.
- DDoS protection: Cloud-native mitigation. AWS Shield Standard (free) / Advanced. Azure DDoS Protection.
- Private endpoints: Access cloud services (S3, databases) over private network without internet traversal. Uses VPC endpoint / Private Link.
- Bastion host / Jump server: Hardened VM in public subnet used as single entry point for SSH/RDP to private subnet instances. Minimizes attack surface.
- Zero Trust Network Access (ZTNA): "Never trust, always verify." Replaces VPN. Grants per-application access based on identity + device posture.
- East-west traffic: Traffic between services within a cloud environment. Often less scrutinized — should be encrypted and segmented.
Compliance & Governance Frameworks
| Framework | Focus Area | Key Requirement |
| SOC 2 | Service org controls | Security, availability, confidentiality, privacy controls |
| ISO 27001 | Info security mgmt | ISMS — comprehensive security management system |
| PCI DSS | Payment card data | Encryption, access control, monitoring cardholder data |
| HIPAA | Healthcare data (PHI) | Encryption, audit logs, access controls for patient data |
| GDPR | EU personal data | Consent, data residency, right to erasure |
| FedRAMP | US federal cloud | NIST-based authorization for federal workloads |
| NIST CSF | Cybersecurity framework | Identify, Protect, Detect, Respond, Recover |
| CSA CCM | Cloud controls | Cloud-specific security controls matrix |
- Data residency: Legal requirement that data must remain within specific geographic boundaries. Managed via region selection and data replication policies.
- Cloud Security Posture Management (CSPM): Continuously monitors cloud configurations for misconfigurations against security best practices and compliance standards.
Cloud-Specific Security Threats & Controls
- Misconfiguration: #1 cause of cloud security incidents. Public S3 buckets, open security groups, overly permissive IAM roles. Use CSPM to detect.
- Credential exposure: Hardcoded API keys in source code, committed to GitHub. Use secrets management and key rotation.
- Insecure APIs: Cloud services exposed via API. Require authentication, use HTTPS, implement rate limiting and input validation.
- VM escape / container escape: Attacker breaks out of VM or container to access hypervisor or host OS. Patching and privilege separation mitigate.
- Side-channel attacks: Exploiting shared hardware (Spectre, Meltdown) to leak data across tenant boundaries in multi-tenant environments.
- Data exfiltration: Unauthorized transfer of data out of cloud environment. Monitor egress traffic, use DLP, encrypt data.
- Cloud logging: Enable all audit logs. AWS CloudTrail, Azure Monitor, GCP Cloud Audit Logs. Critical for incident response and compliance.
- CWPP: Cloud Workload Protection Platform — security for cloud VMs, containers, and serverless functions. Runtime protection.
🚨 Shared Responsibility Exam Trap: The most common Cloud+ exam trap is blaming the wrong party. In SaaS, if a user accidentally deletes data due to misconfigured permissions — that is the CUSTOMER's fault, not the CSP's. The CSP is responsible for availability and the underlying infrastructure, but never for how you configure access to your own data. This distinction appears in multiple scenario questions.
🚀
Domain 3.0 — Cloud Deployment
IaC, CI/CD pipelines, migration strategies, testing, and cloud resource provisioning
20%
Infrastructure as Code (IaC)
- IaC: Managing and provisioning infrastructure through machine-readable configuration files instead of manual GUI processes.
- Benefits: Consistency, repeatability, version control, automated testing, disaster recovery (rebuild from code).
- Declarative IaC: Describe the desired end state; the tool figures out how to achieve it. Examples: Terraform, AWS CloudFormation, Azure ARM templates.
- Imperative IaC: Specify the exact steps to execute. Examples: Ansible playbooks with specific tasks, shell scripts.
- Terraform: Open-source, multi-cloud IaC tool. Uses HCL. State file tracks deployed resources. Plan → Apply workflow.
- CloudFormation: AWS-native IaC. YAML or JSON templates. Stacks group related resources.
- Ansible: Agentless configuration management tool. Uses YAML playbooks. SSH-based. Good for configuration drift correction.
- Configuration drift: When actual infrastructure deviates from the desired state defined in IaC. Detected and corrected by drift detection tools.
- Idempotency: Applying the same IaC configuration multiple times produces the same result. No side effects on re-runs.
CI/CD Pipelines & DevOps
- CI (Continuous Integration): Developers frequently merge code to shared repo. Automated build and test runs on every commit. Catches bugs early.
- CD (Continuous Delivery): Code always in a deployable state. Deployment to production is manual trigger after automated testing passes.
- CD (Continuous Deployment): Fully automated — every passing build automatically deployed to production without human intervention.
- Pipeline stages: Source → Build → Test (unit/integration/security) → Package → Deploy → Monitor.
- DevOps: Cultural and technical practice unifying development and operations for faster, more reliable software delivery.
- DevSecOps: Integrates security into every stage of the CI/CD pipeline. "Shift left" — find security issues early when they're cheapest to fix.
- GitOps: Uses Git as single source of truth for both application and infrastructure state. Changes go through Git pull requests.
- Artifact repository: Stores build outputs (container images, compiled binaries, libraries). Examples: JFrog Artifactory, Nexus, AWS CodeArtifact.
- Pipeline tools: Jenkins, GitHub Actions, GitLab CI/CD, AWS CodePipeline, Azure DevOps.
Cloud Migration Strategies — The 6 Rs
- Rehost ("Lift & Shift"): Move workload to cloud with no code changes. Fastest migration. No cloud-native optimization. Use for: quick wins, tight timelines.
- Replatform ("Lift & Tinker"): Minor cloud optimizations without changing core architecture. Example: move database to RDS managed service instead of self-managing on EC2.
- Refactor/Re-architect: Redesign application to be cloud-native. Microservices, serverless, containers. Highest value but most effort and risk.
- Repurchase: Replace existing application with SaaS equivalent. Example: replace on-premises CRM with Salesforce.
- Retire: Decommission applications that are no longer needed. Reduces cost and complexity.
- Retain: Keep on-premises for now. Application has compliance requirements, mainframe dependency, or isn't ready for cloud.
- Migration tools: AWS Migration Hub, Azure Migrate, Google Migrate for Compute. Assess, plan, and track migration.
- Cutover: The moment of switching from old system to new. Requires rollback plan. Often done during maintenance window.
Testing in the Cloud
- Unit testing: Tests individual functions/modules in isolation. Fastest. Run on every commit in CI pipeline.
- Integration testing: Tests interactions between multiple components or services. Ensures they work together correctly.
- Load testing: Tests system behavior under expected and peak load. Tools: Apache JMeter, AWS Load Testing, Locust.
- Stress testing: Pushes system beyond capacity to find breaking point and observe failure behavior.
- Penetration testing: Simulates attacker to find exploitable vulnerabilities. Most CSPs require advance notice. Not the same as vulnerability scanning.
- SAST: Static Application Security Testing — analyzes source code for vulnerabilities without running it. Runs in CI pipeline.
- DAST: Dynamic Application Security Testing — tests running application from outside. Finds runtime vulnerabilities SAST misses.
- Regression testing: Ensures new code changes haven't broken existing functionality.
- Canary testing: Deploy to small percentage of users first. Monitor before full rollout. Low-risk production validation.
Deployment Strategies
- Rolling deployment: Gradually replaces old instances with new ones. Zero downtime. Rollback is slow (reverse the roll). Some users see old version, some new.
- Blue/Green deployment: Two identical environments. Switch traffic all at once. Instant rollback by switching back. Double the infrastructure cost during deployment.
- Canary deployment: Route small traffic % (1–5%) to new version. Monitor metrics. Gradually increase percentage. Catches issues before full rollout.
- A/B testing: Route different users to different versions to compare metrics (conversion rate, engagement). Business-driven, not just risk mitigation.
- In-place upgrade: Update software on existing instances. Fastest but risky — can cause downtime if upgrade fails.
- Immutable infrastructure: Never update existing servers. Always build new instances from image and replace old ones. Eliminates configuration drift.
- Feature flags: Enable/disable features at runtime without code deployment. Gradual rollout, instant disable if issues.
Containers & Orchestration in Deployment
- Dockerfile: Text file with instructions to build a Docker image. Each instruction creates a new layer.
- Docker Compose: Defines and runs multi-container applications. YAML file specifies services, networks, and volumes.
- Kubernetes Deployment: K8s resource that manages a ReplicaSet of identical pods. Handles rolling updates and rollbacks.
- Kubernetes Service: Stable network endpoint exposing a set of pods. Types: ClusterIP (internal), NodePort, LoadBalancer (external).
- Kubernetes Ingress: Manages external HTTP/HTTPS access to services. HTTP routing, SSL termination, virtual hosting.
- Helm: Kubernetes package manager. Helm charts are reusable templates for K8s applications.
- Image scanning: Scanning container images for known CVEs before deployment. Integrated into CI/CD pipeline. Tools: Trivy, Clair, Snyk.
- Namespace: K8s logical isolation within a cluster. Separate teams/environments in same cluster. Applied resource quotas and RBAC.
💡 The 6 Rs Migration Strategy: Know all six — Rehost, Replatform, Refactor, Repurchase, Retire, Retain. Exam scenarios will describe a business situation and ask which strategy is most appropriate. Rehost = fastest/cheapest. Refactor = most optimized/expensive. Repurchase = replace with SaaS. Retire = just turn it off. Retain = not ready for cloud yet.
⚙️
Domain 4.0 — Cloud Operations & Support
Monitoring, cost management, automation, patching, performance, SLAs, backup/DR, and change management
22%
Monitoring & Observability
- The 3 pillars of observability: Metrics (numeric measurements over time), Logs (discrete events), Traces (path of a request through distributed services).
- Metrics: CPU, memory, network I/O, latency, error rates, request counts. Collected at regular intervals.
- Logs: Application, system, access, audit logs. Centralize in log management platform. AWS CloudWatch Logs, Azure Monitor Logs, ELK Stack.
- Distributed tracing: Tracks a single request as it flows through microservices. Essential for debugging latency in distributed systems. AWS X-Ray, Jaeger, Zipkin.
- Alerting: Threshold-based (CPU > 80%) or anomaly-based alerts. Alert fatigue is a real problem — tune thresholds carefully.
- Dashboard: Real-time visualization of key metrics. AWS CloudWatch, Azure Monitor, Grafana + Prometheus.
- Synthetic monitoring: Simulated user transactions to test availability and performance from external perspective.
- APM: Application Performance Monitoring — tracks end-user experience, code-level performance, database queries. New Relic, Datadog, Dynatrace.
- SIEM: Security Information and Event Management — aggregates and correlates security logs for threat detection. AWS Security Hub, Azure Sentinel, Splunk.
Cloud Cost Management & Optimization
- FinOps: Cloud financial management practice — collaboration between finance, engineering, and operations to manage cloud spend.
- On-demand pricing: Pay per hour/second with no commitment. Highest unit cost. Flexible.
- Reserved Instances / Savings Plans: Commit to 1 or 3 years. 40–72% discount over on-demand. Best for stable, predictable workloads.
- Spot / Preemptible instances: Cheapest option (60–90% discount). Instances can be terminated with 2-min warning when CSP needs capacity. For fault-tolerant batch jobs.
- Right-sizing: Matching instance type and size to actual workload requirements. Eliminate over-provisioned resources.
- Auto Scaling: Automatically scales in (removes) instances when demand drops. Critical for cost efficiency.
- Storage tiering: Move infrequently accessed data from expensive hot storage to cheaper cold/archive tiers automatically.
- Tagging: Apply metadata tags to all cloud resources for cost allocation, chargeback, and ownership tracking. Essential for FinOps.
- Cost anomaly detection: Automated alerts when spending deviates from baseline. AWS Cost Anomaly Detection, Azure Cost Management.
- Egress costs: Transferring data OUT of the cloud is usually charged. Transferring IN is typically free. Factor into architecture decisions.
Automation & Orchestration
- Auto Scaling Groups: Automatically add/remove compute instances based on policies (CPU, schedule, request count). Maintain desired state.
- Event-driven automation: Trigger actions based on events. Example: new file in S3 triggers Lambda to process it.
- Runbook automation: Documented procedures converted to automated scripts. Execute consistent, repeatable operational tasks.
- AWS Systems Manager / Azure Automation: Manage and automate tasks across fleets of VMs — patching, compliance, configuration.
- Scheduler: Run tasks on a schedule (cron). Scale down dev environments at night. Run batch jobs weekly.
- Self-healing: Auto Scaling replaces unhealthy instances. Kubernetes restarts failed containers. Eliminate manual intervention for common failures.
- Policy-as-code: Define compliance and governance policies as code. AWS Service Control Policies, OPA (Open Policy Agent), Azure Policy.
Backup, DR & Business Continuity
- Backup types: Full (all data), Incremental (changes since last backup), Differential (changes since last full). Incremental = smallest backup. Differential = fastest restore.
- Snapshot: Point-in-time copy of a volume or database. Stored in object storage. Fast to create. Used for cloud-native backup.
- 3-2-1 backup rule: 3 copies · 2 different media/storage types · 1 copy offsite/different region.
- Geo-redundant storage: Data replicated asynchronously to secondary region. Survives regional outage.
- Pilot light DR: Minimal infrastructure running in DR region (just core services). Scale up from AMI/snapshot during failover. RTO: hours.
- Warm standby DR: Scaled-down version of full environment running in DR region. RTO: minutes.
- Active-active (multi-site) DR: Full production capacity in multiple regions simultaneously. RTO: seconds. Highest cost.
- Backup testing: Regularly test restore procedures. Untested backups are unreliable. Automated restore testing is best practice.
SLAs, SLOs & Change Management
- SLA: Service Level Agreement — contractual commitment between CSP and customer. Defines availability guarantees and remedies (credits) for breach.
- SLO: Service Level Objective — internal performance target. More ambitious than SLA. Example: 99.95% availability internal target vs 99.9% SLA commitment.
- SLI: Service Level Indicator — actual metric being measured. Example: measured uptime percentage over the period.
- Availability math: 99.9% = 8.7 hours downtime/year. 99.95% = 4.4 hours. 99.99% = 52 minutes. 99.999% = 5.3 minutes.
- Error budget: Allowable amount of downtime/errors within SLO. If error budget is exhausted, freeze new deployments. SRE concept.
- Change management: Formal process for requesting, reviewing, approving, implementing, and documenting changes to production.
- Change types: Standard (low-risk, pre-approved, routine). Normal (requires approval). Emergency (critical fix, expedited approval).
- CAB: Change Advisory Board — reviews and approves significant changes.
- Rollback plan: Every change must include a tested rollback procedure. Essential for risk mitigation.
Patching & Configuration Management
- Patch management: Systematic process for testing and applying security and feature updates to OS, middleware, and applications.
- Immutable patching: Don't patch running VMs — build new patched image and redeploy. Eliminates drift. Cloud-native best practice.
- Vulnerability scanning: Automated scanning of cloud resources for known CVEs and misconfigurations. AWS Inspector, Azure Defender, Qualys.
- Golden image / AMI: Pre-configured, hardened VM image used as a base for all deployments. Contains approved OS, patches, agents, and configuration.
- Configuration drift: Divergence between intended and actual configuration. IaC and configuration management tools detect and correct drift.
- CMDB: Configuration Management Database — records all configuration items (CIs) and their relationships. Source of truth for infrastructure.
- Patch baseline: Defines which patches are required and their criticality thresholds. Critical patches applied within 24–72 hours in most frameworks.
💰 Cost Optimization — Exam Favorite: Know the three instance pricing models: on-demand (flexible, expensive), reserved/savings plans (1–3 year commitment, 40–72% off, best for stable workloads), and spot/preemptible (cheapest 60–90% off, but can be terminated — only for fault-tolerant batch jobs). Right-sizing + auto scaling + storage tiering + tagging are the four pillars of cloud cost management. The exam will give you a scenario and ask which option reduces cost most effectively.
🛠️
Domain 5.0 — Troubleshooting
Cloud troubleshooting methodology, connectivity issues, performance problems, security incidents, and deployment failures
12%
Cloud Troubleshooting Methodology
- Step 1 — Identify the problem: Check dashboards, alerts, and logs. Define exact symptoms. Determine blast radius (how much is affected?).
- Step 2 — Establish theory: What changed recently? Deployment? Config change? Traffic spike? Hardware failure? Check change log first.
- Step 3 — Test the theory: Check CloudWatch/Azure Monitor metrics. Review CloudTrail/audit logs. Reproduce in non-production if possible.
- Step 4 — Plan of action: Define fix with rollback plan. Assess blast radius of the fix. Get approval if production change needed.
- Step 5 — Implement: Apply fix. Make one change at a time. Document what you did and when.
- Step 6 — Verify: Confirm issue resolved AND no regression introduced. Monitor for recurrence.
- Step 7 — Document: Record root cause, timeline, resolution, and preventive measures. Post-incident review (blameless postmortem).
Connectivity & Networking Issues
- Cannot reach instance: Check Security Group (inbound rules correct port/IP?), NACL, route table, IGW attached, instance running, correct public IP.
- Cannot reach internet from private subnet: NAT Gateway configured? Route table has route to NAT GW? NAT GW in public subnet with IGW? Elastic IP attached?
- VPC peering not working: Peering connection accepted? Route tables on BOTH sides updated? Security Groups allow traffic? NACL rules?
- DNS resolution failure: Check DNS settings in VPC (enableDnsSupport, enableDnsHostnames). Check Route 53 resolver rules if using custom DNS.
- On-premises to cloud connectivity: VPN tunnel up? BGP session established? Correct routes advertised? Firewall rules? Check VPN CloudWatch metrics.
- High latency: Traffic routing correctly? Wrong region? CDN misconfigured? Throttling? Check network metrics for packet loss and retransmits.
- Load balancer returning 502/504: Backend targets unhealthy? Health checks misconfigured? Target port wrong? Instance overloaded?
Performance & Resource Issues
- High CPU: Right-size instance (scale up). Enable Auto Scaling. Profile application for CPU-intensive code. Check for runaway processes.
- Memory exhaustion: Application memory leak or insufficient RAM. Upgrade instance type. Add memory limits to containers. Check for zombie processes.
- High storage latency (IOPS): Block storage IOPS limit reached. Upgrade to higher-performance storage tier (gp2 → gp3/io1). Enable storage burst monitoring.
- Database performance: Missing indexes, inefficient queries, connection pool exhaustion, storage I/O bottleneck. Enable slow query logging. Consider read replicas.
- Cold start latency (Lambda/FaaS): First invocation slow while environment initializes. Use Provisioned Concurrency. Keep functions warm. Reduce package size.
- API throttling: Exceeding rate limits. Implement exponential backoff and retry. Request quota increase from CSP. Use API caching.
- Container OOMKilled: Container exceeded memory limit and was killed. Increase memory limits. Fix memory leak in application.
- Cascading failures: One service failure triggers others. Circuit breaker pattern, timeouts, and bulkhead isolation prevent cascade.
Security Incident Response in the Cloud
- Containment first: Isolate compromised instance by removing from load balancer, revoking IAM credentials, modifying security group to block all traffic.
- Preserve evidence: Take snapshot of compromised volume before terminating. Preserve CloudTrail and VPC Flow Logs for forensics.
- Compromised IAM credentials: Immediately revoke/rotate the key. Review CloudTrail for all API calls made with that key. Assess damage scope.
- Public S3 bucket (data exposure): Immediately make bucket private. Enable S3 Block Public Access. Review access logs for exfiltration. Enable GuardDuty.
- Unusual API activity: Check CloudTrail for actions from unexpected IPs, regions, or times. GuardDuty/Azure Defender may alert on anomalous behavior.
- Cryptomining infection: Unusual CPU spike, unexpected egress to mining pools. Check running processes, outbound connections. Terminate and replace instance.
- VPC Flow Logs: Capture metadata for all traffic in VPC. Essential for forensics — shows source/dest IPs, ports, protocol, accept/reject status.
Deployment & Application Failures
- Deployment failure: Check CI/CD pipeline logs. Image pull error? Insufficient quota? Unhealthy health check during deployment? Wrong config/secrets?
- Container crash loop: Container starts, crashes immediately, restarts repeatedly. Check container logs (kubectl logs). Check startup dependencies. Missing env vars?
- Image pull error: Wrong image name/tag? Container registry credentials expired? Private registry accessible from cluster? Network policy blocking?
- IaC apply failure: State file drift? Resource quota exceeded? Insufficient IAM permissions? Dependency ordering issue? Check provider-specific error messages.
- Application 500 errors after deployment: New code bug? Database schema migration failed? Config variable missing? Roll back deployment if critical.
- Auto Scaling not triggering: Check scaling policy thresholds. CloudWatch alarm firing? Cooldown period active? Min/max limits reached? Service role permissions?
- Certificate errors: TLS cert expired? Wrong domain (CN mismatch)? Self-signed cert not trusted? Certificate not provisioned for correct region?
Key Cloud Troubleshooting Tools
- AWS CloudTrail — Logs all API calls to AWS. Who did what, when, from where. Essential for security investigation and compliance.
- VPC Flow Logs — Captures network traffic metadata in VPC. Shows allowed/denied connections. Used to debug Security Group and NACL issues.
- CloudWatch / Azure Monitor — Metrics, logs, alarms, dashboards. Central observability platform for each CSP.
- AWS Config / Azure Policy — Records configuration changes to resources over time. Shows what changed and when. Compliance evaluation.
- kubectl logs/describe/events — Kubernetes troubleshooting. Logs = container output. Describe = resource state. Events = recent cluster events.
- AWS Trusted Advisor — Automated best practice checks across cost, performance, security, fault tolerance, and service limits.
- Cloud Shell / CLI — Browser-based or local CLI (aws cli, az cli, gcloud) for direct resource querying and management.
🎯 Troubleshooting Layer-by-Layer: Always eliminate layers systematically. For connectivity: Instance state → Security Group → NACL → Route table → IGW/NAT → VPC peering → On-premises firewall. For application issues: Health check → Target group → LB listener rules → App logs → Dependencies (DB, cache, external APIs). For cost spikes: Check new resources deployed → check data transfer → check auto scaling events → check reserved instance expiry.