Scenario Based Advanced DevOps Interview Questions and Answers
1. Your production deployment failed. What will you do?
Rollback immediately, check logs, identify root cause, fix issue in lower environment and redeploy.
2. Application is slow in production. How do you troubleshoot?
Check CPU/memory, pod status, logs, database queries, network latency and autoscaling.
3. Kubernetes pod keeps restarting. What is your approach?
Check pod logs, describe pod, resource limits, liveness probe and crash errors.
4. CI pipeline is failing intermittently. What will you check?
Check flaky tests, network issues, dependency downloads and resource limits.
5. Secrets are hardcoded in code. How do you fix this?
Move secrets to Vault or cloud secret manager and inject via environment variables.
6. One microservice is causing whole system failure. What will you do?
Use circuit breaker, isolate service, rollback version and scale independently.
7. How do you perform zero downtime deployment?
Use rolling updates, blue-green or canary deployment with health checks.
8. Production DB is down. Immediate steps?
Switch to replica, check alerts, analyze logs, restore from backup if needed.
9. Disk is full on server. What will you do?
Check large files, clean logs, extend volume and implement log rotation.
10. A new release caused errors. What is your rollback strategy?
Rollback deployment, verify stability, analyze metrics and apply fix.
11. Kubernetes cluster node is not responding.
Drain node, move pods, check node logs, restart or replace node.
12. High latency reported by users.
Check load balancer, network, pod scaling, DB performance and CDN.
13. How do you handle configuration drift?
Detect drift using Terraform plan and reapply correct state.
14. Unauthorized access attempt detected.
Block IP, rotate credentials, audit logs and tighten IAM rules.
15. A pod consumes too much CPU.
Set resource limits, enable HPA and optimize application.
16. How do you deploy to multiple environments?
Use pipeline stages with same artifact and different configs.
17. Build artifacts are inconsistent.
Use artifact repository and promote same artifact across envs.
18. Logs are missing in production.
Check log agent, permissions, disk space and pipeline configuration.
19. API is returning 500 errors.
Check app logs, DB connectivity, service health and rollback if needed.
20. CI/CD pipeline is too slow.
Parallelize stages, cache dependencies and optimize tests.
21. Kubernetes service is not accessible.
Check service type, endpoints, network policies and ingress.
22. Secrets expired unexpectedly.
Rotate secrets automatically and update apps via secret manager.
23. Memory leak detected.
Restart pods, analyze heap dump and fix code.
24. Cloud bill increased suddenly.
Analyze usage, remove unused resources and enable autoscaling.
25. How do you handle disaster recovery?
Use backups, replication, tested recovery scripts and DR drills.
26. How do you test new infra changes?
Apply in staging, run tests and then promote to prod.
27. Service crashes during traffic spike.
Enable autoscaling and load balancing.
28. Pipeline triggered accidentally.
Stop pipeline, restrict permissions and enable approvals.
29. Cluster upgrade failed.
Rollback upgrade and follow staged upgrade approach.
30. How do you ensure system reliability?
Monitoring, alerts, redundancy, chaos testing and automation.
Source: sureshtechlabs.com