Amazon
Capstone Software Engineer
Problem
Engineering teams often rely on multiple dashboards, metrics, and logs to understand system health. Investigating anomalies can require manually searching across several tools before engineers can determine what happened and where to begin troubleshooting.
Built / Contributed
Helped build an observability and anomaly triage platform for AWS environments that centralized monitoring and investigation workflows across resources deployed in multiple regions and accounts. The platform automated metric collection and machine learning–based anomaly detection, analyzed dashboard screenshots for visual anomalies, prioritized findings based on severity, and provided direct access to relevant logs and operational context, enabling engineers to identify and investigate potential issues from a single interface.
Key Metrics & Highlights
- Reduced manual investigation time by 2+ hours per day for engineering teams.
- Worked across user accounts and services in every AWS region worldwide.
- Reduced manual log searching from over 10,000 logs to roughly 50 relevant logs.