About the Job Your Impact
- Design, implement, and optimize resilient infrastructure systems and tools that ensure performance, reliability, and scalability.
- Lead efforts to improve observability through logging, metrics, alerting, and tracing systems (e.g., Prometheus, Grafana, OpenTelemetry).
- Own the development of infrastructure-as-code (IaC) and CI/CD improvements that reduce deployment risks and time-to-resolution.
- Diagnose complex issues across distributed systems and proactively address architectural weaknesses.
- Participate in and lead incident response, postmortems, and root cause analysis with a focus on continuous learning.
- Drive adoption of SRE principles across engineering teams, championing reliability as a shared responsibility.
What You Bring to the Table
- 2 - 4 years of experience in software development, SRE, DevOps, or infrastructure engineering.
- Expertise in a modern programming language such as Golang, Python, Java, or similar.
- Hands-on experience with cloud platforms (e.g., AWS, GCP) and container orchestration (e.g., Kubernetes).
- Strong background in system design, with an emphasis on scalability, fault tolerance, and security.
- Deep knowledge of version control, CI/CD pipelines, and Git-based workflows.
- Experience with observability tooling (e.g., Grafana, Prometheus, Datadog, ELK).
- Proven track record of mentoring peers and influencing technical decisions across teams.
- Familiarity with service-level objectives (SLOs), error budgets, and incident management frameworks.
- Contributions to open-source projects or SRE community initiatives.
- A degree in Computer Science, Engineering, or related field (or equivalent experience).
Perks
Vendasta