Senior Manager, Site Reliability Engineer

Eclipse Foundation, Inc. • Ottawa • 3w ago

About the Eclipse Foundation The Eclipse Foundation is a globally recognized nonprofit organization that supports a vibrant community of open source projects and contributors. With a commitment to vendor neutrality and transparency, we provide a collaborative environment for innovation across industries including cloud, edge, AI, and developer tooling. Our team is remote-first, inclusive, and passionate about open source.

Position Summary We are seeking a Senior Manager, Site Reliability Engineer to lead and evolve the infrastructure supporting critical services used by millions of community members, including the Open VSX Registry. Reporting to the Director of IT, you will be leading the transformation of services towards a 24/7 highly available state, with strong security practices, alongside planning, uptime, incident response, roadmap execution, and long-term sustainability.

This role is central to our mission of empowering developers, enabling collaboration, and ensuring user freedoms by delivering services that are secure, resilient, and aligned with the strategic goals of the Foundation.

Location: Ottawa, Ontario. Must be able to physically go to a data centre when needed to assist with physical work.

What You’ll Do

Architect and manage Kubernetes deployments for Open VSX in production environments
Oversee PostgreSQL and ElasticSearch clusters, ensuring data integrity, performance, and scalability
Implement and refine monitoring, alerting, and incident response systems to maintain high service reliability
Collaborate with development teams to improve CI/CD pipelines and deployment workflows
Partner with the Security team to implement and uphold organisational policies and secure-by-design practices
Lead root cause analysis and postmortems for service disruptions, driving continuous improvement
Provide technical leadership and mentorship to junior operations staff
Engage with the community and users to resolve support issues and gather feedback
Maintain documentation and contribute to operational playbooks
Define and report on service KPIs, SLOs, and operational health indicators
Provide strategic advice to leadership on platform operations and technology decisions
Contribute to annual planning cycles by informing resource needs, tooling requirements, and infrastructure budgeting

What You’ll Bring

5+ years of experience in site reliability engineering, DevOps, or IT operations
Deep expertise in Kubernetes, Helm, and container orchestration
Strong experience with PostgreSQL and ElasticSearch in production environments
Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
Solid scripting and automation skills (e.g., Bash, Python, Ansible)
Familiarity with GitHub Actions or similar CI/CD tools
Excellent troubleshooting skills and a proactive mindset
Ability to work independently in a remote, multicultural team
Bonus: experience supporting open source infrastructure or registries
Excellent communication skills

Why Join Us

Competitive compensation and benefits
Flexible work hours and remote-first culture
“Corporate Recharge” days and right-to-disconnect policy
Opportunity to shape the future of open source infrastructure

We offer competitive compensation along with a comprehensive benefits package. We thank all applicants for their interest; however, only those selected for an interview will be contacted. For more information about the Eclipse Foundation, please visit our website at eclipse.org.

The Eclipse Foundation respects the dignity and independence of people with disabilities and is committed to providing accommodation and support throughout any recruitment process. If you require any special accommodation or support, please let us know when applying.