Department:Solutions Consulting
Location: Canada
Description **At Vitech, we believe in the power of technology to simplify complex business processes. Our mission is to bring better software solutions to market, addressing the intricacies of the insurance and retirement industries. We combine deep domain expertise with the latest technological advancements to deliver innovative, user-centric solutions that future-proof and empower our clients to thrive in an ever-changing landscape. With over 1,600 talented professionals on our team, our innovative solutions are recognized by industry leaders like Gartner, Celent, Aite-Novarica, and ISG.
We offer a competitive compensation package along with comprehensive benefits that support your health, well-being, and financial security.
Senior Site Reliability Engineer (SRE)
Location: Canada or United States (Remote Role)
Senior Site Reliability Engineer (SRE) -- Join Our Global Engineering Team
**
**About the Role: Senior SRE
**
What you will do:
- Own and manage our AWS cloud-based technology stack, using native AWS services and top-tier SRE tools to support multiple client environments with Java-based applications and microservices architecture.
- Define SRE strategy, vision, and goals aligned to Vitech's overall objectives. Establish roadmaps and plans for improving system reliability, scalability, and efficiency.
- Collaborate with Architecture review boards, Solution Architects, engage in viable solutions reviews/implementations.
- Design/refine and implement SLIs and SLO's that covers broad spectrum of SRE -- availability, performance, Error budgeting
- Design, deploy, and manage AWS Aurora PostgreSQL clusters for high availability and scalability. Optimize SQL queries, indexes, and database parameters for performance tuning.
- Automate database operations using Terraform, Ansible, AWS Lambda, and AWS CLI. Manage Aurora's read replicas, auto-scaling, and failover mechanisms.
- Enhance infrastructure as code (IAC) patterns using technologies like Terraform, CloudFormation, Ansible, Python, and SDK. Collaborate with DevOps teams to integrate Aurora with CI/CD pipelines.
- Provide full-stack support, as per assigned schedule, on applications across technologies such as Oracle WebLogic, AWS Aurora PostgreSQL, Oracle Database, Apache Tomcat, AWS Elastic Beanstalk, Docker/ECS, EC2, S3, etc.,
- Troubleshoot database incidents, perform root cause analysis, and implement preventive measures. Document database architecture, configurations, and operational procedures.
- Ensure high availability, scalability, and performance of PostgreSQL databases on AWS Aurora. Monitor database health, troubleshoot issues, and perform root cause analysis for incidents.
- Embrace SRE principles such as Chaos Engineering, Reliability, Reducing Toil, etc.,
What We're Looking For:
- Proven hands-on experience as an SRE for critical, client-facing applications, with the ability to dive deep into daily SRE tasks, manage incidents, and oversee operational tools.
- 4+ years of experience developing and/or administering software in AWS public cloud and deep level experience in hosting applications in AWS (EC2, EBS, ECS/EKS, Elastic Beanstalk, RDS, CloudWatch).
- 3+ years of experience in managing relational databases (Oracle, and/or PostgreSQL) in both cloud and on-prem environments, including SRE tasks like backup/restore, Performance issues and replication.
- Demonstrable cross-functional full-stack knowledge with compute, storage, networking, security and databases
- Strong understanding of AWS networking concepts (VPC, VPN/DX/Endpoints, Route53, CloudFront, Load Balancers, WAF).
- Experience with containerized applications (Docker, Kubernetes, ECS). Leverage AWS Aurora features (e.g., read replicas, auto-scaling, multi-region deployments) to enhance database performance and reliability.
- Familiarity with Datalake architecture, Elasticsearch, Zookeeper, DynamoDB, a plus.
- Familiarity with tools like pgAdmin, psql, or other database management utilities. Automate routine database maintenance tasks (e.g., vacuuming, reindexing, patching). Knowledge of backup and recovery strategies (e.g., pg_dump, PITR).
- Set up and maintain monitoring and alerting systems for database performance and availability (e.g., CloudWatch, Honeycomb, New Relic, Dynatrace etc.,).
- Work closely with development teams to optimize database schemas, queries, and application performance. Provide database support during application deployments and migrations.
- Hands-on experience with web/application layers (Oracle WebLogic, Apache Tomcat, AWS Elastic Beanstalk, SSL certificates, S3 buckets).
- Automation experience with Infrastructure as Code (Terraform, CloudFormation, Python, Jenkins, GitHub/Actions). Knowledge of multi-region Aurora Global Databases for disaster recovery.
- Scripting experience in Python, Bash, Java, JavaScript, Node.js.
- Oversee and streamline change management procedures, efficiently handling daily production change requests to ensure seamless operations.
- Excellent written/verbal communication, critical thinking.
Join Us at Vitech! If you thrive in a dynamic environment and are eager to drive innovation in SRE practices, we want to hear from you!