Employers search
About Us
The Infrastructure/SRE will collaborate with organizational leads and cross-functional teams to ensure our infrastructure, automation, and reliability practices align with business priorities. Will lead reliability-focused infrastructure initiatives, implement observability best practices, and drive operational excellence. Keen attention to detail, strong problem-solving abilities, and deep expertise in cloud systems are essential. This role will focus on building resilient infrastructure, implementing configuration management, enhancing CI/CD pipelines, and improving system performance, scalability, and availability to meet SLOs. An Infrastructure/SRE Developer is also responsible for incident response and on-call management, driving root cause analysis, facilitating blameless postmortems, and implementing remediation plans to prevent recurrence. Will work to continuously improve monitoring, alerting, and automated recovery mechanisms to minimize downtime and ensure high service reliability.
Things You'll Do:
● Ensure high reliability and uptime of production systems through proactive monitoring, incident response, and capacity planning.
● Develop and maintain automated solutions for configuration management, deployment, monitoring, and alerting/self-healing.
● Participate in on-call rotations, lead incident response efforts, and drive root cause analysis to prevent recurrence.
● Define, measure, and track SLIs, SLOs, and SLAs, ensuring alignment with business and reliability goals.
● Collaborate with application and infrastructure teams to design resilient, scalable, and secure architectures.
● Adopt and leverage AI-powered solutions to optimize observability, anomaly detection, automated remediation, and operational forecasting. Implement and refine AI-assisted automation workflows to streamline incident management and reduce human intervention in repetitive tasks.
● Continuously improve system performance, scalability, cost efficiency, and observability across production and pre-production environments.
● Work closely with developers to integrate SRE and security practices into CI/CD pipelines and development workflows.
● Lead and contribute to blameless postmortems and implement action plans to strengthen future resilience.
● Document runbooks, operational workflows, and architectural decisions to ensure knowledge sharing and operational consistency.
● Drive a culture of reliability engineering, automation, and AI adoption to enhance operational excellence and accelerate business innovation.
Things You'll Bring:
Years of Work Experience: 5 - 7 years of experience
Education/Skills & Capabilities: Bachelor's Degree (4-year) : Information Technology, computer science, engineering, or relevant field preferred
● Strong expertise in Linux systems, networking, distributed architectures, and AWS cloud platforms.
● Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, AWS CDK, and configuration management (Ansible, SaltStack).
● Proven ability to build and maintain CI/CD pipelines and automate deployment workflows.
● Deep knowledge of monitoring, observability, and alerting tools (Datadog, Prometheus, ELK, Grafana) with experience implementing self-healing systems.
● Experience defining and tracking SLIs, SLOs, and SLAs, using data-driven insights to guide operational decisions.
● Proficiency in incident response, root cause analysis, and blameless postmortems. Expertise in capacity planning, cost optimization, and performance tuning for largescale systems.
● Familiarity with AI-driven operational tools for anomaly detection, predictive scaling, and intelligent alerting.
● Experience integrating AI-assisted runbooks and automated remediation workflows to reduce MTTR.
● Strong understanding of cloud-native architecture patterns, container orchestration (e.g., Kubernetes, EKS), and service meshes.
● Ability to collaborate with developers to embed reliability, observability, and security best practices throughout the SDLC.
● Excellent analytical and problem-solving skills, capable of diagnosing complex distributed system issues.
● Effective communication and mentoring skills, fostering a culture of continuous learning and operational excellence.
● Proficiency in documenting architecture, operational processes, runbooks, and AI/automation workflows.
Compensation:
Perceptyx is focused on equitable pay for all our staff and aims for transparency with our pay practices. The annual salary range for the role is 110,000 to 140,0000 CAD. The above salary range represents the expected base salary range for this position. The actual salary may vary based upon several factors, including, but not limited to, relevant skills/experience, time in the role, business line, and geographic/office location.
Benefits:
We Care About The Whole Person 🫶

Perceptyx Equal Employment Opportunity Policy:
Perceptyx celebrates diversity and an inclusive environment. We focus on providing an environment of mutual respect where equal employment opportunities are available to all employees and applicants for employment. We prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
Perceptyx’s policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training. All aspects of employment are decided on the basis of qualifications, knowledge, merit, and business needs.
Things You'll Do:
● Ensure high reliability and uptime of production systems through proactive monitoring, incident response, and capacity planning.
● Develop and maintain automated solutions for configuration management, deployment, monitoring, and alerting/self-healing.
● Participate in on-call rotations, lead incident response efforts, and drive root cause analysis to prevent recurrence.
● Define, measure, and track SLIs, SLOs, and SLAs, ensuring alignment with business and reliability goals.
● Collaborate with application and infrastructure teams to design resilient, scalable, and secure architectures.
● Adopt and leverage AI-powered solutions to optimize observability, anomaly detection, automated remediation, and operational forecasting. Implement and refine AI-assisted automation workflows to streamline incident management and reduce human intervention in repetitive tasks.
● Continuously improve system performance, scalability, cost efficiency, and observability across production and pre-production environments.
● Work closely with developers to integrate SRE and security practices into CI/CD pipelines and development workflows.
● Lead and contribute to blameless postmortems and implement action plans to strengthen future resilience.
● Document runbooks, operational workflows, and architectural decisions to ensure knowledge sharing and operational consistency.
● Drive a culture of reliability engineering, automation, and AI adoption to enhance operational excellence and accelerate business innovation.
Things You'll Bring:
Years of Work Experience: 5 - 7 years of experience
Education/Skills & Capabilities: Bachelor's Degree (4-year) : Information Technology, computer science, engineering, or relevant field preferred
● Strong expertise in Linux systems, networking, distributed architectures, and AWS cloud platforms.
● Hands-on experience with Infrastructure as Code (IaC) tools such as Terraform, AWS CDK, and configuration management (Ansible, SaltStack).
● Proven ability to build and maintain CI/CD pipelines and automate deployment workflows.
● Deep knowledge of monitoring, observability, and alerting tools (Datadog, Prometheus, ELK, Grafana) with experience implementing self-healing systems.
● Experience defining and tracking SLIs, SLOs, and SLAs, using data-driven insights to guide operational decisions.
● Proficiency in incident response, root cause analysis, and blameless postmortems. Expertise in capacity planning, cost optimization, and performance tuning for largescale systems.
● Familiarity with AI-driven operational tools for anomaly detection, predictive scaling, and intelligent alerting.
● Experience integrating AI-assisted runbooks and automated remediation workflows to reduce MTTR.
● Strong understanding of cloud-native architecture patterns, container orchestration (e.g., Kubernetes, EKS), and service meshes.
● Ability to collaborate with developers to embed reliability, observability, and security best practices throughout the SDLC.
● Excellent analytical and problem-solving skills, capable of diagnosing complex distributed system issues.
● Effective communication and mentoring skills, fostering a culture of continuous learning and operational excellence.
● Proficiency in documenting architecture, operational processes, runbooks, and AI/automation workflows.
Compensation:
Perceptyx is focused on equitable pay for all our staff and aims for transparency with our pay practices. The annual salary range for the role is 110,000 to 140,0000 CAD. The above salary range represents the expected base salary range for this position. The actual salary may vary based upon several factors, including, but not limited to, relevant skills/experience, time in the role, business line, and geographic/office location.
Benefits:
We Care About The Whole Person 🫶
- Healthy medical, dental, and vision insurance for you and your family
- Life insurance up to 1x your annual salary (with a cap) paid by Perceptyx
- Generous Maternity, Paternity, and Adopter leave benefits with flexibility on when you use this benefit
- Compassionate Care Program with paid time off to care for family members
- Generous Bereavement Leave that also supports Pet Parents
- For USA employees: 401(k) plan, along with a company match and immediate vesting upon hire
- For Canadian employees: you can contribute to a pension plan. Perceptyx will provide an employer match for the pension.
- As hard as we work, we also know how essential it is to take time away to rest and recharge. We offer flexible paid vacation with the expectation that every team member takes off at least 10 business days per calendar year.
- 16 paid holidays per calendar year.
- Mac or PC laptop options
- Materials for working at home
- Perceptyx Announces Winners of 2024 EX IMPACT Awards
- Perceptyx Launches Activate, HR’s Missing Link Between Employee Insight and Impact
- Activate from Perceptyx Named a 2024 Top HR Tech Product Winner by Human Resource Executive

Perceptyx Equal Employment Opportunity Policy:
Perceptyx celebrates diversity and an inclusive environment. We focus on providing an environment of mutual respect where equal employment opportunities are available to all employees and applicants for employment. We prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state, or local laws.
Perceptyx’s policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training. All aspects of employment are decided on the basis of qualifications, knowledge, merit, and business needs.
Experience Requirements
Mid LevelLatest Jobs
Remote (Canada, Canada)
•
6h ago