Dawn InfoTek Inc. is a professional IT consulting team that partners with major financial institutions, investment firms and government sectors. We have been dedicated to delivering cutting-edge consulting services and recruiting all levels of IT positions for our clients.
We are currently seeking competent individuals to fulfill the role of Manager, SRE \& Service Delivery to join our dynamic team for our client, one of the major banks.
Contract to Hire: 4-6 months ( possibility of being converted to FTE after completion of the contract)
Location: 2-3 days on site, Downtown Toronto
This role requires deep technical expertise combined with team leadership skills to drive issue resolution, implement best practices, and optimize platform reliability.
The ideal candidate will have hands-on experience troubleshooting complex software and infrastructure issues and will serve as the primary escalation point for critical incidents affecting the banking applications. They will work closely with engineering, infrastructure, and security teams to ensure high availability, performance, and security of the systems.
Key Responsibilities
Technical Leadership \& Incident Management
- Act as the final technical escalation point for on-call teams, assisting with diagnosing and resolving complex software, infrastructure, and performance issues.
- Lead major incident response efforts, ensuring quick resolution and root cause analysis (RCA).
- Work closely with developers, cloud engineers, and platform teams to troubleshoot issues across full-stack environments (frontend, backend, infrastructure).
- Maintain high availability and performance of Digital Banking applications running on various technologies including WebSphere, AWS, OpenShift, and Red Hat VMs.
- Ensure log monitoring, observability, and proactive alerting (Dynatrace, OpenSearch, or similar tools).
SRE \& Reliability Engineering
- Define and implement reliability, scalability, and availability best practices for the online banking platform.
- Improve CI/CD pipelines, release engineering, and automated deployments to enhance system stability.
- Drive postmortem analysis and continuous improvement efforts to prevent repeat incidents.
- Optimize scalability, redundancy, and high availability of banking applications.
Infrastructure \& Patching
- Oversee infrastructure patching and maintenance for Red Hat VMs, OpenShift containers, WebSphere, and AWS resources.
- Ensure zero-downtime patching strategies and automated updates to reduce operational risk.
- Collaborate with security teams to enforce compliance, harden infrastructure, and remediate vulnerabilities.
Team Leadership \& Process Improvement
- Lead a high-performing SRE and Service Delivery team, fostering a culture of ownership and reliability.
- Establish and enforce best practices for incident management, operational playbooks, and documentation.
- Collaborate with development and infrastructure teams to enhance observability, performance monitoring, and proactive issue detection.
- Partner with product teams to ensure seamless deployments and reduce operational burden (toil).
Required Skills \& Qualifications
Technical Expertise
- Strong hands-on troubleshooting skills across frontend, backend, and infrastructure.
- Experience managing applications built with and/or running on Angular, React, Java J2EE, Oracle, WebSphere, AWS, and OpenShift.
- Understanding of Linux administration, containerized applications (Docker, OpenShift), and cloud environments (AWS, Azure, or GCP).
- Proficiency in reliability engineering, monitoring, and automation best practices.
- Familiarity with infrastructure as code (Terraform, Ansible, or similar tools).
- Experience implementing CI/CD pipelines (Jenkins, GitHub Actions, or similar).
- Knowledge of log monitoring \& observability tools (Splunk, Dynatrace, OpenSearch, Prometheus, Grafana, etc.).
Leadership \& Management Skills
- Proven ability to lead technical teams, mentor engineers, and drive operational excellence.
- Strong problem-solving skills with the ability to resolve complex production incidents quickly.
- Experience implementing incident management processes and postmortem analysis.
- Strong ability to collaborate with cross-functional teams (engineering, security, infrastructure).
- Excellent communication skills---able to explain technical concepts to leadership and stakeholders.
Preferred Qualifications
- Experience in banking, fintech, or highly regulated industries is a plus.
- Familiarity with modern reliability engineering frameworks and methodologies.
- Certifications such as AWS Certified Solutions Architect, Red Hat Certified Engineer (RHCE), Kubernetes (CKA), or ITIL are advantageous.
We thank all applicants for their interest and referral. However, only qualified candidates selected for an interview will be contacted.