Location: Montreal, Quebec, Canada
Category: Engineering
Salary: 85,000 - 120,000 CAD / yearly
Full-time
As a Senior Site Reliability Engineer, you will lead the design, implementation, and maintenance of highly reliable, scalable, and efficient infrastructure and services.
Lead the design, deployment, and operation of large-scale, fault-tolerant systems to ensure high availability and performance.
Develop and implement automation to streamline deployment, monitoring, and incident response processes.
Monitor system health, analyze metrics, and proactively identify and resolve reliability, scalability, and performance issues.
Collaborate with software engineering teams to improve system design, deployment pipelines, and operational practices.
Manage incident response, conduct root cause analysis, and implement corrective actions to prevent recurrence.
Drive continuous improvement in infrastructure efficiency, reliability, and scalability through innovative solutions.
Document system architecture, operational procedures, and best practices to support knowledge sharing and operational consistency.
Mentor and provide technical leadership to junior SREs and cross-functional teams.
Participate in on-call rotations to ensure 24/7 system reliability and rapid incident resolution.
Engage with stakeholders to align SRE practices with business goals and technical strategies.
Extensive experience in site reliability engineering, systems engineering, or related roles, typically 5+ years.
Strong proficiency with cloud platforms (AWS, Azure, Google Cloud) and container orchestration tools (Kubernetes, Docker).
Expertise in Linux system administration, networking, and security best practices.
Proficient in programming and scripting languages such as Python, Go, Bash, or similar for automation.
Experience with infrastructure as code (Terraform, Ansible, CloudFormation) and CI/CD pipelines.
Deep understanding of monitoring, logging, and alerting tools (Prometheus, Grafana, ELK stack).
Proven ability to design and maintain scalable, distributed systems and fault-tolerant architectures.
Strong problem-solving skills and ability to handle complex technical challenges independently.
Excellent communication skills to collaborate effectively across teams and with external vendors.
Familiarity with incident management frameworks and service-level objectives (SLOs), service-level agreements (SLAs).