
Staff SRE Specialist
- Hybrid
- Montréal, Quebec, Canada
- Quebec, Quebec, Canada
+1 more- Infrastructure and IT
Job description
Petal is a leading Canadian healthcare orchestration and billing company that revolutionizes healthcare systems to make them agile, efficient, and resilient by enabling the forecasting and shaping of world-class healthcare through Healthcare BI, advanced analytics, and informed insights.
Our commitment to fostering an exceptional workplace culture has earned us notable recognitions, including being listed as a Great Place to Work in both the technology and healthcare sectors. Join us in our mission to empower healthcare innovators and improve healthcare differently.
What you can expect when joining the team
As a Staff SRE Specialist, you will play a crucial role in ensuring the reliability, performance, and scalability of our services. You will be responsible for improving and maintaining the resilience of our infrastructure through automation, monitoring, and incident management. In this role, you will lead the charge in bridging the gap between software development and operations to ensure efficient delivery and high availability of our applications.
Your daily life
In your day to day, you will be led to:
Define and drive SRE best practices (SLIs/SLOs, error budgets, blameless post-mortems) to ensure availability, reliability, and scalability of critical systems, working closely with the Principal Developer on technical vision and architecture;
Establish and maintain reliability metrics (SLIs, SLOs, recovery time) and design robust monitoring, alerting, and observability systems while optimizing infrastructure costs;
Eliminate toil through automation of critical operations such as incident response, auto-scaling, and CI/CD pipelines to improve service reliability and team productivity;
Architect technical implementation plans and support their delivery, establishing partnerships with teams to meet reliability, performance, and security standards;
Contribute to internal tooling and platform development (deployment tools, dashboards, monitoring frameworks) to improve operational efficiency and developer experience while maintaining security standards in systems and processes;
Lead resilience improvement efforts including capacity planning, disaster recovery, and system optimization through load balancing, failovers, and other high availability strategies;
Manage critical incident response by minimizing MTTR (Mean Time To Recovery), including intervention, resolution, and post-incident analysis with documentation and recommendations to prevent recurrence;
Proactively identify optimization opportunities for system performance and cost-effectiveness in cloud environments while contributing to strategic infrastructure planning;
Provide 24/7 production support through on-call rotation, maintain system availability, and manage internal communications during major incidents;
Mentor and coach team members and contribute to in-depth technical analyses to address strategic business needs.
Job requirements
Your profile
Are you a proactive technical leader with deep expertise in site reliability? Are you passionate about building resilient and high-performing systems and guiding teams toward excellence? The sky is the limit! If you have:
College diploma (DEC) or bachelor's degree in computer science or related field;
More than 10 years of relevant professional experience, with at least 5 years focused on SRE or similar roles;
Deep knowledge of cloud infrastructure (AWS, GCP, or Azure), system architecture, orchestration tools, and automation frameworks;
Advanced knowledge of SRE tools and practices (monitoring, alerting, incident response, capacity planning). Proficiency with tools like Prometheus, Grafana, Kubernetes, and Terraform;
Strong experience with infrastructure automation tools, scripting (Python, Go, or Bash), and CI/CD pipelines;
Proven ability to guide cross-functional teams, mentor junior engineers, and lead reliability initiatives that align with business objectives;
Strong problem-solving and analytical skills with the ability to handle complex technical issues;
Excellent verbal and written communication skills, with the ability to document and explain complex concepts to both technical and non-technical stakeholders;
Proficiency in English and French is preferred, as you will work with diverse teams and stakeholders.
Petal's position on remote working
In our opinion, a company cannot claim to be modern, innovative and have the well-being of their team at heart, without attempting to integrate remote working to the level that their business model allows them to. Petal employees continue to benefit from the option of teleworking up to the maximum flexibility permitted by the nature of the position and the smooth running of operations.
Our benefits
A signing bonus of $1,000 for your remote work set-up;
Compensation that recognizes your contribution;
4 to 6 weeks of paid vacation per year;
5 paid personal days per year;
A group RRSP / DPSP plan with employer contribution;
A complete group insurance plan, from day 1;
An annual wellness allowance;
Access to the Lumino Health™ telehealth application;
Flexible work hours and more.
Petal is an active participant in the equal opportunity employment program, and members of the following target groups are encouraged to apply: women, people with disabilities, aboriginal peoples and visible minorities. If you are a person with a disability, assistance with the screening and selection process is available on request.
A quick important note: We've noticed that some external websites are posting our job openings under incorrect job titles. To find our real opportunities and join our team, please make sure to apply through our official careers page or our trusted partners. We can't wait to hear from you!
- Montréal, Quebec, Canada
- Quebec, Quebec, Canada
or
All done!
Your application has been successfully submitted!