Site Reliability Engineer (Senior)
Join us as a Senior Site Reliability Engineer to help us run an industry-scale GPU cluster via Kubernetes. Together with senior members of our team, you will combine your strong understanding of system scaling and security practices with your cloud-native expertise to stand up and maintain Kubernetes clusters from scratch. Your role will also be pivotal in supporting our other service offerings, from full-stack development to AI integration, ensuring they are robust, scalable, and secure.
We need engineers on our team to be versatile, display leadership qualities and be enthusiastic to take on new problems across the stack as we solve new and interesting technologies problems. As a senior member of the team, you will be relied upon to design robust solutions that solve client problems, drive consensus around technical solutions, and ultimately own the success of projects. In return, you can expect latitude in the way you choose to run projects and design systems, while receiving direct support, guidance, and coaching from Bit Complete’s management team.
What you'll be doing
- Develop and implement comprehensive infrastructure strategies that emphasize reliability, flexibility, and security.
- Manage and scale our cloud-native environments, including Kubernetes clusters and container orchestration.
- Oversee the deployment and maintenance of infrastructure tools.
- Lead initiatives on stateless architectures to enhance scalability and maintainability of our systems.
- Utilize your expertise in distributed systems using technologies like Kafka, Postgres, Redis, and Elasticsearch.
- Design and monitor CI/CD pipelines to streamline deployment processes using tools like Spinnaker.
- Implement and manage monitoring solutions using OpenTSDB, Prometheus, Grafana, and Envoy to ensure optimal performance and reliability.
- Provide leadership and direction to the infrastructure team, fostering a culture of continuous learning and improvement.
Your Background
- Strong experience coding in any modern coding language (Javascript, Python, Go, or similar) is a must-have — we are all technology generalists by nature and the ability to jump into application or service code at various stages will be expected.
- Relevant industry experience, specifically in Site Reliability Engineering or a similar role, with a proven track record in technical leadership and setting the direction for scalable systems.
- Strong background in managing and deploying infrastructure in cloud-native environments (AWS and GCP).
- Experience with container orchestration (Docker, Kubernetes), and infrastructure as code (Terraform, Pulumi).
- Experience with monitoring and logging tools, and a solid understanding of network metrics.
- Familiarity with Linux skills and excellent problem-solving, debugging, and troubleshooting skills.
- Proficiency in system design and a solid understanding of distributed systems, DevOps tools and practices, particularly in developing and maintaining CI/CD systems for fully automated deployment, testing, and monitoring of applications.
- Familiarity with MLOps practices, including automation and orchestration of machine learning models.
- Experience with database technologies and designing infrastructure to support both traditional and AI-driven applications.
- Excellent communication skills with the ability to engage and influence both technical and non-technical stakeholders.
About Us
At Bit Complete, we craft software solutions that make a difference, backed by tech veterans from YouTube, Slack and Thumbtack. With a team of 30 engineers, we tackle tough client challenges and run experiments through side projects.
We’re growing but staying true to our roots. Our focus is on creating a sustainable, profitable company that lets us do what we love while taking on projects that are challenging, interesting, and avoid harming the world. If you’re looking for work that you can genuinely care about, with a team that truly has your back, you’re in the right place. Learn more about our culture and how we see ourselves in the software services industry.
Benefits
- Work-life balance and the set-up to do your best work: We believe in work that fits into your life, not the other way around. Enjoy four weeks of paid vacation, flexible hours, a MacBook Pro, $75/month internet reimbursement, and a $500/year stipend for your home office setup.
- No VC strings attached: We're profitable, bootstrapped, and committed to sharing that success with our team. Expect generous profit-sharing bonuses tied to the company’s performance.
- Comprehensive group benefits: Including drugs, paramedical practitioners, dental, vision care, virtual health care, virtual mental health care, and travel insurance.
- Top-up for parental leave
Compensation
CAD $148,988 - $200,644 annually.
Our ranges include base salary and conservative bonus target.
Interested?
We're excited about working with you, so get in touch! Send us your CV - or the link to it - to [email protected]
The world of work today is overflowing with systems, processes, tools, and assumptions that are flawed and that can push directly against our ability to express what is unique about each of us in the work we do every day. We believe people from diverse backgrounds, with different identities and experiences, make our company better. No matter your background, we'd love to hear from you! Alignment with our values is just as important as experience. Also, please let us know if there are ways we can make our interview process better for you - we're always happy to listen and accommodate where possible.