Senior Site Reliability Engineer

Job Posted 4/18/2025

MetaRouter

Denver, CO

United States

Category Information Technology

Job Description

Job DescriptionJob DescriptionSalary: $170,00 - $210,000

Senior Site Reliability Engineer

About Us

MetaRouter provides highly reliable and robust Customer Data Infrastructure via Software-as-a-Service and Self-Hosted deployment options. Our platform allows organizations to tailor their digital data collection and processing pipelines to their unique needs. MetaRouter is designed to improve how organizations unify real-time data collection and processing while maintaining control over data privacy and security. As a result, our customers gain deeper insights into their consumers, optimize their marketing and advertising operations, mitigate data compliance and security risks, and make data-driven decisions with confidence.

We believe organizations who harness first party data build trust with their audiences by meeting their consumers where they are, with the specific products, services, and experiences they want, at the moment they need it the most. Our purpose is to empower customers to take control of their data, unlocking differentiation, driving growth, and creating value for all stakeholders while meeting compliance regulations and respecting individual privacy rights.

About The Role

We are looking for a Senior Site Reliability Engineer (DevOps) with a minimum 7- 10 years of experience automating systems and advancing the ability to monitor and resolve critical issues proactively while being comfortable with developing and creating processes for a maturing SRE organization.

Core Responsibilities

Architect the creation, maintenance, and removal of cloud infrastructure that supports our applications and internal operations.
Manage deployment of our applications on cloud infrastructure.
Manage upgrades of infrastructure and a wide variety of intermediate software that supports our applications.
Set up and maintain dashboards, logs, metrics, and alerting mechanisms, with a focus on creating alerts that provide high signal and low noise.
Continuously improve observability by enhancing logging, metrics, and tracing systems to provide deeper insights into system performance, reduce time to resolution, and support proactive incident detection.
Lead the investigation and resolution of complex infrastructure and application issues, identifying root causes, driving systemic fixes, and mentoring others in effective troubleshooting practices.
Ensure that cloud infrastructure and our applications meet or exceed compliance requirements.
Establish and drive standards for infrastructure and process documentation, ensuring clarity, consistency, and long-term maintainability across teams and systems.
Drive best practices through code reviews by setting high standards for infrastructure, application, and service reliability, while mentoring engineers and influencing architecture and deployment patterns across teams.
Work with customers to determine and implement custom infrastructure requirements in a way that balances flexibility with repeatable, scalable patterns.
Lead design and architectural decisions for infrastructure and applications, driving improvements in automation, performance, reliability, and security at scale.
Provide technical leadership and mentorship to SRE team members, fostering a culture of growth, ownership, and continuous learning.
Partner cross-functionally with platform engineering and other stakeholders to define and deliver scalable infrastructure solutions for internal and customer-facing systems.
Apply business and technical acumen to prioritize and guide engineering efforts that maximize impact in resource-constrained environments.
Champion a culture of continuous improvement by identifying and implementing strategic process, system, and collaboration enhancements across teams.
Proactively identify and address technical and procedural risks before they escalate, exercising sound judgment and autonomy to drive long-term resilience and operational excellence."
Lead by example in the on-call rotation, setting standards for incident response, postmortems, and systemic resiliency improvements.
Design and implement scalable playbooks and alerting systems that reduce Mean Time To Repair (MTTR) by enabling rapid, consistent, and effective incident response.

Qualifications and Experience

8+ years of experience in SRE or DevOps roles, with a strong track record of owning and scaling infrastructure on at least one major cloud provider (preferably GCP).
Deep expertise in configuring, maintaining, and troubleshooting Kubernetes clusters in production environments, including cluster architecture, security, and performance tuning.
Advanced proficiency with infrastructure and automation tools such as Bash, CI/CD pipelines, Docker, Git, Helm, Prometheus, Terraform, and YAML, with the ability to evaluate and implement tooling at scale.
Demonstrated experience architecting and managing identity and access management (IAM) and single sign-on (SSO) across complex, multi-platform environments.
Operational expertise with observability platforms such as New Relic (including NRQL), using telemetry to guide performance optimization, reliability improvements, and incident response strategies.
Familiarity with the operational aspects of modern application stacks, including Go and React/Node.js, with the ability to collaborate effectively across application and infrastructure domains.
Strong understanding of agile methodologies, with experience leading infrastructure initiatives within iterative development cycles.
Proven ability to prioritize and execute across a diverse set of responsibilities in a fast-paced, evolving environment, balancing tactical needs with long-term technical strategy.

Employment Details

Job Type: Full Time

Location: Fully Remote

Benefits

Health/Dental/Vision/Insurance
401(k)
Unlimited Vacation Policy
Fully Remote (US)

remote work