Sr Site Reliability Engineer – Software Development
Location: Bellevue, WA, United States
Employment Type: Contract to Hire
We are seeking an experienced Senior Site Reliability Engineer to join our software development team on a contract-to-hire basis. In this critical role, you will architect, deploy, and operate robust AI Gateway infrastructure that supports OpenAI and internal large language model (LLM) services. Your expertise in Kubernetes, Helm, cloud-native networking, and service mesh technologies will drive reliability, scalability, and observability across multi-region and multi-environment deployments.
Key Responsibilities:
- Design, deploy, and manage enterprise AI Gateway infrastructure including regional routing, failover strategies, and upstream host configurations to optimize AI traffic management.
- Develop and maintain Kubernetes manifests, Helm charts, and templated configurations (Jinja) for multiple environments such as development, pre-lab, and quality lab.
- Implement customizable per-API configurations including rate limiting, security credentials, feature toggles, and regional host overrides.
- Collaborate with AI Gateway vendors, translating their solutions into strategic platform implementations.
- Build and maintain comprehensive monitoring and troubleshooting frameworks using Splunk, Grafana, and other observability tools.
- Produce and update detailed SRE support documentation covering incident response, alerting, escalation procedures, and proactive monitoring best practices.
- Analyze system telemetry data, including failure patterns, latency anomalies, and request flows to continuously improve platform resilience.
- Automate CI/CD pipelines for environment-specific configuration generation, validation, and deployment with strict governance on naming, versioning, and cluster consistency.
- Serve as a technical liaison across application teams, security, platform engineering, and SRE groups to align on roadmap planning and architectural standards.
Required Qualifications:
- Extensive hands-on experience with Kubernetes, Helm, and cloud-native networking.
- Proficient with Istio/service mesh technologies, routing configurations, and AI traffic management best practices.
- Strong scripting abilities in Python and Bash, with expertise in Jinja templating for infrastructure automation.
- Demonstrated success operating production-grade APIs with high reliability and observability standards.
- Solid understanding of Site Reliability Engineering (SRE) principles, including monitoring, alerting, and incident management workflows.
- Experience with observability platforms such as Splunk and Grafana.
- Excellent written and verbal communication skills capable of conveying complex technical information to diverse stakeholders.
- Familiarity with AI/LLM APIs (e.g. OpenAI) and enterprise-scale deployments.
Preferred Qualifications:
- Knowledge of MCP server architecture and AI gateway design.
- Experience with security practices related to AI platforms, such as secrets management and token handling.
- Proven track record supporting large-scale, multi-region, and multi-environment infrastructure deployments.
- Strong documentation skills emphasizing clarity, operational enablement, and knowledge transfer.
Join us to play a pivotal role in building and maintaining the backbone infrastructure for enterprise AI platforms and large language model integrations.