
Loading…

Loading…

Loading job…
Capgemini
Location
Job type
Workplace
Duration
Posted
Compensation
This role is focused on AI Gateway reliability, Kubernetes platforms, and large-scale LLM integrations.
Role Snapshot: Role: Senior Site Reliability Engineer Type: Contract to Hire Location: Bellevue, WA (US) Start Date: Feb 10, 2026
AI Platform & Gateway Engineering • Design, deploy, and operate enterprise AI Gateway infrastructure supporting OpenAI and internal LLM-based services. • Implement and manage regional routing (east/west), failover strategies, and upstream host configurations for AI traffic. • Develop and maintain Helm charts, Kubernetes manifests, and Jinja templates for multi-environment deployments (dev, plab, qlab). • Enable per-API configuration for rate limiting, AI feature toggles, security credentials, and regional host overrides. • Stay current with industry best practices for:
AI Gateways and MCP servers Secure LLM consumption patterns Token handling, secrets management, and request isolation Observability standards for AI platforms
Vendor & Stakeholder Management • Lead bi-weekly technical and operational syncs with AI Gateway vendors. • Translate vendor capabilities, limitations, and roadmaps into actionable platform strategies. • Communicate clearly in both technical and business terms with: Engineering teams, SRE, Security & compliance, Product and leadership stakeholders
Reliability, Observability & Operations • Build and maintain monitoring and troubleshooting frameworks for AI workloads using Splunk and Grafana. • Author and evolve SRE support cookbooks for proactive monitoring, incident response, and escalation. • Analyze failure rates, latency spikes, and request flows across distributed AI systems. • Support on-call readiness through actionable dashboards, alerts, and operational runbooks.
CI/CD & Automation: • Build CI pipelines to generate and deploy environment-specific configurations at scale. • Automate service registration, deployment validation, and environment promotion. • Enforce consistent naming, versioning, and deployment standards across clusters and environments.
Cross-Functional Collaboration • Act as a technical bridge between application teams, SRE, security, and platform engineering. • Provide architectural guidance for teams onboarding to AI Gateway and Enterprise GPT platforms. • Contribute to platform roadmaps, technical design reviews, and operational readiness planning.
Required Qualifications • Strong experience with Kubernetes, Helm, and cloud-native networking. • Hands-on experience with Istio / service mesh, routing rules, and traffic management. • Proficiency in Python, Bash, and Jinja templating for infrastructure automation. • Experience operating production-grade APIs with high reliability and observability standards. • Deep understanding of SRE principles, monitoring, alerting, and incident management. • Experience building observability frameworks using Splunk, Grafana, or similar tools. • Strong ability to communicate complex technical issues in clear business terms. • Experience working with AI/LLM APIs (OpenAI or similar) in an enterprise context.
Preferred Qualifications • Knowledge of MCP servers, AI gateway patterns, and LLM security models. • Familiarity with security controls for AI platforms (secrets management, token handling, access controls). • Experience supporting multi-region, multi-environment deployments at scale. • Strong documentation skills with a focus on operational clarity and enablement.