codeforce

Loading job…

Loading…

Job description

Job Description JD:

Good Knowledge and Hands on Linux Operating systems and management
Good Knowledge on GPU management on Linux and Containers
Good Knowledge on Docker and Kubernetes platform like Rancher- RKE and Rancher Management server(Key Advantage) – Key Skill
Good Knowledge and understanding on Software defines storage and Network
Good Knowledge on monitoring tools like Prometheus and Grafana
Knowledge on Large Language Model deployment and Monitoring
Knowledge on inferencing engines like vLLM and other inferencing engines.

RKE2 Cluster Maintenance and Administration Tasks

Regular System Updates and Upgrades

Kubernetes version upgrades (planning, testing, execution, rollback if needed)
RKE2 specific updates and patches
Host OS patching and upgrades on all 10 nodes
Container runtime (containerd) updates
GPU driver and NVIDIA GPU Operator updates
Monitoring software stack updates (Prometheus, Grafana, etc.)
CNI plugin updates
Security patches and CVE remediations

High Availability and Disaster Recovery

Regular testing of etcd backups and restoration procedures
Velero backup validation and test restores
Documentation and execution of DR procedures
Node replacement procedures when hardware fails
Cluster recovery exercises and documentation

Monitoring and Alerting

Implementing Prometheus rules for our brainstormed alert list
Alert tuning and reduction of false positives
Creating and maintaining runbooks for each alert type
24/7 alert response (if required)
On-call rotation management
Creation of monitoring dashboards for different stakeholders

Performance Management

Regular cluster performance assessments
Resource utilization optimization
GPU utilization monitoring and optimization
Ceph storage performance monitoring and tuning
Database performance monitoring and tuning

Security Management

Regular security audits
Network policy reviews and updates
RBAC configuration maintenance
Certificate rotation and management
Security scanning of container images
Kyverno policy maintenance and updates
Audit log reviews
TLS certificate management and rotation

User Support and Access Management

Managing access for the approx 30 users (20 US-based, 10 offshore)
Namespace management for proper separation of US and global teams
User onboarding and offboarding procedures
Training and documentation for users
Troubleshooting application deployment issues

Capacity Planning and Resource Management

Regular assessment of resource utilization trends
Planning for additional capacity needs
Resource quota management and adjustments
Namespace resource limit management
Storage capacity planning and management

Documentation and Knowledge Transfer

Maintaining up-to-date cluster documentation
Creating and updating runbooks for common procedures
Knowledge transfer sessions with internal team members
Regular status reporting and metrics

MLOps Infrastructure Management

ClearML Enterprise (or alternative MLOps tool) maintenance and updates
ML model deployment pipeline maintenance
GPU resource allocation optimization
ML experiment tracking infrastructure maintenance

Logging and Observability

FluentBit configuration maintenance
Splunk integration maintenance
OpenTelemetry and Jaeger maintenance
Log retention policy implementation

Multi-tenancy and Compliance

Ensuring continued separation between US and global teams
Regular audits of ITAR, EAR, DFARS, NIST 800-171 compliance
Enforcement of data access controls
Periodic validation of namespace isolation

Incident Management

Root cause analysis for production incidents
Post-incident reviews and improvement plans
Incident documentation and knowledge sharing
SLA monitoring and reporting

Database Infrastructure Management - Optional, may fall on app developers

Management of Neo4J, Weaviate, Chroma, and Milvus instances
Database backup procedures
Database performance tuning
High-availability configuration for critical databases

Redhatlinux,ansible,redhatclus..

Job details

Job description