
Loading…

Loading…

Loading job…
HCL America Inc [1590204BR]
Location
Job type
Workplace
Duration
Posted
Compensation
Job Description JD:
Good Knowledge and Hands on Linux Operating systems and management
Good Knowledge on GPU management on Linux and Containers
Good Knowledge on Docker and Kubernetes platform like Rancher- RKE and Rancher Management server(Key Advantage) – Key Skill
Good Knowledge and understanding on Software defines storage and Network
Good Knowledge on monitoring tools like Prometheus and Grafana
Knowledge on Large Language Model deployment and Monitoring
Knowledge on inferencing engines like vLLM and other inferencing engines.
RKE2 Cluster Maintenance and Administration Tasks
Regular System Updates and Upgrades
Kubernetes version upgrades (planning, testing, execution, rollback if needed)
RKE2 specific updates and patches
Host OS patching and upgrades on all 10 nodes
Container runtime (containerd) updates
GPU driver and NVIDIA GPU Operator updates
Monitoring software stack updates (Prometheus, Grafana, etc.)
CNI plugin updates
Security patches and CVE remediations
High Availability and Disaster Recovery
Regular testing of etcd backups and restoration procedures
Velero backup validation and test restores
Documentation and execution of DR procedures
Node replacement procedures when hardware fails
Cluster recovery exercises and documentation
Monitoring and Alerting
Implementing Prometheus rules for our brainstormed alert list
Alert tuning and reduction of false positives
Creating and maintaining runbooks for each alert type
24/7 alert response (if required)
On-call rotation management
Creation of monitoring dashboards for different stakeholders
Performance Management
Regular cluster performance assessments
Resource utilization optimization
GPU utilization monitoring and optimization
Ceph storage performance monitoring and tuning
Database performance monitoring and tuning
Security Management
Regular security audits
Network policy reviews and updates
RBAC configuration maintenance
Certificate rotation and management
Security scanning of container images
Kyverno policy maintenance and updates
Audit log reviews
TLS certificate management and rotation
User Support and Access Management
Managing access for the approx 30 users (20 US-based, 10 offshore)
Namespace management for proper separation of US and global teams
User onboarding and offboarding procedures
Training and documentation for users
Troubleshooting application deployment issues
Capacity Planning and Resource Management
Regular assessment of resource utilization trends
Planning for additional capacity needs
Resource quota management and adjustments
Namespace resource limit management
Storage capacity planning and management
Documentation and Knowledge Transfer
Maintaining up-to-date cluster documentation
Creating and updating runbooks for common procedures
Knowledge transfer sessions with internal team members
Regular status reporting and metrics
MLOps Infrastructure Management
ClearML Enterprise (or alternative MLOps tool) maintenance and updates
ML model deployment pipeline maintenance
GPU resource allocation optimization
ML experiment tracking infrastructure maintenance
Logging and Observability
FluentBit configuration maintenance
Splunk integration maintenance
OpenTelemetry and Jaeger maintenance
Log retention policy implementation
Multi-tenancy and Compliance
Ensuring continued separation between US and global teams
Regular audits of ITAR, EAR, DFARS, NIST 800-171 compliance
Enforcement of data access controls
Periodic validation of namespace isolation
Incident Management
Root cause analysis for production incidents
Post-incident reviews and improvement plans
Incident documentation and knowledge sharing
SLA monitoring and reporting
Database Infrastructure Management - Optional, may fall on app developers
Management of Neo4J, Weaviate, Chroma, and Milvus instances
Database backup procedures
Database performance tuning
High-availability configuration for critical databases