Job Details
Main Responsibilities First line of contact Troubleshooting and resolution of problems Respond to availability incidents and provide support for service engineers with Make monitoring and alerting alert on symptoms and take appropriate actions Document every action so your findings turn into repeatable actionsand then into automation Required skills A Bachelor's or Master's degree in Computer Science Engineering or related Strong experience with large scale Infrastructure, distributed systems, and application performance 4+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or hosted at a cloud provider like Amazon Web Services (AWS), Google Compute Engine (GKE), or Microsoft Azure (AKS) and On-Premise Kubernetes.
2+ years of experience with enterprise storage systems or distributed systems; knowledge of Ceph Storage is a plus 3+ years of experience in an Operations/SRE 3+ years of experience within a fast-paced site reliability function 3+ years of experience working in a continuous delivery environment with proven track 3+ years of experience with GitLab, Jenkins, CI/CD, infrastructure-as- code 3+ years of experience with software design principles Experience on Python, Shell Scripting, Go Lang will be a plus Logging experience using Graylog, ELK stack, Implement highly available systems and disaster recovery Automation using Puppet, Ansible, Chef, Terraform, Helm, Cobbler Core Networking, Consul DNS Load balancing, Metallb, Weave, Cilium Ingress Controller, Istio, Nginx, HAProxy Foreman Katello, Redhat Satellite FreeIPA, Bind, DNS Security 2+ Years of Experience Container Security Experience in integrating different systems with their available API's Extensive experience using monitoring tools such as Prometheus, Grafana, Nagios, Zabbix implement highly available systems and disaster recovery Strong troubleshooting skills