Looking for site reliability engineer who will be IT expert who uses automation tools to monitor and observe software reliability in the production environment. They should also have experience in finding problems in software and writing codes to fix them. They should be typically former system administrators or operation engineers with good coding skills. The following are some site reliability responsibilities we are looking for:
Emergency incident response
Change management
IT infrastructure management
In addition looking for below:
Management of EC2 Instances:
Understanding and working with ASGs
Load Balancers and target groups
CloudWatch
Monitoring of web applications:
Kubernetes
Advanced understanding of containers
Understanding of k8s concepts and deployment patterns.
Good to Have:
Infra:
Terraform
OpenSearch / Kibana
EC2 management
Baking AMIs
Launch Templates
Messaging:
Kafka
ActiveMQ
Database
MongoDB
Generalist Programmer:
Java
Experience with heap analysis
GC models in Java 8
Understanding of Concurrency in Java: Threading and Connection Pooling
PHP:
Some understanding of configuring Apache with Apache modules