MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.


I'm not curious
8

Senior Software Engineer

Location Hyderabad, India
Posted 28-October-2021
Description
Your Role and Responsibilities

As Unix Administrator, You will be part of global SRE team that will provide high quality SLA operating a global solution running in multiple regions/projects. Run the production environment by monitoring availability and taking a holistic view of system health. Serve as subject matter related to the service operations and second level of escalation for severity. Build software and systems to manage platform infrastructure and applications

Responsibilities Improve reliability, quality, and time-to-market of our suite of software solutions
Define, monitor & manage error budgets to define Service level Agreements (SLAs), Objectives (SLOs), and indictors (SLIs).
Measure and optimise system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
Provide primary operational support and engineering for multiple large, distributed software applications
Design and implement innovations that improve service reliability, infrastructure resiliency and security, and availability.
Develop/Deploy various tools, framework and solutions for infrastructure monitoring, alerting, scaling, automation and management aspects of our IT infrastructure and services to ensure high availability, latency, and overall system health.

Required Technical and Professional Expertise

Understand the lifecycle of infrastructure project (Design, Implement, Transition and Operation). In-depth understanding of Public Cloud (AWS/GCP/Azure) architecture & UNIX Operating System (Preferably RedHat/Suse Linux). Have troubleshooting experience with the ability to analyze technical problems to prevent future occurrence.
Maintenance of our production infrastructure, tools and services.
Identify and fill gaps in the monitoring & alerting system
Gather and analyse metrics from operating systems, subsystems & cloud native technologies to assist in performance tuning, scale-up/scale-down, fault finding & operational excellence.
Involved in solution/platform design discussion with cross-functional teams to provide the infrastructure insight and manage the proper technology and business trade-offs and ensure the infrastructure meets performance and capacity requirements
Incident response, troubleshoot, postmortem, author incident reports by coordinating with multiple engineering teams, RCA for issues spanning code, network, database, and system components.
Design and automate emergency recovery procedures and other tool sets to reduce manual work.
Responsible for discovering, implementing and automating all measures which are necessary to maintain SLA, uptime & drive down the burden of toil.
Design and implement innovations (systems, approaches, processes, tools) that improve service reliability, fault-tolerance, scalability, infrastructure resiliency, availability & security
Partner with development teams to improve services through rigorous testing and release procedures
Participate in system design consulting, platform management, and capacity planning
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service level objectives

Preferred Technical and Professional Experience

Bachelors degree in computer science or other highly technical, scientific discipline
Extensive hands-on knowledge of Unix (Preferably RedHat, SUSE Linux, AIX, Solaris, HPUX) operating system
Need extensive and hands on knowledge of public cloud (AWS/GCP/IBM) and experience in building cloud infrastructure level applications on AWS/GCP
Experience with Kubernetes, OpenShift, Ranchar and related CNCF technologies and frameworks
Fluent in scripting and at least one or other automation tools/programming language e.g. Chef, Puppet, Ansible, Golang & Python
Good understanding with open-source log monitoring software like Prometheus, ELK/EFK/Greylog/DataDog, Istio, etc
Exceptional ability and deep interest to learn new technologies in this area, solve challenging problems, ample energy to evangelise and implement appropriate solutions across the global teams
Have keen interest in growing and mentoring your fellow team members.
Good understanding of large scale data systems and data pipelines including managing NoSQL, SQL and HDFS/Hadoop clusters
Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Previous success in technical engineering
Coding experience beyond simple scripts
Have proven capability to interact with clients and deliver results, taking ideas to production

Required Education

Bachelors Degree

Preferred Education

Masters Degree

 
Awards & Accolades for MyTechLogy
Winner of
REDHERRING
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url