MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.

I'm not curious

Site Reliability Specialist

Location Bengaluru, India
Posted 03-July-2022
About Us

Morgan Stanley is a leading global financial services firm providing a wide range of investment banking, securities, wealth management and investment management services. With offices in more than 41 countries, the Firms employees serve clients worldwide including corporations, governments, institutions and individuals. For further information about Morgan Stanley, please visit


Enterprise Technology & Services (ETS) delivers shared technology services for Morgan Stanley supporting all business applications and end users. ETS provides capabilities for all stages of Morgan Stanleys software development lifecycle, enabling productive coding, functional and integration testing, application releases, and ongoing monitoring and support for over 3,000 production applications. ETS also delivers all workplace technologies (desktop, mobile, voice, video, productivity, intranet/internet) in integrated configurations that boost the personal productivity of employees. Application and end user functions are delivered on a scalable, secure, and reliable infrastructure composed of seamlessly integrated datacenter, network, compute, cloud, storage, and database functions.

The Application Infrastructure (AI) SRE Ops & Support department is seeking a Site Reliability Engineer to drive the reliability engineering, operations and customer support services for Morgan Stanley?s suite of IT Service Management (ITSM) products. AI SRE & Ops Support is a cornerstone of the Application Infrastructure organization in Morgan Stanley?s Technology Division.

This position specializes in IT Service Management (ITSM) products, including ServiceNow and proprietary Change Management tooling, as well as the Morgan Stanley IPSoft Amelia Chatbot platform. The Morgan Stanley implementation of ServiceNow is a vendor hosted solution involving a number of internally hosted technologies (SQL databases, ServiceNow API, Unix, AFS, SSL, SAML and other Web Infrastructure, etc.) that extends the functionality of base applications (Incident, Problem, Change, Service Requests, etc.). This team is also responsible for a suite of change management products and tools used to ensure changes are properly documented, and detected, provide audit trails of changes, and ensure changes are properly authorized before being deployed. These change management tools are critical for ensuring the firm meets regulatory requirements and are critical to protecting the firm. The team is also in charge of managing the firm?s suite of IPSoft Amelia Chatbot?s, deployed to Docker using Ansible and other tooling, as well as overseeing the Chatbot platforms Observability using the firms strategic Observability stack (Prometheus, Grafana, etc.).

The Site Reliability Engineering team (SRE) drive the reliability, recoverability and operational efficiency of this product portfolio. Reporting to the Global SRE Lead, key features of this role include implementing advanced observability, troubleshooting complex systems, task automation, and technical debt management.

Members of the SRE team align to an Agile squad, and are expected to work closely with the ITSM user community on day to day usage of the products, as well with our internal development and engineering squads, and the offshore support team that provide first line support.

Candidates will have the technical skills required to support these products on a Linux platform. Prior task automation experience in at least one programming language is expected, and through that some user experience with typical software development lifecycle toolchains. Hands-on experience with at least one pillar of observability is required and ideally experience in defining system monitoring, not just reacting to alerts. Cloud experience is not necessary, as training can be provided, but it would be an advantage.


Responsibilities include:

Building and maintaining knowledge front to back of Application Infrastructure?s IT Service Management products, and then specializing in one or two of them
Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with dev-side peers
Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
Identification and prioritization of technical debt that is impacting client developer productivity, reliability or the efficiency of the ops team
Complex troubleshooting in a Linux environment
Consult with clients (the Firm?s internal development community, IT service practitioners) to maximize their productivity, including troubleshooting the issues they have using the department?s products
Minimizing the escalation rate to the dev-side product delivery team members to ensure the department has the greatest possible flow of feature delivery
Being operationally responsive, including sharing on-call rotation with the rest of the global team (with a time-off in lieu system)

Required Qualifications / Skills

- Strong Linux troubleshooting skills
- Task automation experience in any programming language
- Practical experience of at least one pillar of observability (metrics, logs or traces)
- Exhibit working knowledge in at least ONE of the following areas
o REST services (API)
o Load balancing and networking
o Performance troubleshooting and resolution
- Confident collaboration skills

Desired Skills

- Python development for task automation
- Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
- Prior experience creating operational dashboards (Splunk, Grafana, etc)
- Experience administering and/or supporting ServiceNow

Awards & Accolades for MyTechLogy
Winner of
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url