Case Study - Standard Chartered Bank
From Pilot to Scale: The Successful SRE Journey at a Large Financial Institution A Case Study on Standard Chartered Bank
From Pilot to Scale:
The Successful SRE Journey at a Large Financial Institution
A Case Study on Standard Chartered Bank
By Eveline Oehrlich, Chief Research & Content Officer, DevOps Institute,
with Karen Skiles, DevOps Institute
Standard Chartered Bank (SCB) offers banking services to individuals and companies and today has more than 85,000 employees and a presence in 59 markets serving customers in 150 countries worldwide. The company is listed on the London and Hong Kong Stock Exchanges. The company headquarters is in London. In 1969 the merger of Standard Bank and Chartered Bank resulted in today’s Standard Chartered Bank with a network across core emerging markets in Asia, Africa, the Middle East, and beyond.
The Technology and Innovation organization is spread across multiple geographic locations and supports a variety of divisions in building the bank of the future.
Technology & Innovation at SCB
Standard Chartered Bank's Technology & Innovation (T&I) organization is key to the success of building a bank of the future. T&I is responsible for systems development and technology infrastructure which underpin the Bank’s client services and defines and implements the Bank’s digital and innovation agenda.
Functions T&I, one of the domains within the overall T&I Organization, supports Group Functions such as Finance, Liquidity, Risk, Legal, Compliance, Treasury, Audit, Human Resources, Digital Foundations Technology, Corporate Affairs, Brand, Marketing, and Property businesses.
Changing How Work Gets Done
Functions T&I's goal is to innovate while providing operational excellence.
Over the last two years, the Functions T&I team has put a heavy emphasis on improving the stability and reliability of the functions applications. This has included a focus on several areas including improving monitoring and the reduction of low priority incidents. Day-to-day management of those incidents added no value to the bank. To eliminate the occurrence of these incidents the Functions T&I team embarked on several initiatives, which planted the early seeds of Site Reliability Engineering. Here is what was initially done:
- Improve monitoring to reduce noise and toil. Their first approach was to successfully reduce low-level incidents by introducing better monitoring which allowed them to augment, correlate and prioritize incidents.
- Address incident and change management to keep up with the pace of change. As Functions T&I started to embark on a DevOps journey with the initial goals to automate software and infrastructure delivery, they realized that they needed to adapt the incident and change management processes further to support the new velocity demands.
- Complexity did not stop nor scare them. The question was could the team leverage the practices and methodology of SRE to manage applications? Within the existing environment, the team already had established a highly sophisticated DevOps pipeline. The complexity of high-performance compute power, different infrastructures within the bank's own data centers and across a variety of vendor platforms, and a combination of both modern and somewhat aged infrastructure and the scale of supporting over 200 applications did not deter the team from implementing a new way of working.
- The conversation of SRE started. Functions T&I decided to investigate whether bringing SRE into the domain was a way to fundamentally change the way they supported their systems and businesses.
**How SRE is Becoming the Primary Support Model at Standard Chartered Bank **
The bank has been heavily focused in recent years on building and enhancing its engineering culture and capabilities. To enable this significant evolution in developer effectiveness, adopting a set of modern support practices was essential to enabling the technical transformation needed at the bank. The move to SRE as the primary support model was agreed upon by the T&I leadership team.
The Bank has been heavily focused in recent years on building and enhancing its engineering culture and capabilities. To enable this significant evolution in developer effectiveness, adopting a set of modern support practices was essential to enabling the technical transformation needed at the bank. The move to SRE as the primary support model was agreed upon by the T&I leadership team.
Each IT organization has taken a unique approach in attaining future-ready reliability and initiate the best practice of SRE, as it must update and align talent, processes, technology, and best practices in conjunction with business strategy to improve its IT operating model. The journey is influenced by leaders and individuals. The need for change within Standard Chartered Bank was encouraged by the Group CIO, Michael Gorriz. His encouragement resulted in SRE being adopted as the primary support model within T&I. The different domains within T&I are at different maturity levels of the SRE journey and have minor differences in the interpretation of SRE to best fit with their customer requirements. Within Functions T&I, this is how their journey progressed.
- Initiated a pilot: Functions T&I formed a small team comprised of five SRE evangelists and made a conscious decision to target five different types of applications, with varying degrees of challenges as their pilot SRE platforms. The core SRE team members (the evangelists) were moved into a central team which was created to enable 100% focus on delivering SRE.
- The pilot team members were trained on Core SRE principles. Each team member received SRE training and was then embedded into a specific application team.
- Selected a diverse range of applications as part of the pilot. The SRE Evangelists were responsible for a single application. One key decision was that each application had different footprints and architecture and ranged from modern to hosted on a vendor platform, run within the bank's own data center, or deployed on a SaaS platform on a cloud-hosted environment.
- Adopted a 'get things done' attitude. The team decided to see how much of the SRE best practices and core principles could be deployed. In their first few months, the team initially worked on defining a framework for evaluating the suitability of an application for onboarding to SRE. This was extremely useful in identifying candidate applications that they felt would gain the most benefit from adopting the SRE model. The framework also laid out a set of stages that supported the progressive onboarding of an application onto the SRE model.
- Communication and collaboration to continuously improve. Over the first few months, the Evangelists consulted and collaborated with stakeholders and external market experts in order to build knowledge and understand the market best practices, along with the potential pitfalls of deploying SRE.
- Overcome fear, uncertainty, and doubt (FUD). One challenge faced by the Evangelists that had to be overcome was the potential resistance due to the fear that this change may have been a cost-cutting exercise. At this point of the pilot, the team only wanted to prove a better way of working. There was no goal of cost or headcount reductions but rather a “see what we can achieve attitude.”
- Best of breed approach on monitoring continued. Monitoring the performance and other aspects of applications is a critical automation task for SRE teams. Rather than prescribing a set approach to each pilot application, they took the pragmatic approach of leveraging what was already in place and building on the existing capabilities rather than starting from scratch.
- Supportability and reliability as an entire team objective. As the pilot included the upskilling of Delivery, Engineering, and Support teams, the mindset of reliability and supportability was spread among all with the SRE Evangelists functioning as the bridge between all teams. In order to ensure the right focus on reliability and supportability, shared supportability and reliability objectives were made part of both the development and support team’s objective.
How to Overcome the Law of Nature that Any Change is Resisted by Humans
Introducing and leveraging the SRE methodology is difficult, largely because it involves changing a running organization. It requires changes to augment existing processes, functions, and tools or replace processes, functions, and tools while keeping the organization running. The Functions T&I team approached the challenges and achieved acceptance of SRE by doing the following:
Initiated conversations, education, training, and coaching session of key stakeholders. One key step to achieving acceptance is to overcome the initial resistance across a variety of stakeholders. Within the bank, many team members had heard about the concept of SRE but did not fully understand it. They had open and consistent discussions on the fundamental concepts and principles of Operability, Reliability, Observability, and Scalability that underpin SRE. It was also crucial to introduce and reinforce the components of SRE such as the concept of “toil,” the error budget, and service levels principles such as SLA, SLO, and SLI. With an improved understanding of the key principles of SRE and how it would impact each stakeholder, they were able to accelerate the embracing of SRE.
Ensuring business value and acceptance. Reaching out to the business stakeholders was essential and one of the first steps initiated by the SRE Head and his team. Upon explaining the concept of the error budget, the business stakeholders embraced the idea of leveraging SRE as a practice, as the stability of their applications and platforms is crucial.
Mindset changes through continuous learning. SRE focuses on increasing the velocity of delivery, improving the collaboration between Development and Support teams, and having a steady focus on System Reliability. Previously the Development and Support Teams worked largely independently, with interactions only when application changes were being made or when something went wrong. SRE required a significant mindset change, where both development and SRE teams worked much closer with shared objectives. The changes came via focusing on continuous learning which is today at the center of the team’s program.
Continuous Learning was a major factor in successfully changing the mindset within our team. We put education at the heart of our program, and we have gone from ‘What is SRE?’, to now being asked, when SRE can be deployed on their applications.”
– Richard Hall, Global Head of Transformation, Resilience, and Architecture for Functions T&I
Exposing an opportunity without force made a big difference. Buy-in and support via brute force was never the plan. The path towards success was rather to show and explain the opportunities for reduced backlogs, offering help in defining SLIs and SLOs and the benefit of the flexibility of the SRE teams’ ability to align with different deployment and execution practices across the different application teams.
Creating momentum by aligning on goals and vision. Capability assessments are one way to understand the current maturity levels of how work is executed. The team conducted DevOps maturity assessments, explored incident management processes, change management statistics, and other topics across key applications to identify what opportunities were available to improve. This enabled them to quickly identify improvement areas and then to start introducing SRE practices to achieve rapid improvements in areas that needed focus or lacked maturity.
**The Past of Standard Chartered Bank is Different to the Future **
Digital transformation is on every banking institution’s radar, but the effort requires vast and fast transformation across people and culture, processes, and replacing or updating the technology ecosystem. One of the key success factors of Standard Chartered Bank's ongoing digital transformation is the deployment of SRE across the T&I domains.
The Great Achievements, Learnings, and Accomplishments
Speak for Themselves
Since the implementation of the pilot program in 2019, the Functions T&I team has developed and accomplished a significant amount. Today, the Functions T&I SRE team act as consultants and guide application owners from across Functions T&I on their SRE journey.
Here is what they have learned and accomplished so far:
- Bridge the gap between developers and support. SRE inherently encourages a culture of DevOps. Within the bank, the role of an SRE has rewritten how the Bank provides support, by working hand in hand with the Development team to improve automation, instill best practices in supportability by design, instilled a culture of putting reliability at the center of everything they do and provides communication benefiting the entire organization. The possibilities for SRE are endless. For example, SRE can expose areas for improvement in the release pipeline while also creating rules around the culture of on-call availability and incident response that encourages everyone to be more accountable, and less reactive.
- Moved the team into a modern state of support. Independent of how large or small an organization is, individuals and teams are required to respond to application and technology infrastructure alerts. The reduction of low priority incidents and other low-value tasks was achieved through the acceleration of automation and leveraging their deep understanding of a system’s operations allowed a shift towards a more proactive and contextual support model.
- Established a Learning model to support the SRE adoption. While the Bank had an extensive Technology Learning program, with the help of DevOps Institute the SRE team designed and implemented Functions T&I SRE Academy. This was key to not only driving the change in mindset but ensured the SRE teams, Delivery teams, and business had the required knowledge to successfully implement SRE in their applications. The Functions T&I SRE Academy provides a blend of internal modules along with DevOps SRE certifications provided by DevOps Institute. The learning programs are complemented by internal coaching and detailed SRE implementation application reviews over a 100-day period. On successfully completing the requirements of the SRE Academy program, participants graduate as SRE Practitioners.
- An ongoing effort to transition towards a learning organization. The Functions T&I Academy's upskilling efforts have been successful and are now being incorporated and expanded into the Standard Chartered Bank aXess Academy, the Learning platform managed by the T&I organization. Including the SRE Academy program in the aXess Academy portfolio of learning modules will allow a broader deployment across all domains within the Global T&I organization.
- A shift towards advisors. As the program matured, the SRE Evangelists, who were leading the SRE transformation, are now providing SRE advisory services and the teams themselves are taking the action of deploying SRE in their own applications. This has actually created an advisory function and as Richard Hall states “One of the really pleasing outcomes of the work we did over the last 18 months is that it is now running itself. SRE is basically self-seeding across the team. This is hugely satisfying to see, as SRE has now been embraced across both the Functions and the broader T&I organization.”
- Putting education at the heart of their program. This was essential in creating momentum, as people were able to see a real benefit to not only the bank but also to them. The teams embraced the training and could see how a previous role as a support person could develop into an SRE Engineer. This is a real change, and key to the change in mindset. The technology capabilities and skills can be taught, but the mindset of working as an engineer and being seen as an engineer required a significant change.
- Developing SLOs and SLIs from top-down. Initially, the SRE team worked with the business and development teams to define high-level SLO and SLIs. These were then fine-tuned and expanded as the business got more comfortable with using these measures. The team also focused on changing how the delivery teams thought about observability, trying to get the teams to think through the metrics required for measuring the performance of an application at design time rather than thinking about it right before production go-live … Observability by design.
- The emphasis on making metrics visible for stability and reliability. The SRE team developed a variety of dashboards for the business they support. This provides real-time status on the current standings of meeting SLOs and SLIs, and by definition, how much error budget is available. This allows the business to understand when they get close to a potential error budget breach, which can trigger the immediate action of the SRE team stopping further functional changes being deployed until stability has been regained. Thus, driving both the SRE and delivery teams to focus on quality, stability, and reliability with the SRE functioning as a partner to both the business and delivery team.
The Recent and Next Steps
The future according to Richard Hall is looking bright as the Functions T&I team is targeting up to 100 applications to be fully supported with the SRE support model by the end of 2021.
As the evolution of SRE increased pace within Standard Chartered Bank, it became clear that several business domains were deploying SRE in a slightly different way. In order to bring more consistency to the way SRE is deployed across the bank, the team has established a bank-wide SRE community of practice (CoP).
The focus of this community is as follows:
- Bringing together best practices. The CoP, where Richard Hall is one of the senior sponsors, is focused on combining the best practices from each of the T&I domains into a community forum. The ongoing sharing and learning from each other, leveraging, and getting the community to talk to each other are essential core practices of the CoP which has been running since February 2021. Richard Hall says, “Seeing the teams sharing, the ongoing knowledge transfer has created a great community spirit. This is not about telling others how to do it but realizing 70% of what many of the teams are doing is common and 30% is different and we all learn from that organically.”
An example of how the CoP has combined best practices to define a set of Reliability standards is Observability by design. While existing application monitoring solutions such as AppDynamics, Prometheus, or ITRS are already in use, the tendency historically has been to plug in monitoring prior to go live and to only measure basic metrics. The evolution of SRE and the SRE principle of “Observability” central to being able to properly understand and measure an application's performance and behavior. Getting the team to think about the metrics they want to capture at design time and then building the observability by design is an area of best practice the CoP is sharing across domains.
- Defining an SRE career path. The company previously had a well-defined engineering career path but no path for becoming an SRE Practitioner. The bank's AXess Academy combined with the SRE CoP learning stream, is now taking best practices from each of the T&I domains and building an SRE career journey into the bank’s engineering career path program.
- Learning and Development. Building on the work done by the Functions T&I and CCIB T&I teams, along with DevOps Institute, the Standard Chartered Bank aXess Academy has developed a series of training pathways for different maturities of SRE engineers, including foundation, practitioner, and leader. This is now being rolled out across the entire T&I organization.
- Continue to focus on the T&I transformation. The technology and business teams are continually adopting a much more agile way of working and the SRE model fits well into that.
- Establishing SRE Standards. An additional step is to define SRE standards that are leveraged across the organization.
- Continue to drive cloud-first philosophy. The bank is adopting an aggressive cloud-first strategy with the focus on moving as much compute to the cloud as possible over the coming years. As the SRE team has evolved, engaging at the earliest point in the cloud migration design sessions has ensured that supportability by design is built into the approach the bank is taking, and ensures all
aspects of the SRE principles are factored into the cloud migration at the design stage.
**What This Means **
Site Reliability Engineering (SRE) is a discipline that applies aspects of software engineering to infrastructure and operations problems to create ultra-scalable and exceptionally reliable distributed software systems. Site Reliability Engineering (SRE) complements DevOps by measuring and achieving reliability of applications and services working on production and DevOps infrastructures in a prescribed manner using error budgets, team relationships brokered by an error budget, Ops-as-code, and the use of reliability control practices to ensure deployments meet Service Level Objectives (SLOs).
For a successful SRE journey, you must initiate changes by:
Embracing human capabilities to learn, apply and adopt. The ever-changing nature of technologies and ongoing customer demands in today’s fiercely competitive environment drives the need for more and more skills within all functions, including technology. Unfortunately, the number of skills required grows faster than the team members can learn. This is where a change in mindset needs to be introduced which we call embracing the capabilities towards rapid learning. By nurturing and cultivating such a mindset, people will have the mindset and disposition to engage in the education and training required to adopt and accelerate in an ever-changing environment. DevOps Institute's mission is to support organizations is this journey by connecting and engaging with DevOps humans, providing insights, and thought leadership and equipping individuals and teams towards success.
Showing courage, getting started, and taking the leap. While getting started with SRE at Standard Chartered Bank, it included the setup of a pilot which continued by setting up a training program. As its first step, the team did educate, communicate, and instill learning within the group and those within the application value streams. The team’s leader gave people time to explore without setting key expectations, but all pilot team members changed their way of working visibly and deliberately working with application teams and others necessary to implement SRE practices. The leadership created an environment that developed and cultivated people’s core capabilities within the pilot. This approach takes courage, as it requires leaders to have faith that enduring human capabilities do indeed drive skills and business value.
About the Author
**Eveline Oehrlich, DevOps Institute**
Eveline Oehrlich is the Chief Research Officer at the DevOps Institute. She conducts research on topics focusing on DevOps as well as Business and IT Automation. She held the position of VP and Research Director at Forrester Research, where she led and conducted research around a variety of topics including DevOps, Digital Operational Excellence, IT and Enterprise Service Management, Cognitive Intelligence and Application Performance Management for 13 years. She has advised leaders and teams across small and large enterprises in the world on challenges and possible changes to people, process, and technology. She is the author of many research papers and thought leadership pieces and a well-known presenter and speaker within the IT industry. Eveline has more than 25 years of experience in IT.
About DevOps Institute
DevOps Institute is a professional member association. Our mission is to advance the human elements of DevOps.We create a safe and interactive ecosystem where members can network, gain knowledge, grow their careers, lead and initiate, and celebrate professional achievements. We inspire thought leadership and knowledge by connecting and enabling the global member community to drive human transformation in the digital age.
Error Budget: The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a pre-defined period. Once the error budget is breached the SRE team are able to prevent further functional change from being deployed until an acceptable level of stability is re-established. In certain situations, resources from the functional change teams can be redeployed to the SRE to assist in re-establishing stability.