AWS

The introduction of Site Reliability Engineering (SRE)

The introduction of Site Reliability Engineering (SRE) teams in the structure of an organization, is becoming more and more popular in the IT industry and in the DevOps domains.
Let’s discover in this article the reason for SRE popularity and what are the differences and the common points between DevOps and SRE. 


In the last two decades we have witnessed a huge transformation in the way of building and delivering software. The Agile culture first and the DevOps revolution later, have transformed the structure of the tech organizations and they can be seen as a de facto standard in the IT industry.

As everyone knows, IT is a constantly evolving sector and recently we are seeing an increasing popularity of Site Reliability Engineering discipline, especially in the DevOps domains. But what is SRE? And what are the differences and the common points between SRE and DevOps? 

DevOps

DevOps culture has contributed to tearing down the wall between software development and the software operation, providing a set of ideas and practices that has led several benefits:

  • Better collaboration and communication between the parts;
  • Shorter release cycles;
  • Faster innovation;
  • Better systems reliability;
  • Reduced IT costs;
  • Higher software delivery performance.

Even though this sounds amazing, there are still quite a lot of companies that struggle with bringing the DevOps culture in their organization. The reason for this is that DevOps is an ideology and not a methodology or technology, which means it doesn’t say anything about how to successfully implement a good DevOps strategy. 

SRE 

Site Reliability Engineering (SRE) is a discipline that was born at Google in early 2000s to reduce the gap between software development and operations and that was completely independent by the DevOps movement. SRE uses software engineering approaches to solve operational problems.
SRE teams have main focus on:

  • Reliability;
  • Automation.

Let’s deepen these aspects.

Reliability

One of the main goals of the SRE is making and keeping systems up and running “no matter what”. In order to achieve this, it is important to keep in mind that failures and faults can happen. SRE discipline embraces them by focusing on:

  • observability;
  • system performance;
  • High availability (HA);
  • emergency response and disaster recovery; 
  • incidents management;
  • Learning from the past problems;
  • disaster mitigation and prevention.

Automation

Automating all the activities that are traditionally performed manually is another of the main goals of SRE.
Automation and software engineering are used to solve operational problems. 

Automation plays a fundamental role in SRE: it allows us to get rid of human errors present in the processes and the activities that regard the system. One could argue that automation introduces bugs in the system anyway and well, that is true but there is one big difference: one can test automated processes but cannot test processes that involve human activities. 

DevOps vs. SRE

As we have understood, both DevOps culture and SRE discipline aim to reduce the gap between software development and operations. Below we summarize them, describing their common goal first and where they differ the most. 

class SRE implements DevOps

As mentioned earlier DevOps doesn’t say anything about how to successfully bring the culture in the organization since it is an ideology. On the other hand, SRE can be seen as implementation of the DevOps philosophy. 
In fact, even though the origins of SRE are completely independent from DevOps, and the discipline provides additional practices that are not part of DevOps, SRE implements DevOps ideas.

Responsibility and Code ownership

SRE can be considered the next stage of DevOps because of the focus on code ownership: the SRE engineer accepts the responsibility of owning the code they develop, in production. A bit different from DevOps where the responsibilities are shared to achieve a shorter release cycle and to improve the collaborations.

Conclusions

The introduction of Site Reliability Engineering (SRE) teams in the structure of an organization, is becoming more and more popular in the IT industry and in the DevOps domains. 
The reason for its popularity can be found in the benefits that the discipline brings: 

  • Better collaboration and communication between the parts;
  • Shorter release cycles;
  • Faster innovation;
  • Better systems reliability;
  • Reduced IT costs;
  • Higher software delivery performance;
  • Reducing incidents in production;
  • Code ownership;
  • Automation of the processes;

As you could notice some of these benefits are exactly the same that you will experience bringing DevOps in your organization.

SRE can be considered an implementation of DevOps culture that has the goal of making and keeping services reliable.

AWS

Continuous improvement – server access example

When you work with any of the big public cloud providers, one thing is for sure – there will be many changes and improvements to the services that they provide to you. Some changes may perhaps change the way you would architect and design a solution entirely, while others may bring smaller, but still useful improvements.

This post is a story of one of these smaller improvements. It is pretty technical, the gist of it is that with a mindset of continuous improvement, we can find nuggets to make life easier and better and it does not have to be completely new architectures and solutions.

A cloud journey

In 2012, before TIQQE existed, and when some of us at TIQQE started our journey in the public cloud, we created virtual machines in AWS to run the different solutions we built. It was a similar set-up to what we had used in on-premises data centres, and we used the EC2 service in AWS.

Using VPC (Virtual Private Cloud) we could set up different solutions isolated from each other. A regularly used pattern used back then was a single account per customer solution, with separate VPCs for test and production environments. These included both private and public (DMZ) subnets.

To login to a server (required in most cases, not so much immutable infrastructure back then) you needed credentials for the VPN server solution, with appropriate permissions set up. To log in to an actual server, you also needed a private SSH key. One such SSH key is the one which you select or create when you create the virtual machine, for the ec2-user user.

While this worked, it did provide some challenges in terms of security and management – which persons or roles should be able to connect to the VPNs, which VPCS should they be able to access? Of those users and roles, who should be able to SSH into a server and which servers?

There was a centrally managed secrets store solution for the SSH keys for the ec2-user user and different keys for different environments and purposes, but this was a challenge to maintain.

Serverless and Systems Manager

The serverless trend which kind of started with AWS Lambda removed some of these headaches since there were no server access or logins to consider – at least not where solution components can be AWS Lambda implementations. That was great – and still is!
Going serverless can provide other challenges, and it is not the answer to all problems either. There is a lot to say about benefits with serverless solutions. However, this story is focusing on when you still need to use servers.

AWS has another service, called Systems Manager, which is essentially a collection of tools and services to manage a fleet of servers. That service has steadily improved over the years, and a few years back it introduced a new feature called Session Manager. This feature allows a user to login to a server via the AWS Console or via the AWS CLI – no SSH keys necessary to be maintained and no ports to configure for SSH access. This feature also removes the need for a VPN solution for people who need direct access to the server for some reason.
Access control uses AWS Identity and access management (IAM) – no separate credential solution.

Some other major cloud providers already had similar features, so in this regard, AWS was doing some catch-up. It is good that they did!

A new solution to an old problem

For a solution that requires servers, there is a new access pattern to use. No VPN, no bastion hosts. Those that should have direct access to a server and login to that server can now login directly via the AWS Console in a browser tab. No VPN connections, no SSH keys to manage –  only select to connect log in to the server via the browser.  That is, assuming you have the IAM permissions to do so!


For those cases that the browser solution is not good enough, it is still possible to perform an SSH login from a local machine. In this case, it is possible with the help of the AWS CLI to make a connection to a server using Systems Manager Session Manager. The user can have their SSH key, which can be authorized temporarily for accessing a specific server.

Since it is then possible to use regular SSH software locally, it is then also possible to do port forwarding for example, so that the user can access some other resource (e.g. a database) that is only accessible via that server. AWS Systems Manager also allows for an audit trail of the access activities. 

Overall, I believe this approach is useful and helpful for situations where we need direct server access. The key here is though, with a mindset of continuous improvement – we can pick up ways to do better, both big and small.

AWS

AWS re:Invent Online 2020

Usually this time of the year we at TIQQE are getting prepared for re:Invent, traveling to Las Vegas and having our yearly Reinvent comes to you live streamed from our office in Örebro together with our friends, customers and employees. 

This year will of course be a little different but still the possibility to take part online! 

You are well on your way to the best few weeks of the year for Cloud. Make sure to join AWS re:Invent and learn about the latest trends, customers and partners. Followed by many excellent key notes, Break-out sessions, Tracks and not to forget all the possibilities to deepen your knowledge and be provided with training and certifications.

So, whether you are just getting started on the cloud or are an advanced user, come and learn something new at the AWS re:Invent Online 2020.

Make sure to register yourself on the link below and secure your place to re:Invent 2020! 

https://reinvent.awsevents.com/

Want to get started with AWS? At TIQQE, we have loads of experience and are an Advanced Partner to AWS. Contact us, we’re here to help.