How DevOps creates value for web developers and their clients
Published 2019-04-23 by Jochen Lillich
Every web developer eventually finds themselves in the situation that a client complains about bad website performance. This is an important issue because performance problems will deter visitors and might even turn into availability problems. Resolving it quickly is important for retaining a healthy business relationship with your client. But it’s not easy to resolve a performance issue in a swift and satisfactory manner when there can be any number of causes.
I hope I’m not the first person to tell you that throwing hardware at a web performance problem is probably not going to be A good, let alone the most effective solution, neither from a cost perspective nor from a technical point of view. (And that’s from someone who makes a living off operating IT infrastructure!) The reason that the “dial up capacity” approach doesn’t work too well is that it will sooner than later yield diminishing returns, and neither you nor your client will be delighted about ever-increasing hosting expenses without much impact.
I’ve written this article to point you to a more effective approach. It counters the complexity of modern web software stacks via DevOps, the practice of having operations engineers and developers collaborate within a common value stream.
The complexity of modern web application stacks
If your website isn’t as fast as it’s supposed to be, there’s a myriad of possible reasons. The cause might hide in any layer of your technology stack. There might even be a combination of multiple causes conspiring against you!
Let’s start at the top and look at the application itself. That’s where a web developer will feel most at home. Modern web applications are complex pieces of software and some of their parts might not be written with performance optimisation in mind. Especially if you build your application on a framework like Laravel or on a CMS like Drupal or WordPress, there are thousands of lines of code that were written by someone else. Often it’s a third-party module or plugin that might have worked perfectly fine with the amount of data the author used for testing, but now that you’re throwing a bigger real-life use case at it, it may be bogging down.
Then, there’s the service stack operated by your infrastructure or hosting provider. The days when that was a decently simple LAMP stack are long gone. At the heart of everything, there’s usually still a database like MySQL, a piece of software so complex that just a list of its configuration settings fills many pages of paper. For managing data for which a relational database isn’t a good fit, “NoSQL” data stores like Redis, Solr and Elasticsearch come into play. To further improve website performance, a modern hosting platform such as freistilbox provides you with caching services like Memcached and Varnish that can substantially boost content generation and delivery. If your application interacts with all these services as intended, it’ll show top-notch performance. However, that’s not a small “if”!
Finally, at the bottom end of the stack we find the computing and networking hardware, i.e. the servers that run your application stack and the data centre they’re in. Yes, even in times of “serverless computing”, there still are servers running your code. And I’m using the plural here because no serious web application stack runs on a single server.
Unfortunately, all this complexity makes investigating performance issues very hard. With most hosting services, especially managed hosting platforms, you as the customer don’t have deep access to all the components you’d need to analyse the problem. And hey, would you even have the necessary expertise or the time to do it yourself? The answer is probably “No”. After all, that’s why you decided to pay someone else for running your web applications and not to run your own web operations team. However, when things get out of whack, you might feel a tinge of regret for not having that ops team at arm’s length. Somewhere in this tangled mess is the clue you’re looking for. That’s where many hosting customers get stuck. I guess you’ve made this frustrating experience yourself already.
Collaboration beats complexity
Finding and resolving the bottleneck in a complex technology stack is a challenging task that you shouldn’t have to take on alone. At freistil IT, we chose a very intentional approach with regard to our customer relationships that differentiates us from most managed hosting providers. Instead of making our hosting platform as opaque as possible and hiding our experts behind a support ticket system, we actively seek contact with our customers. We want to be “The Ops to your Dev” because DevOps-style collaboration will resolve (or even better, prevent) most of the issues that occur during the lifecycle of a website most effectively.
Our ops engineers resolve website issues by analysing the symptoms, correlating available data sources, adding personal experience and efficient research. If we can’t eliminate the bottleneck ourselves with an infrastructure change, we can at least make a recommendation how it can be solved on the application level. In almost every case, this recommendation will be much more sophisticated than “throw hardware at it”. And your client will appreciate you for that.
A well-equipped web operations team has access to many data sources that it can use to narrow down the many possible causes for performance or stability issues. By using infrastructure metrics, ops engineers can investigate if there are any abnormalities that might point to a reason.
Let’s take a real-life case that happened in the early days of freistilbox. A customer contacted us because their website suffered from frequent outages. We saw immediately that the website was maxing out its available PHP workers (we call them “Processing Units”), which caused “Website unavailable” errors for many visitors. However, there wasn’t enough traffic to justify all PU being busy for any extended period of time. We discovered that the PUs didn’t get freed up fast enough because it took the web application multiple seconds to respond to every web request. We dug deeper and identified database access as the reason that rendering pages was taking forever. At this point, we saw an easy solution. We decided to migrate the website to a newly built database cluster, our first cluster that was using SSD storage instead of “spinning rust”. That would surely free up these PU — or so we thought. After switching, the situation did not improve, it got worse! Even though database queries were much faster now, both the website’s performance and availability were still miserable. We realised that we hadn’t found the root cause yet. We went back to our infrastructure metrics and found a change in behaviour that was clearly for the worse: the customer’s web application boxes were now operating at their network bandwidth limit, with database connections making up 99% of the traffic. Additional to infrastructure metrics, ops teams can also use many kinds of logs to investigate issues. By analysing information from the MySQL Slow Query Log, we finally discovered the root cause: Some of the application’s database queries were pulling gigabytes of data from the database cluster. On the previous MySQL cluster, reading these huge result sets from disk caused response times in the thousands of milliseconds. The new SSD-based cluster didn’t have this particular bottleneck anymore, but its superior performance led to a new one: Delivering these monster result sets now happened so fast that it saturated the network interface of the customer’s application servers. Throwing hardware at the problem really didn’t do any good here. We went back to our customer with the database queries we had identified. Armed with this information, they were able to quickly resolve the issue themselves by reducing the amount of data they pulled from the database at a time.
Now they were finally able to enjoy the improved performance of our new database cluster. They didn’t have to ask their client to buy additional hosting capacity. Because our ops team was able to point them in the right direction, it only took a few targeted changes to the application code to resolve a massive performance and availability problem. Everyone was happy.DevOps as the core of our managed hosting
For our ops team, this example isn’t just an anecdote. It has become part of our shared experience that we can tap in similar cases. With every issue we solve, our pool of knowledge grows and our resolution process gets more effective. Every issue that turns up more than once, we can resolve in a highly focused way based on prior experience. This continuous improvement process not only results in a shorter Mean Time To Resolution (MTTR) for our web development customers, it also puts them into a better position. Thanks to our DevOps philosophy, they’re able to work with the full set of options for how to solve a performance issue (or any other issue, actually). They can be confident that they’ll have a better response for their clients than just “let’s try throwing (more) money at the problem”.
Most importantly, our collaborative and transparent approach builds massive trust. And that’s where our DevOps philosophy turns into a strong business model.