For a long time, ever since IT became the heart of the business, the datacenter has been the heart of IT. Owning a large datacenter meant being able to satisfy the IT needs of a large business, and so the datacenter grew as the business grew and its demands increased.
Eventually, however, the demands of the business started to outstrip the ability of the datacenter to grow. If you have your datacenter in the basement of your headquarters building, that’s some very expensive real estate, and there is only so much additional space available to buy. So you move your datacenter out of town, or hire space in someone else’s datacenter, and buy yourself a bit more time. However, that is only a temporary fix.
The Problem(s) with Datacenters
I was talking recently to one company who operates one of the most state-of-the-art datacenters in Europe if not the world, and the limit they have run up against is not space (they are surrounded by fields), nor connectivity (they are a good enough customers that telcos vie to run more fibre into their facility). It’s power. They literally cannot run any more power into their enormously expensive datacenter. For a while they got by because equipment has grown more and more power efficient, but the demand for IT capacity outstripped the increase in efficiency.
The key is having visibility into what is going on with the service, and being able to deal with any issues as they arise. Because of the complexity and the handoffs involved in a cloud computing setup, there is no convenient “single source of truth” to refer to.
This company is actually in a somewhat rare position, in that because of how their business operates, they basically have a small set of applications, all developed in-house, and they can be very efficient and agile in how they respond to their customer demands. IT is actually their business, and therefore a core differentiator for them.
Most companies are not in the business of IT — but they are still very much dependent on the performance and availability of their IT systems for their ability to do business. For many of these companies, IT is not agile or responsive, or not as agile and responsive as they might like. Part of the reason is that they are still dealing with physical datacenters, or at least thinking in those terms. They operate complex “stacks” of different technologies from different vendors or open-source projects, and need to integrate and customize them laboriously in order to satisfy demand from the business. All of that takes time and effort, and the result still requires constant attention just to keep operating, let alone to respond to changing requirements.
Shut Down the Datacenter!
For a while now, companies have been moving away from operating their own datacenters, toward various models of cloud computing. The reason for this is not the same as the earlier move to co-located hosting, or at least not in most cases. Treating the cloud as just a slightly different sort of virtualization will work, sure, but it leaves a lot of the value on the table.
What this means is that, where a few years ago companies were operating most of their technology stack themselves, and often by hand, now most companies have significant chunks of their applications running in the cloud. The datacenter has not gone away, it’s just shifted — as the T-shirt slogan has it, “There is no cloud, it’s just someone else’s computer.”
Conferences like Gartner’s upcoming IT Infrastructure, Operations Management & Data Center Summit are interesting because they span the entire spectrum of this transition. You get to talk to people who are running web-scale cloud service with ten thousand (or more!) systems in them, and also to people who are consuming the resources from those cloud offerings.
Both parties need new models.
The Modern Datacenter: Build Versus Buy
For cloud operators, individual servers are almost irrelevant. This is the famous “cattle vs pets” dichotomy: If you have laboriously hand-assembled your application stack on an individual server, it’s worth significant effort to try to fix it if anything goes wrong. If the only role of your server is to be one interchangeable cog among ten thousand, customers consume aggregated compute resources and you don’t know, or care, what they are doing with them, the rational choice is, if anything gets even slightly out of whack, you drop that server, pave it over, and put it back into service.
For cloud customers, all of this frantic activity is (or should be!) entirely invisible — they pay for a certain amount of compute resource and ancillary services, and do not care in the least how their contract is honoured. Note that this is the same for private cloud. Any number of organizations still want or need to own their own systems, but even then, they want to consume them as a service, in chunks that are determined by business value, not inflexible hardware sizing.
For both parties, the key is having visibility into what is going on with the service, and being able to deal with any issues as they arise. Because of the complexity and the handoffs involved in a cloud computing setup, there is no convenient “single source of truth” to refer to. Also, because of the rapid rate of change — both the constant churn of customer requests, and the intervention of automated systems culling the herd and bringing new resources online — any failure that becomes obvious and visible to more than a tiny fraction of end users is probably going to turn into a massive outage, the sort that makes the news.
This is what drives new trends like Site Reliability Engineering and the shift from monitoring to observability. By the time the threshold is crossed and the alert goes red, it’s already too late. What cloud operators and customers alike require is new ways to sift signal from noise and identify the faint signals of something beginning to go wrong, so that they can deal with it in a calm, relaxed manner instead of reacting in a rush to an emergency.
Can You Predict Cloudy Weather?
This is not prediction. The complexity of this sort of multi-technology and multi-actor environment makes it impossible to model to the degree of precision that is required for accurate and actionable predictions. The number of possible outcomes from any event is too large for the calculations to be made in useful time, if sufficient data were even available.
Instead, what we at Moogsoft talk about is being able to identify useful information in real time, correlating it across all of the different event sources and levels of the stack, and most importantly, showing it to the right people in time for them to be able to do something about it. In the cloud, events are not rare occurrences that can be evaluated one by one. There is a constant stream of alerts, and the trick is figuring out what is actually important, and to whom it is relevant. Both the technical environment and the business one change and evolve too rapidly for it to be possible to document them in any operationally useful detail, so that evaluation must be done based on patterns in the alert stream itself.
Finally, while it used to be possible for one person or small group of people to understand and be responsible for an entire server, for the top of the stack to the bottom, the unbundling of that application stack has made it impossible to maintain that understanding. Instead, the full-stack picture must be assembled on the fly by bringing together different data streams and the different subject-matter experts who can understand them and act upon them.
Moogsoft will be at Gartner IT Infrastructure, Operations Management & Data Center Summit in London. Whether you run a cloud, or use one or several, we would love to talk to you about your experiences, and what Moogsoft customers have done in similar situations. See you there!