Data Center: Identifying Patterns and Bottlenecks
In a recent project, The Data Guild assisted a major Bay Area tech company to identify and optimize system performance bottlenecks within a multi-tiered global Software as a Service (SaaS) product—a real-time, on-demand video service.
Ostensibly, the challenge of working in software/server systems alone seemed less daunting than affecting massive industrial systems in the example above. However, a paradox exists in software systems: infinite flexibility and scalability introduces the possibility of infinite complexity.
In industrial mechanics, the opportunity to alter a system is limited by the capabilities of the system. For example, what controls are enabled? What physical changes can humans achieve in such a system? In software, these limitations vanish, enabling even novice coders to create complexity that makes fault detection and diagnosis challenging. A machine breakdown is easily noticed, but can the same be said for a data center?
In this project, the service logs between each tier of the service were analyzed to determine bottlenecks: which steps in the system were contributing most to limitations in performance? However, the location of observed bottleneck could not reliably be considered its source. To improve the situation, a systematic approach was developed to understand the relationship between different components of hardware and software. Through this covariance model, we could then make recommendations for system improvements.
In this case, the industrial exhaust was the server logs. These logs represent billions of transactional records of the input/output of the servers, detailing the work (CPU) and communication (bandwidth) in the context of a much larger system. In isolation, these logs held limited value other than local tuning, but through the integration of these sources, we were able to identify patterns, hierarchies, and relationships. Though a derivative of existing log data, the covariance data became the new source from which we could derive dependencies and ultimately, recommendations.
It’s worth pointing out that the designers of the logging system likely had little in mind beyond fault detection and diagnosis or software performance tuning. However, in a broader, complex networked environment, each transaction, when taken in symphony with logs, began to illuminate the nature of the larger system. This in turn gave us a path to follow in order to boost software performance in some applications and improve hardware performance in places requiring greater scale.
Another lesson from this project is the importance of data visualization. Humans are trained by evolution to recognize patterns in nature and respond to these designs for the purpose of optimization. In our history, those that understood the patterns of the seasons could optimize crop yield, for example; and those who observed movements of wildlife were more successful hunters and thus more effective survivors. In the current age, pattern recognition improves outcomes in nearly every problem we choose to tackle. Unfortunately, unlike weather or herd movements, signals in data can be difficult to pin down.
In data science, we split activities broadly into supervised and unsupervised learning. In the former, we “know what we’re looking for” and hope to achieve models to approximate that outcome. In the latter, we’re not sure what we will learn as we set off on the journey, but hope to identify new patterns that can help us understand our context in the form of meaningful classes or clusters. The utility of data visualization in this context cannot be understated. Spreading our data across a graph visualization in order to show relationships by providing multi-dimensional rendering or labeling with intelligent color-coding to isolate patterns can unlock value that otherwise may be non-intuitive.
In this project, we went step further: we developed a data sonification system to enable those in the network operations center (NOC) to hear their data center in the same way a factory floor manager might be able to hear her machines and understand when they are on or off, functioning smoothly, or having issues. Audio examples and more detail on this project can be heard in this podcast [insert link to Bruner/Turner podcast].
We did not use IoT to generate new input, but rather used a form of IoT to append the system to achieve new output. Instead of measuring an ambient characteristic (as was done in the prior example), here we created an ambient characteristic through sound to consume information. IoT need not be a read-only variable.
The profit creation through this type of approach is two-fold. While identifying opportunities for throughput optimization in a data center environment can immediately reduce capital expenditure for additional capacity (hardware), the approach is also extensible. It can save organizations from being overly dependent on one platform versus another through a deeper infrastructure investment.