28 March, 2019
Over the next decade, a new era of computing will fundamentally change how we interact with and think about computers—from smart watches to smart cities. Artificial Intelligence (AI) and, in turn, Machine Learning are already powering many of the digital tools we use in daily life (which you can read more about from Google here).
Many of these powerful technologies will be delivered to end users via the cloud or edge data centers. Data center infrastructure will have to adapt accordingly, meaning data center operators will need to transform their existing legacy infrastructures into agile, high-density layouts, while simultaneously managing the high processing power and heat generated by AI hardware. To understand this from an AI hardware perspective, consider reading our Google TPU: Thermal Management in Machine Learning Blog.
To keep up-to-date with the rising power and cooling demands, as well as increasing space limitations, this change in data center infrastructure must start at the chip level. The process of making a row within a data center is as follows: (1) A silicon chip is designed and mounted on a PCB. (2) The PCB is placed inside a blade that makes up a card. (3) The card is mounted into a bigger chassis with multiple cards. (4) The chassis is mounted on a rack, and a group of these racks will constitute a row within a data center.
From silicon chip to data center rack, the designing process should encompass the all-inclusive perspective of “chip to chiller” thermal design.
AI and its powerful subsets often require highly specialized hardware, or powerful processors capable of running complicated algorithms with little to no latency. These high-powered processors generate significant amounts of heat that cannot be removed via conventional air-cooling techniques. As a result, engineers must leverage liquid cooling to overcome the many thermal design issues associated with them.
In the following example, the blade has two processors that are directly cooled using liquid cooling loops. The blade has other heat-generating components—namely DIMMS, memory boards and hard drives—that are not directly cooled by the liquid cooling loop.
Figure. 1: Blade with two processors directly cooled by liquid cooling loops
When a thermally efficient board design is achieved, the individual blade/card is placed into a blade chassis that will hold several cards. Thermal challenges at this level are complex, but an efficient card design enables the engineer to focus on managing the heat from other components that are not directly liquid cooled. At the blade chassis level, it is common to have a hybrid cooling system with a set of fan trays included to cool the card cages.
In the example below, the blade chassis has about 5796 W of cooling: 4473 W of this is liquid cooled, and the rest (~ 1.2 KW) is cooled by air. The key considerations for air cooling are the fan sizing, fan curves, and fan controls. An efficient server design will also be able to ramp up/down the fan depending on the chassis internal temperature, in direct proportion to the power ramps on the chips.
Figure 2. A typical blade chassis with liquid cooling loops and fans installed
The next step is to design a data center to house these high-powered, hybrid-cooled servers efficiently. However, ensuring that the data center delivers high-reliability cooling without compromising on rack power density can be quite a challenge.
Figure 3. A group of chimney cabinets containing hybrid-cooled servers within a data center
Whether you design the liquid cooling loop of the data center in-house or utilize a readily-available solution, no standard implementation will fit all the requirements. It is therefore important to carefully design, evaluate and understand the risks involved to avoid expensive fixes in the future.
Furthermore, implementation of liquid cooling systems requires a data center that is flexible enough to handle a variety of heat rejection systems. With a CFD simulation platform, you can evaluate different scenarios with the utmost accuracy to enable quick and effective decision making.
At the data center level, running CFD models with detailed representations of a distributed system of pumps, heat exchangers, cooling loops and airflow can be complicated and difficult. To this end, at Future Facilities we have built simple modeling tools that power a digital twin, or three-dimensional virtual copy, of your data center. The digital twin enables users to enjoy 6SigmaRoom’s unprecedented speed and detailed modeling, while also incorporating all the necessary boundary conditions to include the liquid cooling loop and overall airflow to generate accurate simulations.
The example below shows a custom-made cabinet incorporating the high-density servers discussed previously. The cabinet houses 3 10U servers, 3 switches and a heat exchanger at the bottom of the cabinet to reject the heat away from the servers. Using the built-in 1D flow network, 6SigmaRoom allows the heat exchangers to connect to the 3 servers in the cabinet and vice versa. Furthermore, important parameters like the coolant flow rate, coolant inlet and outlet temperatures can be parametrized and optimized, enabling the impact of these changes to be studied in detail.
Figure 4. Heat exchangers installed at the bottom of the cabinet connected to the servers through a flow network
The additional ‘air-cooling’ required by all the other components is supported by perimeter cooling on the floor slab, with the return air path through the false ceiling aided by the chimney cabinets (as shown below). Excluding the heat dissipated via liquid cooling, there is still a significant amount of heat within the room that needs to be managed. This is where 6SigmaRoom works seamlessly, integrating different cooling systems within the same room and offering the visibility and flexibility needed to address key issues.
Figure 5. Distribution of heat load cooled using liquid cooling and conventional air cooling
Figure 6. Plots showing the ASHRAE recommended Max Inlet Temperature and % of Cooling Capacity used by the ACUs
In the example data center above, the power density is as high as 500 kW/sq. ft in some areas – which can pose a significant challenge when deploying liquid cooling. Approximately 66% of the total 2MW heat load is managed by the liquid cooling systems, so carrying out cooling redundancy checks for liquid cooling loops is of paramount importance in this case. Alongside the liquid cooling challenge, the remaining 34% of air-cooled heat load must also be considered carefully by the data center operator. Knowing the exact air-cooling versus liquid cooling requirements, plus its redundancies gives data center managers an easy way to optimize resources, reduce risk and increase confidence in the decision-making process.
Liquid cooling in data centers is becoming ever-more prevalent and can result in significant operational cost savings - provided it is carefully designed, planned and managed. Flow network modeling in 6SigmaRoom offers visibility and certainty at every stage of this process: from choosing the correct equipment and testing failure resilience at the design stage, to optimizing coolant flow in the operational stages. Utilizing 6SigmaRoom to simulate your liquid cooling network provides a clear framework to help you implement liquid cooling in your data center that the new age of computing demands.
Blog written by: Akhil Docca, Director of Marketing & Danielle Gibson, Technical Marketing Writer
Other Recent Posts
Why Data Center Operators Need Less Thumbs and More Facts
Many of us may feel we’re winning the battle against over-provisioning of data center resources…
How to Act like a Hyperscaler (Even if You're Not)
Capacity demands are increasing for enterprises and colocations alike. With IoT, AI and increased a…