Deep Learning (DL), a key technique driving artificial intelligence innovation, such as image recognition, chatbots, and self-driving cars, requires algorithms be ‘trained’ using large data sets. Initially, this can be done on a single node (server). However, as the models and datasets grow ever larger and more complex, it becomes essential to scale-out.
Within a single node, power, thermal, storage, and memory limits will cap the scale of the training solution. When a training solution reaches or exceeds the capabilities of a single node, it becomes necessary to scale-out. The emphasis on deep learning has, so far, been focused on single node runs. But, as neural networks are becoming more integrated into existing workloads like Hadoop and HPC, it is becoming necessary to look at multi-node scaling.
A single node (server), supporting one to eight GPUs or accelerators is relatively simple to deploy and operate, but a multi-node system will improve performance significantly. In multi-node systems, training operations can leverage parallelism and increased capacities to handle larger models and datasets. In addition, where support for users on the system is high or growing, scale-out may be required.
This is one of the areas where Dell EMC has been focusing on i.e. how can you enable GPU scaling, a method whereby users can add GPU nodes as their workload need increases? Also, how can you give the end customer an ability to scale-out at a rack level as the number of users accessing these resources increases? Based on some of these requirements, Dell EMC started evaluating Bitfusion, which provides software to enable elastic deep learning and makes development of deep learning applications faster and more economical. There are other benefits in using the Bitfusion stack on top of a Dell EMC hardware solution, as it addresses some of the pain points, such as:
Figure 1: Current infrastructure limitations
Figure 2: Today's complex software ecosystem
The need to simplify and scale solutions, as the usage for deep learning within an organization grows, is clear. Some of the benefits of a partnered Dell EMC and Bitfusion approach are:
Figure 3: Dell EMC converged rack solution with Bitfusion remote GPU virtualization
Figure 4: Bitfusion Flex - streamlined AI development
In order to address some of the pain points mentioned above, Dell EMC started working with Bitfusion to look at the performance with local and remote attached GPUs. We targeted the Dell EMC PowerEdge C4130 server, which is a 1U form factor that can support 4x double-wide full-height GPUs. There are several advantages in using the PowerEdge C4130 including:
Figure 5 shows the PowerEdge C4130 and its internal feature set.
Figure 5: Dell EMC C4130 internal configuration
To demonstrate a variety of usage scenarios, we assembled a pair of Dell EMC PowerEdge R730 servers for CPU and four PowerEdge C4130 GPU-enabled servers connected with a Mellanox FDR switch as shown in Figure 6. This is considered our baseline configuration upon which other upgrades are available. This includes enhancements such as Pascal P100s, EDR networking, and a large variety of PCIe configurations to optimize data movement depending on the workload.
The table below shows the internal configuration for R730 (Client) and C4130 (GPU nodes).
The software stack installed on the test configuration [Figure 6] is depicted in Figure 7. The ‘client server’, which may be a CPU intensive server (PowerEdge R730) or GPU intensive server (PowerEdge C4130), runs the deep learning application(s) such as TensorFlow and Caffe, as well as a Bitfusion Client Library in a fully containerized deployment. The Bitfusion client components are responsible for managing the data transfers and virtualization features to the client CUDA or OpenCL application. Each of the servers runs the Bitfusion Service Daemon, which handles requests from one or more Bitfusion client end points, and provides process isolation, and resource guarantees based on configured SLA. In the current test configuration, we mainly focused on the basic setup, allowing for runtime attach of GPU resources to evaluate performance and efficiency in the most common datacenter scenarios.
Figure 7: Software components
One important aspect of Bitfusion’s virtualization capabilities is the fined-grained control the user and administrator have over resource management. Not only can you scale-up with disaggregated GPUs, individual GPUs can be partitioned into arbitrarily small virtual GPUs, as shown in Figure 8. Though conventional device virtualization approaches are limited by common fractions (1/2, 1/4, 1/8, etc.) of a device and require a full system reboot. With Bitfusion’s virtualization layer, device partitions can be arbitrary fractions (e.g. 1/7, 5/20, 3/4, etc.) and can be assigned on a per-user or per-application basis, without any system in the cluster being rebooted. The implications are dramatically increased control over resources, higher utilization, and a significantly better user experience, resulting in accelerated AI development.
Figure 8: Bitfusion partial GPUs
Transport level benchmarking
To evaluate the efficiency of remotely attached GPU resources, we evaluated the test setup from the ground up, focusing on data movement efficiency. First, we measured bandwidth and latency between all possible source and destination endpoints and created performance matrices as measured at the CUDA application level. NVIDIA includes bandwidth and latency tests with their CUDA SDK, which allowed us to quickly measure data movement overheads as seen by any CUDA application.
Table 1 shows the throughput between the Host CPU and GPUs in the GPU server (C4130). As can be seen, host-to-GPU throughput is close to PCIe3x16 speeds, while GPU-GPU throughput is around 5-6 GB/s. Internal GPU bandwidth, as highlighted by the green diagonal, nears 100GB/s.
Table 2 shows low latency within the GPU of ~7μs, and relatively higher latency between GPUs over PCIe of ~20-30μs, with all of the associated CUDA memory copy overheads.
Table 1: Single node bandwidth matrix showing bandwidth from (columns) the host CPU (H) and all GPUs (0-3) to all other GPUs in the 4GPU C4130 server.
Table 2: Single node latency matrix
2. Bandwidth and latency data – Intra- and Inter-node
By combining four PowerEdge C4130 servers, we can effectively create a 16 GPU virtual server, capable of running a much larger workload. The bandwidth and latency matrices below exhibit very good properties: minimal NUMA effects, uniform performance between all GPUs, and GPU-to-GPU latencies that are better than native.
How is this achieved?
The Bitfusion virtualization layer has several runtime optimizations, which automatically select the best combination of transports: PCIe, InfiniBand, GPUDirect RDMA, as well as host CPU copies to achieve the best results.
3. Relative performance of remote Vs native GPU( Intra and Inter node) – with Caffe and TensorFlow
The next step in our evaluation is to assess application performance, first by measuring how efficient remote attached GPYs perform relative to native GPUs as shown in Figure 9. We compare the training throughput as measured by total training time of native 4 GPUs (N4) locally attached GPUs with Bitfusion’s virtualization layer (L4), 2 local and 2 remote (L2R2), and finally 4 remotely attached GPUs (R4).
The results are fairly impressive, even with the potential “virtualization” overhead; every scenario using virtualized or remote GPUs resulted in a slight increase in application performance relative to native performance. Bitfusion engineers explain that a lot of runtime optimizations exist to make both data movement, as well as the application’s interface to the CUDA, as efficient as possible.
Figure 9: Several ways to create a 4-GPU virtual machine
Extending testing across frameworks and batch sizes, we indeed see that the remotely attached GPUs achieve native performance, as shown in in Figure 10.
Figure 10: Remote vs. Natively attached GPU performance
Finally, we ran Caffe and TensorFlow over all available GPUs in the system and measured total training time. It’s clear that Bitfusion offers a powerful new virtualization technology to elastically manipulate compute resources, while also enabling a highly streamlined AI development experience. The results are shown below in Figure 11.
Figure 11: Demonstrated multi-node scaling
As can be seen from performance data above, we are able to achieve the same or better performance using remote attached GPUs. And this not limited to only GPUs, but applies to other accelerators that can be used for deep learning.
Some key takeaways:
For more information, visit dell.com/hpc.