Next week NVIDIA will be releasing the first major update to their GPGPU programming toolchain since the Fermi-based Tesla series launched earlier this year. Specifically, they will be releasing Parallel Nsight 1.5, and version 3.2 of the CUDA Toolkit.
Parallel Nsight 1.5
As we’ve reiterated a number of times now, NVIDIA’s long-term goals require the company to expand their GPU market beyond video cards and in to the High Performance Computing (HPC) space, where the brute force applied by GPUs for gaming purposes can be applied to academic, industrial, and even consumer computing applications. The Fermi architecture was a big step towards this goal, providing a GPU much better suited for GPU computing than the previous GT200/G80 thanks in large part to the GPU’s unified address space, ECC support, and support for C++. With the Fermi architecture NVIDIA would have the hardware necessary to take the next step in to GPU computing.
However good hardware requires good software, and that’s where our discussion is going today. With support for higher level languages comes the need for better programming tools and better debugging tools. Furthermore NVIDIA ultimately wants to extend practical GPU programming to the more rank & file programmers, where Microsoft’s Visual Studio is by far and wide the IDE of choice for C and C++. Reaching these programmers would require extending CUDA programming to the IDE and toolsets they already use, and in the process expand the market by making development more accessible than it was with older toolsets.
In order to accomplish this goal, NVIDIA wrote a plugin for Visual Studio to enable the complete programming and debugging of CUDA and graphics code within Visual Studio – something that wasn’t previously possible – called Parallel Nsight. We first saw Parallel Nsight last year when Fermi was introduced under the name Nexus, and it finally shipped a few months ago along-side the first Fermi based Tesla cards. Since then roughly 8,000 developers have signed up to use the plugin.
On Wednesday the 22nd NVIDIA will be releasing the first major update to Parallel Nsight with release 1.5 of the plugin. Headlining this launch is the inclusion of Visual Studio 2010 support, bringing Parallel Nsight up to date with Microsoft’s IDE. Previously Parallel Nsight only supported VS 2008, which left Parallel Nsight behind as companies and developers began switching to Visual Studio 2010.
On the backend of things, Parallel Nsight 1.5 will be expanding the debugging capabilities of the plugin, particularly on single-system scenarios. Because the GPU needs to maintain the compute or graphics context being debugged, the only way to do debugging was to have a second system or a virtualized system to work with – one system to actually run the code, and a second system to debug from. While this is still (and likely always will be) the preferred method of debugging, it was impractical for single developers and smaller shops that couldn’t dedicate a second machine to the task.
For those programmers, Parallel Nsight 1.5 will be adding support for debugging compute code within a single system when multiple GPUs are present. This allows one GPU to run the application, and then a second GPU to actually drive the computer’s display. In this case the GPUs don’t even need to match so long as it's an NVIDIA GPU, allowing developers to use a high-end GPU for programming purposes while using a lesser GPU to drive the display, effectively halving the amount of hardware required for debugging. However at this time this single-system debugging support only applies to compute; shader debugging still requires a second system.
The second big feature coming to Parallel Nsight 1.5 will be the addition of support for debugging code on a GPU running in Tesla Compute Cluster (TCC) mode. Since we don’t regularly cover Tesla news we haven’t had the opportunity to discuss TCC before, but in a nutshell TCC is a separate driver mode for Tesla cards that bypasses Windows’ control of a GPU through WDDM. WDDM, which was the new GPU driver model released with Windows Vista, was geared towards better integrating a GPU in to the OS by giving the OS more direct control of the GPU. The OS provided a virtual address space for video memory (including paging), recovery capabilities if the GPU hung, and enforced a coarse level of task scheduling on the GPU so that multiple processes could share a GPU without unnecessarily dragging a system down.
Compared to the old XPDM, WDDM was a big step up for GPU usage on Windows, but only for graphical purposes. With Windows’ iron-fisted control over the GPU and a focus on task scheduling for responsiveness over performance, it wasn’t ideal for GPGPU purposes. Case in point, with a WDDM driver NVIDIA was finding it took 30μs for a kernel to be launched, but if they had Windows treat the GPU as a generic device by using a Windows Driver Model (WDM) driver, that launch time dropped to 2.5μs. This coupled with the fact that a WDM driver is necessary to use Tesla cards in a Windows Remote Desktop Protocol environment (as any Folding @Home junkie can tell you, RDP sessions can’t access the GPU through WDDM) resulted in the birth of TCC mode.
Previously GPUs running in TCC modecould run applications, but the system functioned somewhat in a black box manner. In order to debug that code the GPU would need to be running with a regular WDDM driver, which had a performance impact and also meant it wasn’t possible to do this debugging in conjunction with an RDP session. With Parallel Nsight 1.5, it’s now possible to debug code on a GPU running in TCC mode, finally allowing developers to debug code against the same driver that the production version of the program would be running on.
The third major addition is support for profiling DirectCompute code when it’s used in conjunction with Direct3D. This is primarily a feature for game developers, who previously did not have an effective way to profile the performance of DirectCompute code in their games. DirectCompute has been underutilized in games so far, so perhaps this will expand its use.
Last but not least, Parallel Nsight 1.5 is also adding support for the new 3.2 version of the CUDA Toolkit
CUDA Toolkit 3.2
The CUDA toolkit is the agglomeration of all of NVIDIA’s programming tools: compilers, drivers, and libraries. Version 3.0 of the toolkit added support for the Fermi architecture and its unique features like C++, and version 3.2 will be the first major expansion of the toolkit since then.
A big focus for 3.2 is going to be on libraries. Libraries are a critical component for efficient GPU and CPU programming, as they provide hand-optimized implementations of common functions. Generally speaking, if you can hand something off to a library you can speed up that portion of the task significantly, and if the library can find a way to do something faster than before then it will also speed things up. For 3.2 NVIDIA is adding libraries for sparse matricies (CUSPARSE), H.264 encoding, and random number generation (CURAND), while also improving the performance of the existing math libraries such as the FFT and BLAS libraries.
The other big change will be the full realization of the 64bit address space of Fermi family Tesla cards. The hardware has been capable of addressing better than 32bits (4GB) of memory, but the software had not fully caught up. With the impending release of 2Gb GDDR5 RAM, 24 chip 6GB Tesla cards will be made possible, requiring the software to catch up. As of version 3.2 the toolkit (and Parallel Nsight) will be able to address these cards.
The 3.2 version of the toolkit largely goes hand-in-hand with Parallel Nsight 1.5, however at the moment it’s lagging behind Parallel Nsight by a bit. The release candidate of the toolkit will be ready shortly (likely at or before the Parallel Nsight 1.5 release) but the final version will not be ready until November.
All of these items will be on display next week, when NVIDIA hosts their annual GPU Technology Conference. We’ll be there for the later part of the conference to get a pulse on the state of GPGPU usage in consumer applications and to do some reconnaissance for future GPGPU articles, so stay tuned for our reporter’s notebook from GTC next week.