Monday, November 1, 2010

Adobe CS5: 64-bit, CUDA-Accelerated, And Threaded Performance

Any knowledgeable PC user understands that there are many ways to skin a cat, including when that cat happens to be Adobe’s Creative Suite. Tools like Photoshop, Premiere Pro, and After Effects continue to be favorites for millions of professionals and prosumers. When time is money, the performance levels realized in Creative Suite can mean the difference between making or losing money on jobs. Even if you’re just a home video enthusiast who’s taken to Premiere and After Effects, would you rather spend minutes or hours on a task?

Potentially, this is no overstatement. With Adobe starting to built GPU acceleration into various facets of Creative Suite and better leveraging CPU multi-threading, a system running CS5 today could realize performance an order of magnitude or more better than, say, a five-year-old system running Creative Suite 2 (CS2). We’re not going to state the ridiculously obvious and benchmark just how much faster a new CS5 rig would be compared to CS2. Instead, we want to approach Adobe’s new CS5 from a hardware perspective and examine if and when it makes sense to upgrade from CS4.
After all, the move from the last-generation suite to CS5 is one of the most significant in Adobe's history. Beyond the feature expansion in each app, the company finally embraced 64-bit support, dramatically improving performance in workloads able to take advantage of extra memory. Additionally, there's a good bit of GPU acceleration in play--something we've not seen enough of from other media- and productivity-oriented titles.

So, here’s our scenario. Assume you have CS4 and are considering CS5 as a way to become more productive through getting the same tasks done more quickly. We’re going to examine three possible vectors that could be responsible for this performance increase:
  1. Upgrading from CS4 to CS5. This gives you the benefits of shifting from 32- to 64-bit code and addressing extra memory above the 4 GB threshold.
  2. Increasing CPU threads. This could be through the addition of cores as well as from leveraging Intel’s Hyper-Threading (HT) feature.
  3. Employing CUDA. At this early stage of the industry’s adoption of general purpose GPU acceleration, Adobe has started to weave in support for Nvidia’s CUDA platform. We hope that OpenCL and/or DirectCompute support follows soon, but for now we have to examine CUDA as a case study in what exists today and a harbinger of what will come.
Could it be that stepping up from CS4 to CS5 alone could yield enough benefit to make a hardware upgrade unnecessary? Or will an upgrade to CS5 plus bringing CUDA into play make a $500 processor overhaul mandatory? Let’s try to find out.

While we examine three different applications within Adobe’s Creative Suite (After Effects, Photoshop, and Premiere Pro), most of Adobe’s attention falls on Premiere Pro CS5 and its Mercury Playback Engine, the 64-bit, multi-threaded code base that can utilize Nvidia GPU (CUDA) hardware acceleration. Mercury acceleration is not global throughout the program, but it will accelerate a bunch of effects and operations. For example, the new Ultra keyer, proc amp, Gaussian blur, edge feathering, flips, sharpening, and color correction—in fact, most of the popular effects—are now Mercury-ready. So are three transitions: cross dissolve, dip to black, and dip to white.

Adobe boasts that very large projects can see up to a 10x performance gain from Mercury. Nvidia promises “performance gains of up to 70 times” for visual processing tasks. While we’re more inclined to lean toward Adobe’s number, given some of the GPGPU results we’ve seen in the past, such claims don’t sound infeasible.
Nvidia claims that because CUDA and the Mercury Playback Engine are doing so much of the visual computing work, “the CPU is free to continue to manage other system and application tasks, and to efficiently manage background processes.” On this point, we’ll remain skeptical until proven wrong by our data. We expect that CUDA will accelerate performance when it can, but we don’t expect miracles of CPU utilization reduction...yet.
Turning the Mercury Playback Engine on and off is a fairly simple matter. Simply navigate into Project -> Project Settings -> General. In the Video Rendering and Playback section, use the Renderer pull-down to select the desired Mercury setting.

Our first step in this article was to pick up where Chris Angelini left off in his July look at the Intel Xeon 5600-series. Chris started with 12 threads on a Gulftown chip and worked his way up to 24 threads on a pair of Xeon X5680s. Counter-intuitively, he found that workload completion performance decreased as processing capability increased.
“After Effects CS4 only has access to 4 GB of system memory—a third of what these Xeon boxes bring to bear,” he wrote at the time. “As you add execution resources to AE’s pool, less and less memory is available to each processor, be it logical or physical. The result is a lot more swapping to solid state storage, which is fast, but nowhere near as quick as three channels of DDR3.”
Rather than scale up the CPU chain into workstation configs, we scaled down from the consumer-class flagship, Intel’s Core i7-980X with all features enabled, to only two threads—two 980X cores with no Hyper-Threading. This lowest-end arrangement should more closely resemble some of AMD’s Athlon II processors.
In his story, Chris noted keeping the multiprocessing option in After Effects enabled, as this gave the fastest results in AE CS5. With multiprocessing, AE crunches on different frames with multiple cores. Without multiprocessing, every available core works on a single frame until it’s finished. We decided to run the tests both with and without multiprocessing to better see how much impact adding cores/threads would yield.

As you’ll see more clearly in a moment, After Effects CS4 clearly dislikes Hyper-Threading. In all of our tests with HT and multiprocessing disabled, AE’s overall CPU utilization hovered in the teens and 20s, but the even-numbered threads—the logical cores created through HT— were barely touched. With only two active cores (four threads), there was a bit more activity on the even threads, but still nothing like the utilization seen with multiprocessing enabled in the application.

How does this processor utilization translate into real performance? The data is clear: After Effects CS4 performs much better with Hyper-Threading disabled, sometimes by a factor of 2-to-1. In everyday usage, it would be silly not to run the app with HT off and multiprocessing enabled provided you weren’t multitasking. The exception to this rule would be if you’re multitasking, because running with multiprocessing and HT enabled will save about 20% to 30% in CPU utilization, leaving enough room to run something else concurrently.
Interestingly, Chris noted that “in CS4, we got our best results having all cores working on each frame,” meaning that having multiprocessing disabled yielded faster performance. That was not the case here. In all instances, using multiprocessing yielded much faster results, and the more threads we used, the wider that performance gap became.
So keeping multiprocessing enabled is a foregone conclusion. That decided, what can we observe about thread scaling? Without HTT, we see only a moderate improvement as threads increase. (In fact, there is effectively no difference between four cores and six.) From two physical cores to six, we gain only 29 percent. The punch line here is that two physical cores actually outperforms 12 logical threads by 7.5 percent. Hyper-Threading is just that bad under AE CS4.

Recall from the last page that the fastest time we had for our custom workload under After Effects CS4 was 2:55 with six cores active, multiprocessing enabled, and no Hyper-Threading. Achieving this effectively redlined the CPU. In After Effects CS5, this is nearly our slowest score (the difference between 2:55 and 3:00 being negligible), and it was achieved with only two cores, no multiprocessing, and no Hyper-Threading. Interestingly, the same settings under CS5 yielded a CPU utilization range of 66 to 80 percent.

We see that CS5 does not share CS4's revulsion for Hyper-Threading. Apparently, HT delivers more benefit when fewer physical cores are present, but at least there’s no significant negative impact from having it enabled. Neither do we see CS5 delivering that weird CS4 phenomenon of hammering performance when both HT and multiprocessing are enabled.
We have a hard time imagining many dual-core Intel owners rushing out to buy CS5, so if we set those results aside momentarily, we’re left with the fact that there’s no real benefit from Hyper-Threading here. Yes, there can be a slight gain from HT, but not enough to warrant a processor upgrade. This is clearly a case where increasing physical cores is what matters.
Weirdness starts to reappear when we examine CPU utilization. Time and again when working with the CS5 collection, we witnessed fewer cores working harder. It was like watching a track star say, “You know, I’ve got 12 threads, and I’m so far ahead, I think I’ll just take it easy.” At first, we thought the explanation must harken back to Chris’s earlier CS4 assessment in which more cores are being forced to work with smaller shares of the total RAM pool, even though we jumped from an effective 4 GB to 12 GB. However, we got another answer from Nvidia technical marketing manager Sean Kilbride:
“A lot of it is timing related. Video encoding in particular has a lot of serial operations. You need the result of one frame before you can move to the next. This is generally because encoders create a keyframe which has all the frame information. Frames that follow only record the portion of the frame that has changed since the last keyframe. If too much has changed, you have to create a new keyframe.
So the processors can never work too far ahead, since they rely on the keyframes to be created first. In theory, encoding to an uncompressed format could go very quickly on the CPU with multiple threads as long as each frame was a keyframe. In reality, you'd end up crippled by disk I/O because of the resulting massive file size.”
That all said, we’re glad to see multiprocessing making effective use of the CPU in order to bring work times down.
Overall, we’d say that the best bang for the buck in AE CS5 is a quad-core with HT and multiprocessing enabled. This delivers most of the chip’s potential performance while still leaving loads of available processor bandwidth for other tasks.
We have not examined CUDA acceleration in these After Effects tests because Adobe has yet to code for it. The application does support OpenGL acceleration, which we used across the board, but that’s different than the GPGPU boosting we wanted to examine here.
Photoshop was one of the first applications to hop on the multi-threaded bandwagon during the shift from single- to dual-core CPUs. One of our biggest questions was whether Photoshop has scaled well on the desktop as core counts have gone up. Additionally, we wanted to see what impact enabling or disabling OpenGL acceleration in the GPU would have on our custom workload. Admittedly, this falls outside of our testing objectives, but it still keeps with the spirit of improving Adobe Creative Suite performance in hardware, and should provide an interesting comparison for those who still wonder if GPGPU acceleration is really “that much” better.
In testing Photoshop, we used a single-image photo collage measuring 20K x 20K (1.12 GB) and interpolated it to 50K x 50K (6.98 GB), figuring this would soak up most available RAM without spilling out into a swap file. We then took the interpolated file and rotated it 45 degrees. Being representative of both test sets, we kept and show below the CPU utilization observed during the rotations.

This is more about proving a concept than mimicking a real life workflow need. Photoshop may be relatively good at interpolation, but the dimensions we used were really aimed at creating run times meaningful enough to be measured. You’d be more likely to use smaller levels of interpolation across batch jobs—increasing dozens of image sizes by 20% with a single command, for example.
In a situation somewhat similar to our HT and multiprocessing setup with After Effects CS4, we see several situations in which enabling OpenGL hardware acceleration slows down processing when Hyper-Threading is disabled.
Our four-core, non-HT test shows a strange performance spike with OpenGL enabled, but otherwise we see two- and four-core scores all within two seconds of each other. Only with 12 threads when HT is enabled do we see a respectable gain.

On rotation, Hyper-Threading reappears as an occasional enemy. Eight threads with OpenGL enabled wins this test on a bang-for-buck basis.

Again, we see frustratingly low CPU utilization during Photoshop rotation. Only with two active cores does the CPU get a fair workout.

We wanted to leverage as much of the work done by Chris Angelini in his “Can Your PC Use 24 Processors?” story as possible, so in addition to his custom After Effects load, we also replicated his Premiere Pro work set. Part of this job involved created a custom setting scenario in which the 23.976 FPS default Blu-ray speed (24 FPS in CS5) was doubled to 59.94 FPS. As Chris did, we recorded both the render time as well as the export time with Premiere’s Adobe Media Encoder (AME) in order to assess two key parts of the video workflow process. The single point where our test data replicates Chris’s render and AME times within 20 seconds (980X with 12 threads and HT enabled) confirms that we’re on the same track and working with solid results.

In a workstation setup, Chris didn’t see much positive scaling when moving from the i7 into the Xeon line. Working solely with the i7 and modifying core counts, we see a much more obvious and rewarding progression. Once more, you don’t get as much kick in the move from four cores to six as from two to four, but the benefits of each core increase are clear. Moreover, we see the 10% to 20% benefit from enabling Hyper-Threading that we’ve been expecting all along. Without a doubt, the six-core Intel approach is the way to fly with Premiere Pro CS4.

Unlike our other two Creative Suite apps, Premiere Pro makes much more effective use of all available processor cores, including virtual ones. We don’t see any real breathing room with utilization appear until we hit 12 threads.

With Premiere Pro CS5, which is really the centerpiece of this story, we have to delve a little deeper. Here we finally have the ability to assess Adobe’s Mercury Playback Engine and see it handling 64-bit code, many threads, and CUDA acceleration all at once.
First, let’s compare render times between the two application versions. You’ll recall that with Hyper-Threading, CS4 scored times of 10:17, 5:21, and 3:33 with four, eight, and twelve threads, respectively. In a strange fit of coincidence, these are almost the exact times we saw under CS5 with HT disabled.

Re-enabling HT under CS5 again gives us that 10% to 20% boost. Thus, we can infer that the move from the earlier to the present Premiere Pro will only net you about a 15% improvement, give or take depending on your core count.
Now, when we turn on the Mercury Playback Engine, it’s like hitting a rocket’s launch button. Adobe’s 10x claim turns out to be spot on. With only two physical threads, Mercury and our GeForce GTX 480 are able to blast through our test in 1:36—less than half the time of our best score with 12 threads under CS4.  With all threads and Mercury in play, CS5 sizzles to completion in just 29 seconds.
We didn’t record CPU utilization for this set because the usage patterns would have been meaningless. Most of the time, utilization hovered in the 98% to 100% range, but every so often there would be a downward spike into the teens or single digits. Noting the range would not have given an accurate representation of average resource use.

CUDA is not a cure-all. As our tests showed, there are clearly some tests that showcase CUDA’s benefits more than others. Depending on how apps and plug-ins are coded, CUDA can help with processing effects, compositing video clips, scaling, blending, and similar functions. In some areas, though, don’t expect CUDA to offer much help. For example, say you have a 1280 x 720 clip in WMV format that you want to transcode into 1820 x 720 MPEG-4 via Premiere Pro. The GPU will be of little if any benefit in this case. All of that heavy lifting is done in the CPU.
On the other hand, if you want to take three 1920 x 1080 clips, color correct them, perform luma adjustment, add drop shadows, composite into a single stream, then downsample into a 1280 x 720 MPEG-2 file for DVD export, the benefit from GPU-based acceleration can be enormous. If you’re stepping up from 30 to 60 frames per second through frame doubling, then CUDA won’t help, but if you get that doubling through tweening (Premiere’s method for automatically adding or modifying one or more frames between two existing frames), then CUDA can slice the generation time substantially. If you want one more example of this, check out these stand-alone results we got for Premiere Pro CS5 exporting done on Nvidia’s own “Paladin” workload, which uses many more GPU-based effects:

We mentioned Nvidia’s Sean Kilbride before, and we want to give him one last shout out, because his help at all hours, day and night, over the span of weeks, set a new bar in vendor patience. His assistance is what made this article possible. Interestingly, though, one of the questions we asked him was, “If the direct export and render queue functions in Premiere Pro CS5 do the same job, but direct export is faster, why do we still have the queue in CS5?” He replied, “Because you can’t continue working when doing a direct export.” With all of our test data in hand, we’d argue that you also can’t continue working during a render, at least not without 12 threads at your back and a not-too-demanding set of secondary apps in front.
Our original question was how best to accelerate Adobe’s Creative Suite. In After Effects, the move from CS4 to CS5 was a clear win, while scaling from four cores to six offered surprisingly little benefit. Photoshop CS5 is likewise mushy on its core scaling benefits, but if you can land a plug-in that supports CUDA, baby, hold on—the improvement is massive. Most of all, Premiere Pro performs exactly as we’d expect. The video editor scales well as threads increase, improves with Hyper-Threading, makes good use of the jump to CS5, and runs CUDA like nobody’s business in the Mercury Playback Engine. If Premiere Pro is your life, it’ll make good use of every improvement you can throw at it.

No comments:

Post a Comment