2xPS3s run NFP aggregator 120x faster than quad-core Intel machine

After working out a couple annoying bugs in my Netflix Prize aggregator PS3 cluster app, unrolling a couple loops by hand, and enabling compiler optimizations, the cluster version is now 120 times faster than my old x86 implementation.

That all works out to a sustained 6Gflops per PS3, though it's also spending a significant amount of time in an expf4 operation, and I'm not sure how that's implemented, so the real Gflops could be a little higher. Not that I really care about flops, but with a peak performance of 150Gflops, it seems I should be able to push it even faster. Still, this is 120x faster than the version that was my workhorse for months.

Netflix Prize Aggregator @ PS3 Cluster

I finally have a (mostly) working application on my PS3 cluster.

I would wager that while programming for the PS3's SPUs is generally simpler than programming for nVidia GPUs, it is a little more trouble in my case, mostly because I have to write three separate programs that interact in a chain (head, PS3 node, and spu). Furthermore, since I don't like to spare myself the complication (or education), I wrote my RPC library using bare sockets, and I also wrote a nice little "task dispatch" library to simplify dispatch tasks asynchronously to the SPUs. The kicker is that all this was needed to transfer data to where it needed to be and to organize the execution - steps that are relatively trivial for GPGPU work, where you set up the data using large DMA transfers and setup/start all threads simultaneously. I could have my life simpler by having SPUs execute a single task each before exiting, but that model does lose a lot of the flexibility and efficiency that the PS3 can offer.

The RPC work is actually made far more complicated by the fact that my head is an x86 machine, so I need to do bit reordering on all head<->node transfers (using htonl(), etc).

I don't have any useful computational work done yet, but I do have a benchmark. My pair of PS3s crank through an iteration of the training algorithm for my blender in about 3 seconds, compared to 60 seconds for the same work on a quad-core Intel Q6600 (fully multithreaded).

That works out to 20x as fast, and though the PS3s do have 12 cores working, three times the 4 of the Intel machine, the PS3s are also a lot cheaper. Of course, we are just talking brute computational speed on data of a limited size, or data that can be easily partitioned between separate nodes. In my case, I was able to split 340MB of data at 170MB per node, with only a little breathing room before the ceiling of 185MB or so before my tests showed that the pagefile would start getting thrashed like mad.

UPDATE (June 12): I fixed the annoying bug from yesterday. It was a rather simple indexing issue: I have a vector of (vector float)s that represents an N*(M+1) matrix. The last M+1th column is difference than the rest of the matrix, and I wanted to access it by using a simple offset, like w[offset + m]. However, I constructed this offset by multiplying the vector lengths of N and M (in other words, (N >> 2) * (M >> 2)), which would only be accurate if the short vectors were 4x4 float matrices instead of 1x4 float vectors. In any case, the application really does work, both on the PS3s and in the test code on the cluster head.

Cell SPU 16-bit fixed-point load with conversion to floating point.

I can sometimes have a knack for really beating small problems into the ground. Today, I continued to work on converting my Netflix Prize blender to a PS3 cluster application, and a seemingly simple problem came up that took a long time to solve. I'm writing about it here such that the Google Overmind might become more aware.

Because of RAM limitations, I need to store my training data for the blender in 16-bit fixed point instead of 32-bit floating point. However, I still want to do computations with 32-bit floating point. The Cell's SPUs support a nice vector instruction to convert a vector of 32-bit fixed point numbers to a vector of 32-bit floating point numbers, but I searched back and forth in the manual for some semblance of an instruction or instructions I could use to unpack 4 16-bit integers into a 4x32-bit vector. Finally, I found my solution in a bizarre-looking instruction entitled "Shuffle Two Vectors of Bytes." This is really a very unique instruction that I certainly never saw before, but its uses, I'm sure, are many. It allows you to build a vector by picking and choosing individual bytes, in any order, from two other vectors (plus, you can throw in bytes of 0x00, 0xFF, and 0x80 at will). It's not quite as simple as saying "Unpack my 16-bit integers into 32-bit integers," but it works. It does, however, take some time and energy to set up the "pattern" vector, which tells the SPU which bytes to put where.

Here, I'm quoting a code snippet that loads 8 16-bit fixed point numbers into 2 4x32-bit floating point vectors, using five SPU instructions. Enjoy!

  // Initialize our example data vector (16-bit fixed point)
  unsigned short load_data[8];
  for (int i = 0; i < 8; i++) {
    load_data[i] = 0x2000 + 0x1800 * i;
  }

  // These are tailor-made for the SPU's SHUFB instructions.
  // See "C/C++ Language Extensions for Cell Broadband Engine 
  // Architecture, section 2.8" for details
  // 0x80's mean: put a 0x00 in this byte.
  // 0x0n: put byte data[n] here.
  qword pattern0 = {
    0x80, 0x80, 0x00, 0x01, 0x80, 0x80, 0x02, 0x03, 
    0x80, 0x80, 0x04, 0x05, 0x80, 0x80, 0x06, 0x07 };
  qword pattern1 = {
    0x80, 0x80, 0x08, 0x09, 0x80, 0x80, 0x0a, 0x0b, 
    0x80, 0x80, 0x0c, 0x0d, 0x80, 0x80, 0x0e, 0x0f };

  // These five instructions load the data from local store and convert it to float vectors:
  qword data = *(qword*)&load_data;              // load 8x16-bit vector into register
  qword fixed0 = si_shufb(data, data, pattern0); // convert the first four to a 4x32-bit vector (zero-extended)
  qword fixed1 = si_shufb(data, data, pattern1); // convert the next four to a 4x32-bit vector
  qword float0 = si_cuflt(fixed0, 16);           // convert the first four to floats (dividing by 2^16)
  qword float1 = si_cuflt(fixed1, 16);           // convert the next four to floats

  // Print out results
  float* results0 = (float*)&float0;
  float* results1 = (float*)&float1;
  printf("results: %f %f %f %f %f %f %f %f\n", 
	 results0[0], results0[1], results0[2], results0[3],
	 results1[0], results1[1], results1[2], results1[3]);

PS3 snafu

A few days ago, I began to get a little frustrated with the packaging of IBM's Cell SDK (which, despite being version "3.0" overall, still has some rough-edged alpha/beta components). I thought that maybe I would make my life easier and switch the P3s from Yellow Dog Linux 6 to Fedora 9, seeing as though IBM officially only supports Fedora and RHEL. This, truth be told, wasn't really a problem. Installation of Fedora on the PS3s was a little rougher than YDL (Fedora required a separate USB flash drive to install the bootloader, which itself is made easy by Sony's hypervisor); also, while YDL comes preloaded with PS3 goodies like ps3-boot-game-os and ps-video-mode, you need to install them separately once Fedora's installed (yum install ps3-utils, iirc).

Aside from those very minor bumps, installation of Fedora on the Playstations was a breeze. My problem really only came when I restored the former configurations from the YDL installs.

I use NFS to mount program code and data on the Playstations (keeping no important files on the PS3s themselves... I just don't trust non-RAID drives anymore). So I had backed up my fstab files from YDL, and just copied them over the new ones Fedora made. Everything worked just fine for a while. That night, however, I turned the PS3s off, only to find in the morning that while they turned on, neither were responding. I hooked one up to a CRT to find that it had failed to mount any of its local drives and thus could not boot.

Ugh.

Everything had worked fine the day before only because "mount -a" only mounts what needs to be mounted, so each of the "new" lines in fstab mounted just fine, but everything that had been mounted from the original fstab was no longer there after the restart.

Anyway, long story short... Default config from one linux distro is not the same as the default config from another. Always modify defaults, don't just copy final versions. Though, as they say, that should be pretty obvious advice.

Conclusion: rather than fix the problem, I just reinstalled YDL on both systems. Fedora didn't fix the problems I was having, and in fact YDL comes pre-installed with most of the goodies I needed, anyway, aside from what I could get easily with "yum install *ppu* *spu*".