Skip to content
ai-calculations-on-mac-cluster-gets-a-big-boost-from-new-rdma-support-on-thunderbolt-5

AI calculations on Mac cluster gets a big boost from new RDMA support on Thunderbolt 5

Real-world test of Apple’s latest implementation of Mac cluster computing proves it can help AI researchers work using massive models, thanks to pooling memory resources over Thunderbolt 5.

In November, Apple teased inbound features in macOS Tahoe 26.2 that stands to considerably change how AI researchers perform machine learning processing. At the time, the headline improvement made to MLX, Apple’s machine learning framework, was to support GPU-based neural accelerators, but Thunderbolt 5 clustering support was also a big change.

One month later, and the benefits of Thunderbolt 5 for clustering are finally being seen in a real-world environment.

YouTuber Jeff Geerling wrote a blog post and published a video on December 18, detailing the experience he had with a cluster of Mac Studios loaned to him by Apple. The set of four Macs cost just short of $40,000 in total, and were used to show off the Thunderbolt 5 connectivity in relation to cluster computing.

All models were M3 Ultra models, each equipped with a 32-core CPU, 80-core GPU, and a 32-core Neural Engine. Two of the models supplied had 512GB of unified memory and 8TB of storage, while the other two had 256GB of memory and 4TB of storage.

Put into a compact 10-inch rack, the collection of Mac Studios were said by Geerling to be “almost whisper-quiet” and running at under 250 watts apiece. However, the key is the combination of Thunderbolt 5 support between the Mac Studios and the capability to pool the memory.

Massive memory resources

The MLX changes in macOS Tahoe 26.2 included a new driver with Thunderbolt 5 support. This is important since it can considerably speed up inter-Mac connections when used in small clusters, such as this.

Typical Ethernet-based cluster computing is limited to a maximum of 10Gb/s, depending on the Mac’s specification and not using concepts such as link aggregation and multiple Ethernet ports. To improve on this, researchers have used Thunderbolt to handle connections between Macs in a cluster, since it has much higher bandwidth.

Under previous efforts and using Thunderbolt 4, the maximum bandwidth was 40Gb/s. With Thunderbolt 5, the bandwidth is boosted to a maximum of 80Gb/s.

The massive bandwidth is especially useful thanks to Apple’s inclusion of RDMA (Remote Direct Access Memory) in Thunderbolt 5. Under RDMA, one CPU node in the cluster is capable of directly reading the memory of another, expanding its available memory pool to incorporate others in the cluster.

Crucially it is performed directly, as the name indicates, without requiring much processing from the secondary Mac’s CPU at all.

In short, the different processors have access to all of a cluster’s memory reserves at once. For the collection of four Mac Studios as loaned to Geerling, that’s a total of 1.5 terabytes of memory in use.

With Thunderbolt 5 improving the inter-Mac bandwidth, that access has now improved considerably.

The upshot for researchers working in machine learning is that it’s a way to use huge Large Language Models (LLMs) that go beyond the theoretical limitations of one Mac’s memory capacity.

Doing a cluster this way does have a limit, due to the use of Thunderbolt 5 itself. In lieu of any theoretical Thunderbolt 5 networking switch, all of the Mac Studios have to be daisy-chained, severely limiting the number of units you could cluster together without network latency that would hobble performance.

Real-world testing

Geerling was able to run some benchmarks on the Mac Studio collection to determine how beneficial it actually can be. After running a command to enable RDMA in recovery mode, he used an open source tool called Exo as well as Llama.cpp to run models across the cluster.

Both were used as a form of testing RDMA’s effectiveness. Exo supports RDMA, while Llama does not.

An initial benchmark using Qwen3 235B showed promise in the system. Under a single node, or a single Mac from the cluster, Llama was better at 20.4 tokens per second versus 19.5 tokens per second for Exo.

But when two nodes were in use, Llama dropped to 17.2 tokens per second while Exo improved considerably to 26.2 tokens per second. At four nodes, Llama shrank again to 15.2 tokens per second while Exo went up to 31.9 tokens per second.

Similar improvements were seen using DeepSeek V3.1 671B, with Exo’s performance going from 21.1 tokens per second on a single node to 27.8 tokens per second for two, and 32.5 tokens per second for four nodes.

Bar chart comparing llama.cpp and Exo across single, two, and four nodes. Exo shows consistently higher throughput in yellow, with llama.cpp in blue.

Mac Studio cluster testing using DeepSeek V31 671B – Image Credit: Jeff Geerling

There was also a test of a one-trillion-parameter model, Kimi K2 Thinking 1T A32B, albeit only 32 billion parameters were active at any time. This is a model that is simply too big for a single Mac Studio with 512GB of storage to deal with.

Over two nodes, Llama reported a speed of 18.5 tokens per second, with Exo’s RDMA bumping it up to 21.6 tokens per second. Over four nodes, Exo got to 28.3 tokens per second.

Across the clustering tests, Exo improved considerably as more nodes were available to use, thanks to RDMA.

Big potential, with asterisks

The big takeaway from Geerling’s testing is that there’s a lot of performance available for researchers working in machine learning, especially when it comes to handling massive LLMs. Apple has certainly demonstrated that it is possible, without sacrificing performance, thanks to RDMA and Thunderbolt 5’s available bandwidth.

Creating a cluster like this can still be expensive for the typical user, and it may be a bit too expensive for hobbyists to undertake. However, a $40,000 setup similar to this is a fairly reasonable-priced expense for teams working for companies with a vested interest in AI development.

There are some reservations, though, such as reported stability issues stemming from running HPL benchmarks over Thunderbolt and other bugs that surface in prerelease software. Geerling adds he has trust issues when it comes to the secretive development team working on Exo, especially considering it’s an open source project.

However, there’s also some unrealized potential here. The cluster uses the M3 Ultra as that’s the fastest chip in a Mac that supports Thunderbolt 5, not the slower Thunderbolt 4.

While an M4 Ultra chip is out of the way, it’s proposed that an M5 Ultra Mac Studio could be much better, thanks to its use of GPU neural accelerator support. That should give even more of a boost to machine learning research, if Apple gets around to releasing that chip.

Geerling also wonders if Apple could extend the inter-device Thunderbolt 5 connectivity even more, to include SMB Direct. He reasons that network shares behaving at speeds similar to if they were directly attached to the Mac could be a big assist for people working with latency-sensitive and high-bandwidth applications.

Like video editing for YouTubers.

colind88

Back To Top