Assumptions (kept explicit)
- Model: 90 GB (single model sharded across 4 × 24 GB Arc Pro B60).
- Arc Pro B60 per-GPU local memory bandwidth: 456 GB/s (GDDR6, per earlier numbers).
- Strix Halo SoC memory: 96 GB LPDDR5-8633 on 256-bit bus → ≈276 GB/s (bandwidth ≈ 8633 MT/s × 256/8).
- PCIe 5.0 ×8 theoretical bandwidth per direction: ≈31.5 GB/s (Gen5 ×16 ≈63 GB/s → ×8 ≈31.5 GB/s). We'll treat this as the available P2P bandwidth for GPU↔GPU transfers.
- Typical transformer (example numbers for a realistic large model): hidden size = 12,288, #layers = 70, FP16 = 2 bytes / activation element. (These figures are representative of 50–100B-class models; you can plug your model’s precise hidden size / layers if different.)
If you want different hidden_size / layers / seq_len, tell me and I’ll recalc — but I’ll proceed with these numbers.
Useful intermediate numbers
- Activation bytes per token per layer = hidden_size × bytes_per_element
= 12,288 × 2 = 24,576 bytes ≈ 24.576 KB.
- Activation bytes for a whole context (sequence length L): 24,576 × L bytes.
For L = 2048 → 24,576 × 2048 = 50,331,648 bytes ≈ 50.33 MB.
- PCIe5 ×8 bandwidth = 31.5 GB/s = 31,500 MB/s.
Scenario A — full forward of context (L = 2048)
This models the case where each layer needs to exchange the
entire sequence activations (common for some distributed kernels or large-batch processing).
- Bytes exchanged per layer (per GPU) ≈ 50.33 MB.
- Time to transfer that over PCIe5×8 (one direction) = 50.33 MB / 31,500 MB/s ≈ 0.001598 s = 1.598 ms per layer.
If this transfer must be done for each of
70 layers (and assuming it cannot be completely overlapped away or reduced by algorithmic tricks), total inter-GPU communication time ≈
1.598 ms × 70 ≈ 111.9 ms just spent on PCIe transfers for one full forward.
Compare that to Strix Halo: the Halo avoids any PCIe cross-device transfers because the whole 90 GB model can live in one unified 96 GB addressable memory. There is no inter-GPU PCIe cost in this scenario. The Halo’s lower local memory bandwidth (≈276 GB/s vs 456 GB/s on a B60) will make on-device memory-bound compute slower, but
you avoid ~112 ms of PCIe communication per full forward.
So for
full-context forward / scoring with long sequences (L ~ 2048), the
Strix Halo is very likely to be
faster overall for a single-instance run — unless you can restructure the workload to avoid the full-sequence transfers or hide them behind compute extremely well. The ~112 ms PCIe penalty is large and typically dominates.
Scenario B — autoregressive single-token generation (KV cache)
This models generation where you only compute the new token and append KV cache; per-layer the data you must exchange is only for the
new token (much smaller).
- Bytes exchanged per layer for new token = 24.576 KB.
- Time per layer over PCIe5×8 = 24.576 KB / 31.5 GB/s = 24,576 B / 31.5e9 B/s ≈ 7.81e-7 s = 0.000781 ms per layer.
- For 70 layers total ≈ 0.0547 ms of PCIe transfer time per generated token.
That
0.055 ms of communication overhead per token is essentially negligible. In this case:
- Each Arc B60 has much higher local memory bandwidth (456 GB/s) than the Strix Halo (276 GB/s) → roughly a 1.65× advantage in memory-bound per-GPU work.
- If your model is sharded such that each GPU does a big fraction of per-layer work locally and communication per new token is tiny (as shown), the 4× Arc Pro B60 setup can deliver higher per-token throughput / lower latency than the single Strix Halo (assuming the implementation overlaps compute and that the small PCIe exchanges are handled efficiently).
So for
autoregressive generation with KV caching, one token at a time, the
4× Arc Pro B60 (PCIe5×8) is likely
faster than the Strix Halo — because PCIe x8 is plenty for the tiny per-token exchanges and the B60s’ significantly higher on-card bandwidth and compute wins.
Net takeaway (concise)
- If your workload processes long contexts / large batches where full-sequence activations are being exchanged across GPUs (e.g., evaluating or training over L=2048 or similar), PCIe 5.0 ×8 causes large inter-GPU transfer time (~112 ms in my example). That makes the Strix Halo (96 GB unified LPDDR5) the better choice for single-instance latency and simpler correctness — even though Halo’s on-chip bandwidth is lower (≈276 GB/s vs 456 GB/s on a B60). The elimination of large PCIe transfers typically dominates in that case.
- If your workload is autoregressive generation (KV cache) or otherwise has only tiny per-step communication (new token exchanges), PCIe 5.0 ×8 is sufficient and the 4× Arc Pro B60 array will likely be faster thanks to higher local bandwidth and more aggregate compute — provided your software can efficiently shard and overlap communication/computation.
Quick sensitivity notes / caveats
- I used hidden_size=12,288, layers=70, L=2048 as a concrete example — change those and the numeric communication times scale linearly. (E.g., smaller hidden_size or shorter L reduces the PCIe penalty proportionally.)
- I assumed transfers cannot be fully hidden by overlapping — in real optimized systems you can often overlap some fraction of comm with compute (pipeline/tensor parallel overlap) which reduces the effective PCIe penalty. How much you can hide depends heavily on framework & kernels.
- PCIe real-world throughput is lower than theoretical peak due to protocol overheads, contention, and host CPU involvement; so the numbers above are optimistic for the multi-GPU case — real latency could be worse.
- If you can quantize/stream weights, or use activation/computation slicing to reduce transfer sizes, you can tilt the balance toward the multi-GPU setup.