6.5.1 Compression. As described in Section 5.1, we use a custom
CUDA-implementation to optimize our compressed representation.
Figure 11 shows our compression times for a single 4k material tex-
ture set with 9 channels, compared to a reference implementation in
PyTorch [58 ]. Both implementations were evaluated on an NVIDIA
RTX 4090 GPU for two different compression profiles.
Our custom implementation is approximately 10× faster than
PyTorch, which is crucial for achieving practical compression times.
We can generate a preview quality result in just under one minute
for both configurations, with a difference of less than 1.5 dB com-
pared to the maximum length of optimization (320k steps) in the
0.2 BPPC case. Moreover, for compression with the 1.0 BPPC profile,
our implementation uses less than 2 GB of GPU memory, whereas
PyTorch requires close to 18 GB, which is infeasible for many GPUs.
Traditional BCx compressors vary in speed, ranging from frac-
tions of a second to tens of minutes to compress a single 4096 × 4096
texture [ 60], depending on quality settings.
The median compression time for BC7 textures is a few seconds, while it is a fraction of
a second for BC1 textures. This makes our method approximately
an order of magnitude slower than a median BC7 compressor, but
still faster than the slowest compression profiles.
6.5.2 Decompression. We evaluate real-time performance of our
method by rendering a full-screen quad at 3840 × 2160 resolution
textured with the Paving Stone set, which has 8 4k channels: diffuse
albedo, normals, roughness, and ambient occlusion. The quad is lit
by a directional light and shaded using a physically-based BRDF
model [ 10] based on the Trowbridge–Reitz (GGX) microfacet dis-
tribution [ 76 ]. Results in Table 4 indicate that rendering with NTC
via stochastic filtering (see Section 5.3) costs between 1.15 ms and
1.92 ms on a NVIDIA RTX 4090, while the cost decreases to 0.49 ms
with traditional trilinear filtered BC7 textures. T