Accelerating Client-Side Cryptography with WebGPU

Penumbra Labs participated in this year’s Zprize, an annual competition focused on promoting the use and development of zero-knowledge cryptography. This was a collaborative effort involving Tal Derei, a client-side proving engineer at Penumbra, and Koh Wei Jie, a zero-knowledge researcher from Geometry Research. Our submission achieved the fastest WebGPU multi-scalar multiplication (MSM) solution in the browser, paving the way for the next frontier of client-side compute!

The implementation was inspired by the sparse matrix construction adapted from the cuZK paper (Lu et al, 2022). It extensively uses WebGPU, a modern browser API, and makes unique memory management and threading optimizations catered to the WebGPU environment. WebGPU exposes the capabilities of a GPU (Graphics Processing Unit) in the web, enabling developers to accelerate computationally heavy tasks using GPU parallelism.

We hope that our work can contribute to client-side proving libraries such as webgpu-crypto, so as to speed up client-side proof generation particularly for privacy-centric use cases.

Harnessing GPU Power for zkSNARK Efficiency

Zero-knowledge proofs are cryptographic primitives that enable a prover to convince a verifier that a statement about some secret is true without leaking information about the secret. For instance, the prover may claim to know confidential information like the preimage $$w$$ of a public hash function $$H$$ that evaluates to $$y$$ such that $$H(w, x) = y$$, where $$x$$ and $$w$$ are public and private inputs respectively. The prover generates a succinct zero-knowledge proof $$\pi$$, that convinces the verifier of this computation, such that $$\pi$$ reveals nothing about the secret input $$w$$.

Generating the proof $$\pi$$ is computationally intenstive and can introduce significant overhead for proving systems. However, the majority of the computational load is dominated by cryptographic operations, particularly multi-scalar multiplication which is highly parallelizable. This makes it an ideal candidate for GPU acceleration.

GPUs are massively parallel hardware architectures that can significantly accelerate zkSNARK workloads by leveraging thousands of simultaneous computing cores. This advantage is particularly pronounced for client-side computations, where proof generation is often constrained by the limited physical CPU cores and memory capacity of a mobile device.

Introducing WebGPU: A Browser GPU API

WebGPU, the successor to WebGL, is a game-changing web API that introduces general-purpose compute (GPGPU) as a first-class primitive, unlike WebGL which primarily focuses on graphics processing. WebGPU virtualizes the GPU hardware with a three-dimensional hierarchy of thread blocks that can be indexed to span the vectors and matrices encoding the computation. Specifically, WebGPU enables the browser to access the physical hardware resources of the GPU device to accelerate computations.

The WebGPU Programming Model

GPUs execute code in a fundamentally parallel manner. GPUs use streaming multiprocessors (SMs) that perform the actual computation, where an SM is logically composed of many physical GPU cores running the same instruction set. Before proceeding, it's important to understand three key concepts: shaders, workgroups, and buffers.

A GPU shader is responsible for each stage of our process. A shader is basically code that runs in the GPU, and the way that it operates on data in parallel is orchestrated via the WebGPU API. Shaders are defined in the WebGPU Shader Language (WGSL) and at the time of writing, it is the only available language. The browser maps well to native APIs and compiles, via the dawn compiler, the WGSL (WebGPU shader language) to whatever intermediate target the underlying system expects: HLSL for DirectX12, MSL for metal, SPIR-V for Vulkan.

To control parallelism, the programmer specifies the workgroup size and the number of workgroups to dispatch per shader invocation. Essentially, the number of threads is the product of the workgroup size and the number of workgroups dispatched. Each thread executes an instance of a shader, and the shader logic determines which pieces of data are read, manipulated, and stored in the allocated storage buffers.

Data is stored in buffers, which can be read by and written to by the CPU. Buffers can also be read by and written to by different invocations of the same shader or different shaders. During execution, buffers can be accessed by any thread in any workgroup, and since WebGPU provides no guarantees on order of access, race conditions are possible and must be avoided using atomic operations and barriers.

It is also important to note that WebGPU terminology differs somewhat from that of other GPU programming frameworks and platforms, such as CUDA or Metal, which use terms like warps or kernels to refer to workgroups and shaders. WebGPU also differs in its capabilities; there are certain types of operations which can be done in CUDA but not in WebGPU, such as dynamic workgroup sizes and supporting subgroup functionality.

Readers may refer to “WebGPU — All of the Cores, None of the Canvas” for a more in-depth exploration of WebGPU concepts.

Constructing Multi-Scalar Multiplication using Sparse Matrices

Given an elliptic curve group $$G$$ of prime order $$P$$, computing a multi-scalar multiplication (MSM) involves calculating a linear combination of an array of $$N$$ elliptic curve points and scalars

$$MSM(G, x) = \sum_{i=0}^{N-1} x_i \cdot G_i$$

where $$N$$ ranges inclusively from $$[ 2^{16}, …, 2^{20} ]$$, $$G$$ is an $$N$$-element vector of elliptic curve points, and $$x$$ is an $$N$$-element vector of corresponding scalars.

$$G = \left[ G_0, G_1, \dots, G_{N-1} \in G^N \right]$$

$$x = \left[ x_0, x_1, \dots, x_{N-1} \in \mathbb{F}_p \right]$$

Performing a single scalar multiplication, $$x_{i} \cdot G_{i}$$, in the summation requires numerous elliptic curve additions. These arithmetic operations are computationally expensive because elliptic curve points consist of large finite field elements, typically around 254-bit or 377-bit. Curve additions further decompose into several costly modular field multiplication operations.

The naive method for computing MSMs is the Double-and-Add algorithm, which results in the number of group operations scaling with $$384 \cdot N$$, where $$N$$ is the number of curve points. The goal of optimizing MSM computationally is to minimize the number of group operations as a function of the problem size $$N$$, especially for larger circuits that approach $$2^{26}$$ constraints. Pippenger's Bucket Method, detailed in the next section, reduces this scalar factor to $$16 \cdot N$$, achieving a 24x performance improvement.

Pippenger’s Bucket Method

The state-of-the-art algorithm for computing MSMs is Pippenger’s Bucket Method. Pippenger introduces a windowing technique that breaks up each multiplication

$$x_i \cdot G_i = x_{i_0}G_i + x_{i_1}2^cG_i + x_{i_2}2^{2c}G_i + \dots + x_{i_{k-1}}2^{c(k-1)}G_i$$

into $$K$$ windows of size $$c$$. This can be more formally defined as

$$G = \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} 2^{kc} x_i^{[k]} \cdot G_i$$

where each scalar $$x_{i}$$ is a finite-field element of size $$b$$-bits is partitioned into $$K$$ smaller sub-scalars of size $$c$$-bits, where $$K = \frac{b}{c}$$ windows or subtasks. The algorithm leverages this partitioning to enable an inherently more parallelizable computation.

Sparse Matrix Design

Sparse matrices are two-dimensional data structures with $$m$$ rows and $$n$$ columns, where most elements in the matrix are zero. They naturally adapt to the high parallelism afforded by GPU architectures. GPUs process a sparse matrix in parallel by spawning many independent threads, organized into blocks and arranged in a grid, with each thread assigned to a separate cell in the matrix.

*Heatmap visualization of the sparse-matrix structure.*

Heatmap visualization of a dense-matrix structure.

For our implementation, we use sparse matrices as the foundation instead of Pippenger’s method, forming the basis of our multi-scalar multiplication scheme. We reference the cuZK paper which showed that sparse matrix designs are generally more parallelizable and incur lower computational costs than traditional Pippenger-based designs. The analysis suggests that as the number of constraints increases, the former achieves a nearly perfect linear speedup over Pippenger’s algorithm, where the speedup ratio equals the number of execution threads available on the GPU.

The MSM algorithm we designed transforms the main operations from Pippenger’s algorithm into a series of basic sparse matrix operations, such as Sparse Matrix Transposition and a Sparse Matrix Vector Multiplication (SPMV). The data elements of the MSM, the elliptic curve points and scalars, are stored in sparse matrices in a memory efficient storage format called Compressed Sparse Row (CSR).

Implementation Design and Considerations

We produced submissions for both the Twisted Edwards BLS12 and Short Weierstrass BLS12-377 elliptic curves.

The BLS12-377 elliptic curve originated in the Zexe paper (BCGMMW20), featuring a 377-bit base field and 253-bit scalar field, with G1 equation $$Y^{2} = X^{3} + 1$$ in Weierstrass form. The Twisted Edwards BLS12 curve defines a 253-bit base field, which is the scalar field in the BLS12-377 curve.

Our implementations for the Twisted Edwards BLS12 and BLS12-377 curves were structurally identical, and only differed in the following areas:

1. The number of limbs per point coordinate: Twisted Edwards BLS12 used twenty 13-bit limbs for each point coordinate since the base field order was 253-bits, and BLS12-377 used thirteen 30-bit limbs for each point coordinate since the base field order 377-bits. The Montgomery multiplication algorithm requires the product of the limb size and number of limbs to be greater than the modulo's bit-length.

2. The curve addition and doubling algorithms: Twisted Edwards BLS12 used the Extended Twisted Edwards (ETE) coordinate system and implemented extended twisted Edwards point addition and doubling algorithms. For BLS12-377, we used Projective algorithms.

At a high-level, the implementation is composed of five distinct stages:

There are several stages to our procedure, the vast majority of which runs inside the GPU. This is to avoid relatively costly data transfers in and out of the GPU device. The only time data is written to the GPU is when it is absolutely necessary. This happens at the start of the process, where points and scalars are written to GPU buffers, and near the end, where a relatively low number of reduced bucket sums are extracted to enable the final result to be computed on the CPU.

1. Point Conversion and Scalar Decomposition

Firstly, the elliptic curve points and scalars are converted into amenable byte representations for GPU processing. The internal representations are stored as buffer objects that are written directly to GPU memory. In Twisted Edwards BLS12, the scalars are buffer objects of $$32 * n$$ bytes, and curve points (composed of X and Y coordinates) are $$64 * n$$ bytes, where $$n$$ represents the input size. In memory, the point coordinates are laid out in an alternating fashion: $$[ x [0][0], ... , x[0][31] , y[0][0] , ... , y[0][31] , x[1][0] , ... ]$$, in little-endian form.

The buffers are subsequently fed into the GPU shader, kicking off the first round of the computation:

1. Convert elliptic curve points into Montgomery form: transform a set of point coordinates into 13-bit limbs, organized into BigInt structures each consisting of 20 limbs, represented in Montgomery form by multiplying the BigInt by the Montgomery radix $$R$$ using big-integer multiplication and Barrett Reduction. The smaller limb size was specifically chosen to speed up field multiplications on 32-bit GPU architectures, adapting a technique described by Gregor Mitscha-Baude.

Montgomery Multiplication makes modular field multiplications more efficient by transforming the representation into Montgomery Form. Improving the efficiency of field multiplications is critical since they consume a majority of the entire MSM runtime, an estimated 70-80%. Additionally, our calculations suggest that carry operations comprise an estimated 50% of a single field multiplication.

In WASM, the max unsigned integer type is 64-bits, so the max limb size is 32-bits. In WebGPU, the max unsigned integer type is 32-bits, so max limb-size is 16-bits wide. Our calculations suggest that 13-bit limbs is the ideal limb size that minimizes the number of carry operations and produces the most efficient multiplications in a WebGPU context.

2. Scalar decomposition: decompose each b-bit (251-bit) scalar into c-bit (16-bit) chunks using the signed bucket index technique, where the number of chunks $$K = \frac{b}{c}$$ (16 chunks) for each scalar.

2. Sparse Matrix Transposition and Vector Multiplication

The next step is to compute the indices of the elliptic curve points which share the same scalar chunks, enabling the parallel accumulation of curve points into buckets. The points will be partitioned into $$K$$ larger groups called subtasks. Each subtask represents an independent grouping of the data, and can be processed in parallel. Each subtask $$K$$ will have $$2^{c}$$ buckets, and the total number of buckets across all subtasks will be $$2^{c} \cdot K$$. These buckets will act as logical groupings of curve points that will later be accumulated together.

Importantly, each subtask is internally represented as a distinct CSR (compressed sparse row) sparse matrix. Recall, we’re still treating everything as sparse matrices in a memory efficient matrix format that maps matrix cells to curve points and their associated buckets.

The transpose step generates the CSR sparse matrices and transpose the matrices simultaneously, resulting in a compact wide and flat matrix where width of the matrix $$n = 2^{chunk \ size}$$, and height of the matrix $$m = 1$$. Our transpose implementation modifies the serial transpose algorithm adapted from Wang et al, 2016, "Parallel Transposition of Sparse Data Structures” using atomic operations. This step is from the cuZK paper, where it is dubbed sparse matrix transposition. Our implementation is slightly different, however. While the original cuZK approach stores the points in ELL sparse matrix form before converting them into CSR form, we skip the ELL step altogether, and directly generate the CSR matrices.

After the transposition, each thread is assigned two buckets and performs sparse matrix vector multiplication to accumulate the points in each bucket.

3. Bucket Reduction

Now that we have the aggregated buckets per subtask, we need to compute the bucket reduction:

$$G = \sum_{l=1}^{2^s-1} l \ast B_l$$

We follow the pBucketPointsReduction algorithm from the cuZK paper (Algorithm 4) that states for each subtask $$G$$, split the buckets $$B_{l}$$ into multiple threads, compute a running sum of the points per thread, and multiply each running sum by bucket index $$l$$. Here’s an illustration for 8 buckets arranged in a half-pyramid:

4. Result Aggregation

The final stage of our MSM procedure is to use Horner's method (cuZK Formula 3). The final result $$Q$$ is computed as such:

$$Q = \sum_{j=1}^{\left\lceil \frac{\lambda}{s} \right\rceil} 2^{(j-1)s} G_j$$

To compute the result, we read each subtask sum from GPU memory, and perform the computation in the CPU. Our benchmarks show that doing this in the CPU is faster than in the GPU because the number of points is low (only $$\frac{\lambda}{c}$$ points), and the time taken for WebGPU to compile a shader that performs this computation is far longer than the time taken to read the points and calculate the result.

Performance Metrics

For the Twisted Edwards curve, $$A, B, C$$ represent both the WASM and WebGPU competition baselines by the sponsors, and $$D, E, F, G$$ represent our WebGPU results on the Apple M1, Apple M2 Max, Apple M3 Pro, and Nvidia A1000.

For BLS12-377, $$H$$ represents the only single-threaded WASM baseline, and there's no WebGPU baseline, by the sponsors. $$I, J, K, L$$ represents our WebGPU results on the Apple M1, Apple M2 Max, Apple M3 Pro, and Nvidia A1000.

Future Optimizations

We expect the WebGPU API to evolve over the coming years, and we hope to see its improvements benefit client-side cryptography. Areas which have not yet been explored include:

Multi-device proof generation, where multiple WebGPU instances across separate devices work together via peer-to-peer networking to share a workload.
New low-level features, such as storing big integer limbs in 16-bit floating-point variables.
We also hope to see more lower-level abstractions over the features of the underlying GPU device, such as the ability to set workgroup sizes dynamically (which is possible in CUDA), or lower-level language features such as a kind of assembly language for WGSL.

Additionally, we explored the use of floating-point arithmetic in our implementation, but our error bounds weren't quite correct. We’re working on tweaking the limb sizes and use Fused Multiply-Add (FMA) instructions to enable efficient floating point arithmetic in WebGPU. The use of floating points is particularly promising in zero-knowledge proof systems when deployed on hardware like Google's TPUs.

Conclusion

While we believe that wasm is currently a more battle tested and production-ready platform compared to WebGPU, primarily due to its access to low-level wide vector SIMD assembly instructions, WebGPU shows significant potential for outperforming wasm for larger circuit sizes. We anticipate that continued development in this area will bring WebGPU closer to widespread adoption.

We are grateful to the authors of the cuZK paper, last year's ZPrize participants, and other authors whose work we referenced. We hope that the outcome of this year's ZPrize can continue to drive innovation forward in the industry. Stay tuned for our upcoming, in-depth research paper on WebGPU!