Faster Client-Side Proving with Parallelism

Shielded blockchains achieve privacy by moving state and execution off-chain to the end-user device, then proofs of this execution are generated and submitted to the chain. As opposed to relying on third-party proving clusters, this method of generating proofs locally mitigates risks of de-anonymization and potential loss of funds. Unlike for transparent blockchains, client-side proving is essential for privacy-oriented blockchains, preventing sensitive data such as public keys or private witnesses from being shared with external cloud servers.

Most privacy-preserving blockchains are designed to safeguard transaction confidentiality, but this often comes at the expense of performance. Computationally intensive zero-knowledge proofs foundational to these transactions tend to be generated on the end-user device within the web browser –– an environment that is traditionally slow and resource-intensive. Consequently, most privacy-enabled applications like shielded wallets have poorer user experiences compared with their centralized counterparts.

With this in mind, over the past few months the Penumbra Labs team has focused on refining and optimizing our web wallet extension to change this user experience by enabling a higher degree of transaction-level parallelism, dramatically improving client-side performance in modern web browsers. These improvements ensure that users can reap the benefits of private transactions without compromising on the quality of their experience.

Web performance

Penumbra’s Rust-based prover code is compiled to WebAssembly and executed in-browser, incurring significant slowdowns on the order of 25-30x compared to their native Rust counterparts. There are two major contributors to this slowdown. First, the native Rust code can use Rayon to parallelize computations across multiple threads, while WebAssembly cannot. Second, the native Rust code can perform those computations more efficiently, because it has access to a 128-bit multiplier, while WebAssembly only supports a 64-bit multiplier. Penumbra utilizes Arkworks for generating proofs, which internally uses 64-bit limbs (128-bit multiplier). This leads to substantial emulation blowup costs when lowered to Wasm.

On a 2020 Macbook M1 (equipped with an 8-core CPU and GPU), performing a typical transfer transaction using the command-line took roughly 1 second; meanwhile, constructing the same transaction in the browser took roughly 30 seconds. By parallelizing the transactions and distributing the workload across web worker threads, we reduced the transaction build times in our web frontend to 10 seconds, a 3x performance improvement from the sequential implementation. This dramatically enhances the user experience!

Technical design considerations

Unlike a transparent chain, which uses mutable, global state modified by each transaction, a shielded chain like Penumbra tracks state in individual fragments, similar to a UTXO model. This enables privacy by migrating the user state off-chain:, instead of maintaining the user state itself, the chain only needs to maintain cryptographic commitments to the state. The state is represented by a tree of these commitments, effectively concealing the data being committed to. To update their state, the user generates zero-knowledge proofs — proving they have honestly formed a valid state transition that consumes some input states and produces output states — and publishes commitments to their new state fragments, without revealing their content.

Penumbra’s View Service is our solution for managing client state, providing users with their own private views into their relevant state fragments on-chain. It operates as a personal indexer that provides read access to all private data visible by a given wallet. The view service is responsible for scanning the chain for transactions relevant to the user, downloading and synchronizing them, then decrypting and indexing them locally.

The view service is designed as a common interface used by all software in the ecosystem, and there are two independent implementations of the view service:

  1. Typescript / WebAssembly implementation in the in-browser wallet extension
  2. Pure rust implementation in the core monorepo for command-line tasks

Our web wallet architecture in the browser ingeniously leverages this design. Our local command-line client, pcli, similarly uses the view service as a Rust library and invokes the async runtime, Tokio, to spawn tasks for constructing a transaction in parallel. The primary challenge with exposing the same fine-grained level of parallelism to TypeScript / WebAssembly required redesigning our Rust API to eliminate the tightly coupled dependency on Tokio. This involved disaggregating the internal details in the top-level build method for generating transactions, and stitching those lower-level components together in a parallel manner in TypeScript. This unlocked a greater degree of parallelism, enabling a single transaction to be parallelized across dedicated web worker threads, rather than executed serially by the main execution thread in the browser.

Parallelizng transactions with web workers

Transactions are modeled by a TransactionPlan, a declarative and complete plaintext description of the proposed transaction, allowing the user to review a summary of the transaction details prior to signing. The transaction-building process traverses the TransactionPlan, generates zero-knowledge proofs for each Action and performs the proper state changes, and constructs a fully-formed Transaction.

We restructured the internals for building transactions, specifically decoupling the individual Actions (i.e., state changes performed by a transaction) from the TransactionPlan. This separation allows for each Action within a TransactionPlan to be processed by its own web-worker in TypeScript, enabling each worker to focus on generating a distinct zero-knowledge proof that constitutes the transaction. Subsequently, we deprecated the serial transaction implementation in-favor of the multi-threaded version.

diagram of the offscreen API

In our extension, the heavy lifting of frontend requests is carried out by the service worker, an event-driven script that operates in the webpage's background. It intercepts network requests and manages events triggered by the browser. Unfortunately, service workers operate in a constrained runtime environment without standard DOM access, posing a significant challenge: spawning web workers from within an extension service worker is currently not supported in Chrome.

We opted to use a workaround solution recommended by Google engineers: the offscreen API. This allowed us to spawn nested web workers directly from the service worker. The Offscreen API works by opening an invisible window, transmitting messages to it, and using the window to spawn the dedicated workers. This was the key element that enabled multi-threading in our web extension.

Optimistic proving

The objective is to reduce the client-side ZK proving time below the time it takes for a user to review the transaction approval dialogue. We carefully designed Penumbra’s cryptography to ensure that proving and signing can happen in parallel. This means we can start optimistically generating the computationally intensive zero-knowledge proofs while the user reviews the transaction, and submit instantly on approval.

Traditionally, the user would need to include the proof in their signed transaction, preventing the proving running concurrently to the approval screen. In Penumbra, we designed the authentication mechanism in such a way that the proofs are not covered by the signatures, but the commitments they prove about are. This ensures the user is signing the complete effects of the transaction, allowing the proving to run concurrently or even long after authorization occurs.

This small change to how data is hashed for signing provides a huge UX benefit: as long as proving is faster than the time it takes the user to review the transaction, the perceived proving speed is instant! While we’re not there yet, we’re confident we’ll be able to achieve this goal soon.

Groth16 proving keys

The web workers are responsible for loading the WebAssembly binary and invoking the relevant Wasm functions to construct a fully formed transaction. We observed that loading a larger binary file from within web workers significantly deteriorated performance. Initially, our Groth16 proving keys were bundled into the binary, which meant that workers would take longer to load and initialize that file. To minimize the loading times, we stripped the proving keys from the binary, reducing the size by 90% from 90 MB to 9 MB.

Downloading the proving keys incurs a one-time fixed cost paid at compile-time when the user downloads the extension. Additionally, each transaction incurs a fixed-cost where each web worker engages in a two-step process: (1) fetch the binary proving keys from disk, and (2) dynamically load the proving keys into memory at runtime. Moreover, each worker selectively fetches the relevant proving keys for that specific transaction, instead of all the proving keys. Through these optimizations, we've managed to reduce key loading times by 95% from ~3500 ms to 150 ms.

Getting started

Penumbra's web extension is now published to the Chrome Web Store. The frontend site is available at https://app.testnet.penumbra.zone. Both are still a work-in-progress, but we appreciate hearing about any and all feedback in Discord, and stay up to date with technical developments by following Penumbra Labs on X.

Stay tuned for a future post on how we’ll accelerate the user and client-side experience even more with major WebGPU improvements as we progress on our path to mainnet.