Acceleration of ZK Proof

We adopt the open-source zero-knowledge proof library bellman_ce for circuit development and zero-knowledge proof generation. However, it is very slow to generate a plonk proof by using the native bellman_ce library as it takes 12 minutes to generate a proof for the circuit of 67 million gates, which is far from meeting our business needs. Therefore, it is imperative to accelerate zero-knowledge proof.
We have a pool of FPGA talents, so our initial thought was to use FPGA acceleration. The Plonk proof process is very cumbersome, while the FPGA capacity is limited and cannot perform a large number of calculations. Therefore, only the tasks that take a long time are outsourced to the FPGA for calculation. In the entire Plonk proof process, the most time-consuming operation is to use FFT to find the value of the polynomial or use iFFT to solve the polynomial, and to calculate the Kate commitment of the polynomial. The operations involved are FFT and multi-exponentiation, which account for 70% of the time consumed by the entire plonk-proof process. According to initial estimates, for a circuit of 67 million gates, without considering the FPGA bandwidth, it takes six minutes to calculate FFT and Multi-Exponentiation with a single FPGA board. If four boards are employed at the same time, it takes 1.5 minutes in theory. However, a lot of pre-calculations are needed, and the results need to be stored on TB-level hard disks. With a large number of FPGA boards come the issues of poor stability and subsequent maintenance. In addition, the most important reason is that FPGA development takes a long time. In order to launch the project as soon as possible, the FPGA solution has to be implemented later.

GPU acceleration

GPU acceleration is our second option. Similar to the FPGA solution, FFT and Multi-Exponentiation are accelerated by GPU, while other operations are still performed by CPU. Compared with the FPGA solution, the engineering implementation is less difficult. In order to speed up the implementation of the project, we use a cloud GPU server for debugging.
Firstly, by borrowing filecoin's GPU optimization for groth16, we implemented the GPU optimization for plonk by using multiple GPUs to accelerate FFT and Multi-Exp. With multiple GPUs, the processing methods are different when computing FFT and Multi-Exp in parallel. For Multi-Exp calculation, multiple GPUs complete a Multi-Exp in parallel; for FFT calculation, each GPU completes an FFT. Since up to four FFTs need to be calculated at the same time during the proof process, four graphics cards are supported at most. If more graphics cards are used, FFT calculation time will not be shortened, and Multi-Exp calculation will be slightly shortened (no obvious improvement). Given the tradeoff between the cost and the acceleration effect, we finally adopted two graphics cards to perform two FFTs at the same time, or to accelerate one Multi-Exp at one time.
For the circuit of 67 million gates, after implementing the acceleration of FFT and Multi-Exp, the proving time was reduced from 12 minutes to 7 minutes. Two problems were involved here. The first problem is that the video memory of a single T4 graphics card is 16GB, which is insufficient for calculating the FFT with a length of 2^28, and hence GPU calculation cannot be conducted. As a result, CPU calculation was carried out, resulting in a noticeable increase in proving time. For this reason, we split an FFT with a length of 2^28 into two FFTs with a length of 2^27, and then perform a butterfly transform on the results to obtain the result of a 2^28 FFT. The second problem is that FFTs are not implemented in parallel, which means that only one FFT is done at a time and only one graphics card is working. For this reason, we continued to optimize and modify the implementation of FFT to support simultaneous calculation of multiple FFTs.
The acceleration effect is not so ideal by solely using GPU to calculate FFT and Multi-Exp, as the overall performance is improved only by 40% or so. For this reason, we continue to do software optimization by optimizing each round of Plonk proof. Specifically, we replaced the less efficient rust statement, and removed useless calculation logic. In addition, some common operations are moved to the pre-calculation part. After a series of optimizations, for the circuit of 67 million gates, our proving time is controlled within 3.5 minutes.
The following table shows the time required for Plonk proof of different numbers of gates.
un-optimized results
(in seconds)
optimized results
(in seconds)
At present, we are still working on the optimization of Plonk proof. We are going to go deep into the underlying libraries pairing_ce and ff_ce to optimize basic operations such as point multiplication, point addition and scalar multiplication of BN254, in a hope to further improve the proving performance.