Skip to content

pablogf-uma/chacha20-128-vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChaCha20 128-Bit Vectorized Implementation

This is an optimized version of the ChaCha20 algorithm that uses SIMD (Single Instruction, Multiple Data) instructions to process multiple pieces of data simultaneously, instead of the sequential processing provided by the regular chacha20 algorithm.

This version utilizes Intel Intrinsics through the header immintrin.h.

It is called 128-bit vectorization because it requires 4 128-bit vectors as input (each vector is a complete row of the state matrix, 4 words x 4 bytes in each word x 8 bits in each byte).

This vectorization performs 4 consecutive operations of the non-vectorized version of chacha20 into 1 single vectorized operation:

  • This operation, which I called a "Double whole round" can now be reduced to just two vectorized operations (one for the row round and one for the column round), because each vector operation handles two quarter-rounds at once.
  • It will concatenate the columns and diagonal permutation rounds one after the other.
  • Double (columns + diagonals) Whole (4 parallel quarter rounds) Round.

Resources:

These resources were used throughout the development of the project outlined above.

  • Paper: Original research paper on ChaCha20 by Daniel J. Bernstein.
  • RFC 8439: Standardized specification of ChaCha20 and its use in internet protocols.
  • SIMD Programming Blog entry: In-depth explanation on how to program with SIMD instructions.

Tests

The compiled executable, chacha20, supports the following options for testing and usage:

  • Run all available tests: Test vectors (decryption and encryption), and clock cycle tests
./chacha20 --all-tests
  • Encrypt test vector N (where N is from 1 to 5 for each test vector within the RFC 8439).
./chacha20 --enc-tv N
  • Decrypt test vector N (where N is from 1 to 5).
./chacha20 --dec-tv N
  • Clock cycles test N (where N is from 1 to 5 for each plaintext length per ECRYPT).
./chacha20 --clock-ct N

Encrypt a custom input provided by the user.

./chacha20 --enc-ci
  • Decrypt a custom input provided by the user.
./chacha20 --dec-ci

Performance

This implementation of ChaCha20 has been optimized for performance and includes benchmarking scripts to measure its efficiency. The results are comparable to established benchmarks, such as those available at ECRYPT. To measure the cycles per byte, use the provided automation scripts:

Windows

Run the clock_cycles_tests.bat script:

clock_cycles_tests.bat

Linux

Run the clock_cycles_tests.sh script:

clock_cycles_tests.sh

These scripts perform automated benchmarking of the ChaCha20 implementation, similar to how benchmarks are conducted for cryptographic algorithms at ECRYPT. The average and median results in cycles per byte are as follows:

Plaintext Length Average Median
8-bytes 167.80 158.88
64-bytes 15.32 14.96
576-bytes 10.73 9.65
1536-bytes 10.78 9.22
4096-bytes 10.30 9.39

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages