[Gnuk-users] Gnuk on a faster MCU

Aurelien Jarno aurelien at aurel32.net
Sat Sep 9 23:16:41 UTC 2017


Hi all,

Antoine Beaupré shown a few days ago that Gnuk running on FST-01 is not
really fast, at least for RSA. I know that ED25519 is quite fast on
FST-01, but I would personally would still like to have a faster GPG
token using Free Software for RSA algorithms.

I therefore started to prototype things a bit, and I "ported" Gnuk on a
STM32L432 MCU. I say "ported" because I have done things quick and dirty
and the keys are not even stored in flash, but in RAM. This MCU has a
Cortex-M4 CPU running at 80MHz and tiny caches (1kB for instructions,
256B for data). It's available in a QFN32 case, even smaller than the
STM32F103. It's also able to do crystal-less USB (I haven't tried yet).

On such a CPU, Gnuk is able to do a RSA2048 decryption in 0.84s and
a RSA4096 decryption in 5.18s (vs 1.27s and 8.22s on FST-01). The gains
are mainly due to the instruction cache, as it hides the wait states of
the flash memory. The remaining gain comes from the single cycle
multiply-and-add instructions. I have been able to get these down to
respectively 0.65s and 3.87s by using the UMAAL DSP instruction in
MULADDC and mpi_montsqr.

I am still pondering wether to try with even faster MCU, like an STM32F4
at 168MHz even if it comes in a bigger LQFP64 case. I would consider
getting a < 2s signature / decryption for a RSA4096 something
acceptable.

Anyway to move things forward, I need to port things more cleanly.
Chopstx and Gnuk seems to have been designed with portability in mind.
That said with the time it seems many STM32F103 assumptions have been
added. I believe most of them can be removed relatively easily, even if
it implies some code move. It seems the biggest portability issue
concerns the flash. The current code assumes that the pages are small
(1 or 2kB) and that the writes are done 2 bytes by 2 bytes. These
assumptions are used in src/flash.c, but also define the format of the
data in src/openpgp-do.c. The flash is quite different on newer STM32
families:
- The STM32L4 family has an ECC flash, which requires writes to be done
  8 bytes by 8 bytes. The pages are 2kB long.
- The STM32F4 family has a flash organized by sectors instead of pages,
  with 4 pages of 16kB, 1 page of 64kB and many pages of 128kB. This is
  not compatible with the current segmentation which requires 6 pages
  or he data pool, the keystore pool and the ch_certificate. The writes
  size can be chosen dynamically from 1 to 8 bytes.
I wonder if one way to fix that would be to use a single data pool,
with the possibility to store longer objects like keys or certificates.
It would mean triggering the garbage collector each time a sensitive
data like a private key has been removed or replace. This is however a
significant change to the current code.

Any comments or suggestions are welcome. Following that, I'll try to
cleanup my changes and submit them.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien at aurel32.net                 http://www.aurel32.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/gnuk-users/attachments/20170910/83504e3f/attachment.sig>


More information about the gnuk-users mailing list