Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STM32 Hash Accelerator #2528

Merged
merged 18 commits into from
Feb 13, 2024
Merged

STM32 Hash Accelerator #2528

merged 18 commits into from
Feb 13, 2024

Conversation

caleb-garrett
Copy link
Contributor

@caleb-garrett caleb-garrett commented Feb 3, 2024

Adds STM32 Hash accelerator HAL to embassy-stm32. This includes support for the MD5, SHA1, SHA224, SHA256, SHA-512/224, SHA-512/256, and SHA-512 algorithms (depending on device). HMAC support is the only feature of this peripheral that is not yet implemented.

The API mimics the start/update/finish of other popular crypto libraries, so integration is theoretically straightforward. This includes use of the context switching feature where one digest calculation may be paused for another, then later resumed. This is particularly useful when digests are computed in a concurrent TX/RX scenario.

An example for the STM32F7 series is included that demonstrates use and compares execution time of the SHA256 algorithm with the sha2 crate. The hardware calculation runs >3x faster than the sha2 crate's implementation in my tests.

@caleb-garrett caleb-garrett marked this pull request as draft February 4, 2024 11:55
@caleb-garrett
Copy link
Contributor Author

Note that the hash_v1 implementation differs from the hash_v2 and hash_v3 implementations in that it does not integrate DMA. While the peripheral supports DMA, it does not support multiple DMA transfers (MDMAT). This makes it impractical for DMA to be used in a context-switching scenario. Thus, DMA is only implemented for hash_v2 and hash_v3.

@caleb-garrett caleb-garrett marked this pull request as ready for review February 4, 2024 22:32
@Dirbaio
Copy link
Member

Dirbaio commented Feb 4, 2024

Very cool!

Can you add on-hardware tests for HASH here? This is a prime candidate for testing since it needs no connections to external hardware.

@caleb-garrett
Copy link
Contributor Author

Will do!

@Dirbaio Dirbaio added the trusted label Feb 5, 2024
@Dirbaio
Copy link
Member

Dirbaio commented Feb 5, 2024

the trusted label makes CI run the tests on hardware every time you push :)

@caleb-garrett
Copy link
Contributor Author

50% of the 8 supported boards failed on the first HIL round :( Two of the boards failed due to an apparent error in the quantity of context registers that needs to be corrected in the stm32-data repo. I will open a PR for that.

The other two failures were the two STM32U5 boards. I don't currently have this hardware on hand and the log is a little vague:

ERROR teleprobe::client > === nucleo-stm32u5a5zj out/tests/stm32u5a5zj/hash: FAILED: HTTP request failed with status code: 400: Bad Request
ERROR teleprobe::client > INFO - run_from_ram: true
INFO - flashing program...
INFO - flashing done!
TRACE - 0.000000 BDCR ok: 0c00a200
DEBUG - 0.000000 rcc: Clocks { audioclk: None, hclk1: Some(Hertz(48000000)), hclk2: Some(Hertz(48000000)), hclk3: Some(Hertz(48000000)), hse: None, hsi: None, hsi48: Some(Hertz(48000000)), hsi48_div_2: None, lse: None, lsi: None, msik: None, pclk1: Some(Hertz(48000000)), pclk1_tim: Some(Hertz(48000000)), pclk2: Some(Hertz(48000000)), pclk2_tim: Some(Hertz(48000000)), pclk3: Some(Hertz(48000000)), pll1_p: None, pll1_q: None, pll2_p: None, pll3_q: None, rtc: Some(Hertz(32000)), sys: Some(Hertz(48000000)) }
WARN - Deadline exceeded!
INFO - R0: 00000001 R1: 00000000 R2: 00000004 R3: 00000000
INFO - R4: 20007fe0 R5: 20007fe0 R6: 20004db0 R7: 20007fd8
INFO - R8: ffffffff R9: 00000000 R10: 00000001 R11: 00000101
INFO - R12: 4000000c SP: 20007fd0 LR: 20002df5 PC: 20001710
INFO - XPSR: 29000000
INFO -
INFO - Stack:
INFO - 20007fd0: 20007fe0 20007fe8 20007ff0 2000181b
INFO - 20007fe0: 00000000 ffffffff 00000000 5cc6e200
INFO - 20007ff0: 20007ff8 20001805 00000000 20000277
INFO - 20008000: c1a55e6c f19f6939 77c5c992 c338d887
INFO - 20008010: 24103718 3f7f34f3 2f9a8695 fbda6252
INFO - 20008020: d0c8dbc0 01121e5f 304fb27a d6f8d21d
INFO - 20008030: 1e0a47db 6af3cd36 2739c5c0 6f964eba
INFO - 20008040: 488f00be 7cdfab77 adeb4b11 e7790f73
INFO -
INFO - Backtrace:
WARN - UNWIND: Error while checking for exception context. The stack trace will not include the calling frames. : Register error: No value for Return Address register. Please report this as a bug.
INFO - Frame 0: <unknown function @ 0x20001710> @ 0x20001710
/ci/code/embassy-executor/src/arch/cortex_m.rs:106:21
INFO - Frame 1: __cortex_m_rt_main @ 0x2000181a
/ci/code/tests/stm32/src/bin/hash.rs:34:1
INFO - Frame 2: __cortex_m_rt_main_trampoline @ 0x20001804
/ci/code/tests/stm32/src/bin/hash.rs:34:1
INFO - Frame 3: <unknown function @ 0x20000276> @ 0x20000276
INFO - Frame 4: <unknown function @ 0x20000276> @ 0x20000276
ERROR - Run failed: Deadline exceeded

The affected line in source is:

let dma = peri!(p, HASH_DMA); // HASH_DMA set to GPDMA1_CH0
let mut hw_hasher = Hash::new(p.HASH, dma);

I assume it has something to do with the DMA configuration. I can work on getting hold of an STM32U5 Nucleo board to debug. In the meantime, if anyone has any insight I would appreciate it :)

@Dirbaio
Copy link
Member

Dirbaio commented Feb 5, 2024

the "WARN - UNWIND: Error while checking for exception context" is a red herring, it's an error trying to print the stacktrace. by then the test has already failed. The "real" error is " WARN - Deadline exceeded!". Tests have a timeout of 10 seconds by default, "deadline exceeded" means the test took longer than that (i.e. it's hanging)

@Dirbaio
Copy link
Member

Dirbaio commented Feb 6, 2024

if you want, I can give you a teleprobe token so you can locally cargo run remotely into the test farm, so you don't have to wait for CI. Helps with fast troubleshooting if you want to e.g. add more log printing. Ping me on matrix if you're interested!

@caleb-garrett
Copy link
Contributor Author

I was able to debug the HIL build issue on a NUCLEO-U5A5ZJ and uncovered a very frustrating issue. There is in fact a 4th variant of this peripheral that is used on the STM32WBA, STM32H5, and STM32U5 series. From the STM32 reference manual, there is virtually no difference in this peripheral from the v2 variant besides the memory mapping of ALGO bits. (I suspect the real difference is under the hood to support GPDMA on these chips.)

The HIL test is timing out on these chips because when multiple DMA transfers are enabled (MDMAT=1), the data input interrupt status bit (DINIS) is never asserted after a DMA transfer. This prevents context switching because the CSRx registers are never loaded. Note that the peripheral otherwise works exactly as expected as long as you don't attempt to context switch between DMA transfers. I wonder if this is a silicon issue and I'm just the first person to try this particular feature on these chips...

My current plan is to add a v4 variant of this peripheral without DMA support. Not ideal, but still considerably faster than a software implementation.

@Dirbaio
Copy link
Member

Dirbaio commented Feb 9, 2024

the L0 ADC failures should be fixed in git main, could you rebase?


/// Computes a digest for the given context. A slice of the provided digest buffer is returned.
/// The length of the returned slice is dependent on the digest length of the selected algorithm.
pub async fn finish<'a>(&mut self, mut ctx: Context, digest: &'a mut [u8; MAX_DIGEST_SIZE]) -> &'a [u8] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this signature could be improved. We shouldn't require the user to allocate MAX_DIGEST_SIZE bytes if they're using an algorithm with a smaller digest.

Also, returning a subslice of the passed-in slice sometimes run into trouble with the borrow checker. In the rest of Embassy we follow the other convention, which is returning the written length as usize.

so I suggest changing it to:

pub async fn finish(&mut self, mut ctx: Context, digest: &mut [u8]) -> usize {

it could either panic or return an error if the buffer is too short.

/// Instantiates, resets, and enables the HASH peripheral.
pub fn new(
peripheral: impl Peripheral<P = T> + 'd,
dma: impl Peripheral<P = D> + 'd,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm somewhat worried about code duplication here, also the fact that the Hash::new() signature changes between chip families which is something we try to avoid, since ot helps writing portable code.

I think it'd be better to have 2 constructors: new with DMA and new_blocking without (which sets the D generic param to NoDma). So the same driver can both be used with and without DMA. Then you can cfg the DMA one to only v2.

This is what other drivers like UART, SPI etc already do, you might want to look at their code to see how it works.

The advantage is now you're not forced to use DMA on v2, so you can write code that works on all families as long as you don't use DMA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like other STM32 drivers are only implementing new and not new_blocking. Could you clarify how this should look? For instance this doesn't compile:

pub struct Hash<'d, T: Instance, D = NoDma> {
    _peripheral: PeripheralRef<'d, T>,
    dma: PeripheralRef<'d, D>,
}

impl<'d, T: Instance, D> Hash<'d, T, D> {
    pub fn new_blocking(
        peripheral: impl Peripheral<P = T> + 'd,
        irq: impl interrupt::typelevel::Binding<T::Interrupt, InterruptHandler<T>> + 'd,
    ) -> Self {
        HASH::enable_and_reset();
        let no_dma = NoDma;
        into_ref!(peripheral, no_dma);
        let instance = Self {
            _peripheral: peripheral,
            dma: no_dma, // ERROR: expected struct `PeripheralRef<'d, D>', found struct `PeripheralRef<'_, NoDma>`
        };

        T::Interrupt::unpend();
        unsafe { T::Interrupt::enable() };

        instance
    }
}

Forgive me if this is a dumb question, I'm still new to Rust.

T::regs().str().write(|w| w.set_dcal(true));

// Wait for completion.
poll_fn(|cx| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these should just do a blockingbusy-loop wait. The manual says these computations take ~60 cycles. The cost of an async context switch is ~100 cycles in the very best case, likely more. So using async waits here is likely to make things slower for no benefit. (same for the poll_fn in store_context).

Another benefit of doing blocking waits is now the entire hash computation can be blocking (when not using DMA), which is required in some cases, for example when implementing the RustCrypto traits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will make that adjustment.

I was actually looking at the RustCrypto traits today. If implementing a trait such as Digest is desired or appropriate to include within this project I can certainly contribute towards that.

@caleb-garrett
Copy link
Contributor Author

I removed the hash test from the STM32U5A5ZJ. The application won't fit due to incorrect memory size for this chip noted here: embassy-rs/stm32-data#301

Copy link
Member

@Dirbaio Dirbaio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, thank you! :)

@Dirbaio Dirbaio added this pull request to the merge queue Feb 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 13, 2024
@Dirbaio
Copy link
Member

Dirbaio commented Feb 13, 2024

f103 failure is unrelated (I bricked it refactoring RCC 😭 )

@Dirbaio Dirbaio merged commit 8c82d1b into embassy-rs:main Feb 13, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants