Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a Move-In-Move-Out DMA transaction API in addition (again) #1512

Closed
bjoernQ opened this issue Apr 26, 2024 · 19 comments
Closed

Have a Move-In-Move-Out DMA transaction API in addition (again) #1512

bjoernQ opened this issue Apr 26, 2024 · 19 comments
Assignees
Labels
peripheral:dma DMA Peripheral status:needs-attention This should be prioritized
Milestone

Comments

@bjoernQ
Copy link
Contributor

bjoernQ commented Apr 26, 2024

In #1238 using DMA was made easier but it turned out that the new API doesn't support all use-cases supported by the previous API.

Especially something like https://github.com/georgik/esp-display-interface-spi-dma isn't possible this way.

(Using Polonius partially helps a bit but still doesn't fully solve the issue)

@MabezDev
Copy link
Member

Could you briefly explain the lifetime issue you have run into and how the old API made it viable?

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Apr 27, 2024

Sure! Probably easiest to explain with some code which (hopefully) resembles the problem:

use std::{marker::PhantomData, ptr::addr_of_mut};

fn main() {
    let dma = Dma {
        phantom: PhantomData,
    };

    let mut foo = Foobar {
        dma,
        inflight: None,
    };

    foo.process();
    // foo.process(); // foo is still borrowed mutably
}

struct Foobar<'d, 't> {
    dma: Dma<'d>,
    inflight: Option<Transaction<'t>>,
}

impl<'d, 't> Foobar<'d, 't> {

    fn process(&'d mut self) where 'd: 't {
        if let Some(previous) = self.inflight.take() {
            previous.wait();
        }

        let mut count = 5;
        let transaction = loop {
            let transaction = self.dma.transaction(dma_buffer()); // without Polonius: `self.dma` was mutably borrowed here in the previous iteration of the loop
            if count > 0 {
                transaction.wait();
                count -= 1;
            } else {
                break transaction;
            }
        };
        self.inflight.replace(transaction);
    }
}

fn dma_buffer() -> &'static mut [u8; 1024] {
    static mut BUFFER: [u8; 1024] = [0u8; 1024];
    unsafe { &mut *addr_of_mut!(BUFFER) }
}

struct Dma<'d> {
    phantom: PhantomData<&'d ()>,
}

impl<'d> Dma<'d> {
    pub fn transaction<'t>(&'t mut self, _buffer: &'t mut [u8]) -> Transaction<'t> {
        Transaction {
            phantom: PhantomData,
        }
    }
}

struct Transaction<'t> {
    phantom: PhantomData<&'t ()>,
}

impl<'t> Transaction<'t> {
    pub fn wait(self) {}
}

With the commented-out second call to process this compiles with -Zpolonius but not without.
Even with Polonius it doesn't compile with the second call to process

The old API moved SPI and the buffers into the transaction and we can store the "inflight" transaction in a struct field to finish it in the next iteration

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Apr 27, 2024

I might have an alternative idea. Need to try that

@bjoernQ bjoernQ mentioned this issue Apr 29, 2024
3 tasks
@bjoernQ bjoernQ self-assigned this Apr 29, 2024
@Dominaezzz
Copy link
Collaborator

I happen to also have a use case for Move-In-Move-Out (DMA) transactions, where I want to schedule semi-continuous transfers with interrupts with minimal involvement from the application code.

I'm writing a driver for the MAX3421E USB chip and with the current async SPI driver in esp-hal I find the chip is constantly waiting for the ESP32S3 to send the next 64 byte buffer. If I could just have the interrupt simply start the next SPI transfer rather than go through a waker that eventually schedules the next transfer, I could really improve the throughput of my USB application.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented May 2, 2024

Would my proposed unsafe SPI master API help?

@Dominaezzz
Copy link
Collaborator

It does but having to use unsafe is a bit unfortunate.
Instead of an additional unsafe API, what about an additional move in move out API? 😄
Managing a state machine would be rather convenient for me, or any off-stack use case really.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented May 2, 2024

Yes, the move-in-move-out API was my first thought (and actually how it was before) 😄

@MabezDev
Copy link
Member

MabezDev commented May 2, 2024

Sorry I haven't had much time to think about this recently. I'm not against the Move-In-Out API, but I'd like to avoid two APIs that do the same thing.

I think fundamentally, what we might want is the ability to queue multiple DMA transactions. If you squint a little bit, the Move-In-Out API gives us a queue of depth 1. So I think to achieve that, we probably want to split out the "preparation" part from the actual DMA transaction.

All of what I've mentioned above is very hand wavey though, I don't know much about the DMA impl and this may be impossible to achieve, or a bad idea :D. I would appreciate your inputs on this.

@Dominaezzz
Copy link
Collaborator

Dominaezzz commented May 2, 2024

Just some thoughts here and there about the topic. Apologies for the wall of text, I've been thinking about this for 2 days.

I think the current ergonomic (but technically unsound) API can be built on top of a move based API, it just means we need to live with some memcpy's (and thanks to #1255 it should be running at top speed).
The idea being we have some 'static buffer that the user provides, (probably in a c'tor?) then the driver can use this for it's actual transfers and memcpy the results into the stack allocated buffer that the user provides to write, read, transfer, etc.
For particularly large transfers, it doesn't have to be one big memcpy which results in the total latency being "time to do SPI transfer" + "time to memcpy it all", it could be a series of memcpys that follows the progress of the transfer. So if the transfer is to read 10,000 bytes, a waker can be woken every 1000 bytes (in reality it'll probably be once per DMA descriptor) so the Future::poll can memcpy next available 1000 bytes into the stack borrowed buffer. The total latency the becomes "time to do SPI transfer" + "time to copy last chunk", which is IMO acceptable for ergo and soundness. Latency sensitive applications can take matters into their own hands (using the move based API directly) if they want to skip the memcpys.

I would really love to be able to queue multiple (DMA or not) transactions like esp-idf allows you to (but slightly better).
The SPI peripheral on the ESP32S3 (and all of the others I think) support multiple CS pins (CS0, CS1, CS2, etc). It would be really fantastic if I could have a shared SPI bus situation where each device on the bus could queue up transactions for the Spi driver that it'll gradually execute using the interrupt to schedule the next transfers, then wake the application's Wakers to consume the result of each transaction in it's own time. The important bit here is the fact that the transactions keep getting executed in the background even if the application doesn't Future::poll often, which can happen if the executor is too busy, maybe with a CPU bound task, it's just running in debug mode, a task with a big tree of futures or there are simply too many tasks to poll.
ESP-IDF allows you to do this queueing for a single device on the bus but queueing for a different device forces you to drain the queue beforehand... which is not ideal. I did the wrapper for this in esp-idf-hal which is were I discovered the limitations.
The SPI driver docs currently say to use embedded-hal-bus for bus sharing, which works, but it means each device has to wait for the other devices to finish transacting before even getting to schedule it's transfer.
My immediate use case for this is only one device, which is a great starting point haha, but having multiple would be great! (Once I maximise single device throughput I'll be setting up multiple MAX3421Es on the same SPI bus to have multiple USB ports, so I'll have a use case for that too)

I don't really know if all of this belongs in the hal, with a move based API, users could just DIY it on top of the driver. Or perhaps it's useful enough for everyone to keep in here, idk. Can decide this later I suppose.

So I think to achieve that, we probably want to split out the "preparation" part from the actual DMA transaction.

It's important that one is able to queue up more transactions whilst there are already transactions in flight. Splitting out the "preparation" part from the transaction itself would mean you can queue multiple transactions at once but once they start you have to wait for them to finish to queue some more, killing throughput.

To achieve this, I think we just need a 'static enum representing the state (idle vs transfer in progress) of the Spi driver and an interrupt to read off a 'static shared queue of transactions and manage the state transitions between idle/busy.

and this may be impossible to achieve

If ESP-IDF can do it, then esp-hal can do it too. 😛

A more interesting implementation could also take advantage of the segmented transfer feature that the ESP32S3 has (Idk about other chips). It basically let's you queue up multiple transactions to different devices in one go. It appears to use GDMA to configure the SPI registers.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented May 3, 2024

Thanks for sharing your thoughts - I think segmented transfers are definitely something we need to explore. I avoided it before because of the added complexity but definitely has some advantages

I think the current ergonomic (but technically unsound) API can be built on top of a move based API

I also thought about that - I was a bit concerned about the need for memcpy but maybe it's not a real problem in most cases and when, users could fall back to the "unpleasant" API

Just to explain the original motivation for this issue a little bit:

That linked repository in the first comment basically starts a DMA transfer to push pixels to a display and returns so that the user code can prepare the next frame. (It does that in chunks to avoid needing a 300x240x2 bytes buffer which wouldn't be possible to transfer in one go anyways - but that's not important here).

When the next frame is ready to push to the display, we need to make sure the previous transfer is done before starting the next one.

With the old API it was possible to store the Transaction as a struct member and wait for its completion when the next frame is ready - getting back the SPI instance (and the buffers) to push new pixels to the display.

With the unsafe API doing all this is very easy and convenient but unsafe

With the current API it's impossible unfortunately

A queue might help but I agree with:

t's important that one is able to queue up more transactions whilst there are already transactions in flight.

I guess for that we'd need some memcpy operations for that? Maybe we could build an API where the user can get access to a chunk of the DMA buffer to prepare the transaction

Currently the DMA transactions API is quite simple and duplicated for each peripheral (because it's almost the same but a little bit different). Creating something more advanced ideally wouldn't be duplicated code for each driver (and it would be great to get rid of the duplicated code we already have)

@Dominaezzz
Copy link
Collaborator

Currently the DMA transactions API is quite simple and duplicated for each peripheral (because it's almost the same but a little bit different). Creating something more advanced ideally wouldn't be duplicated code for each driver (and it would be great to get rid of the duplicated code we already have)

Yeah this has always bugged me about the DMA code in each driver, it always feels like there's room for dedup but not an obvious way of how to.

I guess for that we'd need some memcpy operations for that? Maybe we could build an API where the user can get access to a chunk of the DMA buffer to prepare the transaction

Yeah probably. It'd work for SPI at least, that has the concept of transactions. It may be tricky to fill up the DMA buffer without accidentally feeding data into the previous transactions? I guess this would need separate descriptor lists.

Or perhaps users can just provide a 'static buffer per transaction and manage it themselves.

but maybe it's not a real problem in most cases and when, users could fall back to the "unpleasant" API

Well if you're not unhappy with the idea I'd be happy to draft something. Though my initial use case if for SPI without DMA, which at the moment isn't async (for understandable reasons). My transactions never go above 64 bytes.

@Dominaezzz
Copy link
Collaborator

Dominaezzz commented May 19, 2024

I've come back to wanting this again and have some more concrete ideas after reading the TRM (of the ESP32-S3).

To layout some hardware facts, feel free to skip if you already know how the hardware works.

  1. A single SPI transfer consists of either a write, read or both (full duplex). (Command, address and dummy phases aren't important)
  2. The write and/or read could come from either DMA or FIFO. i.e. You can do something like FIFO write and DMA read in the same transfer.
  3. The length of the transfer is always specified in the SPI hardware, regardless of FIFO/DMA size. No such thing as an infinite transfer. (Though you can work around this with a configurable segmented transfer)
  4. In FIFO transfers, if the transfer is larger than the FIFO, the hardware's FIFO pointer wraps around.
  5. In DMA transfers, if the transfer is larger that what the channel can provide, the excess received data is discarded and garbage (whatever the last bit/byte was) is sent for the remaining bit that the TX channel didn't provide. There are special interrupts for when this happens.
  6. The SPI hardware sends an RX EOF to the RX channel (either when data is received or when transfer is done) and doesn't care about TX EOF from the TX channel.
  7. The SPI hardware can be asked to only use the end half of FIFO, meaning you can (sort of) split the FIFO in two. See SPI_USR_MISO_HIGHPART/SPI_USR_MISO_HIGHPART in TRM.

Having laid that out, my ideas build on top of @MabezDev's idea about splitting the preparation from the transaction.


The first idea I have is basically the Move API but with an important change; The driver no longer takes ownership of the DMA descriptors.
At the moment the WithDma driver takes a configured GDMA channel, which contains a fixed number of descriptors. The descriptors are then configured (set length, clear EOF, set buf pointer, etc) for the buffer(s) provided in every transfer. This unfortunately means the next transfer can't be prepared while a transfer is in progress because the descriptors are in use.
My proposal is to have the driver take a descriptor-less GDMA channel in the c'tor, then each transfer method (write, read, etc) that takes a buffer should now take a descriptors+buffer combo struct.

To illustrate with pseudo code, the current API looks like this:

struct SpiDriver {
    gdma: Channel, // with descriptors
    // other things....
}

impl SpiDriver {
    pub fn write(&mut self, buffer: &[u8]) -> DmaTransfer;
}

I want to change it to something like this:

struct SpiDriver {
    gdma: Channel, // without descriptors
    // other things....
}

pub struct DmaTxBuf(buffer: &[u8], descriptors: &[DmaDescriptor]);

impl SpiDriver {
    pub fn write(self, buffer: DmaTxBuf) -> DmaTransfer;
}

With this change, users can prepare several DmaTxBufs and/or DmaRxBufs in advance, or while some transfer is in flight, then pass it to write (or read, etc) for the next transfer, which just needs to point the DMA channels to the new buffer address. When the transfer is done, user can get their bufs and driver back from wait().
(I think this also eliminates the DMA setup latency that esp-idf speaks of)

This is quite a big change, especially for the DMA driver which now needs to be descriptor free.....
Might be easiest to first move the descriptors from the channel to the individual drivers first, then for the SPI driver (to begin with), move the descriptors from the driver to the DmaTxBuf.
Unfortunately I won't be able to test most of the drivers I'll need to modify to make this change 😅.
(Though I'd be able to do the SPI bit just fine 😄)

Having said all that, I think it's best that any queuing should be left to the user if possible and the hal should just make it safe to write queuing implementation outside of it. The main reason is some folks may want to perform some actions between each transaction, like toggling an out of band gpio pin (think of mipidsi D/C pins) or software CS pin. Also locking implementations would depend on what the user is doing, so best to leave them to their devices haha.


The second idea is to make the FIFO a separate object to the driver.
Picture something like this:

let (mut spi, mut fifo) = Spi::new(peripherals.SPI3, .....);

fifo.fill(&[0x01, 0x02, .....]); // Assert buf < 64 (FIFO size)

spi.write(&mut fifo);

I have three reasons for this.

The first being that users will now be able fill the FIFO whilst a DMA transfer is happening. This works well for devices like SD cards, where you have one big DMA transfer to write some blocks to the device, then a small transfer to simply check the status of the previous transfer.

The second being that one can now "split" the FIFO in two (see hardware fact 7), then use one for preparation and the other for the active transfer. This allows for a safe queuing implementation based on the FIFO.

Picture something like this:

let (mut spi, fifo) = Spi::new(peripherals.SPI3, .....);

let (mut fifo_first_half, mut fifo_last_half) = fifo.split();

fifo_first_half.fill(&[0x01, 0x02, .....]); // Assert buf < 32 (Half of FIFO size)
spi.write(&mut fifo_first_half); // Can also take ownership of the FIFO for async transfers.

fifo_second_half.fill(&[0x05, 0x0A, .....]); // Assert buf < 32 (Half of FIFO size)
spi.write(&mut fifo_second_half); // Can also take ownership of the FIFO for async transfers.

let fifo = fifo_first_half.join(fifo_second_half); // Can undo the split for bigger transfers.

The third being that users will be able to take advantage of hardware fact 2, use FIFO for write and use DMA for read, or vice versa.

Another nice side effect of this is, in a full duplex transfer, one can use one half of the FIFO for writing and one half for reading. This is useful for polling, where you're basically writing the same command over and over again until you get a certain result.
For example SD Cards status checks, which require you send 0xFF and read a 0x00 if it's busy. You can avoid rewriting the FIFO used for writing since the reads go into a different part of the FIFO.

Of course, there will need to be some kind of type safety to make sure you cannot use SPI2's FIFO with SPI3 or vice versa.


Now to address the aftermath of these changes. The base SPI driver will no longer be able to implement embedded-hal SpiBus as is (which is acceptable based on the API-GUIDELINES.md). We'd need some kinda container struct that provides the implementation.
Picture something like this:

let (spi, fifo) = Spi::new(peripherals.SPI3, .....);

// e-hal implementation that does memcpy's to the FIFO like the current one does
let mut driver = SpiBusDriver::new(spi, fifo);

driver.write(&[0, 0, 0, 0]);
driver.transfer_in_place(&mut [0, 0, 0, 0]);

or in the DMA case.

let (spi, _) = Spi::new(peripherals.SPI3, .....);

let spi_dma = spi.with_dma(dma_channel);

// e-hal implementation that does memcpy's to the DMA buf like FlashSafeDma does.
let mut driver = SpiBusDmaDriver::new(spi_dma, dma_tx_buf, dma_rx_buf);

This could probably have some helper functions to make construction shorter.


For a future improvement on this, the DmaBufs in my first idea could be enhanced to contain data for several transactions and the SPI driver would simply be restarted to handle each chunk in the buf, without having to reset all the buffers or do more Move-In-Move-Outs. Still a hand wave-y idea here.

Once all this is done, I also have semi concrete ideas for configurable segmented transfers that go on top of this DmaBuf idea.

I also need to check the other TRMs to confirm, but I think the bulk of these should apply to all the supported devices. All the chips supported by this HAL can do what I've proposed. (The base ESP32 can't do segmented transfers but that's out of scope of this).

Thoughts? 🙂 I quite like how this came out.

@ProfFan
Copy link
Contributor

ProfFan commented May 25, 2024

I like this idea, a lot of the operations we do with (especially interrupt-less) SPI devices is polling which can be made much less costly with this.

@MabezDev
Copy link
Member

MabezDev commented Jun 5, 2024

I really like the first proposal, it's exactly what I had in mind! One thing to think about is if we can still have the "easy" way to use the API by default, IMO it's not a dealbreaker if we can't but it would be nice if a user just wants to do one transaction at a time we make it easy as it is now, making the more complex queuing as an opt-in API. Either way, I'd really like to see something like this implemented, I think it's a great proposal.

Having said all that, I think it's best that any queuing should be left to the user if possible and the hal should just make it safe to write queuing implementation outside of it. The main reason is some folks may want to perform some actions between each transaction, like toggling an out of band gpio pin (think of mipidsi D/C pins) or software CS pin. Also locking implementations would depend on what the user is doing, so best to leave them to their devices haha.

I totally agree with this too. In the future we could provide some opt-in wrapper type that could handle the queuing, but I would say for now it's quite an advanced API and it should be used manually when the performance is needed.

The second idea is to make the FIFO a separate object to the driver.

This part I'm not so sure of yet. I like the idea in principle, but in reality, when would you use this? You mentioned SD Cards over SPI, but these are inherently quite slow, if you want performance you would use SDIO. I'm not sure adding this complexity is worth it. Especially since it changes the most common use of SPI with the embedded-hal traits. We could flip the functionality of your proposed PR on its head, where by default its a e-hal compatible SPI device, but then we could be split into the two parts; again though I'm not sure its worth the effort unless I'm missing something.

(also sorry again for not getting to this sooner! I really appreciate the effort required to come up with these proposals :).)

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Jun 5, 2024

I agree with @MabezDev (and also sorry I was scared away by the wall of text initially)

One thing to think about is if we can still have the "easy" way to use the API by default

I might have an idea which sounds great in my head but maybe isn't in code - that would be possible to do in a follow-up PR (I'd volunteer for that), i.e. I guess it's fine to start with the move approach to not have that initial PR even more complex than it will for sure be

Unfortunately I won't be able to test most of the drivers I'll need to modify to make this change 😅

I can support in testing those. And at least we now have a test for I2S (breaking I2S was the main reason I often scared away from touching DMA code before 😆)

@Dominaezzz
Copy link
Collaborator

I'm glad you both like the proposal!

This part I'm not so sure of yet. I like the idea in principle, but in reality, when would you use this? You mentioned SD Cards over SPI, but these are inherently quite slow, if you want performance you would use SDIO.

It's precisely because they're slow that you'd want to do this. If the SPI clock rate is low, then pulling out DMA and faffing around with 'static may not be worth it. You'd just want to fill up the fifo, start the transfer and then have an interrupt wake the scheduler when the transfer is finally done.

I've personally started with a queueing implementation in my project based on core::ops::Coroutine and for tiny transfers it'd be really handy to just use the FIFO for them. And crucially, it's important that my interrupt isn't spending time filling or emptying the fifo, and it's left to the application to do.

We could flip the functionality of your proposed PR on its head, where by default its a e-hal compatible SPI device, but then we could be split into the two parts; again though I'm not sure its worth the effort unless I'm missing something.

I did consider this initially but it gets in the way of a future proposal of mine which was to expose app interrupts. The SPI peripheral (on the ESP32-S3 at least) has "app interrupts" which pretty do the same thing as struct SoftwareInterrupt<N> but shares the SPI interrupt handler (eliminates the need for locking between them).
So it would look something like this. let (spi, buf, app_intr0, app_intr1) = Spi::new(.....);.
Having the easy/e-hal driver hold on to the app interrupts (for the later split) feels wrong to me. Especially since undoing the split means giving back the app interrupts to fill in the struct.

Oh actually! Maybe two separate c'tors make sense here. The existing Spi::new and a new Spi::new_split (pending bike shed)?

Unfortunately I won't be able to test most of the drivers I'll need to modify to make this change 😅

I can support in testing those. And at least we now have a test for I2S (breaking I2S was the main reason I often scared away from touching DMA code before 😆)

Fantastic!

@jessebraham jessebraham added this to the 0.19.0 milestone Jun 5, 2024
@ProfFan
Copy link
Contributor

ProfFan commented Jun 5, 2024

This would be great, since the SPI already has a FIFO of 64 bytes and it would be similar to the "interrupt-based" transactions in the official docs. For example, my application mainly uses the SPI to read IMU packets (20-byte), at 20MHz that would be ~2000 cpu cycles wasted. Setting up DMA for this application is also quite wasteful.

@Dominaezzz
Copy link
Collaborator

Oh actually! Maybe two separate c'tors make sense here. The existing Spi::new and a new Spi::new_split (pending bike shed)?

Turns out this doesn't work well. Particularly because of the with_sck, with_mosi, with_cs methods, as they'll have to be duplicated. (Whilst checking this out I've found an awkward "bug" with the way pins are handled, I'll open a separate issue)

In addition to the "app interrupts", I was also hoping (in a future PR) to return some kind of "interrupt as a resource" thing from Spi::new as well. Picture something like this:

let (spi, buf, app_intr, intr_handler) = Spi::new(....);
let async_bus = AsyncBusDriver::new(spi, buf, intr_handler);

or

unsafe fn spi_handler() {}

let (spi, buf, app_intr, intr_handler) = Spi::new(....);
intr_handler.set(spi_handler);

@MabezDev
Copy link
Member

I'd consider the original issue closed via #1672, but I'm very keen to discuss the DMA queing API further, so I've opened #1785 to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
peripheral:dma DMA Peripheral status:needs-attention This should be prioritized
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants