-
Notifications
You must be signed in to change notification settings - Fork 157
Add HIP 0001 ring buffer proposal for Hyperlight I/O #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
proposals/0001-rng-buf/README.md
Outdated
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| **Risk**: Malicious guest corrupting the queue **Mitigation**: Do not expose low level queue API to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that not exposing it to the guest does anything. The guest is fully untrusted and so we need to do the mitigation suggested after this, that we serialize all data as known types and do assertions as suggested.
| example `ioeventfd` for kvm). This is especially useful for streaming scenarios where the guest can | ||
| continue processing while the host consumes data asynchronously. | ||
|
|
||
| **3. Inline Descriptors** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this proposed to be implemented now or as possible improvement in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to give it a try in the first draft, but really if we're gonna keep it or not depends on two factors: 1) can we actually fit any meaningful data into inline slot, e.g. can serialized function call with single argument fit? 2) the performance improvement for such calls is substantial. Both are hard to assess without actual implementation.
| 2. Device writes up to buffer length | ||
| 3. Device sets actual written length in descriptor | ||
| 4. If `actual_length > buffer_length`, device sets a `TRUNCATED` flag, | ||
| 5. Driver can re-submit with larger buffer if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can image this would be possible performance issue? Can we provide a metric for when this occurs and a way to improve initial buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I would imagine that this only happens when you try to transport a "big" chunk of data for which the base case should be using stream rather then pushing everything at once. That being said the protocol should support full round trip but really the mitigation of this problem should be using streams.
| ### Snapshotting | ||
|
|
||
| Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any | ||
| attempt to snapshot a sandbox with such pending requests will result in a snapshot failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we ensure this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can query the queue for number of inflight messages during snapshotting and fail if it reports positive number. Because we have to track the number of free slots in the queue getting number of inflight messages is just length of the queue minus free slots.
jsturtevant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for such a detailed write up. Left a few minor comments but otherwise this seems like it would set up the project to work well for the future
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
2318367 to
25f5d22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds HIP 0001, a comprehensive proposal document for implementing a virtio-inspired ring buffer mechanism for Hyperlight I/O. The proposal aims to replace the current stack-based communication model with a more efficient ring buffer approach to reduce VM exits and enable streaming communication patterns.
Changes:
- Adds a detailed technical proposal document describing the ring buffer design, API, and implementation plan
- Updates typos.toml to allow "readables" and "writables" as valid terms used in the proposal
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| proposals/0001-rng-buf/README.md | Comprehensive HIP document describing virtio-inspired ring buffer design with memory layout, API examples, and implementation plan |
| typos.toml | Adds exceptions for "readables" and "writables" terms used in the proposal document |
|
Really great work on this HIP, @andreiltd! Loved seeing the diagrams, extensive testing plan, code, and whatnot. A few thoughts: 1. Side-by-side comparison suggestion I think the HIP could benefit from a more explicit side-by-side comparison in the "Comparison with current implementation" section. Something like a table showing:
I feel like this would help readers quickly grasp the key difference in the two approaches. 2. Related work in Nanvix fork I've done some similar work in the Nanvix hyperlight fork that might be relevant: 76c9e7c There, I added a 3. Single-threaded vs multi-threaded phasing For the multi-threaded case (guest and host on separate threads), the host would need a way to interrupt the running guest when responses are ready--yes? For that, I've also done related work for that in Nanvix's Hyperlight fork: a897cad for KVM. My hardware interrupts work enables PIC/APIC interrupt injection which could support the ioeventfd-style notifications mentioned in the HIP. Happy to coordinate on this when the time comes :) The backward compatibility goal (current function call model portable without public API changes) is really cool to see--great for downstream consumers like Nanvix. Looking forward to seeing this land! |
Fix typos Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Tomasz Andrzejak <andreiltd@users.noreply.github.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
a2a24cf to
934dee7
Compare
syntactically
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly really like this! Left a few minor nits, and a couple of bigger comments clarifying the question of where buffers get allocated.
I didn't review the Rust type design in detail, as I understand that is going to be updated shortly.
proposals/0001-rng-buf/README.md
Outdated
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| **Risk**: Malicious guest corrupting the queue **Mitigation**: Do not expose low level queue API to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jsturtevant here---the raw state of the queue is fundamentally exposed to the guest. The thing that we need to do on the host for safety is figure out what that means for us: making sure that the buffers are always in guest memory, ensuring that we do not access them in racy ways (even if the guest tries to trick us into having the same buffer both read from and written to at the same time), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I think all precautions we are currently using stand valid. What I could add here to "mitigation" is:
- validate all addresses point into permitted guest memory regions,
- use volatile/atomic operations for shared memory access,
- never assume buffer contents are stable between reads - which often means volatile-copy out memory first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten this section.
| ## Design Details | ||
|
|
||
| ### Packed Virtqueue Overview | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the producer/consumer terminology here. I think it added to the confusion before about how we should have buffer allocation responsibility be divided. How about something like "the driver is the side that can allocate buffers accessible by both sides (i.e. the guest), and the device is the side that reads from/writes to these buffers on request (i.e. the host). The host is essentially providing a specialised paravirtualised device to the guest, and so always takes on the device role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, especially that both sides devuce and driver involve producing and consuming descriptors and even more true if we want to maintain that only the guest allocates shared buffers. I will change this.
| can act as driver or device depending on the direction of communication. | ||
|
|
||
| #### Bidirectional Communication | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like:
Hyperlight needs to support both device-initiated (for host->guest function calls) and driver-initiated (for guest->host function calls) communication. This is generally accomplished by having the guest create during early initialisation and keep populated a set of available buffers into which the host should next write incoming work for the guest.
I am not fundamentally opposed to having split rx/tx queues, like e.g. virtio-net, but unsure if they actually make sense here, at least in the single-threaded case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single queue might be a cleaner solution but I will need to think harder what it means for backpressure. I will do it here.
Just to not add to the confusion: by split we mean 2 separate queues for each direction and it has nothing to do with virtio terms like split vs packed queue.
In the split queue scenario the backpressure is simple because it is independent in each direction. For single queue on the other hand the backpressure is shared because both directions compete for the same descriptor ring capacity. SO for example, If the guest fills all its available slots with requests, and the host needs to send something to process those requests, this will result in deadlock - so we need to have a mechanism to avoid this.
Maintaining 50/50 split is simple but it's going to be very wasteful for the asymmetric traffic. We could maintain a minimum number of slots that are always available for the device and whenever host mark them used, the driver "immediately" resubmits fresh writable buffers. The explicit pool still introduces a bias -- so should the device be allowed to "request more credits" if it needs increase capacity? This becomes quickly quiet complicated.
|
|
||
| #### Memory Layout | ||
|
|
||
| The queue structure reside in shared memory and are accessible to both host and guest. The layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an offset from the mapped region base, or a GPA? I would tend to think GPA, even though in most cases the host will want to ensure that it is in the scratch region and then convert to an offset from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, GPA is cleaner.
|
|
||
| <img width="1324" height="726" alt="submit" src="https://github.com/user-attachments/assets/a3ee2dea-a55b-4d50-8b96-1702617a21f0" /> | ||
|
|
||
| **Step 2:** Device processes and writes response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that in this picture there is implicitly some header inside the buffers passed where some kind of call ID allocated by the caller can be used to link up calls/responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there is an id in the descriptor, but that identifies the buffer, not the semantic call. But I don't think there is a need for the payload level id: we can just return the opaque submission token that wraps descriptor id that is used to correlate resp/req.
| descriptor layer or in the flatbuffer schema: | ||
|
|
||
| 1. Driver allocates buffer of estimated size | ||
| 2. Device writes up to buffer length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this flag set in the queue flags, or in the header protocol inside of the buffers? The latter I assume needs to keep track of both a request id and an offset from the beginning of the response to that request for this buffer, for this to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I would imagine this has to be protocol level as this is not really about the descriptor state but the mutli-part payload state. As you pointed out it will also require an additional context that cannot be expressed with a single flag.
|
|
||
| ### Snapshotting | ||
|
|
||
| Snapshotting requires that the descriptor table has no in-flight guest-to-host requests and any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there is a slightly more complex issue with snapshotting as well: the state of the buffer pool in the driver will need to be considered/updated, because (a) the physical addresses of the inflight "write the next request for me here" buffers will need to be updated to the new scratch region and (b) something will need to be done (perhaps the same update? but slightly unclear, in case the guest is currently using data in those buffers) with the buffers in the allocator pool that are not currently available in the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, some kind of glitch seems to have resulted in my comments getting posted twice. I can't delete this duplicate review comment, apparently.
syntactically
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some minor comments on the Rust definitions as well.
| This mechanism ensures that no locks are required for synchronization, only memory barriers combined | ||
| with atomic publishing of flags ensure that the other side will never observe a partial update: | ||
|
|
||
| - Driver: Write descriptor fields ? memory barrier ? atomic Release-store flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What barrier were you thinking we might need here? I would think that the synchronising release store and acquire load should always be enough.
Separately-ish, there is the issue of how to ensure that the host does not have data race UB, even in the face of a guest that maliciously violates the synchronisation protocol. This is all a bit complicated by the fact that the reads/writes on the other side are outside the Rust abstract machine. We have some notes about this in shared_mem.rs, which were written for the future thinking about this kind of concurrent update situation. There was also a recentish discussion in unsafe-code-guidelines about this topic, which came to a similar conclusion to our notes: what we really want is a relaxed atomic-per-byte memcpy to get data in/out of the guest (after the acquire atomic load), but unfortunately this operation does not exist in Rust yet. So, I think that the present approach of having an acquire load and then volatile non-atomic reads/writes is still the "least bad" option, even though it is not exactly correct w.r.t. the memory model---hopefully we can eventually switch to the per-byte atomic memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that the synchronising release store and acquire load should always be enough.
That was my assumption, yes. I guess the usual "TOCTOU" issue where the guest modifies memory between host reads remains and we should probably acknowledge this in the proposal?
| pub struct DescFlags: u16 { | ||
| const NEXT = 1 << 0; // Buffer continues via next descriptor | ||
| const WRITE = 1 << 1; // Device write-only (otherwise read-only) | ||
| const INDIRECT = 1 << 2; // Buffer contains list of descriptors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text mentioned not supporting indirect, at least for now---do we want this flag to be allocated even though we aren't supporting it yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can change this to RESERVED maybe?
| /// Event suppression structure for controlling notifications. | ||
| /// Both sides can control when they want to be notified. | ||
| #[repr(C)] | ||
| pub struct EventSuppression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these end up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, typically these live at the end of the descriptor ring or at fixed offsets in the queue region. I should probably add this to layout diag.
| desc_table: DescTable, // Descriptor table in shared memory | ||
| id_free: SmallVec<[u16; 256]>, // Stack of free IDs (allows out-of-order completion) | ||
| id_num: SmallVec<[u16; 256]>, // Chain length per ID | ||
| drv_evt_addr: u64, // Controls when device notifies about used buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does _addr mean here? Are these meant to be bitpacked versions of the suppression structure above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drv_evt_addr: u64 and dev_evt_addr: u64 are addresses pointing to the EventSuppression structures in shared memory.
| /// Length in bytes (for used: bytes written by device) | ||
| pub len: u32, | ||
| /// Buffer ID for correlating completions with submissions | ||
| pub id: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these meant to identify an individual buffer, or are they the higher-level semantic protocol thing used to correlate e.g. calls and returns? If the former, do they gain us anything on just using the address as the buffer ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per spec, id is for identifying descriptors/chains for out-of-order completion. I think using addresses as ids is problematic because: (a) addresses may be reused after deallocation, (b) chains span multiple buffers and the descriptors will share id.
|
|
||
| impl<M: MemOps> RingProducer<M> { | ||
| /// Submit a buffer chain to the ring. | ||
| /// Returns the descriptor ID assigned to this chain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this as well as the _notify variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_with_notify variant reference the "other" side EventSuppression configuration and return if it should be notified while the other omits this step. It's used internally for optimization where for batching you want to read the suppression config once and submit without rechecking.
| /// Submit with notification check based on device's event suppression settings. | ||
| pub fn submit_available_with_notify(&mut self, chain: &BufferChain) -> Result<SubmitResult, RingError>; | ||
|
|
||
| /// Poll for a used buffer. Returns WouldBlock if no completions available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to think a little bit about whether there is a way to actually block/return control to the host, and whether that should be integrated into or distinct from the host->guest notification mechanism, which, at least right now, will also exit to the host in the calling thread (but perhaps in the future could be some faster IPI?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. One way of encoding this information would be to make should_notify an enum rather than bool and then we could express the reason why notification should happen.
|
|
||
| #### Buffer Chain Builder | ||
|
|
||
| Type-state pattern enforces that readable buffers are added before writable buffers, preventing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we care about this "readable before writable" invariant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strictly a virtio spec requirement: A buffer consists of zero or more device-readable physically-contiguous elements followed by zero or more physically-contiguous device-writable elements (each buffer has at least one element).
| inner: RingProducer<M>, | ||
| notifier: N, | ||
| pool: P, | ||
| inflight: Vec<Option<ProducerInflight>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this? I don't see the ProducerInflight definition here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tracks buffer allocations for in-flight requests, so they can be freed when the response arrives.
|
|
||
| ```rust | ||
| /// Trait for buffer providers. | ||
| pub trait BufferProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case these eventually end up passed around through a lot of parsing code with indeterminate lifetimes, would it be useful for alloc/dealloc here to be more like get/put and keep a refcount for the buffer and return it to the free pool when it's unreferenced? I have no idea of this is true, just thought it was worth throwing out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I wanted to use Bytes::from_owner specifically for this:
https://docs.rs/bytes/latest/bytes/struct.Bytes.html#method.from_owner
/// The owner of mapped buffer, ensuring its lifetime.
#[derive(Debug, Clone)]
pub struct BufferOwner<P: BufferProvider, M: MemOps> {
pub pool: Arc<P>,
pub mem: Arc<M>,
pub alloc: Allocation,
pub written: usize,
}
impl<P: BufferProvider, M: MemOps> Drop for BufferOwner<P, M> {
fn drop(&mut self) {
let _ = self.pool.dealloc(self.alloc);
}
}
impl<P: BufferProvider, M: MemOps> AsRef<[u8]> for BufferOwner<P, M> {
fn as_ref(&self) -> &[u8] {
// MemOps should give us a pointer to stored allocation
todo!()
}
}This way you can just use Bytes cheap cloning and it will reuse Bytes reference counting. My only concern is about safety of moving around a pointer to guest memory region.
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Signed-off-by: Tomasz Andrzejak <andreiltd@gmail.com>
Rendered