fs/macos: fix DirStream cache coherence on directory mutations#693
fs/macos: fix DirStream cache coherence on directory mutations#693jgowdy-godaddy wants to merge 3 commits into
Conversation
531fae1 to
2437ef3
Compare
The macOS virtiofs passthrough caches directory entries once per open handle (DirStream) and never refreshes them. Files created by guest operations are written to the host correctly but are invisible to subsequent readdir calls on the same handle. The Linux passthrough does not have this problem — it calls getdents64 on every readdir. macOS lacks getdents64, so the DirStream cache is structurally necessary. Add a per-inode generation counter (dir_gen) that is bumped on every directory mutation (create, mkdir, mknod, symlink, link, unlink, rmdir, rename). do_readdir compares the stored generation against the current inode generation and re-reads from the host fd when stale. Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
2437ef3 to
1f349a6
Compare
Add 7 subtests that exercise readdir visibility after directory mutations on an already-open handle: create, unlink, mkdir/rmdir, symlink, link, rename (same-dir), and rename (cross-dir). Each test opens a directory, reads entries to populate the cache, mutates the directory, calls rewinddir, and asserts the mutation is visible. These reproduce the macOS virtiofs DirStream cache coherence bug fixed in the previous commit (PR containers#693). Assisted-by: OpenCode:claude-opus-4.6 Signed-off-by: Matej Hrica <mhrica@redhat.com>
mtjhrc
left a comment
There was a problem hiding this comment.
Thanks for the PR!
This fixes the described issue, but introduces a new one: if the guest iterates an opened directory while modifying its contents, duplicate files may be returned.
I quickly wrote some tests for this - specifically, adding files while iterating a directory causes duplicates:
https://github.com/mtjhrc/libkrun/blob/a7a14cbe/tests/test_cases/src/test_virtiofs_misc.rs#L470-L578
You can run it via make test TEST=virtiofs-misc:
main...mtjhrc:libkrun:dir-iter-test
Step-by-step AI root cause analysis by Claude Opus 4.6
(This seems to check out but I haven't debugged it extensively personally):
What happens (from host-side logging):
- Guest calls readdir() — FUSE sends all 10 entries in one response (offset 0→10). The guest's libc buffers them, but user code only reads 7, leaving 3 (m_08, m_01, m_06) in libc's internal buffer.
- Guest creates 3 new files — dir_gen bumps from 10 to 13.
- Guest resumes readdir() — libc returns the 3 buffered entries (m_08, m_01, m_06), then asks FUSE for more at offset 10.
- FUSE sees gen mismatch, rebuilds the cache (now 13 entries). The 3 new entries land before position 10, so m_08, m_01, m_06 shifted to positions 10-12.
- FUSE returns positions 10-12: m_08, m_01, m_06 again. Duplicates.
On Linux this doesn't happen because getdents64 uses opaque kernel cookies, not Vec indices — they're stable across directory mutations.
Test failure on macOS:
before mutation (7 entries): ["m_03", "m_04", "m_05", "m_02", "m_07", "m_00", "m_09"]
after mutation (6 entries): ["m_08", "m_01", "m_06", "m_08", "m_01", "m_06"]
DUPLICATES (3): ["m_08", "m_01", "m_06"]
all seen (13 total): ["m_03", "m_04", "m_05", "m_02", "m_07", "m_00", "m_09", "m_08", "m_01", "m_06", "m_08", "m_01", "m_06"]
The generation-counter rebuild in do_readdir fired on any stale generation, including mid-iteration. Rebuilding the entries Vec while the guest is still indexing into it by offset reorders the entries: a file created mid-scan that sorts before already-returned entries shifts those entries to higher offsets, so the guest's next readdir at its saved offset lands on an entry it has already seen and returns it again. Restrict the rebuild to offset == 0, which is only ever sent at the start of a stream (fresh opendir or after rewinddir). For the rest of an in-progress scan we keep serving the snapshot taken when iteration began. POSIX leaves readdir unspecified when the directory is mutated during iteration, and Linux getdents64 likewise does not surface entries inserted before the current cursor, so this matches existing behavior and avoids the duplicates. Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
|
Good catch, and thanks for writing the test — that's a real bug and you nailed the root cause. The mistake was rebuilding the cache on any generation mismatch, including mid-iteration. Once the I pushed It's a one-line change to the guard plus a comment explaining why. The original fix is untouched for the Mind re-running |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces directory generation tracking to safely rebuild the directory stream cache at the start of iteration (offset 0) when mutations occur, preventing duplicate entries. However, a critical issue was identified where the write lock on the file is released prematurely, potentially causing race conditions on the shared file offset during concurrent directory reads. Additionally, the return value of libc::lseek is unchecked, which could lead to silent failures if the seek operation fails.
| if !ds.ready || (ds.gen != current_gen && offset == 0) { | ||
| let fd = data.file.write().unwrap().as_raw_fd(); | ||
| unsafe { libc::lseek(fd, 0, libc::SEEK_SET) }; | ||
| ds.entries.clear(); | ||
| ds.ready = false; | ||
| if let Err(e) = ds.fill_from_fd(fd) { | ||
| if ds.entries.is_empty() { | ||
| return Err(e); | ||
| } | ||
| // If we got some valid entries before error happened, | ||
| // treat this partial read as success and just log | ||
| // the error. | ||
| warn!("virtio-fs: error in readdir {}: {:?}", inode, e); | ||
| } | ||
| ds.gen = current_gen; | ||
| ds.ready = true; | ||
| } |
There was a problem hiding this comment.
Premature Lock Release & Unchecked lseek Return Value
- Premature Lock Release: The write lock guard returned by
data.file.write().unwrap()is a temporary value and is dropped at the end of thelet fd = ...statement. This means the lock is not held during the subsequentlibc::lseekandds.fill_from_fd(fd)calls. If another operation (such aslseekfrom a concurrent request) is executed, it could race on the shared file offset of the underlying open file description, leading to incorrect directory iteration or empty caches. - Unchecked
lseekReturn Value: The return value oflibc::lseekis not checked. Iflseekfails (returning-1), the directory stream offset will not be reset to0, resulting in an incomplete or empty directory cache.
Solution
Keep the lock active by binding the write guard to a variable (file_guard) for the duration of the cache rebuilding block, and check the return value of libc::lseek to handle errors robustly.
if !ds.ready || (ds.gen != current_gen && offset == 0) {
let file_guard = data.file.write().unwrap();
let fd = file_guard.as_raw_fd();
if unsafe { libc::lseek(fd, 0, libc::SEEK_SET) } < 0 {
return Err(linux_error(io::Error::last_os_error()));
}
ds.entries.clear();
ds.ready = false;
if let Err(e) = ds.fill_from_fd(fd) {
if ds.entries.is_empty() {
return Err(e);
}
warn!("virtio-fs: error in readdir {}: {:?}", inode, e);
}
ds.gen = current_gen;
ds.ready = true;
}There was a problem hiding this comment.
The lseek return check is fair - pushed that in bcbe338.
On the lock guard: the temporary drop is intentional and safe here. do_readdir holds data.dirstream.lock() (the DirStream mutex) for its entire scope, and that's the actual serialization point. The only
operation that seeks on a directory handle's fd is do_readdir itself - lseek/read/write are dispatched to file handles from open, not directory handles from opendir. fsyncdir and releasedir are the
only other operations that touch a directory handle, and neither seeks or reads the fd.
The pattern of letting the RwLockWriteGuard drop after as_raw_fd() is used consistently throughout the file (fsync, fallocate, flush, guest lseek) - it's a codebase idiom, not an oversight.
Add 7 subtests that exercise readdir visibility after directory mutations on an already-open handle: create, unlink, mkdir/rmdir, symlink, link, rename (same-dir), and rename (cross-dir). Each test opens a directory, reads entries to populate the cache, mutates the directory, calls rewinddir, and asserts the mutation is visible. These reproduce the macOS virtiofs DirStream cache coherence bug fixed in the previous commit (PR containers#693). Assisted-by: OpenCode:claude-opus-4.6 Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add 7 subtests that exercise readdir visibility after directory mutations on an already-open handle: create, unlink, mkdir/rmdir, symlink, link, rename (same-dir), and rename (cross-dir). Each test opens a directory, reads entries to populate the cache, mutates the directory, calls rewinddir, and asserts the mutation is visible. These reproduce the macOS virtiofs DirStream cache coherence bug fixed in the previous commit (PR containers#693). Assisted-by: OpenCode:claude-opus-4.6 Signed-off-by: Matej Hrica <mhrica@redhat.com>
Add 7 subtests that exercise readdir visibility after directory mutations on an already-open handle: create, unlink, mkdir/rmdir, symlink, link, rename (same-dir), and rename (cross-dir). Each test opens a directory, reads entries to populate the cache, mutates the directory, calls rewinddir, and asserts the mutation is visible. These reproduce the macOS virtiofs DirStream cache coherence bug fixed in PR containers#693. Assisted-by: OpenCode:claude-opus-4.6 Signed-off-by: Matej Hrica <mhrica@redhat.com>
lseek can technically fail; check the return value and propagate the error rather than silently proceeding with an unreset offset. Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
The macOS virtiofs passthrough caches directory entries once per open
handle (
DirStream) and never refreshes them. Files created by guestoperations are written to the host correctly but are invisible to
subsequent
readdircalls on the same handle. The Linux passthroughdoes not have this problem — it calls
getdents64on every readdir.macOS lacks
getdents64, so theDirStreamcache is structurallynecessary. This patch adds a per-inode generation counter (
dir_gen)that is bumped on every directory mutation (create, mkdir, mknod,
symlink, link, unlink, rmdir, rename).
do_readdircompares thestored generation against the current inode generation and re-reads
from the host fd when stale.
The generation counter is a
u64, so rollover requires ~2^64mutations on a single inode and is not a practical concern. In the
theoretical wraparound case, the only consequence is a single stale
readdir until the next mutation — no corruption or data loss.
The counter uses
Ordering::Relaxedwhich is sufficient here: theMutexonDirStreamalready provides the necessarysynchronization barrier, ensuring the bumped generation is visible
before the next
do_readdiron the same inode.