Skip to content

fs/macos: fix DirStream cache coherence on directory mutations#693

Open
jgowdy-godaddy wants to merge 3 commits into
containers:mainfrom
jgowdy-godaddy:fix/macos-virtiofs-dirstream-cache-coherence
Open

fs/macos: fix DirStream cache coherence on directory mutations#693
jgowdy-godaddy wants to merge 3 commits into
containers:mainfrom
jgowdy-godaddy:fix/macos-virtiofs-dirstream-cache-coherence

Conversation

@jgowdy-godaddy
Copy link
Copy Markdown

The macOS virtiofs passthrough caches directory entries once per open
handle (DirStream) and never refreshes them. Files created by guest
operations are written to the host correctly but are invisible to
subsequent readdir calls on the same handle. The Linux passthrough
does not have this problem — it calls getdents64 on every readdir.

macOS lacks getdents64, so the DirStream cache is structurally
necessary. This patch adds a per-inode generation counter (dir_gen)
that is bumped on every directory mutation (create, mkdir, mknod,
symlink, link, unlink, rmdir, rename). do_readdir compares the
stored generation against the current inode generation and re-reads
from the host fd when stale.

The generation counter is a u64, so rollover requires ~2^64
mutations on a single inode and is not a practical concern. In the
theoretical wraparound case, the only consequence is a single stale
readdir until the next mutation — no corruption or data loss.

The counter uses Ordering::Relaxed which is sufficient here: the
Mutex on DirStream already provides the necessary
synchronization barrier, ensuring the bumped generation is visible
before the next do_readdir on the same inode.

@jgowdy-godaddy jgowdy-godaddy force-pushed the fix/macos-virtiofs-dirstream-cache-coherence branch from 531fae1 to 2437ef3 Compare May 27, 2026 20:36
The macOS virtiofs passthrough caches directory entries once per open
handle (DirStream) and never refreshes them. Files created by guest
operations are written to the host correctly but are invisible to
subsequent readdir calls on the same handle. The Linux passthrough
does not have this problem — it calls getdents64 on every readdir.

macOS lacks getdents64, so the DirStream cache is structurally
necessary. Add a per-inode generation counter (dir_gen) that is
bumped on every directory mutation (create, mkdir, mknod, symlink,
link, unlink, rmdir, rename). do_readdir compares the stored
generation against the current inode generation and re-reads from
the host fd when stale.

Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
@jgowdy-godaddy jgowdy-godaddy force-pushed the fix/macos-virtiofs-dirstream-cache-coherence branch from 2437ef3 to 1f349a6 Compare May 28, 2026 15:53
mtjhrc added a commit to mtjhrc/libkrun that referenced this pull request May 28, 2026
Add 7 subtests that exercise readdir visibility after directory
mutations on an already-open handle: create, unlink, mkdir/rmdir,
symlink, link, rename (same-dir), and rename (cross-dir). Each test
opens a directory, reads entries to populate the cache, mutates the
directory, calls rewinddir, and asserts the mutation is visible.

These reproduce the macOS virtiofs DirStream cache coherence bug
fixed in the previous commit (PR containers#693).

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

This fixes the described issue, but introduces a new one: if the guest iterates an opened directory while modifying its contents, duplicate files may be returned.

I quickly wrote some tests for this - specifically, adding files while iterating a directory causes duplicates:
https://github.com/mtjhrc/libkrun/blob/a7a14cbe/tests/test_cases/src/test_virtiofs_misc.rs#L470-L578

You can run it via make test TEST=virtiofs-misc:
main...mtjhrc:libkrun:dir-iter-test


Step-by-step AI root cause analysis by Claude Opus 4.6

(This seems to check out but I haven't debugged it extensively personally):

What happens (from host-side logging):

  1. Guest calls readdir() — FUSE sends all 10 entries in one response (offset 0→10). The guest's libc buffers them, but user code only reads 7, leaving 3 (m_08, m_01, m_06) in libc's internal buffer.
  2. Guest creates 3 new files — dir_gen bumps from 10 to 13.
  3. Guest resumes readdir() — libc returns the 3 buffered entries (m_08, m_01, m_06), then asks FUSE for more at offset 10.
  4. FUSE sees gen mismatch, rebuilds the cache (now 13 entries). The 3 new entries land before position 10, so m_08, m_01, m_06 shifted to positions 10-12.
  5. FUSE returns positions 10-12: m_08, m_01, m_06 again. Duplicates.

On Linux this doesn't happen because getdents64 uses opaque kernel cookies, not Vec indices — they're stable across directory mutations.

Test failure on macOS:

before mutation (7 entries): ["m_03", "m_04", "m_05", "m_02", "m_07", "m_00", "m_09"]
after mutation (6 entries): ["m_08", "m_01", "m_06", "m_08", "m_01", "m_06"]
DUPLICATES (3): ["m_08", "m_01", "m_06"]
all seen (13 total): ["m_03", "m_04", "m_05", "m_02", "m_07", "m_00", "m_09", "m_08", "m_01", "m_06", "m_08", "m_01", "m_06"]

The generation-counter rebuild in do_readdir fired on any stale
generation, including mid-iteration. Rebuilding the entries Vec while
the guest is still indexing into it by offset reorders the entries: a
file created mid-scan that sorts before already-returned entries shifts
those entries to higher offsets, so the guest's next readdir at its
saved offset lands on an entry it has already seen and returns it again.

Restrict the rebuild to offset == 0, which is only ever sent at the
start of a stream (fresh opendir or after rewinddir). For the rest of an
in-progress scan we keep serving the snapshot taken when iteration
began. POSIX leaves readdir unspecified when the directory is mutated
during iteration, and Linux getdents64 likewise does not surface entries
inserted before the current cursor, so this matches existing behavior
and avoids the duplicates.

Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
@jgowdy-godaddy
Copy link
Copy Markdown
Author

Good catch, and thanks for writing the test — that's a real bug and you nailed the root cause.

The mistake was rebuilding the cache on any generation mismatch, including mid-iteration. Once the entries Vec gets rebuilt out from under a scan that's already in progress, the guest's saved offset no longer means what it meant when iteration started: a file inserted mid-scan that sorts ahead of already-returned entries shifts those entries to higher offsets, so the next readdir at the saved offset lands right back on something it already handed out. Hence the duplicates.

I pushed 3404e51 which restricts the rebuild to offset == 0. That's the only offset the guest ever sends at the start of a stream — a fresh opendir, or after rewinddir. For the rest of an in-progress scan we just keep serving the snapshot taken when iteration began. POSIX leaves readdir unspecified once the directory is mutated during iteration, and getdents64 on Linux won't surface entries inserted before the current cursor either, so this lines up with the behavior you'd get on ext4/btrfs/tmpfs.

It's a one-line change to the guard plus a comment explaining why. The original fix is untouched for the offset == 0 path, so the rewinddir-based cases (create/unlink/mkdir/rmdir/symlink/link/rename) still rebuild and pick up the mutation as before — your duplicates test is the only behavior that changes.

Mind re-running make test TEST=virtiofs-misc against this? I worked through your test by hand and it should pass now, but I can't exercise the full guest path on my end, so I'd like the confirmation from your harness before this goes any further.

@mtjhrc
Copy link
Copy Markdown
Collaborator

mtjhrc commented May 29, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces directory generation tracking to safely rebuild the directory stream cache at the start of iteration (offset 0) when mutations occur, preventing duplicate entries. However, a critical issue was identified where the write lock on the file is released prematurely, potentially causing race conditions on the shared file offset during concurrent directory reads. Additionally, the return value of libc::lseek is unchecked, which could lead to silent failures if the seek operation fails.

Comment on lines +721 to 734
if !ds.ready || (ds.gen != current_gen && offset == 0) {
let fd = data.file.write().unwrap().as_raw_fd();
unsafe { libc::lseek(fd, 0, libc::SEEK_SET) };
ds.entries.clear();
ds.ready = false;
if let Err(e) = ds.fill_from_fd(fd) {
if ds.entries.is_empty() {
return Err(e);
}
// If we got some valid entries before error happened,
// treat this partial read as success and just log
// the error.
warn!("virtio-fs: error in readdir {}: {:?}", inode, e);
}
ds.gen = current_gen;
ds.ready = true;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Premature Lock Release & Unchecked lseek Return Value

  1. Premature Lock Release: The write lock guard returned by data.file.write().unwrap() is a temporary value and is dropped at the end of the let fd = ... statement. This means the lock is not held during the subsequent libc::lseek and ds.fill_from_fd(fd) calls. If another operation (such as lseek from a concurrent request) is executed, it could race on the shared file offset of the underlying open file description, leading to incorrect directory iteration or empty caches.
  2. Unchecked lseek Return Value: The return value of libc::lseek is not checked. If lseek fails (returning -1), the directory stream offset will not be reset to 0, resulting in an incomplete or empty directory cache.

Solution

Keep the lock active by binding the write guard to a variable (file_guard) for the duration of the cache rebuilding block, and check the return value of libc::lseek to handle errors robustly.

        if !ds.ready || (ds.gen != current_gen && offset == 0) {
            let file_guard = data.file.write().unwrap();
            let fd = file_guard.as_raw_fd();
            if unsafe { libc::lseek(fd, 0, libc::SEEK_SET) } < 0 {
                return Err(linux_error(io::Error::last_os_error()));
            }
            ds.entries.clear();
            ds.ready = false;
            if let Err(e) = ds.fill_from_fd(fd) {
                if ds.entries.is_empty() {
                    return Err(e);
                }
                warn!("virtio-fs: error in readdir {}: {:?}", inode, e);
            }
            ds.gen = current_gen;
            ds.ready = true;
        }

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lseek return check is fair - pushed that in bcbe338.

On the lock guard: the temporary drop is intentional and safe here. do_readdir holds data.dirstream.lock() (the DirStream mutex) for its entire scope, and that's the actual serialization point. The only
operation that seeks on a directory handle's fd is do_readdir itself - lseek/read/write are dispatched to file handles from open, not directory handles from opendir. fsyncdir and releasedir are the
only other operations that touch a directory handle, and neither seeks or reads the fd.

The pattern of letting the RwLockWriteGuard drop after as_raw_fd() is used consistently throughout the file (fsync, fallocate, flush, guest lseek) - it's a codebase idiom, not an oversight.

mtjhrc added a commit to mtjhrc/libkrun that referenced this pull request May 29, 2026
Add 7 subtests that exercise readdir visibility after directory
mutations on an already-open handle: create, unlink, mkdir/rmdir,
symlink, link, rename (same-dir), and rename (cross-dir). Each test
opens a directory, reads entries to populate the cache, mutates the
directory, calls rewinddir, and asserts the mutation is visible.

These reproduce the macOS virtiofs DirStream cache coherence bug
fixed in the previous commit (PR containers#693).

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
mtjhrc added a commit to mtjhrc/libkrun that referenced this pull request May 29, 2026
Add 7 subtests that exercise readdir visibility after directory
mutations on an already-open handle: create, unlink, mkdir/rmdir,
symlink, link, rename (same-dir), and rename (cross-dir). Each test
opens a directory, reads entries to populate the cache, mutates the
directory, calls rewinddir, and asserts the mutation is visible.

These reproduce the macOS virtiofs DirStream cache coherence bug
fixed in the previous commit (PR containers#693).

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
mtjhrc added a commit to mtjhrc/libkrun that referenced this pull request May 29, 2026
Add 7 subtests that exercise readdir visibility after directory
mutations on an already-open handle: create, unlink, mkdir/rmdir,
symlink, link, rename (same-dir), and rename (cross-dir). Each test
opens a directory, reads entries to populate the cache, mutates the
directory, calls rewinddir, and asserts the mutation is visible.

These reproduce the macOS virtiofs DirStream cache coherence bug
fixed in PR containers#693.

Assisted-by: OpenCode:claude-opus-4.6
Signed-off-by: Matej Hrica <mhrica@redhat.com>
lseek can technically fail; check the return value and propagate the
error rather than silently proceeding with an unreset offset.

Signed-off-by: Jay Gowdy <jgowdy@godaddy.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants