Skip to content

Fix RB8 overlay audio rerun failures by making PipeWire overlay setup idempotent#295

Open
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:audio
Open

Fix RB8 overlay audio rerun failures by making PipeWire overlay setup idempotent#295
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:audio

Conversation

@smuppand
Copy link
Contributor

Problem

On overlay builds (audioreach modules present), repeated runs of AudioRecord can FAIL on RB8 because the overlay setup path restarts PipeWire every run. After the first successful setup (until reboot), subsequent systemctl restart pipewire attempts can fail/hang and cause the testcase to report FAIL even though the audio stack is otherwise usable.

What this PR changes

This PR fixes the issue #291 reported on RB8

Runner/utils/audio_common.sh

  • Make setup_overlay_audio_environment() idempotent for overlay builds:
    • Avoid unconditional PipeWire restart on every invocation (prevents RB8 “frozen” rerun behavior).
    • Keep guarded systemctl/wpctl calls using existing timeout wrappers to avoid control-plane hangs.
    • Preserve overlay requirements (DMA heap permissions) while failing only on real errors.
    • Readiness polling remains to confirm PipeWire is usable when a restart is actually needed.

Runner/suites/Multimedia/Audio/AudioRecord/run.sh

  • Keep the existing run.sh structure/behavior, but align with shared helpers:
    • Use helpers from [audio_common.sh](http://audio_common.sh/) (e.g., PipeWire source default helper where applicable).
    • Remove/avoid any duplicate helper definitions (run.sh should not redefine helpers already in [audio_common.sh](http://audio_common.sh/)).

Copy link

@bhargav0610 bhargav0610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@lumag lumag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting PipeWire is a valid code path which must work. Please add an explicit test, restarting PipeWire and make sure that it works. Not restarting the PW is not a way to solve the issue.

fmt="$1"; dur="$2"
base="${AUDIO_CLIPS_BASE_DIR:-AudioClips}"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many unrelated changes. Please clean up your commit

else
log_error "No downloader (wget/curl) available to fetch $url"
return 1
fi
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mix style cleanups and actual changes. It makes it much harder to review your PR.

@smuppand
Copy link
Contributor Author

  1. /proc/asound/pcm is hanging in-kernel
    cat /proc/asound/pcm timed out (rc=124). That’s not PipeWire anymore — it’s ALSA/ASoC stuck inside the kernel.
  2. The wedge is happening during APR/PDR/remoteproc teardown
    D-state stacks show a very specific chain:
    PID 11 (kworker) stuck in: snd_pcm_dev_disconnect → snd_card_disconnect_sync → soc_cleanup_card_resources → ... → apr_remove_device → apr_pd_status → pdr_notifier_work
    That is the kernel disconnecting the sound card / components as part of an APR/PDR event (typically triggered by ADSP crash/SSR or remoteproc stop/start).

Adjusting the order of the tests may temporarily resolve the freezing issue. qualcomm-linux/lava-test-plans#23

[<0>] snd_pcm_dev_disconnect+0x44/0x1e0 [snd_pcm]
[<0>] snd_device_disconnect_all+0x5c/0xb0 [snd]
[<0>] snd_card_disconnect.part.0+0x13c/0x2b8 [snd]
[<0>] snd_card_disconnect_sync+0x34/0x110 [snd]
[<0>] soc_cleanup_card_resources+0x28/0x2a0 [snd_soc_core]
[<0>] snd_soc_del_component_unlocked+0xc0/0x128 [snd_soc_core]
[<0>] snd_soc_unregister_component_by_driver+0x3c/0x68 [snd_soc_core]
[<0>] devm_component_release+0x14/0x20 [snd_soc_core]
[<0>] devres_release_all+0xa0/0x120
[<0>] device_unbind_cleanup+0x18/0x70
[<0>] device_release_driver_internal+0x1e4/0x21c
[<0>] device_release_driver+0x18/0x24
[<0>] bus_remove_device+0xc4/0x104
[<0>] device_del+0x148/0x40c
[<0>] device_unregister+0x14/0x34
[<0>] apr_remove_device+0x44/0x60 [apr]
[<0>] device_for_each_child+0x64/0xc0
[<0>] apr_pd_status+0x58/0x70 [apr]
[<0>] pdr_notifier_work+0x90/0xdc [pdr_interface]
[<0>] process_one_work+0x150/0x290
[<0>] worker_thread+0x2d0/0x3ec
[<0>] kthread+0x12c/0x204
[<0>] ret_from_fork+0x10/0x20
[rc=0]
\n--- /proc/1268/stack ---
\n$ timeout 1 sh -c cat /proc/1268/stack 2>/dev/null || echo 'NO STACK'
[<0>] pdr_handle_release+0x30/0xf0 [pdr_interface]
[<0>] apr_remove+0x20/0x58 [apr]
[<0>] rpmsg_dev_remove+0x38/0x60
[<0>] device_remove+0x4c/0x80
[<0>] device_release_driver_internal+0x1c4/0x21c
[<0>] device_release_driver+0x18/0x24
[<0>] bus_remove_device+0xc4/0x104
[<0>] device_del+0x148/0x40c
[<0>] device_unregister+0x14/0x34
[<0>] qcom_glink_remove_device+0x10/0x20
[<0>] device_for_each_child+0x64/0xc0
[<0>] qcom_glink_native_remove+0x104/0x270
[<0>] qcom_glink_smem_unregister+0x28/0x54 [qcom_glink_smem]
[<0>] glink_subdev_stop+0x1c/0x3c [qcom_common]
[<0>] rproc_stop_subdevices+0x3c/0x60
[<0>] rproc_stop+0x34/0x11c
[<0>] rproc_shutdown+0x58/0x140
[<0>] state_store+0xb4/0xfc
[<0>] dev_attr_store+0x18/0x2c
[<0>] sysfs_kf_write+0x7c/0x94
[<0>] kernfs_fop_write_iter+0x12c/0x200
[<0>] vfs_write+0x240/0x380
[<0>] ksys_write+0x64/0x100
[<0>] __arm64_sys_write+0x18/0x24
[<0>] invoke_syscall.constprop.0+0x40/0xf0
[<0>] el0_svc_common.constprop.0+0xb8/0xd8
[<0>] do_el0_svc+0x1c/0x28
[<0>] el0_svc+0x34/0xe8
[<0>] el0t_64_sync_handler+0xa0/0xe4
[<0>] el0t_64_sync+0x19c/0x1a0
[rc=0]
\n--- /proc/1283/stack ---
\n$ timeout 1 sh -c cat /proc/1283/stack 2>/dev/null || echo 'NO STACK'
[<0>] snd_pcm_substream_proc_status_read+0x58/0x1e8 [snd_pcm]
[<0>] snd_info_seq_show+0x34/0x4c [snd]
[<0>] seq_read_iter+0x100/0x478
[<0>] seq_read+0xec/0x12c
[<0>] proc_reg_read+0x74/0xe0
[<0>] vfs_read+0xc4/0x33c
[<0>] ksys_read+0x64/0x100
[<0>] __arm64_sys_read+0x18/0x24
[<0>] invoke_syscall.constprop.0+0x40/0xf0
[<0>] el0_svc_common.constprop.0+0xb8/0xd8
[<0>] do_el0_svc+0x1c/0x28
[<0>] el0_svc+0x34/0xe8
[<0>] el0t_64_sync_handler+0xa0/0xe4
[<0>] el0t_64_sync+0x19c/0x1a0
[rc=0]
\n--- /proc/2080/stack ---
\n$ timeout 1 sh -c cat /proc/2080/stack 2>/dev/null || echo 'NO STACK'
[<0>] snd_pcm_proc_read+0x30/0x104 [snd_pcm]
[<0>] snd_info_seq_show+0x34/0x4c [snd]
[<0>] seq_read_iter+0x100/0x478
[<0>] seq_read+0xec/0x12c
[<0>] proc_reg_read+0x74/0xe0
[<0>] vfs_read+0xc4/0x33c
[<0>] ksys_read+0x64/0x100
[<0>] __arm64_sys_read+0x18/0x24
[<0>] invoke_syscall.constprop.0+0x40/0xf0
[<0>] el0_svc_common.constprop.0+0xb8/0xd8
[<0>] do_el0_svc+0x1c/0x28
[<0>] el0_svc+0x34/0xe8
[<0>] el0t_64_sync_handler+0xa0/0xe4
[<0>] el0t_64_sync+0x19c/0x1a0
[rc=0]
\n--- /proc/2179/stack ---
\n$ timeout 1 sh -c cat /proc/2179/stack 2>/dev/null || echo 'NO STACK'
[<0>] snd_pcm_proc_read+0x30/0x104 [snd_pcm]
[<0>] snd_info_seq_show+0x34/0x4c [snd]
[<0>] seq_read_iter+0x100/0x478
[<0>] seq_read+0xec/0x12c
[<0>] proc_reg_read+0x74/0xe0
[<0>] vfs_read+0xc4/0x33c
[<0>] ksys_read+0x64/0x100
[<0>] __arm64_sys_read+0x18/0x24
[<0>] invoke_syscall.constprop.0+0x40/0xf0
[<0>] el0_svc_common.constprop.0+0xb8/0xd8
[<0>] do_el0_svc+0x1c/0x28
[<0>] el0_svc+0x34/0xe8
[<0>] el0t_64_sync_handler+0xa0/0xe4
[<0>] el0t_64_sync+0x19c/0x1a0
[rc=0]
\n--- /proc/2311/stack ---
\n$ timeout 1 sh -c cat /proc/2311/stack 2>/dev/null || echo 'NO STACK'
[<0>] snd_pcm_proc_read+0x30/0x104 [snd_pcm]
[<0>] snd_info_seq_show+0x34/0x4c [snd]
[<0>] seq_read_iter+0x100/0x478
[<0>] seq_read+0xec/0x12c
[<0>] proc_reg_read+0x74/0xe0
[<0>] vfs_read+0xc4/0x33c
[<0>] ksys_read+0x64/0x100
[<0>] __arm64_sys_read+0x18/0x24
[<0>] invoke_syscall.constprop.0+0x40/0xf0
[<0>] el0_svc_common.constprop.0+0xb8/0xd8
[<0>] do_el0_svc+0x1c/0x28
[<0>] el0_svc+0x34/0xe8
[<0>] el0t_64_sync_handler+0xa0/0xe4
[<0>] el0t_64_sync+0x19c/0x1a0

@lumag
Copy link

lumag commented Feb 16, 2026

  1. /proc/asound/pcm is hanging in-kernel
    cat /proc/asound/pcm timed out (rc=124). That’s not PipeWire anymore — it’s ALSA/ASoC stuck inside the kernel.
  2. The wedge is happening during APR/PDR/remoteproc teardown
    D-state stacks show a very specific chain:
    PID 11 (kworker) stuck in: snd_pcm_dev_disconnect → snd_card_disconnect_sync → soc_cleanup_card_resources → ... → apr_remove_device → apr_pd_status → pdr_notifier_work
    That is the kernel disconnecting the sound card / components as part of an APR/PDR event (typically triggered by ADSP crash/SSR or remoteproc stop/start).

So, is it an issue in the kernel itself or in the AudioReach drivers?

Adjusting the order of the tests may temporarily resolve the freezing issue. qualcomm-linux/lava-test-plans#23

Working around the issue would mean that we would not be able to test whether the issue is actually fixed or not.

… freezes

On overlay builds (audioreach modules present), setup_overlay_audio_environment()
was restarting pipewire every run, which can fail/hang on RB8 after the first
successful setup until reboot.

Make overlay setup idempotent:
- avoid unconditional pipewire restart on subsequent runs
- guard systemctl/wpctl calls with timeouts to prevent freezes
- keep DMA heap permission setup but fail only on real errors
- add readiness polling to confirm PipeWire is usable

This removes flaky FAILs on repeated AudioRecord runs on RB8 overlay images.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
…runtime

Align AudioRecord with shared audio_common/functestlib helpers and reduce
local logic that can drift.

- use pw_set_default_source helper instead of raw wpctl set-default
- ensure alsa_pick_virtual_pcm comes from audio_common.sh (no local copy)
- replace expr-based counters with POSIX arithmetic expansion
- keep existing CLI/behavior and result/log layout unchanged

No functional change to the recording matrix/config logic beyond robustness.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
@smuppand
Copy link
Contributor Author

smuppand commented Feb 17, 2026

  1. /proc/asound/pcm is hanging in-kernel
    cat /proc/asound/pcm timed out (rc=124). That’s not PipeWire anymore — it’s ALSA/ASoC stuck inside the kernel.
  2. The wedge is happening during APR/PDR/remoteproc teardown
    D-state stacks show a very specific chain:
    PID 11 (kworker) stuck in: snd_pcm_dev_disconnect → snd_card_disconnect_sync → soc_cleanup_card_resources → ... → apr_remove_device → apr_pd_status → pdr_notifier_work
    That is the kernel disconnecting the sound card / components as part of an APR/PDR event (typically triggered by ADSP crash/SSR or remoteproc stop/start).

So, is it an issue in the kernel itself or in the AudioReach drivers?

From your logs,
this looks like a kernel/DSP-audio stack wedge (q6apm/apr/gpr/remoteproc/glink), not a PipeWire/AudioReach userspace hang.

  1. wpctl status hang only for /run/user/0 (root)
    That’s a userspace/control-plane issue: the PipeWire instance rooted at /run/user/0 is not responding, so wpctl blocks waiting for PipeWire. When you point XDG_RUNTIME_DIR to /run/user/1000, wpctl works immediately (even though it shows mostly “Dummy Output”). That tells us your scripts should not assume root’s runtime dir and must probe/select a working XDG_RUNTIME_DIR + keep timeouts.

  2. Hard wedge: cat /proc/asound/pcm hangs + D-state tasks in snd_pcm_proc_read / snd_pcm_substream_proc_status_read
    This is kernel-side (ALSA/ASoC) getting stuck, not PipeWire/WirePlumber. Your wedge capture shows multiple tasks in uninterruptible sleep (D) inside ALSA proc read paths and card disconnect paths, and your dmesg shows Q6/APR/GPR/Q6APM timeouts + ADSP crash/recovery assertions.
    That points to the kernel ↔ DSP audio stack (q6apm/q6prm/apr/pdr/glink/remoteproc) being left in a bad state after ADSP restart/crash.

Adjusting the order of the tests may temporarily resolve the freezing issue. qualcomm-linux/lava-test-plans#23

Working around the issue would mean that we would not be able to test whether the issue is actually fixed or not.

You’re right, reordering can hide the bug. With this patch, we check the userspace response. If there is no response, we attempt to restart and retry before marking the test as SKIP.

@lumag
Copy link

lumag commented Feb 17, 2026

  1. /proc/asound/pcm is hanging in-kernel
    cat /proc/asound/pcm timed out (rc=124). That’s not PipeWire anymore — it’s ALSA/ASoC stuck inside the kernel.
  2. The wedge is happening during APR/PDR/remoteproc teardown
    D-state stacks show a very specific chain:
    PID 11 (kworker) stuck in: snd_pcm_dev_disconnect → snd_card_disconnect_sync → soc_cleanup_card_resources → ... → apr_remove_device → apr_pd_status → pdr_notifier_work
    That is the kernel disconnecting the sound card / components as part of an APR/PDR event (typically triggered by ADSP crash/SSR or remoteproc stop/start).

So, is it an issue in the kernel itself or in the AudioReach drivers?

From your logs, this looks like a kernel/DSP-audio stack wedge (q6apm/apr/gpr/remoteproc/glink), not a PipeWire/AudioReach userspace hang.

AudioReach also provides its own set of kernel drivers, that's why I am sking if the issue is on the upstream side or in the AR kernel driver. In the latter case I'd prefer to temporarily disable AR in-kernel driver until it is fixed.

@smuppand
Copy link
Contributor Author

  1. /proc/asound/pcm is hanging in-kernel
    cat /proc/asound/pcm timed out (rc=124). That’s not PipeWire anymore — it’s ALSA/ASoC stuck inside the kernel.
  2. The wedge is happening during APR/PDR/remoteproc teardown
    D-state stacks show a very specific chain:
    PID 11 (kworker) stuck in: snd_pcm_dev_disconnect → snd_card_disconnect_sync → soc_cleanup_card_resources → ... → apr_remove_device → apr_pd_status → pdr_notifier_work
    That is the kernel disconnecting the sound card / components as part of an APR/PDR event (typically triggered by ADSP crash/SSR or remoteproc stop/start).

So, is it an issue in the kernel itself or in the AudioReach drivers?

From your logs, this looks like a kernel/DSP-audio stack wedge (q6apm/apr/gpr/remoteproc/glink), not a PipeWire/AudioReach userspace hang.

AudioReach also provides its own set of kernel drivers, that's why I am sking if the issue is on the upstream side or in the AR kernel driver. In the latter case I'd prefer to temporarily disable AR in-kernel driver until it is fixed.

After discussing with the audio team, we received an update that SSR should not be performed for ADSP since it is not supported with upstream. Therefore, we need to adjust the remoteproc test to ensure it does not trigger SSR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants