Skip to content

KLAUS-378: latency_to_first_prediction uses start instead of adjusted_start #57

@michberger

Description

@michberger

Many of us that regularly use the Nextflow pipeline to generate behavior data have noticed that the data in the latency_to_first_prediction and latency_to_last_prediction columns don’t make a lot of sense because they report behaviors in frame numbers greater than the number of frames in the video. Also, the numbers don’t agree with the values in the merged_bouts tables produced by the NF pipeline.

Michelle asked Claude to analyze the code in the repository. It discovered a bug in the code and then confirmed the error with some example data I provided it with.

The Bug: latency_to_first_prediction Uses start Instead of adjusted_start

What the column means (per the README)

latency_to_first_prediction: Frame number of first behavior prediction in the time bin. Frame is relative to the experiment start, not the time bin.

So the intent is: find the frame number (relative to the experiment start) of the first behavior event within that bin.


What the code actually does

At line 1083:

python

results["latency_to_first_prediction"] = behavior_bins["start"].min()

The key distinction is between two columns that live on each bout:

Column | Meaning -- | -- start | Frame number within the video file where the bout begins adjusted_start | Frame number relative to the experiment start (= time_to_frame(video_timestamp, experiment_start) + start)

adjusted_start is what correctly represents the frame offset from experiment start. start only resets for each individual video file.


Why it works for the 0–5 minute bin but fails for later bins

For the first bin (0–5 min), the first video starts at time zero, so time_to_frame(video_timestamp, experiment_start) = 0. This means adjusted_start == start, so the result is accidentally correct.

For later bins (5–20 min and 20–55 min), bouts come from video segments that started later in the experiment. Their adjusted_start correctly reflects the offset from experiment start, but start only holds the within-video frame number — which could be any small number relative to that video's beginning. So behavior_bins["start"].min() returns a frame number that is meaningless in the context of the full experiment timeline.

There's a second compounding issue: when a bout is split at a bin boundary (lines 1011–1019), the second_half["start"] is updated to:

python

second_half["start"] = second_half["start"] + cur_cut - second_half["adjusted_start"]

This calculation adjusts start by the cut offset, but start is a within-video frame number while cur_cut is an experiment-relative frame number — so the arithmetic mixes two different reference frames, producing a corrupted start value on split bouts.


The fix

Line 1083 should use adjusted_start instead of start:

python

# Current (wrong): results["latency_to_first_prediction"] = behavior_bins["start"].min()

Fixed:

results["latency_to_first_prediction"] = behavior_bins["adjusted_start"].min()

Similarly, line 1084–1086 for latency_to_last_prediction should use adjusted_start + duration (which is just adjusted_end):

python

# Current (wrong):
results["latency_to_last_prediction"] = (
behavior_bins["start"] + behavior_bins["duration"]
).max()

Fixed:

results["latency_to_last_prediction"] = behavior_bins["adjusted_end"].max()

This ensures both latency values are always expressed in experiment-relative frames, matching what the documentation describes and what you actually want.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions