Skip to content

Race condition between VIO_update_in_place and the garbage collector#8986

Open
MochalovAlexey wants to merge 2 commits intoFirebirdSQL:masterfrom
MochalovAlexey:update_in_place_and_gc_race_fix
Open

Race condition between VIO_update_in_place and the garbage collector#8986
MochalovAlexey wants to merge 2 commits intoFirebirdSQL:masterfrom
MochalovAlexey:update_in_place_and_gc_race_fix

Conversation

@MochalovAlexey
Copy link
Copy Markdown
Contributor

Hello,
I have discovered a race condition between the garbage collector and VIO_update_in_place, which is invoked in the system transaction during restore.

A race condition occurs when the record has a delta and VIO_update_in_place needs to read the back version to unpack it. However, if the garbage collector is active, it may delete that back version before it is fetched for reading, because it is not protected by anything at that point. This leads to BUGCHECK(291).

My proposal is to acquire the write lock on the record before fetching the back version. This would prevent the background garbage collector from deleting its back versions within VIO_chase_record_version.

The issue is intermittent and was reproduced only on our fork using the firebird-qa test ./tests/functional/tabloid/test_eqc_136030.py. I was unable to create a reproducible case for Firebird itself because the background garbage collector is triggered less frequently there. However, even from analysis alone, this potential problem is evident.

Thanks,
Alexey Mochalov.

Copy link
Copy Markdown
Member

@hvlad hvlad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got this bugcheck during restore also but didn't investigated it yet as it is not reproducible.
Also, interesting - why it happens on v6 only ?

Comment thread src/jrd/vio.cpp

if (!DPM_get(tdbb, org_rpb, LCK_write))
BUGCHECK(186); // msg 186 record disappeared

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change makes possible that two pages will be locked at the same time, this is not good.
When one have primary record version locked and need to get seconary version, he should call DPM_fetch_back().
Of course, after VIO_data(), primary record should be re-acquired again.

@AlexPeshkoff
Copy link
Copy Markdown
Member

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 14, 2026

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

I'll try, thanks for the hint

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 14, 2026

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

I'll try, thanks for the hint

GCLock doesn't matters here, especially during restore.

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 14, 2026

@MochalovAlexey do you have call stack when bugcheck happens ?
Also, do you know which relation was affected ?

@MochalovAlexey
Copy link
Copy Markdown
Contributor Author

MochalovAlexey commented Apr 14, 2026

@MochalovAlexey do you have call stack when bugcheck happens ? Also, do you know which relation was affected ?

Yes, the table affected is RDB$DATABASE.
I've attached two stack traces from our fork: one showing a BUGCHECK(291) that occurred, and the other from a few milliseconds earlier, at the moment when GC and VIO_update_in_place are competing for the same record (see threads 18 and 22). During the race condition, rpb in VIO_chase_record_version before the purge function contains a reference to the same record as org_rpb in VIO_update_in_place. Purge deletes the backversion with the delta right before DPM_fetch is called on that delta in VIO_update_in_place.
vio_update_in_place_bugcheck.txt
vio_update_in_place_vs_gc.txt

I reproduce the stack trace leading up to the BUGCHECK by setting a breakpoint on these two lines:

  1. if (!gcGuard.gcEnabled())
  2. temp2.rpb_page = org_rpb->rpb_b_page;

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 14, 2026

Thanks, now I think I see why it happens with v6 only.

Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it.

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

@AlexPeshkoff
Copy link
Copy Markdown
Member

AlexPeshkoff commented Apr 14, 2026 via email

@MochalovAlexey
Copy link
Copy Markdown
Contributor Author

MochalovAlexey commented Apr 15, 2026

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

I tend to agree.

My experiments showed that VIO_update_in_place in system transactions is not safe in terms of preserving data consistency for changes committed by user transactions.

For example, if a CREATE TABLE statement is running in the system transaction and thread is paused at DPM_get inside VIO_update_in_place, then a parallel user transaction changes the database default character set, and the system thread is resumed afterwards, the primary record version gets overwritten. As a result, the data committed by the user transaction (RDB$CHARACTER_SET_NAME) is lost without any conflict being detected. The probability of such a race seems low in practice, but it can be reproduced under a debugger.

Therefore, the proper fix seems to be to avoid calling VIO_update_in_place from VIO_modify solely because the transaction has the system flag.

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 15, 2026

Afraid, there could be more bad cases and change in VIO_update_in_place is not correct way to fix the issue.

First, system transaction can't create backversions as its number is always 0 and it is impossible to find correct record version in a chain where all versions labeled by tx0. Thus, system transaction uses update in place for UPDATE and backout for DELETE.

Next, system transaction changes are "immediately committed" thus garbage collector should take it into account and explicitly avoid to handle record chain when see version by tx0. Before v6 there was no such requirement, because of:

there must be strong rule that user and system transactions can't modify the same record.
This is the real reason of the issue and it must be fixed.

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 15, 2026

On 4/14/26 18:40, Vlad Khorsun wrote: hvlad left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)
Thanks, now I think I see why it happens with v6 only.

Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it.

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

RDB$DATABASE is modified is system transaction in order to make relation ID generation better fit new needs. With user transaction it we will be for example unable to create 2 independent relations in 2 transactions in parallel.

We can use separate (autonomous) transaction to modify RDB$DATABASE and immediately commit, in this case

On the other hand (after reviewing code right now) I do not understand what prevents from use of DYN_UTIL_gen_unique_id() like we do for procedures and functions also for relations? This will solve both problems. (With system transaction behavior of generator, needed here, was more or less emulated.) The only side effect I see is that field RDB$RELATION_ID will remain unused.

AFAIU, change of RDB$DATABASE in user transaction allows to automatically undo increment of RDB$RELATION_ID in case of rollback. Approach with generator doesn't allow to do it with such easy.
Of course, "modify in separate transaction and commit" also have this drawback.

@AlexPeshkoff
Copy link
Copy Markdown
Member

AlexPeshkoff commented Apr 15, 2026 via email

@AlexPeshkoff
Copy link
Copy Markdown
Member

AlexPeshkoff commented Apr 15, 2026 via email

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 15, 2026

AFAIU, change of RDB$DATABASE in user transaction allows to automatically undo increment of RDB$RELATION_ID in case of rollback. Approach with generator doesn't allow to do it with such easy.
Of course, "modify in separate transaction and commit" also have this drawback.

Yes. But I've looked at it as 'nothing bad happens if IDs are not exactly sequential'.

Originally, there was no code that allows to reuse unused relation IDs. It was added not so long ago, in 2.5, IIRC
I.e. it is not about strong sequentiality, but more about usage of limited number space.

@AlexPeshkoff
Copy link
Copy Markdown
Member

AlexPeshkoff commented Apr 15, 2026 via email

@hvlad
Copy link
Copy Markdown
Member

hvlad commented Apr 15, 2026

Originally, there was no code that allows to reuse unused relation IDs.

Not sure about pre-FB, all FB versions are able to search for the holes in ID's space. BTW, in FB1 related code is executed in system transaction, like all dfw (if GPRE rule regarding transaction used by default in engine mode did not change).

Yes, indeed. My memory was wrong here.

I.e. it is not about strong sequentiality, but more about usage of limited number space.

Yes, and this is not a problem here.

Good. One problem less ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants