Race condition between VIO_update_in_place and the garbage collector by MochalovAlexey · Pull Request #8986 · FirebirdSQL/firebird

MochalovAlexey · 2026-04-14T06:55:50Z

Hello,
I have discovered a race condition between the garbage collector and VIO_update_in_place, which is invoked in the system transaction during restore.

A race condition occurs when the record has a delta and VIO_update_in_place needs to read the back version to unpack it. However, if the garbage collector is active, it may delete that back version before it is fetched for reading, because it is not protected by anything at that point. This leads to BUGCHECK(291).

My proposal is to acquire the write lock on the record before fetching the back version. This would prevent the background garbage collector from deleting its back versions within VIO_chase_record_version.

The issue is intermittent and was reproduced only on our fork using the firebird-qa test ./tests/functional/tabloid/test_eqc_136030.py. I was unable to create a reproducible case for Firebird itself because the background garbage collector is triggered less frequently there. However, even from analysis alone, this potential problem is evident.

Thanks,
Alexey Mochalov.

hvlad

I've got this bugcheck during restore also but didn't investigated it yet as it is not reproducible.
Also, interesting - why it happens on v6 only ?

hvlad · 2026-04-14T08:45:31Z


+	if (!DPM_get(tdbb, org_rpb, LCK_write))
+		BUGCHECK(186);	// msg 186 record disappeared
+


Your change makes possible that two pages will be locked at the same time, this is not good.
When one have primary record version locked and need to get seconary version, he should call DPM_fetch_back().
Of course, after VIO_data(), primary record should be re-acquired again.

AlexPeshkoff · 2026-04-14T08:57:44Z

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

hvlad · 2026-04-14T09:01:26Z

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

I'll try, thanks for the hint

hvlad · 2026-04-14T10:19:31Z

Also, interesting - why it happens on v6 only ?

Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified...

I'll try, thanks for the hint

GCLock doesn't matters here, especially during restore.

hvlad · 2026-04-14T10:27:36Z

@MochalovAlexey do you have call stack when bugcheck happens ?
Also, do you know which relation was affected ?

MochalovAlexey · 2026-04-14T12:43:37Z

@MochalovAlexey do you have call stack when bugcheck happens ? Also, do you know which relation was affected ?

Yes, the table affected is RDB$DATABASE.
I've attached two stack traces from our fork: one showing a BUGCHECK(291) that occurred, and the other from a few milliseconds earlier, at the moment when GC and VIO_update_in_place are competing for the same record (see threads 18 and 22). During the race condition, rpb in VIO_chase_record_version before the purge function contains a reference to the same record as org_rpb in VIO_update_in_place. Purge deletes the backversion with the delta right before DPM_fetch is called on that delta in VIO_update_in_place.
vio_update_in_place_bugcheck.txt
vio_update_in_place_vs_gc.txt

I reproduce the stack trace leading up to the BUGCHECK by setting a breakpoint on these two lines:

firebird/src/jrd/vio.cpp

Line 1690 in 812500a

if (!gcGuard.gcEnabled())
firebird/src/jrd/vio.cpp

Line 7330 in 812500a

temp2.rpb_page = org_rpb->rpb_b_page;

hvlad · 2026-04-14T15:40:26Z

Thanks, now I think I see why it happens with v6 only.

Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it.

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

AlexPeshkoff · 2026-04-14T16:45:31Z

On 4/14/26 18:40, Vlad Khorsun wrote: *hvlad* left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)> Thanks, now I think I see why it happens with v6 only. Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it. I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

RDB$DATABASE is modified is system transaction in order to make relation ID generation better fit new needs. With user transaction it we will be for example unable to create 2 independent relations in 2 transactions in parallel. On the other hand (after reviewing code right now) I do not understand what prevents from use of DYN_UTIL_gen_unique_id() like we do for procedures and functions also for relations? This will solve both problems. (With system transaction behavior of generator, needed here, was more or less emulated.) The only side effect I see is that field RDB$RELATION_ID will remain unused.

MochalovAlexey · 2026-04-15T06:27:44Z

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

I tend to agree.

My experiments showed that VIO_update_in_place in system transactions is not safe in terms of preserving data consistency for changes committed by user transactions.

For example, if a CREATE TABLE statement is running in the system transaction and thread is paused at DPM_get inside VIO_update_in_place, then a parallel user transaction changes the database default character set, and the system thread is resumed afterwards, the primary record version gets overwritten. As a result, the data committed by the user transaction (RDB$CHARACTER_SET_NAME) is lost without any conflict being detected. The probability of such a race seems low in practice, but it can be reproduced under a debugger.

Therefore, the proper fix seems to be to avoid calling VIO_update_in_place from VIO_modify solely because the transaction has the system flag.

hvlad · 2026-04-15T11:27:56Z

Afraid, there could be more bad cases and change in VIO_update_in_place is not correct way to fix the issue.

First, system transaction can't create backversions as its number is always 0 and it is impossible to find correct record version in a chain where all versions labeled by tx0. Thus, system transaction uses update in place for UPDATE and backout for DELETE.

Next, system transaction changes are "immediately committed" thus garbage collector should take it into account and explicitly avoid to handle record chain when see version by tx0. Before v6 there was no such requirement, because of:

there must be strong rule that user and system transactions can't modify the same record.
This is the real reason of the issue and it must be fixed.

hvlad · 2026-04-15T11:35:44Z

On 4/14/26 18:40, Vlad Khorsun wrote: hvlad left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)
Thanks, now I think I see why it happens with v6 only.

Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it.

I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective.

RDB$DATABASE is modified is system transaction in order to make relation ID generation better fit new needs. With user transaction it we will be for example unable to create 2 independent relations in 2 transactions in parallel.

We can use separate (autonomous) transaction to modify RDB$DATABASE and immediately commit, in this case

On the other hand (after reviewing code right now) I do not understand what prevents from use of DYN_UTIL_gen_unique_id() like we do for procedures and functions also for relations? This will solve both problems. (With system transaction behavior of generator, needed here, was more or less emulated.) The only side effect I see is that field RDB$RELATION_ID will remain unused.

AFAIU, change of RDB$DATABASE in user transaction allows to automatically undo increment of RDB$RELATION_ID in case of rollback. Approach with generator doesn't allow to do it with such easy.
Of course, "modify in separate transaction and commit" also have this drawback.

AlexPeshkoff · 2026-04-15T12:15:40Z

On 4/15/26 14:28, Vlad Khorsun wrote: *hvlad* left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)> Afraid, there could be more bad cases and change in |VIO_update_in_place| is not correct way to fix the issue. First, system transaction can't create backversions as its number is always 0 and it is impossible to find correct record version in a chain where all versions labeled by tx0. Thus, system transaction uses update in place for UPDATE and backout for DELETE. Next, system transaction changes are "immediately committed" thus garbage collector should take it into account and explicitly avoid to handle record chain when see version by tx0. Before v6 there was no such requirement, because of: there must be strong rule that user and system transactions can't modify the same record. This is the real reason of the issue and it must be fixed.

I agree with Vlad.

AlexPeshkoff · 2026-04-15T12:29:07Z

On 4/15/26 14:36, Vlad Khorsun wrote: *hvlad* left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)> On 4/14/26 18:40, Vlad Khorsun wrote: /hvlad/ left a comment (FirebirdSQL/firebird#8986 <#8986>) <#8986 (comment) <#8986 (comment)> Thanks, now I think I see why it happens with v6 only. Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it. I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective. RDB$DATABASE is modified is system transaction in order to make relation ID generation better fit new needs. With user transaction it we will be for example unable to create 2 independent relations in 2 transactions in parallel. We can use separate (autonomous) transaction to modify RDB$DATABASE and immediately commit, in this case

May be this is the best possible solution. In addition it solves backward compatibility issue with next RELATION_ID stored in rdb$database. If no serious arguments against it I will make a change tomorrow.

On the other hand (after reviewing code right now) I do not understand what prevents from use of DYN_UTIL_gen_unique_id() like we do for procedures and functions also for relations? This will solve both problems. (With system transaction behavior of generator, needed here, was more or less emulated.) The only side effect I see is that field RDB$RELATION_ID will remain unused. AFAIU, change of |RDB$DATABASE| in user transaction allows to automatically undo increment of |RDB$RELATION_ID| in case of rollback. Approach with generator doesn't allow to do it with such easy. Of course, "modify in separate transaction and commit" also have this drawback.

Yes. But I've looked at it as 'nothing bad happens if IDs are not exactly sequential'.

hvlad · 2026-04-15T13:21:05Z

AFAIU, change of RDB$DATABASE in user transaction allows to automatically undo increment of RDB$RELATION_ID in case of rollback. Approach with generator doesn't allow to do it with such easy.
Of course, "modify in separate transaction and commit" also have this drawback.

Yes. But I've looked at it as 'nothing bad happens if IDs are not exactly sequential'.

Originally, there was no code that allows to reuse unused relation IDs. It was added not so long ago, in 2.5, IIRC
I.e. it is not about strong sequentiality, but more about usage of limited number space.

AlexPeshkoff · 2026-04-15T14:17:57Z

On 4/15/26 16:21, Vlad Khorsun wrote: *hvlad* left a comment (FirebirdSQL/firebird#8986) <#8986 (comment)> AFAIU, change of |RDB$DATABASE| in user transaction allows to automatically undo increment of |RDB$RELATION_ID| in case of rollback. Approach with generator doesn't allow to do it with such easy. Of course, "modify in separate transaction and commit" also have this drawback. Yes. But I've looked at it as 'nothing bad happens if IDs are not exactly sequential'. Originally, there was no code that allows to reuse unused relation IDs.

Not sure about pre-FB, all FB versions are able to search for the holes in ID's space. BTW, in FB1 related code is executed in system transaction, like all dfw (if GPRE rule regarding transaction used by default in engine mode did not change).

I.e. it is not about strong sequentiality, but more about usage of limited number space.

Yes, and this is not a problem here.

hvlad · 2026-04-15T14:41:39Z

Originally, there was no code that allows to reuse unused relation IDs.

Not sure about pre-FB, all FB versions are able to search for the holes in ID's space. BTW, in FB1 related code is executed in system transaction, like all dfw (if GPRE rule regarding transaction used by default in engine mode did not change).

Yes, indeed. My memory was wrong here.

I.e. it is not about strong sequentiality, but more about usage of limited number space.

Yes, and this is not a problem here.

Good. One problem less ;)

lock the record header before fetching its backversion with delta

5990654

hvlad requested changes Apr 14, 2026

View reviewed changes

Changed according to Vlad Khorsun suggestion

4ebd3ff


		if (!DPM_get(tdbb, org_rpb, LCK_write))
		BUGCHECK(186); // msg 186 record disappeared

Uh oh!

Conversation

MochalovAlexey commented Apr 14, 2026

Uh oh!

hvlad left a comment

Choose a reason for hiding this comment

Uh oh!

hvlad Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

AlexPeshkoff commented Apr 14, 2026

Uh oh!

hvlad commented Apr 14, 2026

Uh oh!

hvlad commented Apr 14, 2026

Uh oh!

hvlad commented Apr 14, 2026

Uh oh!

MochalovAlexey commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvlad commented Apr 14, 2026

Uh oh!

AlexPeshkoff commented Apr 14, 2026 via email

Uh oh!

MochalovAlexey commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hvlad commented Apr 15, 2026

Uh oh!

hvlad commented Apr 15, 2026

Uh oh!

AlexPeshkoff commented Apr 15, 2026 via email

Uh oh!

AlexPeshkoff commented Apr 15, 2026 via email

Uh oh!

hvlad commented Apr 15, 2026

Uh oh!

AlexPeshkoff commented Apr 15, 2026 via email

Uh oh!

hvlad commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MochalovAlexey commented Apr 14, 2026 •

edited

Loading

MochalovAlexey commented Apr 15, 2026 •

edited

Loading