Race condition between VIO_update_in_place and the garbage collector#8986
Race condition between VIO_update_in_place and the garbage collector#8986MochalovAlexey wants to merge 2 commits intoFirebirdSQL:masterfrom
Conversation
hvlad
left a comment
There was a problem hiding this comment.
I've got this bugcheck during restore also but didn't investigated it yet as it is not reproducible.
Also, interesting - why it happens on v6 only ?
|
|
||
| if (!DPM_get(tdbb, org_rpb, LCK_write)) | ||
| BUGCHECK(186); // msg 186 record disappeared | ||
|
|
There was a problem hiding this comment.
Your change makes possible that two pages will be locked at the same time, this is not good.
When one have primary record version locked and need to get seconary version, he should call DPM_fetch_back().
Of course, after VIO_data(), primary record should be re-acquired again.
Guys, can you take a look - what does GCLock do when the bug happens? I've tested it and found no problems but - it was seriously modified... |
I'll try, thanks for the hint |
GCLock doesn't matters here, especially during restore. |
|
@MochalovAlexey do you have call stack when bugcheck happens ? |
Yes, the table affected is I reproduce the stack trace leading up to the BUGCHECK by setting a breakpoint on these two lines: |
|
Thanks, now I think I see why it happens with v6 only. Before v6, RDB$DATABASE was modified using user transaction, therefore GC see record as "active" and don't attempt to change it. In v6 system transaction is used to modify RDB$DATABASE and concurrent reader see record as "committed" and thus call purge\expunge for it. I think, decision to use system transaction to modify records is not very good thought. At least from VIO perspective. |
|
On 4/14/26 18:40, Vlad Khorsun wrote:
*hvlad* left a comment (FirebirdSQL/firebird#8986)
<#8986 (comment)>
Thanks, now I think I see why it happens with v6 only.
Before v6, RDB$DATABASE was modified using user transaction, therefore
GC see record as "active" and don't attempt to change it. In v6 system
transaction is used to modify RDB$DATABASE and concurrent reader see
record as "committed" and thus call purge\expunge for it.
I think, decision to use system transaction to modify records is not
very good thought. At least from VIO perspective.
RDB$DATABASE is modified is system transaction in order to make relation
ID generation better fit new needs. With user transaction it we will be
for example unable to create 2 independent relations in 2 transactions
in parallel.
On the other hand (after reviewing code right now) I do not understand
what prevents from use of DYN_UTIL_gen_unique_id() like we do for
procedures and functions also for relations? This will solve both
problems. (With system transaction behavior of generator, needed here,
was more or less emulated.) The only side effect I see is that field
RDB$RELATION_ID will remain unused.
|
I tend to agree. My experiments showed that For example, if a Therefore, the proper fix seems to be to avoid calling |
|
Afraid, there could be more bad cases and change in First, system transaction can't create backversions as its number is always 0 and it is impossible to find correct record version in a chain where all versions labeled by tx0. Thus, system transaction uses update in place for UPDATE and backout for DELETE. Next, system transaction changes are "immediately committed" thus garbage collector should take it into account and explicitly avoid to handle record chain when see version by tx0. Before v6 there was no such requirement, because of: there must be strong rule that user and system transactions can't modify the same record. |
We can use separate (autonomous) transaction to modify RDB$DATABASE and immediately commit, in this case
AFAIU, change of |
|
On 4/15/26 14:28, Vlad Khorsun wrote:
*hvlad* left a comment (FirebirdSQL/firebird#8986)
<#8986 (comment)>
Afraid, there could be more bad cases and change in
|VIO_update_in_place| is not correct way to fix the issue.
First, system transaction can't create backversions as its number is
always 0 and it is impossible to find correct record version in a
chain where all versions labeled by tx0. Thus, system transaction uses
update in place for UPDATE and backout for DELETE.
Next, system transaction changes are "immediately committed" thus
garbage collector should take it into account and explicitly avoid to
handle record chain when see version by tx0. Before v6 there was no
such requirement, because of:
there must be strong rule that user and system transactions can't
modify the same record.
This is the real reason of the issue and it must be fixed.
I agree with Vlad.
|
|
On 4/15/26 14:36, Vlad Khorsun wrote:
*hvlad* left a comment (FirebirdSQL/firebird#8986)
<#8986 (comment)>
On 4/14/26 18:40, Vlad Khorsun wrote: /hvlad/ left a comment
(FirebirdSQL/firebird#8986
<#8986>) <#8986
(comment)
<#8986 (comment)>
Thanks, now I think I see why it happens with v6 only.
Before v6, RDB$DATABASE was modified using user transaction,
therefore GC see record as "active" and don't attempt to
change it. In v6 system transaction is used to modify
RDB$DATABASE and concurrent reader see record as "committed"
and thus call purge\expunge for it.
I think, decision to use system transaction to modify records
is not very good thought. At least from VIO perspective.
RDB$DATABASE is modified is system transaction in order to make
relation ID generation better fit new needs. With user transaction
it we will be for example unable to create 2 independent relations
in 2 transactions in parallel.
We can use separate (autonomous) transaction to modify RDB$DATABASE
and immediately commit, in this case
May be this is the best possible solution. In addition it solves
backward compatibility issue with next RELATION_ID stored in
rdb$database. If no serious arguments against it I will make a change
tomorrow.
On the other hand (after reviewing code right now) I do not
understand what prevents from use of DYN_UTIL_gen_unique_id() like
we do for procedures and functions also for relations? This will
solve both problems. (With system transaction behavior of
generator, needed here, was more or less emulated.) The only side
effect I see is that field RDB$RELATION_ID will remain unused.
AFAIU, change of |RDB$DATABASE| in user transaction allows to
automatically undo increment of |RDB$RELATION_ID| in case of rollback.
Approach with generator doesn't allow to do it with such easy.
Of course, "modify in separate transaction and commit" also have this
drawback.
Yes. But I've looked at it as 'nothing bad happens if IDs are not
exactly sequential'.
|
Originally, there was no code that allows to reuse unused relation IDs. It was added not so long ago, in 2.5, IIRC |
|
On 4/15/26 16:21, Vlad Khorsun wrote:
*hvlad* left a comment (FirebirdSQL/firebird#8986)
<#8986 (comment)>
AFAIU, change of |RDB$DATABASE| in user transaction allows to
automatically undo increment of |RDB$RELATION_ID| in case of
rollback. Approach with generator doesn't allow to do it with
such easy.
Of course, "modify in separate transaction and commit" also
have this drawback.
Yes. But I've looked at it as 'nothing bad happens if IDs are not
exactly sequential'.
Originally, there was no code that allows to reuse unused relation IDs.
Not sure about pre-FB, all FB versions are able to search for the holes
in ID's space. BTW, in FB1 related code is executed in system
transaction, like all dfw (if GPRE rule regarding transaction used by
default in engine mode did not change).
I.e. it is not about strong sequentiality, but more about usage of
limited number space.
Yes, and this is not a problem here.
|
Yes, indeed. My memory was wrong here.
Good. One problem less ;) |
Hello,
I have discovered a race condition between the garbage collector and
VIO_update_in_place, which is invoked in the system transaction during restore.A race condition occurs when the record has a delta and
VIO_update_in_placeneeds to read the back version to unpack it. However, if the garbage collector is active, it may delete that back version before it is fetched for reading, because it is not protected by anything at that point. This leads toBUGCHECK(291).My proposal is to acquire the write lock on the record before fetching the back version. This would prevent the background garbage collector from deleting its back versions within
VIO_chase_record_version.The issue is intermittent and was reproduced only on our fork using the firebird-qa test
./tests/functional/tabloid/test_eqc_136030.py. I was unable to create a reproducible case for Firebird itself because the background garbage collector is triggered less frequently there. However, even from analysis alone, this potential problem is evident.Thanks,
Alexey Mochalov.