Skip to content

Bug 2007416 - wipe individual logins on DecryptionErrors #7343

Open
bendk wants to merge 1 commit into
mozilla:mainfrom
bendk:push-myztqrztlsuu
Open

Bug 2007416 - wipe individual logins on DecryptionErrors #7343
bendk wants to merge 1 commit into
mozilla:mainfrom
bendk:push-myztqrztlsuu

Conversation

@bendk
Copy link
Copy Markdown
Contributor

@bendk bendk commented Apr 28, 2026

First commit is prep work, without much changes.
Second commit is the actual change.

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

@bendk bendk force-pushed the push-myztqrztlsuu branch from 6c5a0c6 to e8513bf Compare April 28, 2026 16:41
Comment thread components/logins/src/db.rs Outdated
named_params! {
":now_millis": now_ms,
":guid": id,
":sync_status": SyncStatus::Changed as u8,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

touch() now also updates the sync_status column. I needed this to make the tests pass, but it seems like the code should have always done that to me. Maybe we were just not using this method enough to notice?

Comment thread components/logins/src/sync/engine.rs Outdated
Ok(Some(ServerTimestamp(millis)))
Ok(db
.get_meta::<i64>(schema::LAST_SYNC_META_KEY)?
.map(ServerTimestamp))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needed changes to make the tests pass, because it seemed like the old code was not correct. Before this would panic if there was no metadata for LAST_SYNC_META_KEY, now it returns None.

Comment thread components/logins/src/db.rs Outdated
self.execute("DELETE FROM loginsM WHERE guid=?", [guid])?;
// Delete the LAST_SYNC_META_KEY so that we resync the logins data next time
self.delete_meta(schema::LAST_SYNC_META_KEY)?;
tx.commit()?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots of incidental changes in this PR, but this is the main change.

@bendk bendk requested a review from mhammond April 28, 2026 16:46
@bendk bendk force-pushed the push-myztqrztlsuu branch 2 times, most recently from 99137dd to aac821a Compare May 6, 2026 20:49
Comment thread testing/sync-test/src/logins.rs Outdated
username: "cool_username".into(),
password: "hunter2".into(),
..Default::default()
})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I probably broke this test and I'm not sure exactly how to fix it or even run it (do I need to create a burner sync account for this?).

Is this still worthwhile to keep? It seems a bit redundant with the TPS tests. I can definitely fix this if it's providing value, but I want to make sure someone's using it before I do that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's redundant with TPS because that's primarily just testing the existing desktop stuff, so the Rust components don't get much coverage - ie, this was intended to kinda be "TPS but for rust components" :) It uses the same auth as that fxa example, so should start an oauth flow and remember the account in .fxa-cli-credentials.json (although I don't recall how it manages 2 accounts!). It doesn't have much value as it's not run automatically etc, so in that regard it is even more like TPS (which also doesn't run in CI, so has questionable value, but we're also reluctant to just kill that because there is some value there)

I'd actually be surprised if it hasn't already gone stale, so I'd be fine with even just leaving a comment here explaining you think it might be broken and a reference to this PR, and we can have a longer discussion about it later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right it was already stale before this. I tested this on main and it's also failing with:

thread 'main' (3767285) panicked at testing/sync-test/src/logins.rs:278:5:
assertion failed: c1.logins_store.delete(&l0id).expect("Delete should work")

I'm thinking we should leave the code as-is, create a ticket for it, and update the readme.

Copy link
Copy Markdown
Member

@mhammond mhammond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely convinced wiping is the right thing to do here - I recall we started a discussion but don't remember it concluding - I think we agreed to rope in the credentials management team - did they say we should take this route? I'm fine with this if they did.

Comment thread testing/sync-test/src/logins.rs Outdated
username: "cool_username".into(),
password: "hunter2".into(),
..Default::default()
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's redundant with TPS because that's primarily just testing the existing desktop stuff, so the Rust components don't get much coverage - ie, this was intended to kinda be "TPS but for rust components" :) It uses the same auth as that fxa example, so should start an oauth flow and remember the account in .fxa-cli-credentials.json (although I don't recall how it manages 2 accounts!). It doesn't have much value as it's not run automatically etc, so in that regard it is even more like TPS (which also doesn't run in CI, so has questionable value, but we're also reluctant to just kill that because there is some value there)

I'd actually be surprised if it hasn't already gone stale, so I'd be fine with even just leaving a comment here explaining you think it might be broken and a reference to this PR, and we can have a longer discussion about it later.

@bendk
Copy link
Copy Markdown
Contributor Author

bendk commented May 7, 2026

I'm still not entirely convinced wiping is the right thing to do here - I recall we started a discussion but don't remember it concluding - I think we agreed to rope in the credentials management team - did they say we should take this route? I'm fine with this if they did.

I'm not sure either, but my memory is that we agreed this was an okay short-term step since it's the essentially the same behavior as when the canary check fails. Long-term, we should have the discussion with more teams and we may decide to go the other route: keep the DB row around and present it to the user as a login with missing data (or maybe some 3rd route). There actually is some urgency to this since the logins success rates are pretty low and most of the failures are decryption errors.

That said, I do think it's a good idea to get the OK from credential management on this. WYDT @jo?

BTW, here's the bugzilla bug for the longer-term decision.

@bendk bendk requested a review from jo May 7, 2026 13:03
@bendk bendk force-pushed the push-myztqrztlsuu branch from aac821a to aa3939e Compare May 7, 2026 13:11
@jo
Copy link
Copy Markdown
Contributor

jo commented May 7, 2026

Oh, I think that's a tricky issue.

I'd prefer to control that via a feature flag so we can enable it selectively. I'd be reluctant to have that enabled on the desktop for now.

We often have problems getting the key from the OSKeyStore as well. Here, with form autofill data, we had actually implemented a similar mechanism and had just removed it when we discovered it. In this case, if no key was returned, a new one was created and stored immediately, which actively reset the credit card data. With OSKeyStore, there are many different reasons why the key might not be available at the moment. https://sql.telemetry.mozilla.org/queries/113887 lists 29 different errors only for Darwin. While this all pertains to OSKeyStore, I can imagine that similar issues exist on iOS or Android, and that the keys may simply be temporarily unavailable under certain circumstances.

For the desktop, we have at least roughly considered various future backup strategies, partly for other reasons as well. One option is via FxA, with manual user interaction. Another possibility is backup codes (as with GH or 1Password) or the primary password.

In this context, I could also imagine that we’d at least wait to wipe the data until we’re syncing, right? That way, there’s a high probability the data will be restored immediately.

By the way, thanks a lot for the groundwork - I didn’t dare touch that back then :) It’s much cleaner this way.

So, to cut a long story short: if we enable this with a feature flag, which we’d initially activate only for Android and/or iOS, I’d be okay with that as an interim solution.

@bendk bendk force-pushed the push-myztqrztlsuu branch from aa3939e to fed1593 Compare May 8, 2026 14:31
@bendk
Copy link
Copy Markdown
Contributor Author

bendk commented May 8, 2026

Thanks for the thoughtful review. I wasn't thinking about desktop at all when I implemented this. Added a feature flag for this one, how does it look now?

Copy link
Copy Markdown
Contributor

@jo jo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To take a step back: I’m not opposed to a pragmatic short-term workaround for mobile devices. I understand the urgency and have no problem with us opting for an interim approach, provided we explicitly state this compromise.

However, I want us to make sure that the short-term solution doesn’t create a situation that will be difficult to maintain.

My two main concerns are:

  1. decrypt_fields now has a permanent side effect. This makes the function harder to understand, since it no longer just decrypts fields and reports errors, but also modifies local storage and affects synchronization behavior.
  2. I’m not sure if a restore per login is the right approach. If the error is actually caused by an unavailable/corrupted key, it could be a systemic error rather than an error in a single data record. In this case, deleting one login at a time could leave the local storage in a partially restored state and potentially lead to repeated synchronization behavior. I may be misinterpreting this part, so I would like to clarify the expected behavior.

Who else should be involved in deciding on the recovery strategy here? What do the mobile team members say about this?

Comment thread components/logins/Cargo.toml Outdated
Comment thread components/logins/src/db.rs Outdated
self.execute("DELETE FROM loginsL WHERE guid=?", [guid])?;
self.execute("DELETE FROM loginsM WHERE guid=?", [guid])?;
// Delete the LAST_SYNC_META_KEY so that we resync the logins data next time
self.delete_meta(schema::LAST_SYNC_META_KEY)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we re-sync all logins after a wipe of a single login?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unfortunately. I don't think this can be improved that much though, since the only tool we have here is to pass the server a last_modified time. We could try to set that to the last modified of the deleted record, but I was a bit worried that would have some subtle bugs and we're still fetching half the total records on average.

Comment thread components/logins/src/login.rs Outdated
Comment on lines 676 to 687
pub fn decrypt_fields(&self, db: &LoginDb) -> Result<SecureLoginFields> {
SecureLoginFields::decrypt(&self.sec_fields, db.encdec.as_ref(), &self.meta.id)
SecureLoginFields::decrypt(&self.sec_fields, db.encdec.as_ref(), &self.meta.id).inspect_err(
|e| {
#[cfg(feature = "wipe-on-decryption-error")]
if matches!(e, Error::DecryptionFailed(_)) {
if let Err(e) = db.wipe_local_login_data(&self.meta.id) {
warn!("Failed to wipe login on decryption failure: {e}");
}
}
},
)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if deleting individual entries during decrypt_fields is the right place to do this.

If the decryption error is caused by a corrupted or unavailable key, I would expect this to affect all entries, not just the current one. Since this happens implicitly during reading and the first decryption throws an error, a “Fetch-All” operation would result in only one entry being deleted at first, then all of them being synchronized. And that would happen again on the next Fetch-All until all affected logins had been repaired. And throw an error (to the user?) every time.

It also seems as though this leaves the database in an inconsistent intermediate state: some logins have already been deleted/repaired, while others still exist but cannot be decrypted, even though the underlying problem is global.

Regardless, decrypt_fields now has a persistent side effect, which is surprising given the method’s name. I would expect this method to simply decrypt fields and report errors. Perhaps the caller should decide how to handle decryption errors, and if we want to delete data, this should be done in a more explicit recovery path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that it's just one login. I wrote this in the bugzilla ticket:

It seems like it's just 1 login generating many errors. When I filter by user ID I see lots of errors, but they all list the same user ID. I can't be sure that this is always the case, but I've sampled quite a few errors at this point. It's seems unlikely to me the users have multiple logins with undecryptable data, but only 1 shows up in the error reports because I believe there are on Android at least there are startup functions that iterate over all logins.

Also, we have the check_canary function that should detect global errors when we fail to fetch the key.

However, you're right that this does introduce the possibility that we just detect/wipe one login at a time when all the logins are undecryptable. Also, it is a bit weird that decrypt_fields can delete data and it's not so easy to fix that because it's called by lots of functions and many of those don't feel like they should delete data either.

What if we moved this code to run_maintenence()? That seems like a decent place to do a bulk search for undecryptable logins. It might be a bit slower to catch errors, but it seems like these errors are pretty rare anyways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are startup functions that iterate over all logins.

I assumed we exit on the first decryption error in Store::list() but I'm not 100% sure. This would then always return the same error login uuid, even if more logins are affected.

Yes, I think run_maintenence() would be a perfect fit!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really easy since @lougeniaC64 already wrote this code. What do you both think of the new version? Also, I can split out the first refactoring commit as a separate PR if that would help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is indeed much simpler and clearer. And yes, the refactoring commit would really be useful!

@bendk bendk force-pushed the push-myztqrztlsuu branch 5 times, most recently from 28d18fc to 712f4ef Compare May 11, 2026 20:32
Also:
  * Set sync_status in `LoginsDb::touch()`
  * Handle missing row for LAST_SYNC_META_KEY in `get_last_sync()`
@bendk bendk force-pushed the push-myztqrztlsuu branch from 712f4ef to 6e79a56 Compare May 12, 2026 13:13
dictionary RunMaintenanceOptions {
// Wipe un-decryptable logins. These will hopefully come back on the next sync.
// Defaults to true.
boolean delete_undecryptable_records_for_remote_replacement;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this default to true since I think it's what we want on iOS and Android. Desktop might want false, but it's not calling runMaintenance yet AFAICT.

Copy link
Copy Markdown
Member

@mhammond mhammond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

pub delete_undecryptable_records_for_remote_replacement: bool,
}

impl Default for RunMaintenanceOptions {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want a default in the udl? Foreign code passing an empty value created on the foreign side will need to specify this value. No big deal today, but if that grows more values it will make some sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants