Bug 2007416 - wipe individual logins on DecryptionErrors #7343
Conversation
6c5a0c6 to
e8513bf
Compare
| named_params! { | ||
| ":now_millis": now_ms, | ||
| ":guid": id, | ||
| ":sync_status": SyncStatus::Changed as u8, |
There was a problem hiding this comment.
touch() now also updates the sync_status column. I needed this to make the tests pass, but it seems like the code should have always done that to me. Maybe we were just not using this method enough to notice?
| Ok(Some(ServerTimestamp(millis))) | ||
| Ok(db | ||
| .get_meta::<i64>(schema::LAST_SYNC_META_KEY)? | ||
| .map(ServerTimestamp)) |
There was a problem hiding this comment.
This also needed changes to make the tests pass, because it seemed like the old code was not correct. Before this would panic if there was no metadata for LAST_SYNC_META_KEY, now it returns None.
| self.execute("DELETE FROM loginsM WHERE guid=?", [guid])?; | ||
| // Delete the LAST_SYNC_META_KEY so that we resync the logins data next time | ||
| self.delete_meta(schema::LAST_SYNC_META_KEY)?; | ||
| tx.commit()?; |
There was a problem hiding this comment.
There's lots of incidental changes in this PR, but this is the main change.
99137dd to
aac821a
Compare
| username: "cool_username".into(), | ||
| password: "hunter2".into(), | ||
| ..Default::default() | ||
| }) |
There was a problem hiding this comment.
I think I probably broke this test and I'm not sure exactly how to fix it or even run it (do I need to create a burner sync account for this?).
Is this still worthwhile to keep? It seems a bit redundant with the TPS tests. I can definitely fix this if it's providing value, but I want to make sure someone's using it before I do that.
There was a problem hiding this comment.
I don't think it's redundant with TPS because that's primarily just testing the existing desktop stuff, so the Rust components don't get much coverage - ie, this was intended to kinda be "TPS but for rust components" :) It uses the same auth as that fxa example, so should start an oauth flow and remember the account in .fxa-cli-credentials.json (although I don't recall how it manages 2 accounts!). It doesn't have much value as it's not run automatically etc, so in that regard it is even more like TPS (which also doesn't run in CI, so has questionable value, but we're also reluctant to just kill that because there is some value there)
I'd actually be surprised if it hasn't already gone stale, so I'd be fine with even just leaving a comment here explaining you think it might be broken and a reference to this PR, and we can have a longer discussion about it later.
There was a problem hiding this comment.
You're right it was already stale before this. I tested this on main and it's also failing with:
thread 'main' (3767285) panicked at testing/sync-test/src/logins.rs:278:5:
assertion failed: c1.logins_store.delete(&l0id).expect("Delete should work")
I'm thinking we should leave the code as-is, create a ticket for it, and update the readme.
There was a problem hiding this comment.
I'm still not entirely convinced wiping is the right thing to do here - I recall we started a discussion but don't remember it concluding - I think we agreed to rope in the credentials management team - did they say we should take this route? I'm fine with this if they did.
| username: "cool_username".into(), | ||
| password: "hunter2".into(), | ||
| ..Default::default() | ||
| }) |
There was a problem hiding this comment.
I don't think it's redundant with TPS because that's primarily just testing the existing desktop stuff, so the Rust components don't get much coverage - ie, this was intended to kinda be "TPS but for rust components" :) It uses the same auth as that fxa example, so should start an oauth flow and remember the account in .fxa-cli-credentials.json (although I don't recall how it manages 2 accounts!). It doesn't have much value as it's not run automatically etc, so in that regard it is even more like TPS (which also doesn't run in CI, so has questionable value, but we're also reluctant to just kill that because there is some value there)
I'd actually be surprised if it hasn't already gone stale, so I'd be fine with even just leaving a comment here explaining you think it might be broken and a reference to this PR, and we can have a longer discussion about it later.
I'm not sure either, but my memory is that we agreed this was an okay short-term step since it's the essentially the same behavior as when the canary check fails. Long-term, we should have the discussion with more teams and we may decide to go the other route: keep the DB row around and present it to the user as a login with missing data (or maybe some 3rd route). There actually is some urgency to this since the logins success rates are pretty low and most of the failures are decryption errors. That said, I do think it's a good idea to get the OK from credential management on this. WYDT @jo? BTW, here's the bugzilla bug for the longer-term decision. |
|
Oh, I think that's a tricky issue. I'd prefer to control that via a feature flag so we can enable it selectively. I'd be reluctant to have that enabled on the desktop for now. We often have problems getting the key from the OSKeyStore as well. Here, with form autofill data, we had actually implemented a similar mechanism and had just removed it when we discovered it. In this case, if no key was returned, a new one was created and stored immediately, which actively reset the credit card data. With OSKeyStore, there are many different reasons why the key might not be available at the moment. https://sql.telemetry.mozilla.org/queries/113887 lists 29 different errors only for Darwin. While this all pertains to OSKeyStore, I can imagine that similar issues exist on iOS or Android, and that the keys may simply be temporarily unavailable under certain circumstances. For the desktop, we have at least roughly considered various future backup strategies, partly for other reasons as well. One option is via FxA, with manual user interaction. Another possibility is backup codes (as with GH or 1Password) or the primary password. In this context, I could also imagine that we’d at least wait to wipe the data until we’re syncing, right? That way, there’s a high probability the data will be restored immediately. By the way, thanks a lot for the groundwork - I didn’t dare touch that back then :) It’s much cleaner this way. So, to cut a long story short: if we enable this with a feature flag, which we’d initially activate only for Android and/or iOS, I’d be okay with that as an interim solution. |
|
Thanks for the thoughtful review. I wasn't thinking about desktop at all when I implemented this. Added a feature flag for this one, how does it look now? |
jo
left a comment
There was a problem hiding this comment.
To take a step back: I’m not opposed to a pragmatic short-term workaround for mobile devices. I understand the urgency and have no problem with us opting for an interim approach, provided we explicitly state this compromise.
However, I want us to make sure that the short-term solution doesn’t create a situation that will be difficult to maintain.
My two main concerns are:
decrypt_fieldsnow has a permanent side effect. This makes the function harder to understand, since it no longer just decrypts fields and reports errors, but also modifies local storage and affects synchronization behavior.- I’m not sure if a restore per login is the right approach. If the error is actually caused by an unavailable/corrupted key, it could be a systemic error rather than an error in a single data record. In this case, deleting one login at a time could leave the local storage in a partially restored state and potentially lead to repeated synchronization behavior. I may be misinterpreting this part, so I would like to clarify the expected behavior.
Who else should be involved in deciding on the recovery strategy here? What do the mobile team members say about this?
| self.execute("DELETE FROM loginsL WHERE guid=?", [guid])?; | ||
| self.execute("DELETE FROM loginsM WHERE guid=?", [guid])?; | ||
| // Delete the LAST_SYNC_META_KEY so that we resync the logins data next time | ||
| self.delete_meta(schema::LAST_SYNC_META_KEY)?; |
There was a problem hiding this comment.
Does this mean we re-sync all logins after a wipe of a single login?
There was a problem hiding this comment.
Yeah, unfortunately. I don't think this can be improved that much though, since the only tool we have here is to pass the server a last_modified time. We could try to set that to the last modified of the deleted record, but I was a bit worried that would have some subtle bugs and we're still fetching half the total records on average.
| pub fn decrypt_fields(&self, db: &LoginDb) -> Result<SecureLoginFields> { | ||
| SecureLoginFields::decrypt(&self.sec_fields, db.encdec.as_ref(), &self.meta.id) | ||
| SecureLoginFields::decrypt(&self.sec_fields, db.encdec.as_ref(), &self.meta.id).inspect_err( | ||
| |e| { | ||
| #[cfg(feature = "wipe-on-decryption-error")] | ||
| if matches!(e, Error::DecryptionFailed(_)) { | ||
| if let Err(e) = db.wipe_local_login_data(&self.meta.id) { | ||
| warn!("Failed to wipe login on decryption failure: {e}"); | ||
| } | ||
| } | ||
| }, | ||
| ) | ||
| } |
There was a problem hiding this comment.
I'm not sure if deleting individual entries during decrypt_fields is the right place to do this.
If the decryption error is caused by a corrupted or unavailable key, I would expect this to affect all entries, not just the current one. Since this happens implicitly during reading and the first decryption throws an error, a “Fetch-All” operation would result in only one entry being deleted at first, then all of them being synchronized. And that would happen again on the next Fetch-All until all affected logins had been repaired. And throw an error (to the user?) every time.
It also seems as though this leaves the database in an inconsistent intermediate state: some logins have already been deleted/repaired, while others still exist but cannot be decrypted, even though the underlying problem is global.
Regardless, decrypt_fields now has a persistent side effect, which is surprising given the method’s name. I would expect this method to simply decrypt fields and report errors. Perhaps the caller should decide how to handle decryption errors, and if we want to delete data, this should be done in a more explicit recovery path.
There was a problem hiding this comment.
I'm pretty sure that it's just one login. I wrote this in the bugzilla ticket:
It seems like it's just 1 login generating many errors. When I filter by user ID I see lots of errors, but they all list the same user ID. I can't be sure that this is always the case, but I've sampled quite a few errors at this point. It's seems unlikely to me the users have multiple logins with undecryptable data, but only 1 shows up in the error reports because I believe there are on Android at least there are startup functions that iterate over all logins.
Also, we have the check_canary function that should detect global errors when we fail to fetch the key.
However, you're right that this does introduce the possibility that we just detect/wipe one login at a time when all the logins are undecryptable. Also, it is a bit weird that decrypt_fields can delete data and it's not so easy to fix that because it's called by lots of functions and many of those don't feel like they should delete data either.
What if we moved this code to run_maintenence()? That seems like a decent place to do a bulk search for undecryptable logins. It might be a bit slower to catch errors, but it seems like these errors are pretty rare anyways.
There was a problem hiding this comment.
there are startup functions that iterate over all logins.
I assumed we exit on the first decryption error in Store::list() but I'm not 100% sure. This would then always return the same error login uuid, even if more logins are affected.
Yes, I think run_maintenence() would be a perfect fit!
There was a problem hiding this comment.
This was really easy since @lougeniaC64 already wrote this code. What do you both think of the new version? Also, I can split out the first refactoring commit as a separate PR if that would help.
There was a problem hiding this comment.
That is indeed much simpler and clearer. And yes, the refactoring commit would really be useful!
28d18fc to
712f4ef
Compare
Also: * Set sync_status in `LoginsDb::touch()` * Handle missing row for LAST_SYNC_META_KEY in `get_last_sync()`
| dictionary RunMaintenanceOptions { | ||
| // Wipe un-decryptable logins. These will hopefully come back on the next sync. | ||
| // Defaults to true. | ||
| boolean delete_undecryptable_records_for_remote_replacement; |
There was a problem hiding this comment.
I made this default to true since I think it's what we want on iOS and Android. Desktop might want false, but it's not calling runMaintenance yet AFAICT.
| pub delete_undecryptable_records_for_remote_replacement: bool, | ||
| } | ||
|
|
||
| impl Default for RunMaintenanceOptions { |
There was a problem hiding this comment.
Do you also want a default in the udl? Foreign code passing an empty value created on the foreign side will need to specify this value. No big deal today, but if that grows more values it will make some sense.
First commit is prep work, without much changes.
Second commit is the actual change.
Pull Request checklist
[ci full]to the PR title.