Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix of iOS data corruption issues, and retiring of necessary infrastructure #360

Open
biblicabeebli opened this issue Mar 6, 2024 · 1 comment
Assignees
Labels
ANNOUNCEMENT Listen up, devs and sysadmins should probably watch.

Comments

@biblicabeebli
Copy link
Member

This posting serves as a notification of our recovery process and - thank gawd - final fix of and therefore eventual retirement of a complicated section of the backend codebase in the near future. The development of this work is currently on our staged-updates branch, and the real documentation for those who want to know is inside this script:
https://github.com/onnela-lab/beiwe-backend/blob/staged-updates/scripts/script_that_recovers_some_ios_data.py

TL;DR for the high level of what has been going on and what lead to this fix

  • iOS has had a data corruption issue for a long time.
  • After removing a certain library Beiwe 2.5, Background App Persistence, the Removal of PromiseKit, and Notification of Full iOS File Corruption Fix beiwe-ios#56 I was able to finally resolve the issue.
  • There has been for quite some time now a component of beiwe-backend, running on the data processing server, that checked uploaded malformed files and made some attempts to decrypt them. The underlying issue was that the decryption keys were not included in all files, but we had a heuristic we could use to find missing keys. Think of it as the files getting cut in half.
  • There were still undecryptable files that looked well formed, but until specific development on Beiwe-iOS lead me to finally directly observe the cause I did not understand it.
  • Part of this discovery is that some keys had been stuck in the wrong spots of uploaded files, and are therefore recoverable!
  • I then identified that one of my attempted fixes from quite a while ago (over a year, possibly even some 2022 work) had fixed one class of data issues - it was bad, white noise junk binary data - and transformed them into this issue where the key was present but in the wrong place.

We have more recoverable data than anticipated!

I'm still working on this, I don't have a timeline because this isn't trivial and it came up in the middle of other work that I have pushed off and now have to attend to.

I will try and update this issue with my progress.

Here is the initial report on what this work consists of, copied directly from that file at time of original posting, it is necessarily technical in nature:

####################################################################################################
# This script will process all the files in the PROBLEM_UPLOADS folder of AWS S3, which contains all
# uploaded files that we were unable to decrypt. The existence of these files was due to a bug
# limited to iOS devices only. There was a race-condition that affected all .csv files and it could
# result in encryption keys failing to be present in uploaded files.
#
# 1) Sometimes the files were just junk, binary noise. I don't know when it was squashed, but it
#    was. These files are completely unparseable and are never expected to be recovered.
#
# 2) Sometimes the encryption key was not present - but two files with the same name, only one
#    containing the key, were uploaded. We added a capability to the backend to stash all decryption
#    keys in the database associated with those file names. We could then use them to decrypt files
#    lacking keys but matching the name. This required both an at-upload-time check and a periodic
#    script that checks the recently uploaded bad files and check for any keys. Code for this can be
#    found in /scripts/process_ios_no_decryption_key.py. The task for this script runs hourly.
#
# 3) Sometimes, and only observed as present in 2024 (after substantially rewriting iOS file-writing
#    code + thorough testing) the encryption key WAS present _but on the wrong line in the file_.
#    These files are fully recoverable, as are any instances of 2) that lost their keys to these
#    malformatted files.
#################################################################################################
###
#
# THIS SCRIPT...
# - Finds, decrypts, and sets up for processing all files affected by issue 3.
#
# - RE-processes all uploaded files that experienced issue 2.
#   - This is incredibly wasteful.
#   - This is because I can't work out how to determine if any given file has been processed at an
#     unknown time in the past.
#   - we could look at the created_on timestamp of the decryption key and make a heuristic guess???
#
# - Has been written with the intention of removing the architecture over in
#   /scripts/process_ios_no_decryption_key.py because We Have Fixed The Bug.
# 
# - Iterates over So Many Files that we can't even cache file names in memory.
#
# - At this point I'm considering making it a distributed celery task.
#
# - IS NOT FINISHED. I have been working on it on our staging server, its just so complex and gross.
#
# - Should probably delete the files in PROBLEM_UPLOADS after we are definitely completely done
#   processing them.
#
# - I don't know what the payoff actually is for this. It "seems like a lot" to me watching test
#   versions of the script execute on staging, but I would be astonished if it is 10%.
#   More data more better though, so its worth doing.
####################################################################################################
@biblicabeebli
Copy link
Member Author

Final Comments, Analysis, Post-Mortem

No further work on this item is planned. The script exists as-is and may be run manually.

The update:

  • Virtually all data encryption issues were resolved in the 2.5.0 release of the Beiwe iOS app ~one year ago.
  • The fundamental cause was eventually discovered and resolved in the followup 2.5.2 release (we skipped 2.5.1).
  • The exact reason the 2.5.0 fix worked is likely due to the thread-safety additions to file io in 2.5.0.
    • TL;DR: a positive change in the way memory allocations (probably) occurred at encryption-key-generation time solved all known issues.

Therefore:

Due to the now-understood mechanism of this failure mode I can finally make some assertions about the results of running this proposed script:

  • This script reconstructs a fraction of data from a separate ~corruption issue where files got split in half (or thirds, etc.). This is by-far the minority of data.
  • The overwhelming majority of corrupted data cannot be recovered.
  • The underlying bug fully destroys data across an entire file in a fully unpredictable fashion.

Technical details:

  • The bug was caused by a use of Swift's raw pointers, due to an inappropriate inclusion of specific code in the app.

  • This code was *siiiiigh* copied out of an ancient alpha version of a cryptography library, by a LONG gone developer, instead of sourcing a real actually-supported library.

    • I did the deep dive and found the exact range of commits it could have been sourced from to confirm this.
    • This was probably done because of the then-early state of Swift as a programming language.
    • Despite the fact that there were long-standing Objective-C cryptography libraries that could easily be called into.
    • This was Never acceptable, but I also failed to identify that this had occurred.
  • Swift's raw memory mechanisms are unstable-by-design. They are subject to meaningful changes as the language (and compilers!) evolve.

    • This would probably have been fine even if we had never updated an old version of the library because they can be pegged to a specific version of Swift. But it wasn't.
    • Swift has an odd quality of being highly connected to the evolution of the LLVM compiler in a way that is unusual. Strong knowledge of Swift requires a strong knowledge of LLVM in particular, not just compilers in general.
    • Swift is designed to provide handlebars and safety mechanisms related to raw memory access. It is an expertise that I lack. The correct place for this code is in a highly used, reviewed, and time-tested, public, open-source library. NOT in app code with one non-specializing maintainer and developer.
  • The code in the iOS app repo continued to compile fine, but eventually required some changes with an update - as far as I could determine due to an updated version of LLVM - this applied starting in roughly June 2024 with a new version of Xcode.

    • With the release of iOS 18 the app started to reliably crash...
    • but only when deployed by Testflight / the app store...
    • not when run out of Xcode...
    • and not on ios 18 on the existing version of the app in the app store, and not on earlier versions of iOS from those new Testflight versions
    • I assume that there is some difference - perhaps debugging symbols? - maybe an extra Apple bitcode compilation step? - but I simply don't know.
  • This Code In Question generated the encryption key for the given file, some call to generate random bits into a buffer.

    • At some point, possibly even from the beginning, this operation became memory-unsafe. The buffer would get cleared. It had an unknowable chance to silently fail, eliding the line containing the en/decryption key from being written to the file.
    • It has now been replaced with a call to a modern encryption library.

Post Mortem and conclusions:

  • Use real crypto libraries? No, that's too stupid and obvious. That's what they tell you in CS class, podcasts, Stack Overflow, I bet ChatGPT would even comment if blithely queried.
  • Don't be the developer that copies mission-critical code out of an obviously alpha-quality library.
  • *insert swear words as appropriate*

This is the script developed 18+ months ago and referenced above.
https://github.com/onnela-lab/beiwe-backend/blob/main/scripts/script_that_recovers_some_ios_data.py
This script requires review. I would trust my comments above, but I no longer recall any of the particulars.

If a system administrator wants to run this script:

  • ssh onto your data processing server
  • cd into beiwe-backend, if you ls you will see a run_script.py file.
  • While you are sshd into the server you should see (beiwe)` in the command prompt - this is an indicator that you are using the correct python virtual environment
  • run a command like this: nohup python run_script.py script_that_recovers_some_ios_data > script_log_1.txt &
    • this will run the script in the background and not exit it when you close the ssh session
  • When it is done, you run it again, but change the name of script_log_1 to script_log_2 so you have both

@onnela-lab onnela-lab locked as resolved and limited conversation to collaborators Jan 30, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ANNOUNCEMENT Listen up, devs and sysadmins should probably watch.
Projects
None yet
Development

No branches or pull requests

1 participant