-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log and resolve errors in uploading data due to Firebase API changes #1008
Log and resolve errors in uploading data due to Firebase API changes #1008
Conversation
We recently encountered an issue in which `usercache/put` calls were failing because of an error with the message format. It was not easy to determine the object that was generating the error, since we only log successful updates. This change catches all mongo errors and logs the full backtrace, including the document that generated the error. Testing done: - Added additional debugging statements to `sync_phone_to_server` in `emission/net/api/usercache.py` ``` update_query = {'user_id': uuid, 'metadata.type': data["metadata"]["type"], 'metadata.write_ts': data["metadata"]["write_ts"], 'metadata.key': data["metadata"]["key"]} + time.sleep(2) + logging.debug("After sleep, continuing to processing") ``` - Started up a local server, and logged in from the emulator - Started a trip, started location tracking and ended the trip - While the `usercache/put` was processing the entries (slowed down because of the sleep), killed the local DB - put failed with an error message highlighting the document that was being saved (although it did not matter in for this error since it was a connection error and not a document format error) ``` 2025-01-03 12:11:54,717:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:11:54,720:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.229703 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448afacd6df071652448'), 'ok': 1.0, 'updatedExisting': False} 2025-01-03 12:11:56,726:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:11:56,728:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.2422519 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448cfacd6df07165244a'), 'ok': 1.0, 'updatedExisting': False} 2025-01-03 12:11:58,732:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:12:29,131:ERROR:12994228224:In sync_phone_to_server, while executing update_query={'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32'), 'metadata.type': 'message', 'metadata.write_ts': 1735933996.3793979, 'metadata.key': 'stats/client_time'} on document={'$set': {'data': {'ts': 1735933996.369, 'client_app_version': '1.9.6', 'name': 'onboarding_state', 'client_os_version': '18.1', 'reading': {'route': 1, 'opcode': 'nrelop_dev-emulator-study_default_testdbfail'}}, 'metadata': {'time_zone': 'America/Los_Angeles', 'plugin': 'none', 'write_ts': 1735933996.3793979, 'platform': 'ios', 'read_ts': 0, 'key': 'stats/client_time', 'type': 'message'}, 'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32')}} 2025-01-03 12:12:29,133:ERROR:12994228224:localhost:27017: [Errno 61] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6778444d81d477e59b10bb3a, topology_type: Single, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 61] Connection refused')>]> ```
It's annoying that we got a bunch of calls at the same time because we would not have deleted entries between calls, and will end up pushing the same data over and over. But we do handle duplicates on the server, so it is not the end of the world. This time, the network was fine, and we were able to pull data successfully from the server. We are no longer creating any cached values, so we pull 0 documents, but there is no error
However, the push seems to have generated a 500 error on the server
Did it generate the error for all the pushes?
|
I tried converting the HTTP response from hex to UTF-8 characters using |
The 500 error was due to this backtrace
This would clearly result in a 500 error. Per the only match that I find for this error, which appears to be DocumentDB specific, is that it may be generated if there are field names with |
The error is from
The server API time is logged in
So the error is not with storing the server API time. Instead, the error is with storing some other data; we then bail on the call and store the times as part of the post-hook. So the error is specifically related to the data that is on this particular phone (or similar data on other phones) and is not reproducible via the simulator. |
Exporting the
So looking for the entry after that in the DB, we see
Bingo! We have a |
This is not a one-off, there are 44 such entries
and they range from
So my guess is that firebase changed the format of their additional data on the 20th (sounds familiar?) It seems like we have three main tasks going forward:
|
The stat is added here: This code hasn't changed for 2 years The message that we send is generated using this trace:
So that is confirmation that the google and gcm fields are automatically added by firebase. We don't add them, so we can't strip them out on the server, and must munge them on the phone side. |
From @JGreenlee
Actually, once we have patched the server, we don't need to munge on the phone at all. We can just handle all such entries, which also makes this more robust to additional breaking integration changes in the future. |
On around Dec 21st 2024, it looks like firebase changed the format of their push notifications to add in some metadata into the `additionalData` field. This metadata has keys with dots. Since we also use the `additionalData` field to pass in the survey or popup message for custom push notifications, we store the entire `additionalData` into the notification stat. When this notification is pushed up to the server, it cannot be stored in the database, since MongoDB/DocumentDB do not support keys with dots. https://stackoverflow.com/questions/66369545/documentdb-updateone-fails-with-163-name-is-not-valid-for-storage While trying to save the entry, we get the error ``` Traceback (most recent call last): File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle out = route.call(**args) File "/usr/src/app/emission/net/api/bottle.py", line 1998, in wrapper rv = callback(*a, **ka) File "/usr/src/app/emission/net/api/cfc_webapp.py", line 249, in putIntoCache return usercache.sync_phone_to_server(user_uuid, from_phone) File "/usr/src/app/emission/net/api/usercache.py", line 54, in sync_phone_to_server result = usercache_db.update_one(update_query, File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 1041, in update_one self._update_retryable( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 836, in _update_retryable return self.__database.client._retryable_write( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1476, in _retryable_write return self._retry_with_session(retryable, func, s, None) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1349, in _retry_with_session return self._retry_internal(retryable, func, session, bulk) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/_csot.py", line 105, in csot_wrapper return func(self, *args, **kwargs) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1390, in _retry_internal return func(session, sock_info, retryable) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 817, in _update return self._update( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 782, in _update _check_write_command_response(result) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response _raise_last_write_error(write_errors) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 190, in _raise_last_write_error raise WriteError(error.get("errmsg"), error.get("code"), error) pymongo.errors.WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'} ``` This is bad because this error interrupts the processing of the incoming data, and causes the `/usercache/put` call to fail. The phone keeps trying to upload this data over and over, and failing over and over, so the pipeline never makes progress, and deployers are not able to see newly processed data in their admin dashboards. To fix this, and make the ingestion code more robust in general, we check the incoming data for keys with dots and munge them. This will fix this immediately, and will also ensure that we don't Testing done: - Added a new unit test that invokes the function directly - Added a new integration test that creates entries and calls `sync_phone_to_server` on them Both tests pass
7ae93dd
to
1ec706e
Compare
While investigating user-reported issues about trips not visible in the admin dashboard, we found
We got several
T_RECEIVED_SILENT_PUSH
on Jan 1.but 1154 transitions are still stuck on the phone and none on the server
We investigated this in an internal issue; I'm copying over the relevant messages here