Log and resolve errors in uploading data due to Firebase API changes #1008

shankari · 2025-01-04T05:57:36Z

While investigating user-reported issues about trips not visible in the admin dashboard, we found

We got several T_RECEIVED_SILENT_PUSH on Jan 1.

2025-01-01T09:52:46.052520-08:00	In TripDiaryStateMachine, received transition T_RECEIVED_SILENT_PUSH in state STATE_WAITING_FOR_TRIP_START

but 1154 transitions are still stuck on the phone and none on the server

2025-01-02T09:48:21.513590-08:00	"getRawEntries, args: [[""statemachine/transition""],1735275600,1735880399.999]; got 0 entries"
2025-01-02T09:48:21.524160-08:00	Found 1154 transitions. yay!

We investigated this in an internal issue; I'm copying over the relevant messages here

We recently encountered an issue in which `usercache/put` calls were failing because of an error with the message format. It was not easy to determine the object that was generating the error, since we only log successful updates. This change catches all mongo errors and logs the full backtrace, including the document that generated the error. Testing done: - Added additional debugging statements to `sync_phone_to_server` in `emission/net/api/usercache.py` ``` update_query = {'user_id': uuid, 'metadata.type': data["metadata"]["type"], 'metadata.write_ts': data["metadata"]["write_ts"], 'metadata.key': data["metadata"]["key"]} + time.sleep(2) + logging.debug("After sleep, continuing to processing") ``` - Started up a local server, and logged in from the emulator - Started a trip, started location tracking and ended the trip - While the `usercache/put` was processing the entries (slowed down because of the sleep), killed the local DB - put failed with an error message highlighting the document that was being saved (although it did not matter in for this error since it was a connection error and not a document format error) ``` 2025-01-03 12:11:54,717:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:11:54,720:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.229703 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448afacd6df071652448'), 'ok': 1.0, 'updatedExisting': False} 2025-01-03 12:11:56,726:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:11:56,728:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.2422519 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448cfacd6df07165244a'), 'ok': 1.0, 'updatedExisting': False} 2025-01-03 12:11:58,732:DEBUG:12994228224:After sleep, continuing to processing 2025-01-03 12:12:29,131:ERROR:12994228224:In sync_phone_to_server, while executing update_query={'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32'), 'metadata.type': 'message', 'metadata.write_ts': 1735933996.3793979, 'metadata.key': 'stats/client_time'} on document={'$set': {'data': {'ts': 1735933996.369, 'client_app_version': '1.9.6', 'name': 'onboarding_state', 'client_os_version': '18.1', 'reading': {'route': 1, 'opcode': 'nrelop_dev-emulator-study_default_testdbfail'}}, 'metadata': {'time_zone': 'America/Los_Angeles', 'plugin': 'none', 'write_ts': 1735933996.3793979, 'platform': 'ios', 'read_ts': 0, 'key': 'stats/client_time', 'type': 'message'}, 'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32')}} 2025-01-03 12:12:29,133:ERROR:12994228224:localhost:27017: [Errno 61] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6778444d81d477e59b10bb3a, topology_type: Single, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 61] Connection refused')>]> ```

shankari · 2025-01-04T06:18:04Z

It's annoying that we got a bunch of calls at the same time because we would not have deleted entries between calls, and will end up pushing the same data over and over. But we do handle duplicates on the server, so it is not the end of the world.

This time, the network was fine, and we were able to pull data successfully from the server. We are no longer creating any cached values, so we pull 0 documents, but there is no error

23576,1735753966.81084,2025-01-01T09:52:46.810840-08:00,received response for server_to_phone
23577,1735753966.81613,2025-01-01T09:52:46.816130-08:00,Got non NULL data while retrieving data
23578,1735753966.8171,2025-01-01T09:52:46.817100-08:00,Retrieved 0 documents

However, the push seems to have generated a 500 error on the server

24029,1735754047.0036,2025-01-01T09:54:07.003600-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24030,1735754047.17737,2025-01-01T09:54:07.177370-08:00,"After clear complete, cache has entries stats/client_nav_event, stats/client_time, config/app_ui_config, CONFIG_PHONE_UI, connection_settings, statemachine/transition, background/battery, background/bluetooth_ble, prompted-auth, config/consent, data_collection_consented_protocol, intro_done, background/location, background/filtered_location, stats/client_error, CURR_GEOFENCE_LOCATION"
24031,1735754047.18681,2025-01-01T09:54:07.186810-08:00,"In TripDiaryStateMachine, received transition T_DATA_PUSHED in state STATE_WAITING_FOR_TRIP_START"
24032,1735754047.2,2025-01-01T09:54:07.200000-08:00,Received notification T_DATA_PUSHED while processing silent push notification
24033,1735754047.20429,2025-01-01T09:54:07.204290-08:00,"Data pushed, fetch result = new data"

Did it generate the error for all the pushes?

$ grep --before-context=5 "2025-01-01.*T_DATA_PUSHED" ~/Downloads/loggerDB.txt.jan2-draft-buildup-withdate.csv
24029,1735754047.0036,2025-01-01T09:54:07.003600-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24055,1735754056.88317,2025-01-01T09:54:16.883170-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24073,1735754057.08409,2025-01-01T09:54:17.084090-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"

shankari · 2025-01-04T06:18:47Z

I tried converting the HTTP response from hex to UTF-8 characters using
https://www.rapidtables.com/convert/number/hex-to-ascii.html
but 0x0a202020 203c2144 4f435459 50452048 maps to DYH which doesn't really help us very much

shankari · 2025-01-04T06:19:52Z

The 500 error was due to this backtrace

2025-01-03 03:00:13,342:DEBUG:140708218398272:Inserting entry Entry({'_id': ObjectId('677752bd0e474abce20eaa53'), 'user_id': UUID('d1e49700-633d-4c4a-9684-622adeed4e23'), 'metadata': Metadata({'key': 'stats/server_api_time', 'platform': 'server', 'write_ts': 1735873213.3418846, 'time_zone': 'America/Los_Angeles', 'write_local_dt': LocalDate({'year': 2025, 'month': 1, 'day': 2, 'hour': 19, 'minute': 0, 'second': 13, 'weekday': 3, 'timezone': 'America/Los_Angeles'}), 'write_fmt_time': '2025-01-02T19:00:13.341885-08:00'}), 'data': {'name': 'POST_/usercache/put_cputime', 'ts': 1735873213.3211474, 'reading': 45.678044127998874}}) into timeseries
Traceback (most recent call last):
File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle
out = route.call(**args)
File "/usr/src/app/emission/net/api/bottle.py", line 1998, in wrapper
rv = callback(*a, **ka)
File "/usr/src/app/emission/net/api/cfc_webapp.py", line 249, in putIntoCache
return usercache.sync_phone_to_server(user_uuid, from_phone)
File "/usr/src/app/emission/net/api/usercache.py", line 54, in sync_phone_to_server
result = usercache_db.update_one(update_query,
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 1041, in update_one
self._update_retryable(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 836, in _update_retryable
return self.__database.client._retryable_write(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1476, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1349, in _retry_with_session
return self._retry_internal(retryable, func, session, bulk)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
return func(self, *args, **kwargs)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1390, in _retry_internal
return func(session, sock_info, retryable)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 817, in _update
return self._update(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 782, in _update
_check_write_command_response(result)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
_raise_last_write_error(write_errors)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 190, in _raise_last_write_error
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'}

This would clearly result in a 500 error.

Per the only match that I find for this error, which appears to be DocumentDB specific, is that it may be generated if there are field names with . https://stackoverflow.com/questions/66369545/documentdb-updateone-fails-with-163-name-is-not-valid-for-storage

shankari · 2025-01-04T06:20:48Z

The error is from

243 @post('/usercache/put')
244 def putIntoCache():
245   logging.debug("Called userCache.put")
246   user_uuid=getUUID(request)
247   logging.debug("user_uuid %s" % user_uuid)
248   from_phone = request.json['phone_to_server']
249   return usercache.sync_phone_to_server(user_uuid, from_phone)

The server API time is logged in

480   # Keep track of the time and duration for each call
481   stats.store_server_api_time(request.params.user_uuid, "%s_%s" % (request.metho    d, request.path),
482         msTimeNow, duration)
483   stats.store_server_api_time(request.params.user_uuid, "%s_%s_cputime" % (reque    st.method, request.path),
484         msTimeNow, new_duration)

So the error is not with storing the server API time. Instead, the error is with storing some other data; we then bail on the call and store the times as part of the post-hook.

So the error is specifically related to the data that is on this particular phone (or similar data on other phones) and is not reproducible via the simulator.

shankari · 2025-01-04T06:22:14Z

Exporting the userCacheDB from the phone, there are a ton of entries, which makes it hard to figure out which entries are incorrect. But fortunately, we process the entries in order of write_ts, and we can see the last successful entry

2025-01-03 03:00:13,320:DEBUG:140708218398272:Updated result for user = d1e49700-633d-4c4a-9684-622adeed4e23, key = background/battery, write_ts = 1734728132.8328 = {'n': 1, 'nModified': 0, 'ok': 1.0, 'operationTime': Timestamp(1735873213, 1), 'updatedExisting': True}
2025-01-03 03:00:13,321:DEBUG:140708218398272:END POST /usercache/put d1e49700-633d-4c4a-9684-622adeed4e23 45.69412040710449

So looking for the entry after that in the DB, we see

sqlite> select * from userCache;
...
1734728132.8328||America/New_York|message|background/battery||{"battery_status":1,"battery_level_ratio":0.70999997854232788,"ts":1734728132.832566}
1734728363.79963||America/New_York|message|stats/client_time||{"ts":1734728363.789,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"...","coldstart":false,"notId":"...","payload":...,"content-available":1,"foreground":false,"google.c.fid":"...","gcm.message_id":"..."}}}
...

Bingo! We have a stats/client_time with the name open_notification and additionalData that includes fields like google.c.sender.id and google.c.fid and gcm.message_id

shankari · 2025-01-04T06:23:17Z

This is not a one-off, there are 44 such entries

sqlite> select count(*) from userCache where data like '%"name":"open_notification"%';
44

and they range from 1734728363.79963 (2024-12-20T20:59:23.799630+00:00) to 1735934360.26561 (2025-01-03T19:59:20.265610+00:00)

1734728363.79963||America/New_York|message|stats/client_time||{"ts":1734728363.789,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"","coldstart":false,"notId":"","payload":,"content-available":1,"foreground":false,"google.c.fid":"","gcm.message_id":""}}}
...
1735934360.26561||America/New_York|message|stats/client_time||{"ts":1735934360.256,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"","coldstart":false,"notId":"","payload":,"content-available":1,"foreground":false,"google.c.fid":"","gcm.message_id":""}}}

So my guess is that firebase changed the format of their additional data on the 20th (sounds familiar?)
And once this "bad data" is in there, no pushes, even those generated at the ends of trips for phones with a data connection will work.

It seems like we have three main tasks going forward:

we need better logging on the server
we need to change the open notification stat code to munge the dots
we need a backwards compatibility fix to find and munge any existing client stats, or if that is too difficult, just delete them since they are just stats and not crucial to the operation of OpenPATH

shankari · 2025-01-04T06:23:44Z

The stat is added here:
https://github.com/e-mission/e-mission-phone/blob/89127bca0ae3f4074ab1fc9d7be0581ad62ea24f/www/js/splash/remoteNotifyHandler.ts#L34

This code hasn't changed for 2 years
https://github.com/e-mission/e-mission-phone/blame/89127bca0ae3f4074ab1fc9d7be0581ad62ea24f/www/js/splash/remoteNotifyHandler.ts#L34

The message that we send is generated using this trace:

response = pnu.send_silent_notification_to_ios_with_interval(args.interval, dev=args.dev)
- return __get_default_interface__().send_silent_notification(token_map, {}, dev)
   - def send_silent_notification(self, token_map, json_data, dev=False):
   - ios_raw_data = copy.copy(json_data)
   - ios_raw_data.update({"notId": str(int(time.time() * 10**6))})
   - ios_raw_data.update({"payload": ios_raw_data["notId"]})
   - response["ios"] = self.notify_multiple_devices(push_service,
                            fcm_token_map["ios"], data_payload=ios_raw_data)
        -     def notify_multiple_devices(self, push_service, fcm_token_list,
                  notification_title=None, notification_body=None, data_payload=None):
                            result = push_service.notify(fcm_token=t,
                                                                             notification_title=notification_title,
                                                                             notification_body=notification_body,
                                                                             data_payload=data_payload)

So that is confirmation that the google and gcm fields are automatically added by firebase. We don't add them, so we can't strip them out on the server, and must munge them on the phone side.

shankari · 2025-01-04T06:27:09Z

From @JGreenlee

But for the existing entries that are already malformed, can't we just patch the server (in usercache.py) to skip bad entries (i.e. not store them)
This way user who are on 1.9.6 will be able to sync once the server is updated, not having to wait for their phones to update to 1.9.7

Actually, once we have patched the server, we don't need to munge on the phone at all. We can just handle all such entries, which also makes this more robust to additional breaking integration changes in the future.

On around Dec 21st 2024, it looks like firebase changed the format of their push notifications to add in some metadata into the `additionalData` field. This metadata has keys with dots. Since we also use the `additionalData` field to pass in the survey or popup message for custom push notifications, we store the entire `additionalData` into the notification stat. When this notification is pushed up to the server, it cannot be stored in the database, since MongoDB/DocumentDB do not support keys with dots. https://stackoverflow.com/questions/66369545/documentdb-updateone-fails-with-163-name-is-not-valid-for-storage While trying to save the entry, we get the error ``` Traceback (most recent call last): File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle out = route.call(**args) File "/usr/src/app/emission/net/api/bottle.py", line 1998, in wrapper rv = callback(*a, **ka) File "/usr/src/app/emission/net/api/cfc_webapp.py", line 249, in putIntoCache return usercache.sync_phone_to_server(user_uuid, from_phone) File "/usr/src/app/emission/net/api/usercache.py", line 54, in sync_phone_to_server result = usercache_db.update_one(update_query, File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 1041, in update_one self._update_retryable( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 836, in _update_retryable return self.__database.client._retryable_write( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1476, in _retryable_write return self._retry_with_session(retryable, func, s, None) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1349, in _retry_with_session return self._retry_internal(retryable, func, session, bulk) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/_csot.py", line 105, in csot_wrapper return func(self, *args, **kwargs) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1390, in _retry_internal return func(session, sock_info, retryable) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 817, in _update return self._update( File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 782, in _update _check_write_command_response(result) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response _raise_last_write_error(write_errors) File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 190, in _raise_last_write_error raise WriteError(error.get("errmsg"), error.get("code"), error) pymongo.errors.WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'} ``` This is bad because this error interrupts the processing of the incoming data, and causes the `/usercache/put` call to fail. The phone keeps trying to upload this data over and over, and failing over and over, so the pipeline never makes progress, and deployers are not able to see newly processed data in their admin dashboards. To fix this, and make the ingestion code more robust in general, we check the incoming data for keys with dots and munge them. This will fix this immediately, and will also ensure that we don't Testing done: - Added a new unit test that invokes the function directly - Added a new integration test that creates entries and calls `sync_phone_to_server` on them Both tests pass

shankari changed the title ~~Run user stats profile always~~ Log and resolve errors in uploading data due to Firebase API changes Jan 4, 2025

shankari force-pushed the run_user_stats_profile_always branch from 7ae93dd to 1ec706e Compare January 4, 2025 06:29

shankari merged commit 6ab5a5c into e-mission:master Jan 4, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log and resolve errors in uploading data due to Firebase API changes #1008

Log and resolve errors in uploading data due to Firebase API changes #1008

shankari commented Jan 4, 2025 •

edited

Loading

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

Log and resolve errors in uploading data due to Firebase API changes #1008

Log and resolve errors in uploading data due to Firebase API changes #1008

Conversation

shankari commented Jan 4, 2025 • edited Loading

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025

shankari commented Jan 4, 2025 •

edited

Loading