Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log and resolve errors in uploading data due to Firebase API changes #1008

Merged
merged 2 commits into from
Jan 4, 2025

Conversation

shankari
Copy link
Contributor

@shankari shankari commented Jan 4, 2025

While investigating user-reported issues about trips not visible in the admin dashboard, we found

We got several T_RECEIVED_SILENT_PUSH on Jan 1.

2025-01-01T09:52:46.052520-08:00	In TripDiaryStateMachine, received transition T_RECEIVED_SILENT_PUSH in state STATE_WAITING_FOR_TRIP_START

but 1154 transitions are still stuck on the phone and none on the server

2025-01-02T09:48:21.513590-08:00	"getRawEntries, args: [[""statemachine/transition""],1735275600,1735880399.999]; got 0 entries"
2025-01-02T09:48:21.524160-08:00	Found 1154 transitions. yay!

We investigated this in an internal issue; I'm copying over the relevant messages here

We recently encountered an issue in which `usercache/put` calls were failing
because of an error with the message format. It was not easy to determine
the object that was generating the error, since we only log successful
updates.

This change catches all mongo errors and logs the full backtrace, including the document that generated the error.

Testing done:

- Added additional debugging statements to `sync_phone_to_server` in `emission/net/api/usercache.py`

```
        update_query = {'user_id': uuid,
                        'metadata.type': data["metadata"]["type"],
                        'metadata.write_ts': data["metadata"]["write_ts"],
                        'metadata.key': data["metadata"]["key"]}
+        time.sleep(2)
+        logging.debug("After sleep, continuing to processing")
```

- Started up a local server, and logged in from the emulator
- Started a trip, started location tracking and ended the trip
- While the `usercache/put` was processing the entries (slowed down because of the sleep), killed the local DB
- put failed with an error message highlighting the document that was being saved (although it did not matter in for this error since it was a connection error and not a document format error)

```
2025-01-03 12:11:54,717:DEBUG:12994228224:After sleep, continuing to processing
2025-01-03 12:11:54,720:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.229703 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448afacd6df071652448'), 'ok': 1.0, 'updatedExisting': False}
2025-01-03 12:11:56,726:DEBUG:12994228224:After sleep, continuing to processing
2025-01-03 12:11:56,728:DEBUG:12994228224:Updated result for user = 34da08c9-e7a7-4f91-bf65-bf6b6d970c32, key = stats/client_time, write_ts = 1735933996.2422519 = {'n': 1, 'nModified': 0, 'upserted': ObjectId('6778448cfacd6df07165244a'), 'ok': 1.0, 'updatedExisting': False}
2025-01-03 12:11:58,732:DEBUG:12994228224:After sleep, continuing to processing
2025-01-03 12:12:29,131:ERROR:12994228224:In sync_phone_to_server, while executing update_query={'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32'), 'metadata.type': 'message', 'metadata.write_ts': 1735933996.3793979, 'metadata.key': 'stats/client_time'} on document={'$set': {'data': {'ts': 1735933996.369, 'client_app_version': '1.9.6', 'name': 'onboarding_state', 'client_os_version': '18.1', 'reading': {'route': 1, 'opcode': 'nrelop_dev-emulator-study_default_testdbfail'}}, 'metadata': {'time_zone': 'America/Los_Angeles', 'plugin': 'none', 'write_ts': 1735933996.3793979, 'platform': 'ios', 'read_ts': 0, 'key': 'stats/client_time', 'type': 'message'}, 'user_id': UUID('34da08c9-e7a7-4f91-bf65-bf6b6d970c32')}}
2025-01-03 12:12:29,133:ERROR:12994228224:localhost:27017: [Errno 61] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 6778444d81d477e59b10bb3a, topology_type: Single, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 61] Connection refused')>]>
```
@shankari shankari changed the title Run user stats profile always Log and resolve errors in uploading data due to Firebase API changes Jan 4, 2025
@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

It's annoying that we got a bunch of calls at the same time because we would not have deleted entries between calls, and will end up pushing the same data over and over. But we do handle duplicates on the server, so it is not the end of the world.

This time, the network was fine, and we were able to pull data successfully from the server. We are no longer creating any cached values, so we pull 0 documents, but there is no error

23576,1735753966.81084,2025-01-01T09:52:46.810840-08:00,received response for server_to_phone
23577,1735753966.81613,2025-01-01T09:52:46.816130-08:00,Got non NULL data while retrieving data
23578,1735753966.8171,2025-01-01T09:52:46.817100-08:00,Retrieved 0 documents

However, the push seems to have generated a 500 error on the server

24029,1735754047.0036,2025-01-01T09:54:07.003600-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24030,1735754047.17737,2025-01-01T09:54:07.177370-08:00,"After clear complete, cache has entries stats/client_nav_event, stats/client_time, config/app_ui_config, CONFIG_PHONE_UI, connection_settings, statemachine/transition, background/battery, background/bluetooth_ble, prompted-auth, config/consent, data_collection_consented_protocol, intro_done, background/location, background/filtered_location, stats/client_error, CURR_GEOFENCE_LOCATION"
24031,1735754047.18681,2025-01-01T09:54:07.186810-08:00,"In TripDiaryStateMachine, received transition T_DATA_PUSHED in state STATE_WAITING_FOR_TRIP_START"
24032,1735754047.2,2025-01-01T09:54:07.200000-08:00,Received notification T_DATA_PUSHED while processing silent push notification
24033,1735754047.20429,2025-01-01T09:54:07.204290-08:00,"Data pushed, fetch result = new data"

Did it generate the error for all the pushes?

$ grep --before-context=5 "2025-01-01.*T_DATA_PUSHED" ~/Downloads/loggerDB.txt.jan2-draft-buildup-withdate.csv
24029,1735754047.0036,2025-01-01T09:54:07.003600-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24055,1735754056.88317,2025-01-01T09:54:16.883170-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"
24073,1735754057.08409,2025-01-01T09:54:17.084090-08:00,"Got error Error Domain=com.google.HTTPStatus Code=500 ""(null)"" UserInfo={data={length = 3654, bytes = 0x0a202020 203c2144 4f435459 50452048 ... 3c2f6874 6d6c3e0a }, data_content_type=text/html; charset=UTF-8} while pushing changes to server, retaining data"

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

I tried converting the HTTP response from hex to UTF-8 characters using
https://www.rapidtables.com/convert/number/hex-to-ascii.html
but 0x0a202020 203c2144 4f435459 50452048 maps to DYH which doesn't really help us very much

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

The 500 error was due to this backtrace

2025-01-03 03:00:13,342:DEBUG:140708218398272:Inserting entry Entry({'_id': ObjectId('677752bd0e474abce20eaa53'), 'user_id': UUID('d1e49700-633d-4c4a-9684-622adeed4e23'), 'metadata': Metadata({'key': 'stats/server_api_time', 'platform': 'server', 'write_ts': 1735873213.3418846, 'time_zone': 'America/Los_Angeles', 'write_local_dt': LocalDate({'year': 2025, 'month': 1, 'day': 2, 'hour': 19, 'minute': 0, 'second': 13, 'weekday': 3, 'timezone': 'America/Los_Angeles'}), 'write_fmt_time': '2025-01-02T19:00:13.341885-08:00'}), 'data': {'name': 'POST_/usercache/put_cputime', 'ts': 1735873213.3211474, 'reading': 45.678044127998874}}) into timeseries
Traceback (most recent call last):
File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle
out = route.call(**args)
File "/usr/src/app/emission/net/api/bottle.py", line 1998, in wrapper
rv = callback(*a, **ka)
File "/usr/src/app/emission/net/api/cfc_webapp.py", line 249, in putIntoCache
return usercache.sync_phone_to_server(user_uuid, from_phone)
File "/usr/src/app/emission/net/api/usercache.py", line 54, in sync_phone_to_server
result = usercache_db.update_one(update_query,
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 1041, in update_one
self._update_retryable(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 836, in _update_retryable
return self.__database.client._retryable_write(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1476, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1349, in _retry_with_session
return self._retry_internal(retryable, func, session, bulk)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
return func(self, *args, **kwargs)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1390, in _retry_internal
return func(session, sock_info, retryable)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 817, in _update
return self._update(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 782, in _update
_check_write_command_response(result)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
_raise_last_write_error(write_errors)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 190, in _raise_last_write_error
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'}

This would clearly result in a 500 error.

Per the only match that I find for this error, which appears to be DocumentDB specific, is that it may be generated if there are field names with . https://stackoverflow.com/questions/66369545/documentdb-updateone-fails-with-163-name-is-not-valid-for-storage

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

The error is from

243 @post('/usercache/put')
244 def putIntoCache():
245   logging.debug("Called userCache.put")
246   user_uuid=getUUID(request)
247   logging.debug("user_uuid %s" % user_uuid)
248   from_phone = request.json['phone_to_server']
249   return usercache.sync_phone_to_server(user_uuid, from_phone)

The server API time is logged in

480   # Keep track of the time and duration for each call
481   stats.store_server_api_time(request.params.user_uuid, "%s_%s" % (request.metho    d, request.path),
482         msTimeNow, duration)
483   stats.store_server_api_time(request.params.user_uuid, "%s_%s_cputime" % (reque    st.method, request.path),
484         msTimeNow, new_duration)

So the error is not with storing the server API time. Instead, the error is with storing some other data; we then bail on the call and store the times as part of the post-hook.

So the error is specifically related to the data that is on this particular phone (or similar data on other phones) and is not reproducible via the simulator.

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

Exporting the userCacheDB from the phone, there are a ton of entries, which makes it hard to figure out which entries are incorrect. But fortunately, we process the entries in order of write_ts, and we can see the last successful entry

2025-01-03 03:00:13,320:DEBUG:140708218398272:Updated result for user = d1e49700-633d-4c4a-9684-622adeed4e23, key = background/battery, write_ts = 1734728132.8328 = {'n': 1, 'nModified': 0, 'ok': 1.0, 'operationTime': Timestamp(1735873213, 1), 'updatedExisting': True}
2025-01-03 03:00:13,321:DEBUG:140708218398272:END POST /usercache/put d1e49700-633d-4c4a-9684-622adeed4e23 45.69412040710449 

So looking for the entry after that in the DB, we see

sqlite> select * from userCache;
...
1734728132.8328||America/New_York|message|background/battery||{"battery_status":1,"battery_level_ratio":0.70999997854232788,"ts":1734728132.832566}
1734728363.79963||America/New_York|message|stats/client_time||{"ts":1734728363.789,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"...","coldstart":false,"notId":"...","payload":...,"content-available":1,"foreground":false,"google.c.fid":"...","gcm.message_id":"..."}}}
...

Bingo! We have a stats/client_time with the name open_notification and additionalData that includes fields like google.c.sender.id and google.c.fid and gcm.message_id

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

This is not a one-off, there are 44 such entries

sqlite> select count(*) from userCache where data like '%"name":"open_notification"%';
44

and they range from 1734728363.79963 (2024-12-20T20:59:23.799630+00:00) to 1735934360.26561 (2025-01-03T19:59:20.265610+00:00)

1734728363.79963||America/New_York|message|stats/client_time||{"ts":1734728363.789,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"","coldstart":false,"notId":"","payload":,"content-available":1,"foreground":false,"google.c.fid":"","gcm.message_id":""}}}
...
1735934360.26561||America/New_York|message|stats/client_time||{"ts":1735934360.256,"client_app_version":"1.9.6","name":"open_notification","client_os_version":"15.5","reading":{"additionalData":{"google.c.sender.id":"","coldstart":false,"notId":"","payload":,"content-available":1,"foreground":false,"google.c.fid":"","gcm.message_id":""}}}

So my guess is that firebase changed the format of their additional data on the 20th (sounds familiar?)
And once this "bad data" is in there, no pushes, even those generated at the ends of trips for phones with a data connection will work.

It seems like we have three main tasks going forward:

  • we need better logging on the server
  • we need to change the open notification stat code to munge the dots
  • we need a backwards compatibility fix to find and munge any existing client stats, or if that is too difficult, just delete them since they are just stats and not crucial to the operation of OpenPATH

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

The stat is added here:
https://github.com/e-mission/e-mission-phone/blob/89127bca0ae3f4074ab1fc9d7be0581ad62ea24f/www/js/splash/remoteNotifyHandler.ts#L34

This code hasn't changed for 2 years
https://github.com/e-mission/e-mission-phone/blame/89127bca0ae3f4074ab1fc9d7be0581ad62ea24f/www/js/splash/remoteNotifyHandler.ts#L34

The message that we send is generated using this trace:

response = pnu.send_silent_notification_to_ios_with_interval(args.interval, dev=args.dev)
- return __get_default_interface__().send_silent_notification(token_map, {}, dev)
   - def send_silent_notification(self, token_map, json_data, dev=False):
   - ios_raw_data = copy.copy(json_data)
   - ios_raw_data.update({"notId": str(int(time.time() * 10**6))})
   - ios_raw_data.update({"payload": ios_raw_data["notId"]})
   - response["ios"] = self.notify_multiple_devices(push_service,
                            fcm_token_map["ios"], data_payload=ios_raw_data)
        -     def notify_multiple_devices(self, push_service, fcm_token_list,
                  notification_title=None, notification_body=None, data_payload=None):
                            result = push_service.notify(fcm_token=t,
                                                                             notification_title=notification_title,
                                                                             notification_body=notification_body,
                                                                             data_payload=data_payload)

So that is confirmation that the google and gcm fields are automatically added by firebase. We don't add them, so we can't strip them out on the server, and must munge them on the phone side.

@shankari
Copy link
Contributor Author

shankari commented Jan 4, 2025

From @JGreenlee

But for the existing entries that are already malformed, can't we just patch the server (in usercache.py) to skip bad entries (i.e. not store them)
This way user who are on 1.9.6 will be able to sync once the server is updated, not having to wait for their phones to update to 1.9.7

Actually, once we have patched the server, we don't need to munge on the phone at all. We can just handle all such entries, which also makes this more robust to additional breaking integration changes in the future.

On around Dec 21st 2024, it looks like firebase changed the format of their push
notifications to add in some metadata into the `additionalData` field. This
metadata has keys with dots.

Since we also use the `additionalData` field to pass in the survey or popup
message for custom push notifications, we store the entire `additionalData` into the notification stat.

When this notification is pushed up to the server, it cannot be stored in
the database, since MongoDB/DocumentDB do not support keys with dots.
https://stackoverflow.com/questions/66369545/documentdb-updateone-fails-with-163-name-is-not-valid-for-storage

While trying to save the entry, we get the error

```
Traceback (most recent call last):
File "/usr/src/app/emission/net/api/bottle.py", line 997, in _handle
out = route.call(**args)
File "/usr/src/app/emission/net/api/bottle.py", line 1998, in wrapper
rv = callback(*a, **ka)
File "/usr/src/app/emission/net/api/cfc_webapp.py", line 249, in putIntoCache
return usercache.sync_phone_to_server(user_uuid, from_phone)
File "/usr/src/app/emission/net/api/usercache.py", line 54, in sync_phone_to_server
result = usercache_db.update_one(update_query,
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 1041, in update_one
self._update_retryable(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 836, in _update_retryable
return self.__database.client._retryable_write(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1476, in _retryable_write
return self._retry_with_session(retryable, func, s, None)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1349, in _retry_with_session
return self._retry_internal(retryable, func, session, bulk)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
return func(self, *args, **kwargs)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1390, in _retry_internal
return func(session, sock_info, retryable)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 817, in _update
return self._update(
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/collection.py", line 782, in _update
_check_write_command_response(result)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 217, in _check_write_command_response
_raise_last_write_error(write_errors)
File "/root/miniconda-23.5.2/envs/emission/lib/python3.9/site-packages/pymongo/helpers.py", line 190, in _raise_last_write_error
raise WriteError(error.get("errmsg"), error.get("code"), error)
pymongo.errors.WriteError: Name is not valid for storage, full error: {'index': 0, 'code': 163, 'errmsg': 'Name is not valid for storage'}
```

This is bad because this error interrupts the processing of the incoming
data, and causes the `/usercache/put` call to fail. The phone keeps trying
to upload this data over and over, and failing over and over, so the
pipeline never makes progress, and deployers are not able to see newly
processed data in their admin dashboards.

To fix this, and make the ingestion code more robust in general, we check
the incoming data for keys with dots and munge them. This will fix this
immediately, and will also ensure that we don't

Testing done:
- Added a new unit test that invokes the function directly
- Added a new integration test that creates entries and calls
  `sync_phone_to_server` on them

Both tests pass
@shankari shankari force-pushed the run_user_stats_profile_always branch from 7ae93dd to 1ec706e Compare January 4, 2025 06:29
@shankari shankari merged commit 6ab5a5c into e-mission:master Jan 4, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant