Fix frontend alarms #1276

codemonkey800 · 2023-09-29T12:18:40Z

Caught Error Alarms

These are errors that are caught and logged in the frontend. The errors for these logs are can be surfaced from CloudWatch Logs.

Error fetching spdx license data

This log can be found in CloudWatch when filtering using the following query:

This is related to error logs that happens when fetching the SPDX license data on the browser throws an error:

napari-hub/frontend/src/components/MetadataList/MetadataListMetadataItem.tsx

Lines 126 to 132 in 41ae700

    
           onError(err) { 
        
             logger.error({ 
        
               message: 
        
                 'Error fetching spdx license data for MetadataListMetadataItem', 
        
               error: getErrorMessage(err), 
        
             }); 
        
           },

Fetching this data on the browser is probably inefficient and more prone to error because of the user’s environment. We can move this fetch to the server side to improve the reliability of this API call. If this doesn’t reduce the amount of errors occurring, we can look into reducing the log level of this message.

Error loading route

This log can be found in CloudWatch when filtering using the following query:

This is related to some code for logging when an error occurs while a page is transitioning:

napari-hub/frontend/src/hooks/usePageTransitions.ts

Lines 66 to 69 in 41ae700

    
           logger.error({ 
        
             message: 'Error loading route', 
        
             error: getErrorMessage(error), 
        
           });

According to the docs, this error occurs if the route transition is cancelled or if an error is thrown, but the code above doesn’t check for this when logging the error message. We can refactor the code to use a different log level depending on if the user cancelled the transition or not:

const level = error.cancelled ? 'info' : 'error'

logger[level]({
  message: 'Error loading route',
  error: getErrorMessage(error),
  cancelled: error.cancelled,
})

Ideally this should reduce the amount of actual errors we encounter, but if not, we can look into filtering out this error from the logs metric filter if it’s something we can’t easily fix.

Uncaught Error Alarms

These are alarms that are not handled within a try / catch block. Currently RUM has reported the following errors:

CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch

This error occurs when a network error occurs while fetching credentials from AWS STS. The stacktrace for this message looks like:

Error: CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch
    at nS.<anonymous> (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:375226)
    at www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:373992
    at Object.throw (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:374097)
    at s (www.napari-hub.org/_next/static/chunks/pages/_app-ab7d999ffe90ca01.js:165:375420)

Unfortunately we can't really fix this error since we can't control user network conditions. Instead, we can try filtering this event from being tracked by the alarm.

To do this, we will need to refactor the alarm infrastructure to:

Export RUM events to a log stream
Create a logs metric filter that filters out STS fetch errors
Updated frontend alarm to use data from logs metric filter

Error details: CWR: Failed to retrieve Cognito OpenId token: TypeError: Failed to fetch

Similar to the above error, this is out of our control due to user network conditions. We can remove this from the frontend alarm by ignoring this specific error message.

The provided `href` (/plugins/[name]) value is missing query values

According to the docs, this error occurs when the UI tries to open a URL that does not have the provided variable in the pathname.

This error is a bit complex to debug because it happens intermittently and is not easy to reproduce. The frequency appears to be 1-2 instance per week:

The plugin page also does not have links to itself or plugin pages, so it seems technically impossible for this error to occur.

One thing we can try is updating all references to /plugins/[name] to check that name is defined before creating a link or navigating to a route.

If this does not reduce the errors, we could reduce the log level since this type of error doesn't have a huge impact on the functionality of the page. It's possible this error could be a result of an intermittent loading state since some of the errors happen in the loading state for the plugin page.

Script error

These are unknown errors that happen during JavaScript execution that seemingly only happen on Desktop Safari browsers:

This error may occur when the frontend tries to load JavaScript from another domain. Based on this article, we can possibly fix this by updating references to external JavaScript to include the crossorigin property in the <script> tag.

The only reference to this is the script we use for hub spot:

napari-hub/frontend/src/pages/_app.tsx

Lines 88 to 93 in 41ae700

    
           <Script 
        
             onLoad={() => { 
        
               hubspotStore.ready = true; 
        
             }} 
        
             src="//js.hsforms.net/forms/v2.js?pre=1" 
        
           />

If this does not reduce the errors, we can look into filtering out this message for this specific error.

Request aborted

This error occurs when a request is cancelled which may happen if the user navigates away from a page with an in-progress request, so it should be safe to filter out.

ResizeObserver loop completed with undelivered notifications.

This error occurs when ResizeObserver is trying to notify subscribers of a recent resize. This error may occur if the users page resizes during a notification. Unfortunately we can't control this because of the variety of differences in the user's environment like viewport and browser, so this is something we can look into filtering out.

Action Items

The text was updated successfully, but these errors were encountered:

codemonkey800 · 2023-10-06T01:49:05Z

recently got some 400 errors today related to a user somehow accessing the plugins page using the template variable [name]:

this would mean they accessed /plugins/[name] somehow. overall this isn't necessarily an error we have to worry about since it's a client error, so we can filter these out by reducing the log level to warning. I've added the task Assign lower log level to 4xx errors to capture this 🫡

codemonkey800 added the frontend label Sep 29, 2023

codemonkey800 self-assigned this Sep 29, 2023

codemonkey800 added this to napari hub backlog Sep 29, 2023

codemonkey800 moved this to Backlog in napari hub backlog Sep 29, 2023

codemonkey800 mentioned this issue Sep 29, 2023

fix frontend alarm bugs #1277

Merged

codemonkey800 moved this from Backlog to In Progress in napari hub backlog Sep 29, 2023

codemonkey800 mentioned this issue Sep 29, 2023

Frontend monitoring + alerts in AWS #1064

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix frontend alarms #1276

Fix frontend alarms #1276

codemonkey800 commented Sep 29, 2023 •

edited

Loading

codemonkey800 commented Oct 6, 2023

Fix frontend alarms #1276

Fix frontend alarms #1276

Comments

codemonkey800 commented Sep 29, 2023 • edited Loading

Caught Error Alarms

Error fetching spdx license data

Error loading route

Uncaught Error Alarms

CWR: Failed to retrieve credentials from STS: TypeError: Failed to fetch

Error details: CWR: Failed to retrieve Cognito OpenId token: TypeError: Failed to fetch

The provided href (/plugins/[name]) value is missing query values

Script error

Request aborted

ResizeObserver loop completed with undelivered notifications.

Action Items

codemonkey800 commented Oct 6, 2023

codemonkey800 commented Sep 29, 2023 •

edited

Loading

The provided `href` (/plugins/[name]) value is missing query values