-
Notifications
You must be signed in to change notification settings - Fork 35
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #8 from tme-osx/main
Merge Back
- Loading branch information
Showing
27 changed files
with
101,882 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
|
||
OpenAI Analysis: | ||
|
||
Root Cause Analysis: | ||
Based on the provided metrics and system logs, the anomalies seem to be related to a significant drop in call success rate starting from 2024-09-04 11:23:00. The call success rate dropped from 98.24% at 07:30:00 to 23.491% at 11:23:00, and then gradually increased to 61.491% at 11:26:00. | ||
|
||
The system logs indicate that there were several activities happening around the time when the anomalies started. The OpenStack Compute Service, Identity Service, Image Service, Block Storage Service, Networking Service, Database Service, Bare Metal Service, and Data Processing Service were all started successfully. | ||
|
||
However, at 11:22:13, there was an error with the Open vSwitch Daemon (openvswitch-vswitchd), which resulted in the service being killed and restarted. The error message indicates an assertion failure in dp_packet_set_l2_pad_size(). Open vSwitch is a key component in managing network traffic in a virtualized environment, and any disruption in its service could potentially impact network-dependent services, such as call attempts and successes. | ||
|
||
In addition, there were several instances being migrated, rebooted, created, and shut down around the same time. These activities could have put additional load on the system, further exacerbating the impact of the Open vSwitch failure. | ||
|
||
In conclusion, the root cause of the anomalies appears to be a combination of the Open vSwitch failure and the high system load due to instance management activities. The Open vSwitch failure could have disrupted network traffic, leading to a drop in call success rate, while the high system load could have delayed the recovery of the call success rate. | ||
|
||
---- | ||
|
||
MaaS Mistral Analysis: | ||
|
||
Root Cause Analysis: | ||
To perform a root cause analysis on the provided anomalies, let's first understand the metrics and logs. | ||
|
||
Metrics: | ||
1. time: Timestamp of the metric measurement. | ||
2. call_attempt: The number of call attempts made during that time. | ||
3. call_success: The number of successful calls during that time. | ||
4. call_failure: The number of failed calls during that time. | ||
5. total_registered_subs: The total number of registered subscribers. | ||
6. call_success_rate: The percentage of successful calls out of the total number of call attempts. | ||
|
||
Logs: | ||
The logs show the startup process of the OpenStack Compute Service (Nova Compute service) on the server 'openstack-5'. The service loads a hypervisor driver and starts instance management successfully. | ||
|
||
Now, let's analyze the anomalies in the metrics: | ||
|
||
1. At '2024-09-04 11:25:00', the call_attempt increased from 112 to 114, but the call_success and call_failure remained the same. This could indicate an increased load on the system, leading to a higher number of call attempts, but the system was unable to handle the additional load, resulting in no change in successful or failed calls. | ||
|
||
2. At '2024-09-04 11:26:00', the call_attempt remained the same, but the call_success dropped significantly from 40 to 70, and there was a single call_failure. This could indicate a sudden spike in call attempts that the system was unable to handle efficiently, leading to a decrease in the success rate. | ||
|
||
3. Comparing the anomalies with the logs, we notice that the logs end at 'Sep 04 ', which is before the first anomaly at '2024-09-04 11:23:00'. However, the logs don't show any signs of issues during the normal operation time. | ||
|
||
4. The call_success_rate was consistently high (99.11%) before the anomalies, but it dropped significantly afterward. This could suggest a system degradation or a configuration issue that was not present earlier. | ||
|
||
To further investigate, you could: | ||
|
||
1. Check the system load at the time of the anomalies. | ||
2. Investigate the network connection for any potential issues. | ||
3. Check the server logs for any error messages or warnings around the time of the anomalies. | ||
4. Analyze the call logs to identify any patterns or commonalities in the failed calls. | ||
5. Investigate the hypervisor driver and instance management for any configuration issues or bugs. | ||
6. Compare the system configuration before and after the anomalies to identify any changes that could have caused the degradation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.