Skip to content

Commit

Permalink
Merge pull request #8 from tme-osx/main
Browse files Browse the repository at this point in the history
 Merge Back
  • Loading branch information
fenar authored Nov 19, 2024
2 parents 16406e7 + e31bf73 commit ae92f69
Show file tree
Hide file tree
Showing 27 changed files with 101,882 additions and 50 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,20 @@ This repository is dedicated to exploring various TME use-cases built around ope

## Article
Our Project Summary: [TrueAI4Telco](https://medium.com/open-5g-hypercore/episode-xxiii-trueai4telco-3e372898ce06) <br>
Short AI Native Telco Recording: [AINativeTelco-Video](https://drive.google.com/file/d/1AuImD3qL28GGekMuSQSFP7aYV5kMxmPj/view?usp=sharing)

## Projects
1. [Revenue Assurance and Fraud Management (RAFM)](https://github.com/fenar/TME-AIX/tree/main/revenueassurance) & [ShortDemoVideo](https://drive.google.com/file/d/19XQey7w_q6xW79fwUebDXJbCDQWXQcFs/view?usp=drive_link)
2. [Service Assurance Latency & NPS Predictions](https://github.com/fenar/TME-AIX/tree/main/serviceassurance) & [ShortDemoVideo](https://drive.google.com/file/d/1QIm2rmIkZOiozLVoDhKVgUimPJH3WUi7/view?usp=drive_link)
1. [Revenue Assurance and Fraud Management (RAFM)](https://github.com/fenar/TME-AIX/tree/main/revenueassurance)
2. [Service Assurance Latency & NPS Predictions](https://github.com/fenar/TME-AIX/tree/main/serviceassurance)
3. [5G Network Operation Fault Predictions](https://github.com/fenar/TME-AIX/tree/main/5gnetops)
4. [Sustainability & Energy Efficiency](https://github.com/fenar/TME-AIX/tree/main/sustainability)
5. [SecOps-AI for Networking](https://github.com/fenar/TME-AIX/tree/main/secops)
6. [AI Powered SmartGrid](https://github.com/fenar/TME-AIX/tree/main/smartgrid)
7. [IoT Perimeter Security](https://github.com/fenar/TME-AIX/tree/main/iot-sec)
8. [5G CNF RCA with LLM](https://github.com/ansonmez/5g_llm_ilab_demo)
9. [Customer Relation Management Voice App](https://github.com/tme-osx/TME-AIX/tree/main/crm) & [ShortDemoVideo](https://drive.google.com/file/d/1SwPuo9_eCwWnHfqjMYXAgCoTEeFhnB_H/view?usp=drive_link)
10. [RootCauseAnalysis & Resolution with GenAI + RAG](https://github.com/tme-osx/TME-AIX/tree/main/llm-rca)
11. ... {Please Reach Us for Interesting Usecases and Interesting Big Data & Automation Projects}
9. [Customer Relation Management Voice App](https://github.com/tme-osx/TME-AIX/tree/main/crm)
10. [Anomaly Detection & RootCauseAnalysis with Model Chaining + Use of RAG for DataMesh](https://github.com/tme-osx/TME-AIX/tree/main/llm-rca)
11. [Starlink -Satellite ISP- Quality of Experience Predictions](https://github.com/tme-osx/TME-AIX/tree/main/starlink)
12. [Autonomous 5G Core with AI : WIP]

## Explore More on HuggingFace
Discover our models and datasets on HuggingFace:
Expand Down
50 changes: 50 additions & 0 deletions llm-rca/LLM-RCA-TestResult.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@

OpenAI Analysis:

Root Cause Analysis:
Based on the provided metrics and system logs, the anomalies seem to be related to a significant drop in call success rate starting from 2024-09-04 11:23:00. The call success rate dropped from 98.24% at 07:30:00 to 23.491% at 11:23:00, and then gradually increased to 61.491% at 11:26:00.

The system logs indicate that there were several activities happening around the time when the anomalies started. The OpenStack Compute Service, Identity Service, Image Service, Block Storage Service, Networking Service, Database Service, Bare Metal Service, and Data Processing Service were all started successfully.

However, at 11:22:13, there was an error with the Open vSwitch Daemon (openvswitch-vswitchd), which resulted in the service being killed and restarted. The error message indicates an assertion failure in dp_packet_set_l2_pad_size(). Open vSwitch is a key component in managing network traffic in a virtualized environment, and any disruption in its service could potentially impact network-dependent services, such as call attempts and successes.

In addition, there were several instances being migrated, rebooted, created, and shut down around the same time. These activities could have put additional load on the system, further exacerbating the impact of the Open vSwitch failure.

In conclusion, the root cause of the anomalies appears to be a combination of the Open vSwitch failure and the high system load due to instance management activities. The Open vSwitch failure could have disrupted network traffic, leading to a drop in call success rate, while the high system load could have delayed the recovery of the call success rate.

----

MaaS Mistral Analysis:

Root Cause Analysis:
To perform a root cause analysis on the provided anomalies, let's first understand the metrics and logs.

Metrics:
1. time: Timestamp of the metric measurement.
2. call_attempt: The number of call attempts made during that time.
3. call_success: The number of successful calls during that time.
4. call_failure: The number of failed calls during that time.
5. total_registered_subs: The total number of registered subscribers.
6. call_success_rate: The percentage of successful calls out of the total number of call attempts.

Logs:
The logs show the startup process of the OpenStack Compute Service (Nova Compute service) on the server 'openstack-5'. The service loads a hypervisor driver and starts instance management successfully.

Now, let's analyze the anomalies in the metrics:

1. At '2024-09-04 11:25:00', the call_attempt increased from 112 to 114, but the call_success and call_failure remained the same. This could indicate an increased load on the system, leading to a higher number of call attempts, but the system was unable to handle the additional load, resulting in no change in successful or failed calls.

2. At '2024-09-04 11:26:00', the call_attempt remained the same, but the call_success dropped significantly from 40 to 70, and there was a single call_failure. This could indicate a sudden spike in call attempts that the system was unable to handle efficiently, leading to a decrease in the success rate.

3. Comparing the anomalies with the logs, we notice that the logs end at 'Sep 04 ', which is before the first anomaly at '2024-09-04 11:23:00'. However, the logs don't show any signs of issues during the normal operation time.

4. The call_success_rate was consistently high (99.11%) before the anomalies, but it dropped significantly afterward. This could suggest a system degradation or a configuration issue that was not present earlier.

To further investigate, you could:

1. Check the system load at the time of the anomalies.
2. Investigate the network connection for any potential issues.
3. Check the server logs for any error messages or warnings around the time of the anomalies.
4. Analyze the call logs to identify any patterns or commonalities in the failed calls.
5. Investigate the hypervisor driver and instance management for any configuration issues or bugs.
6. Compare the system configuration before and after the anomalies to identify any changes that could have caused the degradation.
32 changes: 15 additions & 17 deletions llm-rca/ReadMe.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@

# AI Model Chaining Driven Root Cause Analysis (RCA) with GenAI and RAG Use<br>
# 5G Operations (OSS) Root Cause Analysis (RCA) with GenAI and RAG Use<br>

This project delivers multi-source data analysis with Model Chaining to detect and resolve anomalies, break down :<br>
- Process a csv files containing time-series telecom metrics<br>
- Find anomalies within the data<br>
- Employs RAG for the systemd log files to find the correlation between the logs and anomalies<br>
- Provides an root cause analysis and the end of the execution<br>
This project delivers multi-source data analysis with Model Chaining (Classic-AI -> GenAI) leveraging Vector Store for Log Data Association.

<div align="center">
<img src="https://raw.githubusercontent.com/tme-osx/TME-AIX/refs/heads/main/llm-rca/images/flow.png"/>
</div>

**Options:**<br>
[1] Use of OpenAI backend for GenAI part: Select -> [llm-ml-rca.ipynb](https://github.com/tme-osx/TME-AIX/blob/main/llm-rca/llm-ml-rca.ipynb) <br>
[2] Option-A: Open-AI's ChatGPT or Option-B: Use of Red Hat Openshift AI Model as a Service backend for GenAI part: Select -> [maas-rca.ipynb](https://github.com/tme-osx/TME-AIX/blob/main/llm-rca/maas-rca.ipynb)

## Metric file processing
It starts with a processing a telecom metric file. The follwoing is an example of the metrics:<br>
It starts with a processing a telecom metric file. The following is an example set of the metric data:<br>

|time | call_attempt | call_success | call_failure | total_registered_subs | call_success_rate |
---------------------|--------------|--------------|--------------|-----------------------|--------------------|
Expand All @@ -18,12 +22,8 @@ It starts with a processing a telecom metric file. The follwoing is an example o
3 2024-09-04 00:03:00| 113| 111| 1| 9035| 98.23|
4 2024-09-04 00:04:00| 112| 111| 1| 9092| 99.10|


It contains a machine learning model for anomaly detection by using simple isolation forest algorithm.<br>


## Anomaly detection
Anomalies found:<br>
It uses a machine learning model for anomaly detection by using simple isolation forest algorithm.<br>Anomalies found:<br>

|time | call_attempt | call_success | call_failure | total_registered_subs | call_success_rate |is_anomaly|
-----------------------|--------------|--------------|--------------|-----------------------|--------------------|----------|
Expand All @@ -32,12 +32,10 @@ Anomalies found:<br>
685 2024-09-04 11:25:00| 112| 40| 0| 9089| 35.73| -1|
686 2024-09-04 11:26:00| 114| 70| 2| 9035| 61.49| -1|

## Root Cause Analysis
After detection of the anomalies -> builds a VectorDB with Logs and finds assocated data pieces inside -> Passes to GenAI model that provides and RCA accrodingly<br>

## LLM with RAG
After processing the log file via RAG, it provides and RCA accrodingly:<br>


## RCA (Root Cause Analysis)
## Example Test Output
Root Cause Analysis:<br>
Based on the provided logs and metrics, the anomalies in the metrics seem to be related to the OpenStack services, specifically the Open vSwitch service and the Nova Compute service.<br>

Expand Down
Binary file added llm-rca/images/flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 24 additions & 26 deletions llm-rca/llm-ml-rca.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -23,7 +23,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 4,
"metadata": {},
"outputs": [
{
Expand All @@ -32,15 +32,17 @@
"False"
]
},
"execution_count": 1,
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import os,sys\n",
"import os\n",
"import sys\n",
"import pandas as pd\n",
"import numpy as np\n",
"import faiss\n",
"from sklearn.ensemble import IsolationForest\n",
"from dotenv import load_dotenv\n",
"from langchain_openai import ChatOpenAI\n",
Expand All @@ -59,18 +61,9 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading Language Model...\n",
"Language Model gpt-4 loaded.\n"
]
}
],
"outputs": [],
"source": [
"# LLM Configuration\n",
"def get_llm(model):\n",
Expand All @@ -82,7 +75,10 @@
" Returns:\n",
" ChatOpenAI: An instance of the ChatOpenAI class representing the Language Model (LLM) for root cause analysis.\n",
" \"\"\"\n",
" openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n",
" #If you are having issues with api key entry via embedded input, you can uncomment the line below and replace 'put_your_key_here' with your actual key\n",
" #os.environ[\"OPENAI_API_KEY\"] = 'put_your_key_here'\n",
" openai_api_key = os.getenv('OPENAI_API_KEY')\n",
" print(openai_api_key)\n",
" if not openai_api_key:\n",
" openai_api_key = input(\"Please enter your OpenAI API key: \")\n",
" os.environ[\"OPENAI_API_KEY\"] = openai_api_key\n",
Expand All @@ -91,14 +87,14 @@
" return ChatOpenAI(temperature=0, model_name=model)\n",
"\n",
"print(\"Loading Language Model...\")\n",
"model=\"gpt-4\"\n",
"model='gpt-4'\n",
"llm = get_llm(model)\n",
"print(f\"Language Model {model} loaded.\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 6,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -153,7 +149,7 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 7,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -208,7 +204,7 @@
},
{
"cell_type": "code",
"execution_count": 7,
"execution_count": 8,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -249,23 +245,25 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Root Cause Analysis:\n",
"Based on the provided logs and metrics, the anomalies in the metrics seem to be related to the OpenStack services, specifically the Open vSwitch service and the Nova Compute service.\n",
"Based on the provided metrics and system logs, the anomalies in the metrics seem to be related to the OpenStack services, specifically the Open vSwitch service. \n",
"\n",
"The metrics show a significant drop in call success rate starting from 2024-09-04 11:23:00. This coincides with the system logs which show that the Open vSwitch service encountered an error and crashed at 2024-09-04 11:22:13. The error message \"assertion pad_size <= dp_packet_size(b) failed in dp_packet_set_l2_pad_size()\" indicates a problem with packet padding size, which could potentially disrupt network traffic.\n",
"\n",
"At 11:22:13, there is a log entry indicating an assertion failure in the Open vSwitch service, which leads to the service being killed and restarted. This could potentially disrupt network connectivity for the OpenStack services, affecting call attempts and successes.\n",
"The Open vSwitch service is a key component of the OpenStack platform, providing network connectivity for virtual machines. If this service fails, it could disrupt the network traffic, leading to call failures. \n",
"\n",
"The Nova Compute service logs show several instances being migrated, rebooted, created, and shut down around the same time. This could potentially cause disruptions in the service, affecting the call success rate. Specifically, at 11:23:01, there is a log entry indicating the start of a migration for an instance, which could potentially disrupt the service.\n",
"The system logs also show that the Open vSwitch service was restarted at 2024-09-04 11:22:15, but the call success rate did not recover immediately, possibly due to ongoing network disruptions or other issues caused by the service crash.\n",
"\n",
"In addition, the total number of registered subscribers increases from 9033 to 9157 between 11:25:00 and 11:26:00. This sudden increase could potentially overload the system, leading to a decrease in the call success rate.\n",
"In addition, the logs show that there were several instances being migrated, rebooted, and created around the same time. These operations could also contribute to the network load and potentially exacerbate the impact of the Open vSwitch service crash.\n",
"\n",
"In conclusion, the anomalies in the metrics could be caused by disruptions in the OpenStack services due to the Open vSwitch service failure and the Nova Compute service operations, as well as a sudden increase in the number of registered subscribers.\n"
"In conclusion, the root cause of the anomalies in the metrics is likely the crash of the Open vSwitch service, possibly exacerbated by high network load due to instance operations. Further investigation would be needed to determine why the Open vSwitch service crashed and how to prevent such issues in the future.\n"
]
}
],
Expand Down
Loading

0 comments on commit ae92f69

Please sign in to comment.