To create new RCA datasets, follow these steps:
-
System Setup: Deploy the target microservice system in a controlled environment, such as a Kubernetes cluster, and configure it to generate telemetry data (metrics, logs, traces).
-
Fault Injection: Identify the fault types to include (e.g., resource, network, code-level faults). Use tools like
stress-ng
for resource faults,tc
for network faults, and manual code modifications for code-level faults. -
Data Collection:
- Metrics: Use tools like Prometheus and cAdvisor to gather system metrics.
- Logs: Employ log aggregators like Fluent Bit or Loki to collect and structure logs.
- Traces: Use tracing tools like Jaeger to capture distributed traces.
-
Fault Annotation: Annotate the collected data with labels for the injected faults, including:
- The time of fault injection.
- The root cause service.
- Specific root cause indicators (e.g., a metric, log entry, or trace span).
-
Data Processing: Format the telemetry data into a structured format like CSV or JSON. Ensure consistency by including columns for timestamps, service names, and telemetry data values.
-
Validation: Engage domain experts to validate the dataset for accuracy and completeness.
-
Documentation: Provide a README or similar file with details about the dataset, including:
- The systems used.
- Fault types included.
- Instructions for downloading and using the dataset.
To develop new RCA methods and integrate them into RCAEval, follow these steps:
-
Define the Approach:
- Decide on the type of RCA method (metric-based, trace-based, multi-source).
- Determine the algorithm or technique to use (e.g., statistical analysis, causal inference, machine learning).
-
Implement the Method:
- Create a new Python file in the
RCAEval/e2e/
directory, naming it appropriately (e.g.,new_method.py
). - Implement the method as a Python function with the following signature:
def new_method(data, inject_time=None, dataset=None, sli=None, anomalies=None, **kwargs): # Method logic here return { "ranks": ranked_root_causes, }
- Create a new Python file in the
-
Preprocess the Data:
- Use existing utilities from
RCAEval.io.time_series
to preprocess the input telemetry data, such aspreprocess
,drop_constant
, orselect_useful_cols
.
- Use existing utilities from
-
Analyze the Data:
- Implement the core logic for root cause analysis.
- Rank the root cause candidates based on their likelihood of causing the failure.
-
Test the Method:
- Write unit tests in
tests/test_new_method.py
to ensure correctness and reproducibility. - Use sample datasets available in RCAEval to validate the method.
- Write unit tests in
-
Integrate with RCAEval:
- Add the method to
RCAEval/e2e/__init__.py
for seamless import. - Update the
main.py
evaluation script to include the new method by adding it to the--method
options.
- Add the method to
-
Document the Method:
- Provide usage examples in the README or a dedicated tutorial notebook in the
docs/
folder. - Include a description of the method, its assumptions, and limitations.
- Provide usage examples in the README or a dedicated tutorial notebook in the
-
Contribute Back:
- Submit a pull request to the RCAEval repository with the new method and associated documentation.