Creating new RCA datasets and Developing new RCA methods

Creating new RCA datasets

To create new RCA datasets, follow these steps:

System Setup: Deploy the target microservice system in a controlled environment, such as a Kubernetes cluster, and configure it to generate telemetry data (metrics, logs, traces).
Fault Injection: Identify the fault types to include (e.g., resource, network, code-level faults). Use tools like stress-ng for resource faults, tc for network faults, and manual code modifications for code-level faults.
Data Collection:
- Metrics: Use tools like Prometheus and cAdvisor to gather system metrics.
- Logs: Employ log aggregators like Fluent Bit or Loki to collect and structure logs.
- Traces: Use tracing tools like Jaeger to capture distributed traces.
Fault Annotation: Annotate the collected data with labels for the injected faults, including:
- The time of fault injection.
- The root cause service.
- Specific root cause indicators (e.g., a metric, log entry, or trace span).
Data Processing: Format the telemetry data into a structured format like CSV or JSON. Ensure consistency by including columns for timestamps, service names, and telemetry data values.
Validation: Engage domain experts to validate the dataset for accuracy and completeness.
Documentation: Provide a README or similar file with details about the dataset, including:
- The systems used.
- Fault types included.
- Instructions for downloading and using the dataset.

Developing new RCA methods

To develop new RCA methods and integrate them into RCAEval, follow these steps:

Define the Approach:
- Decide on the type of RCA method (metric-based, trace-based, multi-source).
- Determine the algorithm or technique to use (e.g., statistical analysis, causal inference, machine learning).

Implement the Method:

Create a new Python file in the RCAEval/e2e/ directory, naming it appropriately (e.g., new_method.py).

Implement the method as a Python function with the following signature:

def new_method(data, inject_time=None, dataset=None, sli=None, anomalies=None, **kwargs):
    # Method logic here
    return {
        "ranks": ranked_root_causes,
    }

Preprocess the Data:
- Use existing utilities from RCAEval.io.time_series to preprocess the input telemetry data, such as preprocess, drop_constant, or select_useful_cols.
Analyze the Data:
- Implement the core logic for root cause analysis.
- Rank the root cause candidates based on their likelihood of causing the failure.
Test the Method:
- Write unit tests in tests/test_new_method.py to ensure correctness and reproducibility.
- Use sample datasets available in RCAEval to validate the method.
Integrate with RCAEval:
- Add the method to RCAEval/e2e/__init__.py for seamless import.
- Update the main.py evaluation script to include the new method by adding it to the --method options.
Document the Method:
- Provide usage examples in the README or a dedicated tutorial notebook in the docs/ folder.
- Include a description of the method, its assumptions, and limitations.
Contribute Back:
- Submit a pull request to the RCAEval repository with the new method and associated documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXTENDING.md

EXTENDING.md

Creating new RCA datasets and Developing new RCA methods

Creating new RCA datasets

Developing new RCA methods

Files

EXTENDING.md

Latest commit

History

EXTENDING.md

File metadata and controls

Creating new RCA datasets and Developing new RCA methods

Creating new RCA datasets

Developing new RCA methods