[BUG] Missing Input Validator in `validate_and_prep_inputs` #273

gmchueh · 2025-01-02T12:54:34Z

Bug: Missing Input Validator in validate_and_prep_inputs

Expected Behavior

The validate_and_prep_inputs method should include an input validator to ensure that all required columns in the input DataFrame are present and contain valid data types before proceeding with further processing. If an invalid column name or data type is detected, a descriptive error message should be raised, specifying the issue and potential solution.

Current Behavior

Currently, the validate_and_prep_inputs method only checks for the existence of the action_id column and adds it if missing. It does not perform comprehensive validation on other required columns or their data types. This can lead to unexpected behavior or errors during subsequent processing steps if the input DataFrame is not correctly formatted. It only raises an error if the required columns are missing but not when their values are empty. It also coerces the required columns into an object type which should be updated to ensure that they always have a str type.

Possible Solution

Implement an input validator within the validate_and_prep_inputs method. This validator should check for the presence of all required columns, as defined by the class, and also validate the data types of the entries within those columns. If validation fails, raise a ValueError with a detailed message explaining the error, the problematic column(s)/rows, and suggested corrective actions.

Steps to Reproduce

Create a Pandas DataFrame with missing or incorrect data types for required columns, such as 'eval_id', 'action_type', 'action_input', etc. in the evaluations.py tool.
Instantiate the Evaluations class and call the from_dataframe method, passing the DataFrame.
Observe that no validation error is raised even when the required columns contain empty or invalid data, unexpected rows exist, or their data type/values does not match the expected values.

Context (Environment)

This issue can cause problems when processing evaluation datasets. Without proper input validation, incorrect is frequently being passed and causing errors downstream. Users can't troubleshoot on their own due to lack of clarity that the dataframe filling is the actual issue. Adding validation ensures data integrity and helps prevent issues later in the evaluation pipeline.

Detailed Description

The missing input validation makes the validate_and_prep_inputs method vulnerable to malformed input DataFrames. This can lead to unexpected behavior during evaluation runs. A robust validation step is needed to check not only for the presence of required columns, but also for empty or invalid values and correct data types within those columns. This will improve the reliability and robustness of the Evaluations class and the validate_and_prep_inputs method, and allow users to be more self-suficient while troubleshooting.

Possible Implementation

Within the validate_and_prep_inputs method, before any other processing, implement a validation check. This check should:

Verify presence of required columns: Ensure all columns listed in self.required_columns are present in the input DataFrame.
Check for empty values: Validate that the entries in the required columns are not empty ("", None, np.nan). If they are empty, raise a ValueError with a message specifying the affected column(s).
Validate data types/values: Confirm that the data types of values in required columns match the expected types and formatting. If a mismatch is found, raise a ValueError specifying the column and expected type. Also, ensure the data type of the required columns is changed from object to str to int.
Validate rows: Confirm that the rows don't contain invalid standards (ie. double tool call, playbook call when we have only one playbook, etc).

Example code snippet for implementing dtype validation:

REQUIRED_COLUMNS_TYPES = {
    'eval_id': str,
    'action_type': str,
    'action_input': str,
    # ... other required columns and types
}
for col, expected_type in REQUIRED_COLUMNS_TYPES.items():
    if not all(isinstance(val, expected_type) for val in df[col]):
        raise ValueError(f"Column '{col}' must contain values of type {expected_type}") # noqa: E501

# add example code to change required column type from `object` to `str`
for col in REQUIRED_COLUMNS_TYPES.keys():
    df[col] = df[col].astype(str)

The text was updated successfully, but these errors were encountered:

gmchueh added the bug Something isn't working label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Missing Input Validator in `validate_and_prep_inputs` #273

[BUG] Missing Input Validator in `validate_and_prep_inputs` #273

gmchueh commented Jan 2, 2025

[BUG] Missing Input Validator in validate_and_prep_inputs #273

[BUG] Missing Input Validator in validate_and_prep_inputs #273

Comments

gmchueh commented Jan 2, 2025

[BUG] Missing Input Validator in `validate_and_prep_inputs` #273

[BUG] Missing Input Validator in `validate_and_prep_inputs` #273