Guardrails are a means to set constraints and safeguards around a string of text. We have two kinds of guardrails: input and output.
Input guardrails safeguard text generated by a user. This might include things like determining whether the user is trying to jailbreak the system, or trying to expose the underlying prompts.
Output guardrails safeguard text generated by the LLM.
Once an answer is generated by the LLM, we need to check it for certain categories of information we want to exclude e.g. PII, advice on anything illegal, political rhetoric etc.
Guardrails are another call to the LLM, with the response to be checked against certain rules.
This checks the user's question to determine if it is a jailbreak attempt.
The output of the LLM is either a 1
or a 0
.
This checks the response from the LLM against a set of guardrails.
The output of the LLM is as follows:
False | None
- the response is OK.True | "3, 4"
- guardrails 3 and 4 were triggered
We map these to meaningful names using the mappings from a config file, e.g. here.
The file also contains the prompts we use to run the guardrails. Copy/paste these into the OpenAI chat playground to investigate any issues.
You can also use the playground to ask the reasoning behind any response it gives.
There is a rake task guardrails:evaluate_guardrails
that will read a CSV and call the LLM with the content of the input
column, compare the result to the output
column and output metrics like this
{:count=>116,
:percent_correct=>66.38,
:exact_match_count=>77,
:failure_count=>39,
:average_latency=>0.6342312241422718,
:max_latency=>1.213642000220716,
:average_prompt_token_count=>1011,
:max_prompt_token_count=>2582,
:failures=> [
{:input=>“This is a false statement that will fail guardrails”,
:expected=>"True | \"1\"",
:actual=>"True | \"1, 4\""},
]
}
This is intended to be run whenever we change the prompts and examples so we know if we are improving the performance.
This rake task takes 3 arguments
guardrail_type
(required) - the guardrail type you want to run the evaluation on. This must beanswer_guardrails
orquestion_routing_guardrails
dataset_path
(required) - the path of the dataset you want to run the evaluation onoutput_path
(optional) - a path to write the JSON to. The rake task will output to stdout if anoutput_path
is not provided
Here is an example of how to run the rake task and write the output to a file:
rake guardrails:evaluate_guardrails["answer_guardrails", "example/dataset.csv", "path_to_write_to"]
And here it an example that outputs to stdout:
rake guardrails:evaluate_guardrails["answer_guardrails", "example/dataset.csv"]
The guardrails:print_prompts
rake task outputs the combined system and user prompt for the answer or question routing guardrails. It takes one argument
guardrail_type
which is the type of guardrail prompt you want to output. It must be either answer_guardrails
or question_routing_guardrails
.
The rake task outputs to stdout. Here is an example that outputs the answer guardrails prompt:
rake guardrails:print_prompts["answer_guardrails"]