Skip to content

Latest commit

 

History

History
89 lines (59 loc) · 3.43 KB

guardrails.md

File metadata and controls

89 lines (59 loc) · 3.43 KB

GOV.UK Chat Guardrails

What are guardrails

Guardrails are a means to set constraints and safeguards around a string of text. We have two kinds of guardrails: input and output.

Input guardrails

Input guardrails safeguard text generated by a user. This might include things like determining whether the user is trying to jailbreak the system, or trying to expose the underlying prompts.

Output guardrails

Output guardrails safeguard text generated by the LLM.

Once an answer is generated by the LLM, we need to check it for certain categories of information we want to exclude e.g. PII, advice on anything illegal, political rhetoric etc.

Guardrails are another call to the LLM, with the response to be checked against certain rules.

Guardrails in the codebase

JailbreakChecker

This checks the user's question to determine if it is a jailbreak attempt.

The output of the LLM is either a 1 or a 0.

MultipleChecker

This checks the response from the LLM against a set of guardrails.

The output of the LLM is as follows:

  • False | None - the response is OK.
  • True | "3, 4" - guardrails 3 and 4 were triggered

We map these to meaningful names using the mappings from a config file, e.g. here.

The file also contains the prompts we use to run the guardrails. Copy/paste these into the OpenAI chat playground to investigate any issues.

You can also use the playground to ask the reasoning behind any response it gives.

Evaluation

There is a rake task guardrails:evaluate_guardrails that will read a CSV and call the LLM with the content of the input column, compare the result to the output column and output metrics like this

{:count=>116,
 :percent_correct=>66.38,
 :exact_match_count=>77,
 :failure_count=>39,
 :average_latency=>0.6342312241422718,
 :max_latency=>1.213642000220716,
 :average_prompt_token_count=>1011,
 :max_prompt_token_count=>2582,
 :failures=>  [
   {:input=>“This is a false statement that will fail guardrails”,
    :expected=>"True | \"1\"",
    :actual=>"True | \"1, 4\""},
 ]
}

This is intended to be run whenever we change the prompts and examples so we know if we are improving the performance.

This rake task takes 3 arguments

  • guardrail_type (required) - the guardrail type you want to run the evaluation on. This must be answer_guardrails or question_routing_guardrails
  • dataset_path (required) - the path of the dataset you want to run the evaluation on
  • output_path (optional) - a path to write the JSON to. The rake task will output to stdout if an output_path is not provided

Here is an example of how to run the rake task and write the output to a file:

rake guardrails:evaluate_guardrails["answer_guardrails", "example/dataset.csv", "path_to_write_to"]

And here it an example that outputs to stdout:

rake guardrails:evaluate_guardrails["answer_guardrails", "example/dataset.csv"]

Printing prompts

The guardrails:print_prompts rake task outputs the combined system and user prompt for the answer or question routing guardrails. It takes one argument guardrail_type which is the type of guardrail prompt you want to output. It must be either answer_guardrails or question_routing_guardrails.

The rake task outputs to stdout. Here is an example that outputs the answer guardrails prompt:

rake guardrails:print_prompts["answer_guardrails"]