Differences between model prompts #89

jacob-mink · 2023-08-07T13:31:14Z

jacob-mink
Aug 7, 2023

tl;dr: The bot spits out intents as responses instead of generating a response. I'm curious if it has to do with the generate_user_intent prompt.

Hi NeMo-Guardrails team. As I've been playing with the tools you've provided and been perusing the repository, I've noticed a behavior that led me to the llm/prompts/ folder. I haven't filed an issue yet because I'm not really sure that it IS an issue, maybe just a normal LLM limitation.

Let me describe what I've observed first. The behavior is a failure to generate & parse an appropriate intent - things along the lines of

define user say hi
  "hello"
  "hi there"

define bot respond greeting
  "General Kenobi!"

define flow
  user say hi
  bot respond greeting

Then a prompt like "What's up?" might have the end output of the bot be something like "user ask status". Obviously, this isn't even one of the described utterances... but where did it come from?

Now, the research. I found a couple key differences in the prompts for determining user intent. Taking a look at https://github.com/NVIDIA/NeMo-Guardrails/blob/main/nemoguardrails/llm/prompts/openai-chatgpt.yml#L7 and comparing it to https://github.com/NVIDIA/NeMo-Guardrails/blob/main/nemoguardrails/llm/prompts/cohere.yml#L24, the key difference is that the cohere intent prompt SPECIFICALLY asks the LLM to generate an intent, while the OpenAI ChatGPT one does not.

So is this a conscious choice to not have that specific instruction for ChatGPT, or is it an issue? Has anyone else noticed the bot spitting out some weird intent messages instead of an actual response?

drazvan · 2023-08-07T17:59:42Z

drazvan
Aug 7, 2023
Maintainer

Hi @jacob-mink! Indeed, the behaviour depends on the LLM model. Some models, like text-davinci-003 or gpt-3.5-turbo generalize better and have better in-context learning capabilities. Others, need to be told in a more explicit way. The dolly templates (https://github.com/NVIDIA/NeMo-Guardrails/blob/main/nemoguardrails/llm/prompts/dolly.yml) for example, are more explicit and typically yield good results on other models as well. If you're using a different LLM, you'll have to experiment with tweaking the prompt to get the model to follow the right format.

0 replies

trebedea · 2023-08-14T15:46:40Z

trebedea
Aug 14, 2023
Collaborator

Hi @jacob-mink !

We have done several experiments to improve the prompts for the various LLMs that we have explored. Of course, this is a continuous process and the prompts can still be improved - in the last month we have put more energy in improving the prompts and assessing the performance of open source models. Some results are available here and the nemoguardrails/eval package contains various tools to test topical rails, but also execution rails (each tool has its own detailed README on how to use).

It would be great for people from the community to improve the current existing prompts, especially as now we have provided these initial set of tools and datasets to assess the performance of the models with different prompts.

What we have observed is that LLMs that are run in chat vs completion mode act differently:

LLMs in completion mode (e.g. text-davinci-003, but also Cohere or Dolly at the current moment) , are a bit better at following the instruction and the few-shots added to the prompt for each step (e.g. user canonical form / intent generation, next step / bot message etc.)
LLMs in chat mode (e.g. gpt-3.5-turbo) are more prone to generate a wider range of canonical forms and bot responses, sometimes ignoring the few-shots. In many situations, the generated canonical form is semantically similar to the ones defined in Colang files and a semantic post-processing helps improve performance (this is detailed in the evaluation results, but the functionality to perform the semantic matching is right now only available in the eval package but probably we are going to move it to the Guardrails runtime as well).
Moving the few shots at the end of the prompts, after the last turn of the current conversation history, most times improves performance in generating canonical forms / intents. This is what you have seen in the Cohere prompts.

Hope this helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences between model prompts #89

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Differences between model prompts #89

jacob-mink Aug 7, 2023

Replies: 2 comments

drazvan Aug 7, 2023 Maintainer

trebedea Aug 14, 2023 Collaborator

jacob-mink
Aug 7, 2023

drazvan
Aug 7, 2023
Maintainer

trebedea
Aug 14, 2023
Collaborator