Skip to content

models prompt flow evaluator documentation

github-actions[bot] edited this page Sep 25, 2024 · 3 revisions

prompt flow evaluator

Models in this category


  • Bleu-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] | | What is this metric? | Measures how closely the generated text matches a reference text based on n-gram overlap. | | How does it work? | The BLEU score calculates the geometric mean of the precision of n-grams between the model-generated text and ...

  • Coherence-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: where 1 is bad and 5 is good | | What is this metric? | Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. | | How does it work? | The coherence measure assesses the abi...

  • ECI-Evaluator

    Definition

Election Critical Information (ECI) refers to any content related to elections, including voting processes, candidate information, and election results. The ECI evaluator uses the Azure AI Safety Evaluation service to assess the generated responses for ECI without a disclaimer.

#...

  • F1Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] | | What is this metric? | Measures the ratio of the number of shared words between the model generation and the ground truth answers. | | How does it work? | The F1-score computes the ratio of the number of shared words between the model generation ...

  • Fluency-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: where 1 is bad and 5 is good | | What is this metric? | Measures the grammatical proficiency of a generative AI's predicted answer. | | How does it work? | The fluency measure assesses the extent to which the generated text conforms to grammatical...

  • Gleu-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] | | What is this metric? | Measures the degree of overlap between the generated text and both the reference text and source text, balancing between precision and recall. | | How does it work? | The GLEU score is computed by averaging the precision and...

  • Groundedness-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: where 1 is bad and 5 is good | | What is this metric? | Measures how well the model's generated answers align with information from the source data (user-defined context). | | How does it work? | The groundedness measure assesses the correspondenc...

  • Hate-and-Unfairness-Evaluator

    Definition

Hateful and unfair content refers to any language pertaining to hate toward or unfair representations of individuals and social groups along factors including but not limited to race, ethnicity, nationality, gender, sexual orientation, religion, immigration status, ability, persona...

Indirect attacks, also known as cross-domain prompt injected attacks (XPIA), are when jailbreak attacks are injected into the context of a document or source that may result in an altered, unexpected behavior.

Indirect attacks evaluations are broken down into three subcategories: ...

  • Meteor-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] | | What is this metric? | Evaluates the quality of the generated text by considering precision, recall, and a range of linguistic features like synonyms, stemming, and word order. | | How does it work? | The METEOR score is calculated based on the ha...

  • Protected-Material-Evaluator

    Definition

Protected material is any text that is under copyright, including song lyrics, recipes, and articles. Protected material evaluation leverages the Azure AI Content Safety Protected Material for Text service to perform the classification.

Labeling

Protected Material evaluations ...

  • Relevance-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: where 1 is bad and 5 is good | | What is this metric? | Measures the extent to which the model's generated responses are pertinent and directly related to the given questions. | | How does it work? | The relevance measure assesses the ability of a...

  • Rouge-Score-Evaluator

    | | | | -- | -- | | Score range | Float [0-1] | | What is this metric? | Measures the quality of the generated text by comparing it to a reference text using n-gram recall, precision, and F1-score. | | How does it work? | The ROUGE score (Recall-Oriented Understudy for Gisting Evaluation) evaluat...

  • Self-Harm-Related-Content-Evaluator

    Definition

Self-harm-related content includes language pertaining to actions intended to hurt, injure, or damage one's body or kill oneself.

Severity scale

Safety evaluations annotate self-harm-related content using a 0-7 scale.

Very Low (0-1) refers to

  • Content that contains self-...

Sexual content includes language pertaining to anatomical organs and genitals, romantic relationships, acts portrayed in erotic terms, pregnancy, physical sexual acts (including assault or sexual violence), prostitution, pornography, and sexual abuse.

Severity scale

Safety eva...

  • Similarity-Evaluator

    | | | | -- | -- | | Score range | Integer [1-5]: where 1 is bad and 5 is good | | What is this metric? | Measures the similarity between a source data (ground truth) sentence and the generated response by an AI model. | | How does it work? | The GPT-similarity measure evaluates the likeness betw...

  • Violent-Content-Evaluator

    Definition

Violent content includes language pertaining to physical actions intended to hurt, injure, damage, or kill someone or something. It also includes descriptions of weapons and guns (and related entities such as manufacturers and associations).

Severity scale

Safety evaluations ...

Clone this wiki locally