Skip to content

Commit

Permalink
Merge pull request #45 from cvs-health/release-branch/v0.2.0
Browse files Browse the repository at this point in the history
Release PR: v0.2.0
  • Loading branch information
dylanbouchard authored Nov 21, 2024
2 parents 9cbfc71 + d634800 commit 679464c
Show file tree
Hide file tree
Showing 34 changed files with 1,751 additions and 1,453 deletions.
Binary file modified assets/images/autoeval_process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
179 changes: 125 additions & 54 deletions examples/evaluations/text_generation/auto_eval_demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/envs/brand-new3/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
"/opt/conda/envs/langchain/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
Expand All @@ -54,6 +54,7 @@
"\n",
"import pandas as pd\n",
"from dotenv import find_dotenv, load_dotenv\n",
"from langchain_core.rate_limiters import InMemoryRateLimiter\n",
"\n",
"from langfair.auto import AutoEval\n",
"\n",
Expand Down Expand Up @@ -152,14 +153,14 @@
"- `responses` - (**list of strings, default=None**)\n",
"A list of generated output from an LLM. If not available, responses are computed using the model.\n",
"- `langchain_llm` (**langchain llm (Runnable), default=None**) A langchain llm object to get passed to LLMChain `llm` argument. \n",
"- `max_calls_per_min` (**int, default=None**) Specifies how many api calls to make per minute to avoid a rate limit error. By default, no limit is specified.\n",
"- `suppressed_exceptions` (**tuple, default=None**) Specifies which exceptions to handle as 'Unable to get response' rather than raising the exception\n",
"- `metrics` - (**dict or list of str, default is all metrics**)\n",
"Specifies which metrics to evaluate.\n",
"- `toxicity_device` - (**str or torch.device input or torch.device object, default=\"cpu\"**)\n",
"Specifies the device that toxicity classifiers use for prediction. Set to \"cuda\" for classifiers to be able to leverage the GPU. Currently, 'detoxify_unbiased' and 'detoxify_original' will use this parameter.\n",
"- `neutralize_tokens` - (**bool, default=True**)\n",
"An indicator attribute to use masking for the computation of Blue and RougeL metrics. If True, counterfactual responses are masked using `CounterfactualGenerator.neutralize_tokens` method before computing the aforementioned metrics.\n",
"- `max_calls_per_min` (**Deprecated as of 0.2.0**) Use LangChain's InMemoryRateLimiter instead.\n",
"\n",
"**Class Methods:**\n",
"1. `evaluate` - Compute supported metrics.\n",
Expand All @@ -181,9 +182,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below we use LangFair's `AutoEval` class to conduct a comprehensive bias and fairness assessment for our text generation/summarization use case. To instantiate the `AutoEval` class, provide prompts and LangChain LLM object. We provide two examples of LangChain LLMs below, but these can be replaced with a LangChain LLM of your choice.\n",
"Below we use LangFair's `AutoEval` class to conduct a comprehensive bias and fairness assessment for our text generation/summarization use case. To instantiate the `AutoEval` class, provide prompts and LangChain LLM object. \n",
"\n",
"**Important:** When installing community packages for LangChain, please ensure that the package version is compatible with `langchain<0.2.0`. Incompatibility may lead to unexpected errors or issues."
"**Important note: We provide three examples of LangChain LLMs below, but these can be replaced with a LangChain LLM of your choice.**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Use LangChain's InMemoryRateLimiter to avoid rate limit errors. Adjust parameters as necessary.\n",
"rate_limiter = InMemoryRateLimiter(\n",
" requests_per_second=10, \n",
" check_every_n_seconds=10, \n",
" max_bucket_size=1000, \n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###### Example 1: Gemini Pro with VertexAI"
]
},
{
Expand All @@ -194,13 +218,52 @@
},
"outputs": [],
"source": [
"# # Run if langchain-google-vertexai not installed (must be compatible with langchain<0.2.0). Note: kernel restart may be required.\n",
"# # Run if langchain-google-vertexai not installed. Note: kernel restart may be required.\n",
"# import sys\n",
"# !{sys.executable} -m pip install langchain-google-vertexai==0.1.3\n",
"# !{sys.executable} -m pip install langchain-google-vertexai\n",
"\n",
"# # Example with Gemini-Pro on VertexAI\n",
"# from langchain_google_vertexai import VertexAI\n",
"# llm = VertexAI(model_name='gemini-pro', temperature=1)"
"# llm = VertexAI(model_name='gemini-pro', temperature=1, rate_limiter=rate_limiter)\n",
"\n",
"# # Define exceptions to suppress\n",
"# suppressed_exceptions = (IndexError, ) # suppresses error when gemini refuses to answer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###### Example 2: Mistral AI"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# # Run if langchain-mistralai not installed. Note: kernel restart may be required.\n",
"# import sys\n",
"# !{sys.executable} -m pip install langchain-mistralai\n",
"\n",
"# os.environ[\"MISTRAL_API_KEY\"] = os.getenv('M_KEY')\n",
"# from langchain_mistralai import ChatMistralAI\n",
"\n",
"# llm = ChatMistralAI(\n",
"# model=\"mistral-large-latest\",\n",
"# temperature=1,\n",
"# rate_limiter=rate_limiter\n",
"# )\n",
"# suppressed_exceptions = None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###### Example 3: OpenAI on Azure"
]
},
{
Expand All @@ -211,11 +274,10 @@
},
"outputs": [],
"source": [
"# Run if langchain-openai not installed (must be compatible with langchain<0.2.0)\n",
"# # Run if langchain-openai not installed\n",
"# import sys\n",
"# !{sys.executable} -m pip install langchain-openai==0.1.6\n",
"# !{sys.executable} -m pip install langchain-openai\n",
"\n",
"# Example with AzureChatOpenAI\n",
"import openai\n",
"from langchain_openai import AzureChatOpenAI\n",
"llm = AzureChatOpenAI(\n",
Expand All @@ -224,8 +286,19 @@
" azure_endpoint=API_BASE,\n",
" openai_api_type=API_TYPE,\n",
" openai_api_version=API_VERSION,\n",
" temperature=1 # User to set temperature\n",
")"
" temperature=1, # User to set temperature\n",
" rate_limiter=rate_limiter\n",
")\n",
"\n",
"# Define exceptions to suppress\n",
"suppressed_exceptions = (openai.BadRequestError, ValueError) # this suppresses content filtering errors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instantiate `AutoEval` class"
]
},
{
Expand All @@ -236,14 +309,13 @@
},
"outputs": [],
"source": [
"# import torch\n",
"# device = torch.device(\"cuda\") # use if GPU is available\n",
"auto_object = AutoEval(\n",
"# import torch # uncomment if GPU is available\n",
"# device = torch.device(\"cuda\") # uncomment if GPU is available\n",
"ae = AutoEval(\n",
" prompts=prompts, # small sample used as an example; in practice, a bigger sample should be used\n",
" langchain_llm=llm,\n",
" suppressed_exceptions=(openai.BadRequestError, ValueError), # this suppresses content filtering errors \n",
" neutralize_tokens=True\n",
" # toxicity_device=device # use if GPU is available\n",
" suppressed_exceptions=suppressed_exceptions,\n",
" # toxicity_device=device # uncomment if GPU is available\n",
")"
]
},
Expand All @@ -267,43 +339,42 @@
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[1mStep 1: Fairness Through Unawareness\u001b[0m\n",
"------------------------------------\n",
"langfair: Number of prompts containing race words: 0\n",
"- langfair: The prompts satisfy fairness through unawareness for race words, the recommended risk assessment only include Toxicity\n",
"langfair: Number of prompts containing gender words: 31\n",
"- langfair: The prompts do not satisfy fairness through unawareness for gender words, the recommended risk assessments include Toxicity, Stereotype, and Counterfactual Discrimination.\n",
"\u001b[1mStep 1: Fairness Through Unawareness Check\u001b[0m\n",
"------------------------------------------\n",
"Number of prompts containing race words: 0\n",
"Number of prompts containing gender words: 31\n",
"Fairness through unawareness is not satisfied. Toxicity, stereotype, and counterfactual fairness assessments will be conducted.\n",
"\n",
"\u001b[1mStep 2: Generate Counterfactual Dataset\u001b[0m\n",
"---------------------------------------\n",
"langfair: gender words found in 31 prompts.\n",
"Gender words found in 31 prompts.\n",
"Generating 25 responses for each gender prompt...\n",
"langfair: Responses successfully generated!\n",
"Responses successfully generated!\n",
"\n",
"\u001b[1mStep 3: Generating Model Responses\u001b[0m\n",
"----------------------------------\n",
"langfair: Generating 25 responses per prompt...\n",
"langfair: Responses successfully generated!\n",
"Generating 25 responses per prompt...\n",
"Responses successfully generated!\n",
"\n",
"\u001b[1mStep 4: Evaluate Toxicity Metrics\u001b[0m\n",
"---------------------------------\n",
"langfair: Computing toxicity scores...\n",
"langfair: Evaluating metrics...\n",
"Computing toxicity scores...\n",
"Evaluating metrics...\n",
"\n",
"\u001b[1mStep 5: Evaluate Stereotype Metrics\u001b[0m\n",
"-----------------------------------\n",
"langfair: Computing stereotype scores...\n",
"langfair: Evaluating metrics...\n",
"Computing stereotype scores...\n",
"Evaluating metrics...\n",
"\n",
"\u001b[1mStep 6: Evaluate Counterfactual Metrics\u001b[0m\n",
"---------------------------------------\n",
"langfair: Evaluating metrics...\n"
"Evaluating metrics...\n"
]
}
],
"source": [
"warnings.filterwarnings(\"ignore\")\n",
"results = await auto_object.evaluate()"
"results = await ae.evaluate()"
]
},
{
Expand All @@ -325,27 +396,27 @@
"output_type": "stream",
"text": [
"\u001b[1m1. Toxicity Assessment\u001b[0m \n",
"- Toxic Fraction 0.0024 \n",
"- Expected Maximum Toxicity 0.0131 \n",
"- Toxic Fraction 0.0052 \n",
"- Expected Maximum Toxicity 0.0151 \n",
"- Toxicity Probability 0.0100 \n",
"\u001b[1m2. Stereotype Assessment\u001b[0m \n",
"- Stereotype Association 0.3534 \n",
"- Cooccurrence Bias 0.5349 \n",
"- Stereotype Fraction - gender 0.1892 \n",
"- Expected Maximum Stereotype - gender 0.3604 \n",
"- Stereotype Probability - gender 0.5800 \n",
"- Stereotype Association 0.3349 \n",
"- Cooccurrence Bias 0.5677 \n",
"- Stereotype Fraction - gender 0.2092 \n",
"- Expected Maximum Stereotype - gender 0.4009 \n",
"- Stereotype Probability - gender 0.6300 \n",
"\u001b[1m3. Counterfactual Assessment\u001b[0m \n",
" male-female \n",
"- Cosine Similarity 0.8403 \n",
"- RougeL Similarity 0.5065 \n",
"- Bleu Similarity 0.2784 \n",
"- Sentiment Bias 0.0040 \n",
"- Cosine Similarity 0.8675 \n",
"- RougeL Similarity 0.5190 \n",
"- Bleu Similarity 0.2791 \n",
"- Sentiment Bias 0.0022 \n",
"\n"
]
}
],
"source": [
"auto_object.print_results()"
"ae.print_results()"
]
},
{
Expand All @@ -363,7 +434,7 @@
},
"outputs": [],
"source": [
"auto_object.export_results(file_name=\"final_metrics.txt\")"
"ae.export_results(file_name=\"final_metrics.txt\")"
]
},
{
Expand Down Expand Up @@ -463,7 +534,7 @@
}
],
"source": [
"toxicity_data = pd.DataFrame(auto_object.toxicity_data)\n",
"toxicity_data = pd.DataFrame(ae.toxicity_data)\n",
"toxicity_data.sort_values(by='score', ascending=False).head()"
]
},
Expand Down Expand Up @@ -564,7 +635,7 @@
}
],
"source": [
"stereotype_data = pd.DataFrame(auto_object.stereotype_data)\n",
"stereotype_data = pd.DataFrame(ae.stereotype_data)\n",
"stereotype_data.sort_values(by='stereotype_score_gender', ascending=False).head()"
]
},
Expand Down Expand Up @@ -662,15 +733,15 @@
],
"metadata": {
"environment": {
"kernel": "langfair0.1.2-beta",
"kernel": "langchain",
"name": "workbench-notebooks.m125",
"type": "gcloud",
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "langfair0.1.2-beta",
"display_name": "langchain",
"language": "python",
"name": "langfair0.1.2-beta"
"name": "langchain"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -682,7 +753,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
"version": "3.11.10"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 679464c

Please sign in to comment.