Merge branch 'main' into langsmith

instructor-ai · Feb 18, 2024 · 8637ba8 · 8637ba8
2 parents 4354bd8 + 66a8285
commit 8637ba8
Show file tree

Hide file tree

Showing 34 changed files with 2,836 additions and 162 deletions.
diff --git a/README.md b/README.md
@@ -1,27 +1,28 @@
-# Welcome to Instructor - Your Gateway to Structured Outputs with OpenAI
+# Instructor
 
-_Pythonic Structured Outputs powered by LLM function calling and tool calling APIs. Designed for simplicity, transparency, and control._
+_Structured outputs powered by llms. Designed for simplicity, transparency, and control._
 
 ---
 
-[Star us on Github!](https://www.github.com/jxnl/instructor)
-
 [![Twitter Follow](https://img.shields.io/twitter/follow/jxnlco?style=social)](https://twitter.com/jxnlco)
-[![Downloads](https://img.shields.io/pypi/dm/instructor.svg)](https://pypi.python.org/pypi/instructor)
-[![Documentation](https://img.shields.io/badge/docs-available-brightgreen)](https://jxnl.github.io/instructor)
-[![Coverage Status](https://coveralls.io/repos/github/jxnl/instructor/badge.svg?branch=add-coveralls)](https://coveralls.io/github/jxnl/instructor?branch=add-coveralls)
 [![Discord](https://img.shields.io/discord/1192334452110659664?label=discord)](https://discord.gg/CV8sPM5k5Y)
+[![Downloads](https://img.shields.io/pypi/dm/instructor.svg)](https://pypi.python.org/pypi/instructor)
+
+Instructor stands out for its simplicity, transparency, and user-centric design. We leverage Pydantic to do the heavy lifting, and we've built a simple, easy-to-use API on top of it by helping you manage [validation context](./concepts/reask_validation.md), retries with [Tenacity](./concepts/retrying.md), and streaming [Lists](./concepts/lists.md) and [Partial](./concepts/partial.md) responses.
 
-Dive into the world of Python-based structured extraction, empowered by OpenAI's cutting-edge function calling API. Instructor stands out for its simplicity, transparency, and user-centric design. Whether you're a seasoned developer or just starting out, you'll find Instructor's approach intuitive and its results insightful.
+Check us out in [Typescript](https://instructor-ai.github.io/instructor-js/) and [Elixir](https://github.com/thmsmlr/instructor_ex/).
 
-## Ports to other languages
+Instructor is not limited to the OpenAI API, we have support for many other backends that via patching. Check out more on [patching](./concepts/patching.md).
 
-Check out ports to other languages below:
+1. Wrap OpenAI's SDK
+2. Wrap the create method
 
-- [Typescript / Javascript](https://www.github.com/jxnl/instructor-js)
-- [Elixir](https://github.com/thmsmlr/instructor_ex/)
+Including but not limited to:
 
-If you want to port Instructor to another language, please reach out to us on [Twitter](https://twitter.com/jxnlco) we'd love to help you get started!
+- [Together](./blog/posts/together.md)
+- [Ollama](./blog/posts/ollama.md)
+- [AnyScale](./blog/posts/anyscale.md)
+- [llama-cpp-python](./blog/posts/llama-cpp-python.md)
 
 ## Get Started in Moments
 

diff --git a/docs/concepts/patching.md b/docs/concepts/patching.md
@@ -6,11 +6,7 @@ Instructor enhances client functionality with three new keywords for backwards c
 - `max_retries`: Determines retry attempts for failed `chat.completions.create` validations.
 - `validation_context`: Provides extra context to the validation process.
 
-There are three methods for structured output:
-
-1. **Function Calling**: The primary method. Use this for stability and testing.
-2. **Tool Calling**: Useful in specific scenarios; lacks the reasking feature of OpenAI's tool calling API.
-3. **JSON Mode**: Offers closer adherence to JSON but with more potential validation errors. Suitable for specific non-function calling clients.
+The default mode is `instructor.Mode.TOOLS` which is the recommended mode for OpenAI clients. This mode is the most stable and is the most recommended for OpenAI clients. The other modes are for other clients and are not recommended for OpenAI clients.
 
 ## Tool Calling
 
@@ -30,6 +26,7 @@ Parallel tool calling is also an option but you must set `response_model` to be
 ```python
 import instructor
 from openai import OpenAI
+
 client = instructor.patch(OpenAI(), mode=instructor.Mode.PARALLEL_TOOLS)
 ```
 

diff --git a/docs/concepts/philosophy.md b/docs/concepts/philosophy.md
@@ -4,9 +4,20 @@ The instructor values [simplicity](https://eugeneyan.com/writing/simplicity/) an
 
 > “Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.” — Edsger Dijkstra
 
-## The Bridge to Object-Oriented Programming
+### Proof that its simple
 
-`instructor` acts as a bridge converting text-based LLM interactions into a familiar object-oriented format. Its integration with Pydantic provides type hints, runtime validation, and robust IDE support; love and supported by many in the Python ecosystem. By treating LLMs as callable functions returning typed objects, instructor makes [language models backwards compatible with code](https://www.youtube.com/watch?v=yj-wSRJwrrc), making them practical for everyday use while being complex enough for advanced applications.
+1. Most users will only need to learn `response_model` and `patch` to get started.
+2. No new prompting language to learn, no new abstractions to learn.
+
+### Proof that its transparent
+
+1. We write very little prompts, and we don't try to hide the prompts from you.
+2. We'll do better in the future to give you config over the 2 prompts we do write, Reasking and JSON_MODE prompts.
+
+### Proof that its flexible
+
+1. If you build a system with OpenAI dirrectly, it is easy to incrementally adopt instructor.
+2. Add `response_model` and if you want to revert, just remove it.
 
 ## The zen of `instructor`
 

diff --git a/docs/examples/extract_slides.md b/docs/examples/extract_slides.md
@@ -0,0 +1,113 @@
+# Data extraction from slides
+
+In this guide, we demonstrate how to extract data from slides.
+
+!!! tips "Motivation"
+
+   When we want to translate key information from slides into structured data, simply isolating the text and running extraction might not be enough. Sometimes the important data is in the images on the slides, so we should consider including them in our extraction pipeline.
+
+## Defining the necessary Data Structures
+
+Let's say we want to extract the competitors from various presentations and categorize them according to their respective industries.
+
+Our data model will have `Industry` which will be a list of `Competitor`'s for a specific industry, and `Competition` which will aggregate the competitors for all the industries.
+
+```python
+from openai import OpenAI
+from pydantic import BaseModel, Field
+from typing import Optional, List
+
+class Competitor(BaseModel):
+    name: str
+    features: Optional[List[str]]
+
+
+# Define models
+class Industry(BaseModel):
+    """
+    Represents competitors from a specific industry extracted from an image using AI.
+    """
+
+    name: str = Field(
+        description="The name of the industry"
+    )
+    competitor_list: List[Competitor] = Field(
+        description="A list of competitors for this industry"
+    )
+
+class Competition(BaseModel):
+    """
+    This class serves as a structured representation of 
+    competitors and their qualities.
+    """
+
+    industry_list: List[IndustryCompetition] = Field(
+        description="A list of industries and their competitors"
+    )
+```
+
+## Competitors extraction
+
+To extract competitors from slides we will define a function which will read images from urls and extract the relevant information from them.
+
+```python
+import instructor
+from openai import OpenAI
+
+# Apply the patch to the OpenAI client
+# enables response_model keyword
+client = instructor.patch(
+    OpenAI(), mode=instructor.Mode.MD_JSON
+)
+
+# Define functions
+def read_images(image_urls: List[str]) -> Competition:
+    """
+    Given a list of image URLs, identify the competitors in the images.
+    """
+    return client.chat.completions.create(
+        model="gpt-4-vision-preview",
+        response_model=Competition,
+        max_tokens=2048,
+        temperature=0,
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "text",
+                        "text": "Identify competitors and generate key features for each competitor.",
+                    },
+                    *[
+                        {"type": "image_url", "image_url": {"url": url}}
+                        for url in image_urls
+                    ],
+                ],
+            }
+        ],
+    )
+```
+
+## Execution
+
+Finally, we will run the previous function with a few sample slides to see the data extractor in action.
+
+As we can see, our model extracted the relevant information for each competitor regardless of how this information was formatted in the original presentations.
+
+```python
+url = [
+    'https://miro.medium.com/v2/resize:fit:1276/0*h1Rsv-fZWzQUyOkt', 
+    'https://earlygame.vc/wp-content/uploads/2020/06/startup-pitch-deck-5.jpg'
+    ]
+model = read_images(url)
+print(model.model_json_dump(indent=2))
+```
+    industry_list=[
+
+    Industry(name='Accommodation and Hospitality', competitor_list=[Competitor(name='CouchSurfing', features=['Affordable', 'Online Transaction']), Competitor(name='Craigslist', features=['Affordable', 'Offline Transaction']), Competitor(name='BedandBreakfast.com', features=['Affordable', 'Offline Transaction']), Competitor(name='AirBed&Breakfast', features=['Affordable', 'Online Transaction']), Competitor(name='Hostels.com', features=['Affordable', 'Online Transaction']), Competitor(name='VRBO', features=['Expensive', 'Offline Transaction']), Competitor(name='Rentahome', features=['Expensive', 'Online Transaction']), Competitor(name='Orbitz', features=['Expensive', 'Online Transaction']), Competitor(name='Hotels.com', features=['Expensive', 'Online Transaction'])]), 
+
+    Industry(name='Wine E-commerce', competitor_list=[Competitor(name='WineSimple', features=['Ecommerce Retailers', 'True Personalized Selections', 'Brand Name Wine', 'No Inventory Cost', 'Target Mass Market']), Competitor(name='NakedWines', features=['Ecommerce Retailers', 'Target Mass Market']), Competitor(name='Club W', features=['Ecommerce Retailers', 'Brand Name Wine', 'Target Mass Market']), Competitor(name='Tasting Room', features=['Ecommerce Retailers', 'True Personalized Selections', 'Brand Name Wine']), Competitor(name='Drync', features=['Ecommerce Retailers', 'True Personalized Selections', 'No Inventory Cost']), Competitor(name='Hello Vino', features=['Ecommerce Retailers', 'Brand Name Wine', 'Target Mass Market'])])
+
+    ]
+```
+```
diff --git a/docs/examples/ollama.md b/docs/examples/ollama.md
@@ -1,12 +1,12 @@
 # Structured Outputs with Ollama
 
-Open-source LLMS are gaining popularity, and the release of Ollama's OpenAI compatibility later it has made it possible to obtain structured outputs using JSON schema.
+Open-source LLMS are gaining popularity, and with the release of Ollama's OpenAI compatibility layer, it has become possible to obtain structured outputs using JSON schema.
 
-By the end of this blog post, you will learn how to effectively utilize instructor with ollama. But before we proceed, let's first explore the concept of patching.
+By the end of this blog post, you will learn how to effectively utilize instructor with Ollama. But before we proceed, let's first explore the concept of patching.
 
 ## Patching
 
-Instructor's patch enhances a openai api it with the following features:
+Instructor's patch enhances an openai api with the following features:
 
 - `response_model` in `create` calls that returns a pydantic model
 - `max_retries` in `create` calls that retries the call if it fails by using a backoff strategy

diff --git a/docs/hub/index.md b/docs/hub/index.md
@@ -0,0 +1,92 @@
+# Instructor Hub
+
+Welcome to instructor hub, the goal of this project is to provide a set of tutorials and examples to help you get started, and allow you to pull in the code you need to get started with `instructor`
+
+Make sure you're using the latest version of `instructor` by running:
+
+```bash
+pip install -U instructor
+```
+
+## Contributing
+
+We welcome contributions to the instructor hub, if you have a tutorial or example you'd like to add, please open a pull request in `docs/hub` and we'll review it.
+
+1. The code must be in a single file
+2. Make sure that its referenced in the `mkdocs.yml`
+3. Make sure that the code is unit tested.
+
+### Using pytest_examples
+
+By running the following command you can run the tests and update the examples. This ensures that the examples are always up to date.
+Linted correctly and that the examples are working, make sure to include a `if __name__ == "__main__":` block in your code and add some asserts to ensure that the code is working.
+
+```bash
+poetry run pytest tests/openai/docs/test_hub.py --update-examples
+```
+
+## CLI Usage
+
+Instructor hub comes with a command line interface (CLI) that allows you to view and interact with the tutorials and examples and allows you to pull in the code you need to get started with the API.
+
+### List Cookbooks
+
+By running `instructor hub list` you can see all the available tutorials and examples. By clickony (doc) you can see the full tutorial back on this website.
+
+```bash
+$ instructor hub list --sort
+```
+
+| hub_id | slug                          | title                         | n_downloads |
+| ------ | ----------------------------- | ----------------------------- | ----------- |
+| 2      | multiple_classification (doc) | Multiple Classification Model | 24          |
+| 1      | single_classification (doc)   | Single Classification Model   | 2           |
+
+### Searching for Cookbooks
+
+You can search for a tutorial by running `instructor hub list -q <QUERY>`. This will return a list of tutorials that match the query.
+
+```bash
+$ instructor hub list -q multi
+```
+
+| hub_id | slug                          | title                         | n_downloads |
+| ------ | ----------------------------- | ----------------------------- | ----------- |
+| 2      | multiple_classification (doc) | Multiple Classification Model | 24          |
+
+### Reading a Cookbook
+
+To read a tutorial, you can run `instructor hub pull --id <hub_id> --page` to see the full tutorial in the terminal. You can use `j,k` to scroll up and down, and `q` to quit. You can also run it without `--page` to print the tutorial to the terminal.
+
+```bash
+$ instructor hub pull --id 2 --page
+```
+
+### Pulling in Code
+
+You can pull in the code with `--py --output=<filename>` to save the code to a file, or you cal also run it without `--output` to print the code to the terminal.
+
+```bash
+$ instructor hub pull --id 2 --py --output=run.py
+$ instructor hub pull --id 2 --py > run.py
+```
+
+You can run the code instantly if you `|` it to `python`:
+
+```bash
+$ instructor hub pull --id 2 --py | python
+```
+
+## Call for Contributions
+
+We're looking for a bunch more hub examples, if you have a tutorial or example you'd like to add, please open a pull request in `docs/hub` and we'll review it.
+
+- [ ] Converting the cookbooks to the new format
+- [ ] Validator examples
+- [ ] Data extraction examples
+- [ ] Streaming examples (Iterable and Partial)
+- [ ] Batch Parsing examples
+- [ ] Open Examples, together, anyscale, ollama, llama-cpp, etc
+- [ ] Query Expansion examples
+- [ ] Batch Data Processing examples
+- [ ] Batch Data Processing examples with Cache
diff --git a/docs/hub/multiple_classification.md b/docs/hub/multiple_classification.md
@@ -0,0 +1,51 @@
+For multi-label classification, we introduce a new enum class and a different Pydantic model to handle multiple labels.
+
+```python
+import openai
+import instructor
+
+from typing import List, Literal
+from pydantic import BaseModel, Field
+
+# Apply the patch to the OpenAI client
+# enables response_model keyword
+client = instructor.patch(openai.OpenAI())
+
+LABELS = Literal["ACCOUNT", "BILLING", "GENERAL_QUERY"]
+
+
+class MultiClassPrediction(BaseModel):
+    labels: List[LABELS] = Field(
+        ...,
+        description="Only select the labels that apply to the support ticket.",
+    )
+
+
+def multi_classify(data: str) -> MultiClassPrediction:
+    return client.chat.completions.create(
+        model="gpt-4-turbo-preview",  # gpt-3.5-turbo fails
+        response_model=MultiClassPrediction,
+        messages=[
+            {
+                "role": "system",
+                "content": f"You are a support agent at a tech company. Only select the labels that apply to the support ticket.",
+            },
+            {
+                "role": "user",
+                "content": f"Classify the following support ticket: {data}",
+            },
+        ],
+    )  # type: ignore
+
+
+if __name__ == "__main__":
+    ticket = "My account is locked and I can't access my billing info."
+    prediction = multi_classify(ticket)
+    assert {"ACCOUNT", "BILLING"} == {label for label in prediction.labels}
+    print("input:", ticket)
+    #> input: My account is locked and I can't access my billing info.
+    print("labels:", LABELS)
+    #> labels: typing.Literal['ACCOUNT', 'BILLING', 'GENERAL_QUERY']
+    print("prediction:", prediction)
+    #> prediction: labels=['ACCOUNT', 'BILLING']
+```