Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a setup.py #62

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,7 @@ COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY ./tests /app/tests
COPY ./lib /app/lib
COPY ./src /app/src
COPY . .

EXPOSE 80
CMD ["waitress-serve", "--host=0.0.0.0", "--port=80", "--call", "src:create_app"]
CMD ["waitress-serve", "--host=0.0.0.0", "--port=80", "--call", "aiproxy.app:create_app"]
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,9 @@ Install requirements to the virtual environment with pip:

Export the following environment variables (or add them once to your shell profile)
* `export OPENAI_API_KEY=<your API key>`
* `export PYTHONPATH=<path to aiproxy root>`

See rubric tester options with:
* `python lib/assessment/rubric_tester.py --help`
* `bin/rubric_tester --help`

### example usage

Expand All @@ -132,7 +131,7 @@ GPT 3.5 Turbo is the default because a complete test run with that model costs o

A recommended first run is to use default experiment and dataset, limited to 1 lesson:
```
(.venv) Dave-MBP:~/src/aiproxy (rt-recover-from-bad-llm-responses)$ python ./lib/assessment/rubric_tester.py --lesson-names csd3-2023-L11
(.venv) Dave-MBP:~/src/aiproxy (rt-recover-from-bad-llm-responses)$ bin/rubric_tester --lesson-names csd3-2023-L11
2024-02-13 20:15:30,127: INFO: Evaluating lesson csd3-2023-L11 for dataset contractor-grades-batch-1-fall-2023 and experiment ai-rubrics-pilot-gpt-3.5-turbo...
```

Expand All @@ -150,7 +149,7 @@ The report that gets generated will contain a count of how many errors there wer
In order to rerun only the failed student projects, you can pass the `-c` (`--use-cached`) option:

```commandline
(.venv) Dave-MBP:~/src/aiproxy (rt-recover-from-bad-llm-responses)$ python ./lib/assessment/rubric_tester.py --lesson-names csd3-2023-L11 -c
(.venv) Dave-MBP:~/src/aiproxy (rt-recover-from-bad-llm-responses)$ bin/rubric_tester --lesson-names csd3-2023-L11 -c
```

![Screenshot 2024-02-13 at 8 24 31 PM](https://github.com/code-dot-org/aiproxy/assets/8001765/ff560302-94b9-4966-a5d6-7d9a9fa54892)
Expand All @@ -163,7 +162,7 @@ After enough reruns, you'll have a complete accuracy measurement for the lesson.

experiments run against GPT 4, GPT 4 Turbo and other pricey models should include report html and cached response data. this allows you to quickly view reports for these datasets either by looking directly at the `output/report*html` files or by regenerating the report against cached data via a command like:
```commandline
python ./lib/assessment/rubric_tester.py --experiment-name ai-rubrics-pilot-baseline-gpt-4-turbo --use-cached
bin/rubric_tester --experiment-name ai-rubrics-pilot-baseline-gpt-4-turbo --use-cached
```

#### smaller test runs
Expand Down
File renamed without changes.
6 changes: 3 additions & 3 deletions src/__init__.py → aiproxy/app/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
import logging

# Our modules
from src.test import test_routes
from src.openai import openai_routes
from src.assessment import assessment_routes
from .test import test_routes
from .openai import openai_routes
from .assessment import assessment_routes

# Flask
from flask import Flask
Expand Down
9 changes: 5 additions & 4 deletions src/assessment.py → aiproxy/app/assessment.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,13 @@
import openai
import json

from lib.assessment.config import DEFAULT_MODEL
from aiproxy.assessment.config import DEFAULT_MODEL

# Our assessment code
from lib.assessment import assess
from lib.assessment.assess import KeyConceptError
from lib.assessment.label import InvalidResponseError
from aiproxy.assessment import assess
from aiproxy.assessment import assess
from aiproxy.assessment.assess import KeyConceptError
from aiproxy.assessment.label import InvalidResponseError

assessment_routes = Blueprint('assessment_routes', __name__)

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
4 changes: 2 additions & 2 deletions lib/assessment/assess.py → aiproxy/assessment/assess.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
import logging

# Import our support classes
from lib.assessment.config import SUPPORTED_MODELS, DEFAULT_MODEL, VALID_LABELS
from lib.assessment.label import Label
from .config import SUPPORTED_MODELS, DEFAULT_MODEL, VALID_LABELS
from .label import Label

class KeyConceptError(Exception):
pass
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion lib/assessment/label.py → aiproxy/assessment/label.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from threading import Lock

from typing import List, Dict, Any
from lib.assessment.config import VALID_LABELS
from .config import VALID_LABELS

from io import StringIO

Expand Down
2 changes: 1 addition & 1 deletion lib/assessment/report.py → aiproxy/assessment/report.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import json
import math
from typing import List, Dict, Any
from lib.assessment.config import VALID_LABELS
from .config import VALID_LABELS

class Report:
def _compute_pass_fail_cell_color(self, actual, predicted, passing_labels):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,28 +1,30 @@
#!/usr/bin/env python

# Make sure the caller sees a helpful error message if they try to run this script with Python 2
f"This script requires {'Python 3'}. Please be sure to activate your virtual environment via `source .venv/bin/activate`."
#!/usr/bin/env python3

import argparse
import boto3
import concurrent.futures
import csv
import glob
import json
import time
import os
from multiprocessing import Pool
import concurrent.futures
import io
import json
import logging
import os
import pprint
import boto3
import subprocess
import sys
import time

from sklearn.metrics import accuracy_score, confusion_matrix
from multiprocessing import Pool
from collections import defaultdict

from lib.assessment.config import SUPPORTED_MODELS, DEFAULT_MODEL, VALID_LABELS, LESSONS, DEFAULT_DATASET_NAME, DEFAULT_EXPERIMENT_NAME
from lib.assessment.label import Label, InvalidResponseError
from lib.assessment.report import Report
from sklearn.metrics import accuracy_score, confusion_matrix

from .config import SUPPORTED_MODELS, DEFAULT_MODEL, VALID_LABELS, LESSONS, DEFAULT_DATASET_NAME, DEFAULT_EXPERIMENT_NAME
from .label import Label, InvalidResponseError
from .report import Report

if 'OPEN_AI_KEY' not in os.environ:
print("Warning: OPEN_AI_KEY environment variable is not set.", file=sys.stderr)

#globals
prompt_file = 'system_prompt.txt'
Expand Down
2 changes: 1 addition & 1 deletion bin/assessment-test.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/env ruby
#!/usr/bin/env ruby

require 'net/http'
require 'uri'
Expand Down
7 changes: 7 additions & 0 deletions bin/rubric_tester
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash

# Set current working dir to ../
cd "$(dirname "$0")"/..

source .venv/bin/activate
python3 -m aiproxy.assessment.rubric_tester "$@"
8 changes: 8 additions & 0 deletions run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/usr/bin/env python3

from aiproxy.app import create_app

app = create_app()

if __name__ == '__main__':
app.run(debug=True)
14 changes: 14 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from setuptools import setup, find_packages

setup(
name='aiproxy',
version='0.1',
packages=find_packages(),
install_requires=[line.strip() for line in open('requirements.txt')],
entry_points={
'console_scripts': [
'rubric_tester=aiproxy.assessment.rubric_tester:main',
'aiproxy=aiproxy.app:create_app',
]
},
)
2 changes: 1 addition & 1 deletion tests/accuracy/test_accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

from unittest import mock

from lib.assessment.rubric_tester import (
from aiproxy.assessment.rubric_tester import (
main,
)

Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

import pytest

from src import create_app
from aiproxy.app import create_app

import contextlib
import os
Expand Down
30 changes: 15 additions & 15 deletions tests/routes/test_assessment_routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ def test_should_return_400_when_no_rubric(self, client, randomstring):
assert response.status_code == 400

def test_should_return_400_on_openai_error(self, mocker, client, randomstring):
mocker.patch('lib.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
mocker.patch('aiproxy.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
response = client.post('/assessment', data={
"code": randomstring(10),
"prompt": randomstring(10),
Expand Down Expand Up @@ -88,7 +88,7 @@ def test_should_return_400_when_passing_not_a_number_to_temperature(self, client
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_data(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
label_mock.return_value = []

response = client.post('/assessment', data={
Expand All @@ -106,7 +106,7 @@ def test_should_return_400_when_the_label_function_does_not_return_data(self, mo
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_the_right_structure(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
label_mock.return_value = {
'metadata': {},
'data': {}
Expand All @@ -127,7 +127,7 @@ def test_should_return_400_when_the_label_function_does_not_return_the_right_str
assert response.status_code == 400

def test_should_pass_arguments_to_label_function(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
data = {
"code": randomstring(10),
"prompt": randomstring(10),
Expand Down Expand Up @@ -155,7 +155,7 @@ def test_should_pass_arguments_to_label_function(self, mocker, client, randomstr
)

def test_should_return_the_result_from_label_function_when_valid(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
label_mock.return_value = {
'metadata': {},
'data': [
Expand Down Expand Up @@ -190,7 +190,7 @@ class TestPostTestAssessment:
"""

def test_should_return_400_on_openai_error(self, mocker, client, randomstring):
mocker.patch('lib.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
mocker.patch('aiproxy.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
response = client.post('/test/assessment', data={
Expand Down Expand Up @@ -236,7 +236,7 @@ def test_should_return_400_when_passing_not_a_number_to_temperature(self, mocker
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_data(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = []
Expand All @@ -255,7 +255,7 @@ def test_should_return_400_when_the_label_function_does_not_return_data(self, mo
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_the_right_structure(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = {
Expand All @@ -277,7 +277,7 @@ def test_should_return_400_when_the_label_function_does_not_return_the_right_str
assert response.status_code == 400

def test_should_pass_arguments_to_label_function(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
data = {
Expand Down Expand Up @@ -305,7 +305,7 @@ def test_should_pass_arguments_to_label_function(self, mocker, client, randomstr
)

def test_should_return_the_result_from_label_function_when_valid(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = {
Expand Down Expand Up @@ -341,7 +341,7 @@ class TestPostBlankAssessment:
"""

def test_should_return_400_on_openai_error(self, mocker, client, randomstring):
mocker.patch('lib.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
mocker.patch('aiproxy.assessment.assess.label').side_effect = openai.error.InvalidRequestError('', '')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
response = client.post('/test/assessment/blank', data={
Expand Down Expand Up @@ -384,7 +384,7 @@ def test_should_return_400_when_passing_not_a_number_to_temperature(self, mocker
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_data(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = []
Expand All @@ -402,7 +402,7 @@ def test_should_return_400_when_the_label_function_does_not_return_data(self, mo
assert response.status_code == 400

def test_should_return_400_when_the_label_function_does_not_return_the_right_structure(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = {
Expand All @@ -423,7 +423,7 @@ def test_should_return_400_when_the_label_function_does_not_return_the_right_str
assert response.status_code == 400

def test_should_pass_arguments_including_blank_code_to_label_function(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
data = {
Expand All @@ -450,7 +450,7 @@ def test_should_pass_arguments_including_blank_code_to_label_function(self, mock
)

def test_should_return_the_result_from_label_function_when_valid(self, mocker, client, randomstring):
label_mock = mocker.patch('lib.assessment.assess.label')
label_mock = mocker.patch('aiproxy.assessment.assess.label')
mock_open = mocker.mock_open(read_data='file data')
mock_file = mocker.patch('builtins.open', mock_open)
label_mock.return_value = {
Expand Down
Loading