AI SQL Generator Tuned for Snowflake

@iloveitaly, @ryanstout and @kduraiswami built a product to convert natural language questions into SQL queries. We built this for a set of early users who were using Snowflake, although we designed the system to work with any SQL/columnar database.

After many user interviews we decided not to move forward with the product (it'll be a business, but not a big one). If you want more information about the user interviews we did, I'm (@iloveitaly) happy to share, just drop me an email.

Dispite being a short hackathon-style project, there's some really interesting code in here that we wanted to publish for the community.

Video Walkthrough

Stack
- Postgres / Prisma
- Redis
- Python
- Node: Remix + Mantine
- Snowflake
DevProd
- Trunk.io
- ENV var structure
- GH Actions
- bin/
- Formatters
Python
- Schema normalizer
- Embeddings database
- SQL Parsing
- Prisma
- OpenAI throttler
Deployment
- Fly.io
- Docker images
  - Helper scripts
- Prisma generation

Development

All local installation is documented in:

bin/setup.sh
npm run dev

Deployment

Autodeploys are configured with GitHub actions. Essentially then run:

fly deploy --config {python,node}.toml
```

It's better to let GH actions do the deploy, if you deploy locally *everything in your local directory* will be included in the build (including unstaged changes!).

Want to force an app to restart?

```shell
flyctl apps restart knolbe-python
```

Attach the python app to the database:

```shell
fly pg attach knolbe-db --config python.toml --database-name knolbe

Proxy to the prod DB for inspection:

fly proxy 5431:5432 -a knolbe-db

Want to open up a console?

fly ssh console -C /bin/bash --config python.toml
```

Upload stuff to prod ([helpful post](https://community.fly.io/t/scp-a-file-into-a-persistent-volume/2729))

1. auth your machine `flyctl ssh issue --agent`
2. Get wireguard setup and connected
3. upload trained model `scp -r tmp/models root@knolbe-python.internal:/persisted-data/models`
4. upload indexes `scp -r tmp/indexes/ root@knolbe-python.internal:/persisted-data/indexes`

Use `-o StrictHostKeyChecking=no` to get around weird host errors:

```shell
Host knolbe-python.internal
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
```

Increase persistence size (in gbs):

```shell
fly volumes extend vol_okgj5451oe8vy2wz -s 10 --config python.toml

Docker

Frontend

All route logic is very thin: business logic should be pushed into ~/lib/, view logic should validate params, redirect, hit lib logic, and then pass props to components for rendering.
All ~/components should not use "top-level state"
All action and loader exports should be thin and pass to a ~/lib/*.server file (this is the "frontend of the backend")
action and loader exports should do all input validation and state retrieval before passing to the *.server functions

Python API

Start the server:

poetry run python python/server.py

To run the server without the learned ranker, run:

ENABLE_LEARNED_RANKER=0 poetry run python python/server.py

Ask a question:

http POST http://127.0.0.1:5000/ 'question:="my question"' data_source_id:=1

Get results from a datasource:

http POST http://0.0.0.0:5000/query data_source_id:=1 'sql:="SELECT * FROM CUSTOMER LIMIT 10"'

You can also hit production:

http POST http://knolbe-python.fly.dev/question 'question:="my question"' data_source_id:=1

CLI Tools

First, we need to import the table data:

poetry run nlp import-datasource \
  --data-source-id=1 \
  --table-limit=1000 \
  --column-limit=1000 \
  --column-value-limit=1000 \
  | tee user.log

Once everything is imported, we can ask questions:

poetry run python python/answer.py \
  --data-source-id 1 \
  --question "What are the top orders of all time?"

Training

Table, Column, and Value Rankers

Then train the ranker:

# Build the dataset and train the models
python -m python.ranker.train

cp /tmp/datasets/ranker/values/validation_23.npz /tmp/datasets/ranker/values/tes1.npz

# or to just train on the previous dataset
TRAIN_ONLY=1 python -m python.ranker.train

Few Shot Embeddings

As more evaluation quesitons are added, the few shot embeddings can be updated (and used to provide better few shot examples)

python -m python.prompts.few_shot_embeddings.builder

Datasources

Snowflake

API responses do not return raw SQL, everything comes back as a JSON object
- Unsure if this is just the specific API we are using or if it is a limitation of Snowflake
You'll get a Table 'ABANDONED_CHECKOUTS' does not exist or not authorized. if you don't specify a specific schema when connecting.

Connections

Shopify

Staff role does not allow us to create API keys. API keys must be created by generating a private app, which requires a specific permissions set. More info.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AI SQL Generator Tuned for Snowflake

Video Walkthrough

Development

Deployment

Docker

Frontend

Python API

CLI Tools

Training

Table, Column, and Value Rankers

Few Shot Embeddings

Datasources

Snowflake

Connections

Shopify

Files

README.md

Latest commit

History

README.md

File metadata and controls

AI SQL Generator Tuned for Snowflake

Video Walkthrough

Development

Deployment

Docker

Frontend

Python API

CLI Tools

Training

Table, Column, and Value Rankers

Few Shot Embeddings

Datasources

Snowflake

Connections

Shopify