Skip to content

ddkang/aidb

Repository files navigation

AIDB

Analyze unstructured data blazingly fast with machine learning. Connect your own ML models to your own data sources and query away!

Quick Start

In order to start using AIDB, all you need to do is install the requirements, specify a configuration, and query! Setting up on the environment is as simple as

git clone https://github.com/ddkang/aidb.git
cd aidb
pip install -r requirements.txt

# Optional if you'd like to run the examples below
gdown https://drive.google.com/uc?id=1SyHRaJNvVa7V08mw-4_Vqj7tCynRRA3x
unzip data.zip -d tests/

Text Example (in CSV)

We've set up an example of analyzing product reviews with HuggingFace. Set your HuggingFace API key. After this, all you need to do is run

python launch.py --config=config.sentiment --setup-blob-table --setup-output-table

As an example query, you can run

SELECT AVG(score)
FROM sentiment
WHERE label = '5 stars'
ERROR_TARGET 10%
CONFIDENCE 95%;

You can see the mappings here. We use the HuggingFace API to generate sentiments from the reviews.

Image Example (local directory)

We've also set up another example of analyzing whether or not user-generated content is adult content for filtering. In order to run this example, all you need to do is run

python launch.py --config=config.nsfw_detect --setup-blob-table --setup-output-table

As an example query, you can run

SELECT *
FROM nsfw
WHERE racy LIKE 'POSSIBLE';

You can see the mappings here. We use the Google Vision API to generate the safety labels.

Key Features

AIDB focuses on keeping cost down and interoperability high.

We reduce costs with our optimizations:

  • First-class support for approximate queries, reducing the cost of aggregations by up to 350x.
  • Caching, which speeds up multiple queries over the same data.

We keep interoperability high by allowing you to bring your own data source, ML models, and vector databases!

Approximate Querying

One key feature of AIDB is first-class support for approximate queries. Currently, we support approximate AVG, COUNT, and SUM. We don't currently support GROUP BY or JOIN for approximate aggregations, but it's on our roadmap. Please reach out if you'd like us to support your queries!

In order to execute an approximate aggregation query, simply append ERROR_TARGET <error percent>% CONFIDENCE <confidence>% to your normal aggregation. As a full example, you can compute an approximate count by doing:

SELECT COUNT(xmin)
FROM objects
ERROR_TARGET 5%
CONFIDENCE 95%;

The ERROR_TARGET specifies the percent error compared to running the query exactly. For example, if the true answer is 100, you will get answers between 95 and 105 (95% of the time).

Useful Links

Contribute

We have many improvements we'd like to implement. Please help us! For the time being, please email us, if you'd like to help contribute.

Contact Us

Need help in setting up AIDB for your specific dataset or want a new feature? Please fill this form.