Askem retrieval-augmented generation prototype
Repo: https://github.com/UW-Madison-DSI/ask-xDD
Demo: https://xdddev.chtc.io/ask-xdd-demo
API Base URL: http://cosmos0002.chtc.wisc.edu:4502/
The end users of our system are ASKEM performers who access it using REST API. You can also visit our demo to try how this system can power a traceable COVID-19 search engine.
- Enhance performance tailored to Hackathon scenarios
- Integrate
ReAct
for better handling of complex queries - Implement
hybrid
search to refine keyword query results
The retriever uses an embedding-based search engine, specifically Dense Passage Retriever (DPR), to query relevant documents from the XDD database. Currently, it returns paragraphs
as documents. Future updates may include figures
, tables
, and equations
. The API accepts POST requests and requires an APIKEY. ASKEM performers can obtain an API key by contacting me.
Base URL: http://cosmos0002.chtc.wisc.edu:4502
There are 3 endpoints available:
vector
: Basic DPR vector search (Not recommended).hybrid
: Combines Elasticsearch pre-filtering with DPR vector search (Recommended, better performance).react
: Builds on thehybrid
approach, integrating the ReAct agent for "reasoning" (via gpt-4 by default) and subsequent querying (viahybrid
endpoint by default) to generate better answers. (Experimental, slow, highest performance).
Both vector
and hybrid
endpoints use similar format for request and response data.
import requests
APIKEY = "insert_api_key_here"
ENDPOINT = "BASE_URL/hybrid"
headers = {"Content-Type": "application/json", "Api-Key": APIKEY}
data = {
"topic": "covid",
"question": "What is SIDARTHE model?",
"top_k": 3,
}
response = requests.post(ENDPOINT, headers=headers, json=data)
response.json()
{
"question": str,
"top_k": Optional[int] = 5, # Number of documents to return
"distance": Optional[float] = None, # Max cosine distance between question and document
"topic": Optional[str] = None, # Filter by topic, only "covid" is available now
"doc_type": Optional[str] = None, # Filter by document type, only "paragraph" is available now
"preprocessor_id": Optional[str] = None, # Filter by preprocessor_id, for developer use only
"article_terms": Optional[List[str]] = None, # Obsolete, do not use
"paragraph_terms": Optional[List[str]] = None, # Filter by capitalized terms (any word that has more than one capital letter) in the paragraph
"paper_ids": Optional[List[str]] = None, # Filter by XDD paper ids
"move_to": Optional[str] = None, # Move the answer to better match the context of the given string, like `mathematical equation`.
"move_to_weight": Optional[float] = 0, # Weight `move_to` parameter to adjusts the influence on the original answer, with a range from 0 to 1. Higher values mean stronger augmentation.
"move_away_from": Optional[str] = None, # Move the answer away from irrelevant topics, like `general commentary`.
"move_away_from_weight": Optional[float] = 0, # Weight `move_away_from` to adjusts the influence on the original answer, with a range from 0 to 1. Higher values mean stronger augmentation.
"screening_top_k": Optional[int] = 100, # `hybrid` endpoint only. Number of documents to return from the elastic search pre-filtering step.
}
[
{
"paper_id": str, # XDD paper id
"doc_type": str, # only "paragraph" for now
"text": str, # text content
"distance": float, # distance to question
"cosmos_object_id": str, # only available for doc_type="figure"
"article_terms": List[str], # Obsolete, do not use
"paragraph_terms": List[str], # Capitalized terms in the paragraph
},
...
]
import requests
APIKEY = "insert_api_key_here"
ENDPOINT = "BASE_URL/react"
headers = {"Content-Type": "application/json", "Api-Key": APIKEY}
data = {
"topic": "covid",
"question": "What is SIDARTHE model?",
"top_k": 3,
}
response = requests.post(ENDPOINT, headers=headers, json=data)
response.json()
{
"question": str,
..., # Same as `hybrid` endpoint, see
"model_name": Optional[str] = "gpt-4", # OpenAI llm model name
}
{
"answer": str, # Final answer to the question
"used_docs": list[Document], # Relevant documents used to generate the answer, with the same schema as the response of `hybrid` endpoint
}
For developer
-
Make a .env file in the project root directory with these variables
see example:
.env.example
see shared dotenv file for the actual values
-
Run launch test
bash ./scripts/launch_test.sh
-
Ingest documents
Put all text files in a folder, with file format as
<ingest_dir>/<paper-id>.txt
, then run this:python askem/ingest_docs.py --input-dir "data/debug_data/paragraph_test" --topic "covid-19" --doc-type "paragraph" --weaviate-url "url_to_weaviate"
-
Ingest figures
Put all text files in a folder, with file format as
<ingest_dir>/<paper-id>.<cosmos_object_id>.txt
, then run this:python askem/deploy.py --input-dir "data/debug_data/figure_test" --topic "covid-19" --doc-type "figure" --weaviate-url "url_to_weaviate"
- In retriever data models, add new topic to
Topic
enum class - Ingest data with the new topic
- Make sure xDD articles API has the new topic
dataset
available, they use the same name without any translation layer. - Add new preset demo questions to
askem/demo/present_questions
, e.g.: climate change preset questions - Add
Topic
in demo.