results

biocypher · Dec 28, 2023 · ae47d17 · ae47d17
1 parent 7faec31
commit ae47d17
Show file tree

Hide file tree

Showing 3 changed files with 47 additions and 3 deletions.
diff --git a/build/build.sh b/build/build.sh
@@ -18,7 +18,7 @@ DOCKER_RUNNING="$(docker info &> /dev/null && echo "true" || (true && echo "fals
 # Set option defaults
 CI="${CI:-false}"
 BUILD_PDF="${BUILD_PDF:-true}"
-BUILD_DOCX="${BUILD_DOCX:-false}"
+BUILD_DOCX="${BUILD_DOCX:-true}"
 BUILD_LATEX="${BUILD_LATEX:-false}"
 SPELLCHECK="${SPELLCHECK:-false}"
 MANUBOT_USE_DOCKER="${MANUBOT_USE_DOCKER:-$DOCKER_RUNNING}"

diff --git a/content/20.results.md b/content/20.results.md
@@ -13,7 +13,7 @@ Functionalities include:
 
 - **knowledge graph querying** with automated synchronisation of biochatter with any KG created in the BioCypher framework [@biocypher]
 
-- **retrieval augmented generation** using vector database embeddings of user-provided literature
+- **retrieval augmented generation** (RAG) using vector database embeddings of user-provided literature
 
 - **model chaining** to orchestrate multiple LLMs and other models in a single conversation using the LangChain framework [@langchain]
 
@@ -43,4 +43,39 @@ In addition, to facilitate the scaling of prompt engineering, we integrate this
 
 ### Benchmarking
 
-To facilitate the reproducible evaluation of LLMs, we implement a benchmarking framework that allows the comparison of models, prompts, and other components of the pipeline.
+To facilitate the reproducible evaluation of LLMs, we implement a benchmarking framework that allows the comparison of models, prompt sets, and other components of the pipeline.
+Built on the generic Pytest framework, it allows the automated evaluation of a matrix of all possible combinations of components.
+The results are stored and displayed on our website for easy comparison, and the benchmark is updated upon the release of new models.
+We create a bespoke biomedical benchmark for multiple reasons: 
+Firstly, the biomedical domain has its own tasks and requirements, and creating a bespoke benchmark allows us to be more precise in the evaluation of components.
+Secondly, we aim to create benchmark datasets that are complementary to the existing, general purpose benchmarks and leaderboards for LLMs [@{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}].
+Thirdly, we aim to prevent leakage of the benchmark data into the training data of the models, which is a known issue in the general purpose benchmarks, also called memorisation [].
+To achieve this goal, we implemented an encrypted pipeline that contains the benchmark datasets and is only accessible to the workflow that executes the benchmark (see Methods).
+
+**Figure?**
+
+### Knowledge Graphs
+
+Knowledge graphs (KGs) are a powerful tool to represent and query knowledge in a structured manner.
+With BioCypher [@biocypher], we have developed a framework to create KGs from biomedical data in a user-friendly manner.
+BioChatter is an extension of the BioCypher ecosystem, elevating its user-friendliness further by allowing natural language interactions with the data.
+We use information generated in the build process of BioCypher KGs to tune BioChatter's understanding of the data structures and contents, thereby increasing the efficiency of LLM-based KG querying (see Methods).
+In addition, the ability to connect to any KG allows the integration of prior knowledge into the LLM's reasoning, which can be used to ground the model's responses in the context of the KG via in-context learning / RAG (see below).
+
+### Retrieval Augmented Generation
+
+LLMs have a well-known tendency to confabulate (also referred to as hallucinations), i.e., to generate responses that are not grounded in reality.
+This is a major issue for biomedical applications, where the consequences of incorrect information can be severe.
+One popular way of addressing this issue is to apply "in-context learning," which is also more recently referred to as "retrieval augmented generation" (RAG).
+While this can be done by processing structured knowledge, for instance from KGs, it is often more efficient to use a semantic search engine to retrieve relevant information from unstructured data sources such as literature.
+To this end, we allow the management and integration of vector databases in the BioChatter framework.
+The user is able to connect to a vector database, embed an arbitrary number of documents, and then use semantic search to improve the model prompts by adding text fragments relevant to the given question (see Methods).
+
+### Model Chaining and Fact Checking
+
+LLMs cannot only seamlessly interact with human users, but also with other LLMs as well as many other types of models.
+They understand API calls and can therefore theoretically orchestrate complex multi-step tasks [].
+However, this is not trivial to implement, and the current generation of LLMs is not yet suited for unsupervised reasoning.
+We aim to improve the stability of model chaining in biomedical applications by developing bespoke approaches for common biomedical tasks, such as interpretation and design of experiments, evaluating literature, and exploring web resources.
+While try to reuse existing functionalities of model chaining frameworks such as LangChain [@langchain], we also develop bespoke solutions where necessary to provide stability for the given application.
+As an example, we implement a fact-checking module that uses a second LLM to evaluate the factual correctness of the primary LLM's responses continuously during the conversation (see Methods).
diff --git a/content/30.discussion.md b/content/30.discussion.md
@@ -0,0 +1,9 @@
+## Discussion
+
+Reuse of generic open-source libraries
+
+Guarantee robustness and objective evaluation
+
+Open and sustainable models
+
+Complex logic and chaining: future