kstathou · kstathou · Dec 6, 2023 · Dec 4, 2023 · Dec 4, 2023 · Dec 4, 2023
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,7 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+notebooks/
+data/
+wandb/
diff --git a/README.md b/README.md
@@ -1,24 +1,34 @@
 # llm-stack
 
-End-to-end tech stack for the LLM data flywheel.
+This tutorial series will show you how to build an end-to-end data flywheel for Large Language Models (LLMs).
 
-## Chapters
+We will be summarising arXiv abstracts.
 
-- Building your training set with GPT-4
-- Fine-tuning an open-source LLM
-- Evaluation
-- Human feedback
-- Unit tests
-- Deployment
+## What you will learn
 
-## Installation
+How to:
 
-TODO
+- Build a training set with GPT-4 or GPT-3.5
+- Fine-tune an open-source LLM
+- Create a set of Evals to evaluate the model.
+- Collect human feedback to improve the model.
+- Deploy the model to an inference endpoint.
 
-## Fine-tuning
+## Software used
 
-### Data
+- [wandb](https://wandb.ai) for experiment tracking. This is where we will record all our artifacts (datasets, models, code) and metrics.
+- [modal](https://modal.com/) for running jobs on the cloud.
+- [huggingface](https://huggingface.co/) for all-things-LLM.
+- [argilla](https://docs.argilla.io/en/latest/) for labelling our data.
+
+## Tutorial 1 - Generating a training set with GPT-3.5
+
+In this tutorial, we will use GPT-3.5 to generate a training set for summarisation task.
+
+```python
+modal run src/llm_stack/scripts/build_dataset_summaries.py
+```
 
 ## Contributing
 
-TODO
+Found any mistakes or want to contribute? Feel free to open a PR or an issue.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -25,6 +25,12 @@ bitsandbytes = "^0.41.2.post2"
 wandb = "^0.16.0"
 accelerate = "^0.24.1"
 torch = "2.0.1"
+feedparser = "^6.0.10"
+python-dotenv = "^1.0.0"
+httpx = "^0.25.2"
+openai = "^1.3.7"
+tenacity = "^8.2.3"
+tqdm = "^4.66.1"
 
 
 [tool.poetry.group.test]

diff --git a/src/llm_stack/build_dataset/__init__.py b/src/llm_stack/build_dataset/__init__.py
@@ -0,0 +1,4 @@
+from .arxiv import ArxivAPI
+
+
+__all__ = ["ArxivAPI"]