A free, lightweight tool to streamline the discovery of API documentation, policies, and community resources and enhancing LLMs with accurate, relevant context
Like the project? Please give it a Star so it can reach more people >>>>>
β οΈ Under Construction
This project is in the early stages of development and may not function as intended yet. Contributions, feedback, and ideas are highly welcome!
api-docs-urls.csv
contains a centralized collection of popular APIs with links to their official documentation and associated policies. It includes tools to scrape, preprocess, and update the dataset for better usability and retrieval.
api-docs-url.csv:
API Name | Official Documentation URL | Privacy Policy URL | Terms of Service URL | Rate Limiting Policy URL | Changelog/Release Notes URL | Security Policy URL | Developer Community/Forum URL |
---|---|---|---|---|---|---|---|
OpenAI API | Documentation | Privacy | Terms | Rate Limits | Changelog | Security | Community |
... |
β οΈ the URLs are auto-generated and require manual verification
We aim to maintain these URLs to be pointing to the current document (TODO: Set up cron jobs/GitHub Actions to periodically re-run the scrapers and keep the dataset up-to-date)
You can manually add new entries to api-docs-urls.csv
with the following format:
API_Name,Official_Documentation_URL,Privacy_Policy_URL,Terms_of_Service_URL,Rate_Limiting_Policy_URL,Changelog_Release_Notes_URL,Security_Policy_URL,Developer_Community_Forum_URL
Example API,https://example.com/docs,https://example.com/privacy,https://example.com/tos,https://example.com/rate-limits,https://example.com/changelog,https://example.com/security,https://example.com/community
If you have additional entries in separate CSV files, use the provided Python utility script to merge them into the main dataset.
- Ensure you have Python installed.
- Run the script:
python utils/combine_csv.py new_entries.csv api-docs-urls.csv combined_dataset.csv
- Replace the existing
api-docs-urls.csv
with the newcombined_dataset.csv
.
Use case 1: You can use the scrapers (fast-scraper.js or accurate-scraper.js) to extract content from API docs and enhance your LLM to provide specific and accurate answers about APIs
Workflow Example:
-
Retrieve relevant snippets with a custom script / Query the vector database for a user question
-
Generate Answers with an LLM: Pass the retrieved snippets as context to the LLM (e.g., GPT-4 or LLaMA-2)
from transformers import AutoModelForCausalLM, AutoTokenizer from faiss import read_index # Load vector index index = read_index('vector_index.faiss') # Query embeddings user_query = "What are the rate limits for the OpenAI API?" query_embedding = model.encode(user_query) _, indices = index.search(np.array([query_embedding]), k=5) # Retrieve relevant chunks context = " ".join([documents[i] for i in indices[0]]) # Use an LLM to answer model = AutoModelForCausalLM.from_pretrained('gpt-4') tokenizer = AutoTokenizer.from_pretrained('gpt-4') prompt = f"Context: {context}\nQuestion: {user_query}\nAnswer:" inputs = tokenizer(prompt, return_tensors='pt') outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Use case 2: Maintain offline copies of API documentation for scenarios where internet access is unavailable or restricted. Offline access ensures reliability and speed when querying API documentation.
How?
- Use the scrapers to generate offline copies of the documentation in JSON, HTML, or Markdown formats.
- Serve these copies locally or integrate them into a lightweight desktop or web application.
Use case 3: API documentation changes frequently, and outdated information can lead to bugs or misconfigurations. Automating change detection ensures your knowledge base remains up-to-date.
How?
- Compare the current version of a page with its previously saved version.
- Use hashing (e.g., MD5) or diff-checking tools to detect changes in content.
Recommended Python Versions: Python >=3.7 and <3.10
- Check your Python version:
python --version
- If your Python version is incompatible, you can:
- Install a compatible version (e.g., Python 3.9).
- Use a virtual environment:
python3.9 -m venv venv source venv/bin/activate # Or venv\Scripts\activate on Windows pip install -r requirements.txt
- Alternatively, use Conda to install PyTorch and its dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
We provide two scraping tools to suit different needs:
fast-scraper.js
: A lightweight Cheerio-based scraper for fast retrieval of static content.accurate-scraper.js
: A Playwright-based scraper for handling JavaScript-loaded pages and more dynamic content.
- Purpose: For quickly scraping static API documentation pages.
- Strengths:
- Lightweight and fast.
- Suitable for pages without JavaScript content.
- Limitations:
- Does not handle JavaScript-loaded content.
- Install dependencies:
npm install
- Run the script:
node fast-scraper.js
- Results will be saved in
scraped_data_fast.json
.
- Purpose: For scraping API documentation pages that rely on JavaScript for rendering.
- Strengths:
- Handles dynamic content and JavaScript-loaded pages.
- More accurate for modern, interactive documentation sites.
- Limitations:
- Slower compared to
fast-scraper.js
.
- Slower compared to
- Install Playwright:
npm install playwright
- Run the script:
node accurate-scraper.js
- Results will be saved in
scraped_data_accurate.json
.
For first time contributors, I recommend you to check out https://github.com/firstcontributions/first-contributions and https://www.youtube.com/watch?v=YaToH3s_-nQ
Contributions are welcome! Here's how you can contribute:
-
Add API Entries:
- Add new API entries directly to
api-docs-urls.csv
or via pull request. - Ensure URLs point to the current version of the documentation and policies.
- Add new API entries directly to
-
Verify API Entries:
- Is the URL up-to-date?
- Is the URL root-level for the relevant page? (api.com/docs/, not api.com/docs/nested)
- Is the API doc public and does it comply with "robots.txt"?
- Does the URL provide all the expected information (changelogs, rate limits, etc) ?
- Is there any dynamically loaded page content that the scraper is able to extract?
-
Improve Scrapers:
- Enhance
fast-scraper.js
oraccurate-scraper.js
for better performance and compatibility. - Add features like advanced error handling or field-specific scraping.
- Enhance
-
Submit Pull Requests:
- Fork the repository.
- Create a new branch for your changes.
- Submit a pull request for review.
If you're using the scripts, first install dependencies:
npm install
pip install -r requirements.txt
This installs everything listed in package.json and requirements.txt
- π Search & Browse: Easily find APIs by keyword or category (e.g., "Machine Learning APIs," "Finance APIs")
- π Latest API Metadata Retrieval: Retrieve up-to-date API endpoints and parameters, directly from official documentation.
- π VS Code Integration: Use the lightweight UpdAPI extension to search and retrieve APIs directly from your terminal.
This repository is licensed under the MIT License.
- Under Construction: Weβre building the core MVP features and testing functionality.
- Limited API support.
- Some features may not work as expected.
Check the Open Issues for more details.
- Basic search and browse functionality.
- JSON exports for select APIs.
- Direct links to official API documentation.
- IDE integrations (e.g., VS Code plugin).
- API update notifications via email/webhooks.
- Support for more APIs.
We thank all API providers for publishing robust documentation and fostering developer-friendly ecosystems. Your contributions make projects like this possible! Special thanks to: