Skip to content

Latest commit

 

History

History
272 lines (192 loc) · 10.7 KB

README.md

File metadata and controls

272 lines (192 loc) · 10.7 KB

UpdAPI 🔧

"Update your knowledge base with the latest API resources"

A free, lightweight tool to streamline the discovery of API documentation, policies, and community resources and enhancing LLMs with accurate, relevant context

updAPI

License
Contributions Welcome
Build Status
Open Issues


Like the project? Please give it a Star so it can reach more people >>>>> Star on GitHub

⚠️ Under Construction
This project is in the early stages of development and may not function as intended yet. Contributions, feedback, and ideas are highly welcome!

📋 Links to Public API DOCS

api-docs-urls.csv contains a centralized collection of popular APIs with links to their official documentation and associated policies. It includes tools to scrape, preprocess, and update the dataset for better usability and retrieval.

api-docs-url.csv:

API Name Official Documentation URL Privacy Policy URL Terms of Service URL Rate Limiting Policy URL Changelog/Release Notes URL Security Policy URL Developer Community/Forum URL
OpenAI API Documentation Privacy Terms Rate Limits Changelog Security Community
...

⚠️ the URLs are auto-generated and require manual verification
We aim to maintain these URLs to be pointing to the current document (TODO: Set up cron jobs/GitHub Actions to periodically re-run the scrapers and keep the dataset up-to-date)

🛠 Adding More APIs to the Dataset

Option 1: Manually Add to api-docs-urls.csv

You can manually add new entries to api-docs-urls.csv with the following format:

API_Name,Official_Documentation_URL,Privacy_Policy_URL,Terms_of_Service_URL,Rate_Limiting_Policy_URL,Changelog_Release_Notes_URL,Security_Policy_URL,Developer_Community_Forum_URL
Example API,https://example.com/docs,https://example.com/privacy,https://example.com/tos,https://example.com/rate-limits,https://example.com/changelog,https://example.com/security,https://example.com/community

Option 2: Combine Multiple CSV Files

If you have additional entries in separate CSV files, use the provided Python utility script to merge them into the main dataset.

Combine CSV Files

  1. Ensure you have Python installed.
  2. Run the script:
    python utils/combine_csv.py new_entries.csv api-docs-urls.csv combined_dataset.csv
  3. Replace the existing api-docs-urls.csv with the new combined_dataset.csv.

What can we do with the API URLs?

Use case 1: You can use the scrapers (fast-scraper.js or accurate-scraper.js) to extract content from API docs and enhance your LLM to provide specific and accurate answers about APIs

Workflow Example:

  1. Retrieve relevant snippets with a custom script / Query the vector database for a user question

  2. Generate Answers with an LLM: Pass the retrieved snippets as context to the LLM (e.g., GPT-4 or LLaMA-2)

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from faiss import read_index
    
    # Load vector index
    index = read_index('vector_index.faiss')
    
    # Query embeddings
    user_query = "What are the rate limits for the OpenAI API?"
    query_embedding = model.encode(user_query)
    _, indices = index.search(np.array([query_embedding]), k=5)
    
    # Retrieve relevant chunks
    context = " ".join([documents[i] for i in indices[0]])
    
    # Use an LLM to answer
    model = AutoModelForCausalLM.from_pretrained('gpt-4')
    tokenizer = AutoTokenizer.from_pretrained('gpt-4')
    
    prompt = f"Context: {context}\nQuestion: {user_query}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_new_tokens=200)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Use case 2: Maintain offline copies of API documentation for scenarios where internet access is unavailable or restricted. Offline access ensures reliability and speed when querying API documentation.

How?

  • Use the scrapers to generate offline copies of the documentation in JSON, HTML, or Markdown formats.
  • Serve these copies locally or integrate them into a lightweight desktop or web application.

Use case 3: API documentation changes frequently, and outdated information can lead to bugs or misconfigurations. Automating change detection ensures your knowledge base remains up-to-date.

How?

  • Compare the current version of a page with its previously saved version.
  • Use hashing (e.g., MD5) or diff-checking tools to detect changes in content.

🚀 How to Use the Scrapers

Check Python Version

Recommended Python Versions: Python >=3.7 and <3.10

  1. Check your Python version:
    python --version
  2. If your Python version is incompatible, you can:
    • Install a compatible version (e.g., Python 3.9).
    • Use a virtual environment:
      python3.9 -m venv venv
      source venv/bin/activate  # Or venv\Scripts\activate on Windows
      pip install -r requirements.txt
  3. Alternatively, use Conda to install PyTorch and its dependencies:
    conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

We provide two scraping tools to suit different needs:

  • fast-scraper.js: A lightweight Cheerio-based scraper for fast retrieval of static content.
  • accurate-scraper.js: A Playwright-based scraper for handling JavaScript-loaded pages and more dynamic content.

1. fast-scraper.js (Cheerio-Based)

  • Purpose: For quickly scraping static API documentation pages.
  • Strengths:
    • Lightweight and fast.
    • Suitable for pages without JavaScript content.
  • Limitations:
    • Does not handle JavaScript-loaded content.

Run the Script

  1. Install dependencies:
    npm install
  2. Run the script:
    node fast-scraper.js
  3. Results will be saved in scraped_data_fast.json.

2. accurate-scraper.js (Playwright-Based)

  • Purpose: For scraping API documentation pages that rely on JavaScript for rendering.
  • Strengths:
    • Handles dynamic content and JavaScript-loaded pages.
    • More accurate for modern, interactive documentation sites.
  • Limitations:
    • Slower compared to fast-scraper.js.

Run the Script

  1. Install Playwright:
    npm install playwright
  2. Run the script:
    node accurate-scraper.js
  3. Results will be saved in scraped_data_accurate.json.

💡 How to Contribute

For first time contributors, I recommend you to check out https://github.com/firstcontributions/first-contributions and https://www.youtube.com/watch?v=YaToH3s_-nQ

Contributions are welcome! Here's how you can contribute:

  1. Add API Entries:

    • Add new API entries directly to api-docs-urls.csv or via pull request.
    • Ensure URLs point to the current version of the documentation and policies.
  2. Verify API Entries:

    • Is the URL up-to-date?
    • Is the URL root-level for the relevant page? (api.com/docs/, not api.com/docs/nested)
    • Is the API doc public and does it comply with "robots.txt"?
    • Does the URL provide all the expected information (changelogs, rate limits, etc) ?
    • Is there any dynamically loaded page content that the scraper is able to extract?
  3. Improve Scrapers:

    • Enhance fast-scraper.js or accurate-scraper.js for better performance and compatibility.
    • Add features like advanced error handling or field-specific scraping.
  4. Submit Pull Requests:

    • Fork the repository.
    • Create a new branch for your changes.
    • Submit a pull request for review.

If you're using the scripts, first install dependencies:

npm install
pip install -r requirements.txt

This installs everything listed in package.json and requirements.txt

🚀 Roadmap Features

  • 🔍 Search & Browse: Easily find APIs by keyword or category (e.g., "Machine Learning APIs," "Finance APIs")
  • 📄 Latest API Metadata Retrieval: Retrieve up-to-date API endpoints and parameters, directly from official documentation.
  • 🛠 VS Code Integration: Use the lightweight UpdAPI extension to search and retrieve APIs directly from your terminal.

📜 License

This repository is licensed under the MIT License.


Status

🛠 Current Phase:

  • Under Construction: We’re building the core MVP features and testing functionality.

📌 Known Issues:

  • Limited API support.
  • Some features may not work as expected.

Check the Open Issues for more details.


Roadmap

✅ MVP Goals

  • Basic search and browse functionality.
  • JSON exports for select APIs.
  • Direct links to official API documentation.

🔜 Future Enhancements

  • IDE integrations (e.g., VS Code plugin).
  • API update notifications via email/webhooks.
  • Support for more APIs.

Community


❤️ Acknowledgments

We thank all API providers for publishing robust documentation and fostering developer-friendly ecosystems. Your contributions make projects like this possible! Special thanks to:

  • Crawlee: A powerful web scraping and crawling library that simplifies the extraction of structured data from websites.
  • OpenAPI: For setting the standard in API specifications and enabling better interoperability and accessibility.