Home

Jump to bottom

Andreas edited this page Nov 19, 2024 · 9 revisions

Our data infrastructure aims to curate, process, and provide access to the depths of bitcoin's technical ecosystem.

The data pipeline can be broken down into the following stages that handle the complete lifecycle of data.

🌍 Discover: Identify relevant data sources
📥 Ingest: Bring data into the system
⚙️ Process: Make the data more useful
🗂️ Store: Persist and organize the data
🍽️ Serve: Provide access points and efficient search/query capabilities
💡 Consume: Retrieve and use the data to create value

Current Priorities

These are cross-cutting concerns and challenges that affect all stages of the data pipeline:

🤖 Flexible multi-source scraperV2 with modular architecture for streamlined data ingestion and processing.
- ✅ introduced with PR#81
- 🛠️ Add additional sources
- 📌 Assess support for different types of sources
🧩 Chunking Strategy to improve contextual retrieval.
- 🛠️ Researching semantic chunking and optimum chunking strategy
⚖️ Evaluation Framework to guide refinements across the data pipeline.
- 🛠️ Researching evaluation strategies
📡 Monitoring to ensure reliability, data quality and overall system health.
- 📌 Central logging for data collection events
- 📌 Automated alerts for scraper failures or data collection issues
🗣️ Terminology inconsistencies across the infrastructure complicate understanding and documentation.
- See Proposal for Terminology Standardization.
- 🛠️ scraperV2 is using this new terminology

Objectives

Centralized Access: Aggregate and organize data for efficient user and system access.
Discoverability: Leverage topics and metadata for advanced search and contextual exploration.
Automation: Automate ingestion, processing, and enrichment tasks, such as summarization and topic extraction to ensure up-to-date, actionable data.
Scalability: Build adaptable systems to accommodate growing sources and evolving needs, while maintaining flexibility to integrate new tools and use cases.
Continuous Improvement: Monitor performance and iterate based on feedback and metrics.

Data Pipeline

An outline of the current state of the data infrastructure, ongoing work, and areas for improvement across the pipeline.

🌍 Discover

Focuses on WHAT data we need and where to find them.

CURRENT STATE

Sources are identified through human input.
User can suggest sources for inclusion, but there is no establish pipeline to include those suggestions, it's manual work.

AREAS FOR IMPROVEMENT and EXPLORATION

📌 Automate source suggestion workflows.
📌 Automate source discovery using web crawlers, API monitors and aggregated lists (e.g awesome X pages)

📥 Ingest

Focuses on HOW to bring data into the system.

CURRENT STATE

Sources Management
- Sources are managed through a registry of predefined knowledge sources. see scraper/README.md
- Scheduling is done using GitHub Actions with nightly cron jobs for periodic updates, while one-off scrapers collect data on an ad hoc basis.
Data Collection
- Supported sources via scrapyV2
  - Documents in GitHub repositories (GitHub scraper)
  - Websites (scrapy scraper)
- Original content is stored in its native format in body_formatted and identified by type.

AREAS FOR IMPROVEMENT and EXPLORATION

🛠️ enable scraperV2
- introduced with PR#81
🛠️ Add support for Bitcoin Core PR Review Club to scraperV2
🛠️ Add StackExchange scraper to scraperV2
📌 Assess support for different types of sources: Research papers, Release docs, awesome-x pages, Twitter threads, Medium

⚙️ Process

Focuses on HOW to make the data more useful.

CURRENT STATE

Processing only happens on resources from the bitcoin-dev mailing list and Delving Bitcoin
Summarization
- The summarizer creates individual post summaries and combined thread summaries daily using gpt-4-turbo-preview.
- Workflow inefficiencies: Duplicate storage (XML and Elasticsearch), questionable utility of individual summaries, and potential quality issues.
  - see Thread Summaries Workflow Analysis and Proposal
Topic Extraction
- Current topic extraction relies on an outdated list of Bitcoin-related topics.
- Primary and secondary topics as generated by topic extraction are not utilized anywhere on our data infrastructure.
Embeddings
- Vector embeddings generated using the document's title and summary using SentenceTransformer (intfloat/e5-large-v2). There's no cost, but the embedding size limit is 1024.
- Outputs are stored in Elasticsearch but currently unused.

AREAS FOR IMPROVEMENT and EXPLORATION

📌 Integrate summarization into scraperV2.
🛠️ Build a topics index to standardize topic extraction across all resources.
- see bitcoinsearch/topics-index
📌 Refine topic extraction using the new topics index and integrate into scraperV2.
📌 Refine the summarization logic for individual summaries
- see Summary Efficiency Analysis
📌 Consideration of additional enrichment tasks, like named entity recognition or relationship mapping.
- Identify key entities (e.g., people, topics, locations) within data
📌 Fix incorrect url in Combined Summaries (isse#64)

🗂️ Store

Focuses on HOW to persist and organize the data.

CURRENT STATE

Elasticsearch stores all data, with full-text search support. It may be less effective at capturing semantic similarities between concepts.
Bitcoin Transcripts are managed autonomously in their own GitHub repository, but also scraped for integration with Elasticsearch.
Summaries are duplicated as XML files in the summarizer.

AREAS FOR IMPROVEMENT and EXPLORATION

📌 Implement a central thread resource document to consolidate summaries.
📌 Consolidate all data in Elasticsearch to avoid duplications (e.g., between XML files and index).

🍽️ Serve

Focuses on HOW to make the data available for consumption

CURRENT STATE

Access Methods
- The Bitcoin Search API serves as the primary interface for querying data, powering both Bitcoin Search and chat-btc.
- API lacks documentation
Relevancy Challenges
- Poor relevancy ranking affects the usability of both search results and chat-btc responses.

AREAS FOR IMPROVEMENT and EXPLORATION

🛠️ Research improved ranking methodologies for Elasticsearch queries.
- see Elastic search shortcomings
📌 Better documentation for API

💡 Consume

CURRENT STATE

Bitcoin Search: Displays search results from Elasticsearch using the Bitcoin Search API.
Chat-BTC: ChatGPT wrapper using a RAG pipeline to fetch relevant documents for queries.
- Retrieval Strategy
  1. Prompt gpt-3.5-turbo to extract keywords from user's last 10 questions
  2. Use keywords (or query as fallback) to get relevant documents from Bitcoin Search API
  3. Filter out StackExchange Questions
- Response Generation
  - 7000 tokens context window
  - Include the last 4 to 6 messages of chat history
Bitcoin TLDR: Displays Combine Summaries and individual Summaries
- Currently in redesign