Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] MarkdownDB Index and Library v1 #3

Open
29 of 43 tasks
rufuspollock opened this issue Mar 12, 2023 · 0 comments
Open
29 of 43 tasks

[epic] MarkdownDB Index and Library v1 #3

rufuspollock opened this issue Mar 12, 2023 · 0 comments
Assignees
Labels

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Mar 12, 2023

A database of markdown files so that you can quickly access the metadata and content you want.

  • All metadata including frontmatter, links, tags, tasks etc
  • Auto-reloading
  • Super simple javascript API

Bonus

  • Can generate sqlite so you get full sql access (if you want)

Non-features

  • Does not index the full-text content

Re Flowershow: Use this to replace contentlayer.dev.

See https://datahub.io/notes/markdowndb

Acceptance aka Roadmap

Feature list

Marketing

Features

Index a folder of files - create an "DB" index from a folder of markdown files (and other files including images)

  • Index a folder and get JS/TS objects
  • Index a folder and get json output
  • BONUS Index multiple folders (with support for configuring e.g. prefixing in some way e.g. i have all my blog files in this separate folder over here)
  • Command line tool for indexing: Create a markdowndb (index) on the command line
  • Index a folder and get SQLite

Extract structured data like:

Data types, data enhancement and validation


Inbox

Marketing

Sections on front page about major features

  • Have a section on front page about links feature
  • Have a section for tags
  • etc

💤

  • Refactor: improve our interfaces, do something similar to CachedMetadata and CachedFile
  • "multi-thread" support for fast indexing

Misc

  • ➕ 2023-03-15 Add layout e.g. layout: blog as a rule in markdown db loading rather than in getStaticPaths for rendering blogs (follow up to work in datopian/datahub-next#51) ⛔2023-03-17 on having markdowndb support for rules

Rufus random notes

  • how can we get type stuff like contentlayer has e.g. a given type in markdown frontmatter leads to use of X typescript type/interface
  • check out astro-build - how do they do type stuff?

Notes

Questions

  • What is a ContentBase / ContentDB? ✅2023-03-07 a database (index) of content e.g. of text files on disk, images etc. DB need not store content of files but it "indexes" them i.e. has a list of them, with associated metadata etc.
  • Why do we need one? ✅2023-03-07 a) to replace this (basic) functionality in ContentLayer.dev so we can replace ContentLayer.dev b) so we can richer things like get files with all tags etc
    • What contentlayer.dev API calls do we need to replace **✅2023-03-07 ~8 of them. quite simple. see below. **
  • What is the different between a Content Layer (API) and a ContentBase
  • What are the key technical components of a ContentBase ✅2023-03-07 see diagram
  • What is MarkdownDB? ✅2023-03-07 It is a ContentBase whose text files are in markdown format
  • What information do we index about markdown files in ContentBase? ✅2023-03-07
    • frontmatter
    • list of all blocks and their types?
    • tags?
  • What is the unique identifier for files?
  • What are the job stories that the MarkdownDB needs to support? 🔥
  • What about assets other than markdown files? e.g. images and pngs? ✅2023-03-07 these should also get processed.
  • Does something like this already exist and how does it work?
  • How big will the sqlite db get? (i.e. per 1k documents indexed) NB: we aren't storing the text ... (though perhaps we could ...) 🚧2023-03-07 guess metadata is ~1kb per file. so 1k files = 1Mb and 100k files = 100Mb so seems ok for memory
  • What happens if the sqlite file gets really big? ✅2023-03-07 we've probably have to store it somewhere in cloud etc
  • What DB should we use e.g. IndexedDB or sqlite? ✅2023-03-07 propose sqlite3 b/c you get sql etc and now pretty much supported in browser if we ever need that
  • How do we handle the indexing of remote files, such as files in GitHub repos? ✅2023-03-07 ❌ kind of invalid question. we can index the remote files easily and then cache that locally. We aren't indexing on the fly.
    • Do we just store a reference to that file?
  • What's a minimal viable API? 🚧2023-03-08 see section below

Notes on obsidian dataview API

blacksmithgu/obsidian-dataview#1811

How to handle document types 2023-03-09

I'm not sure how we want to handle types, since having it as a frontmatter field might not be the most ideal way because if we had a blog folder we'd have to add the type metadata to all the files individually.

On contentlayer.dev it uses a filePathPattern for that:

const Blog = defineDocumentType(() => ({
  name: "Blog",
  filePathPattern: `${siteConfig.blogDir}/!(index)*.md*`,
  contentType: "mdx",
  fields: {
  ...

I believe that's a good way of handling this. The caveat is that the path of a file is now determining its type and therefore folders with mixed types are impossible, although we could apply the pattern as something like *.blog.md*.

The use case I'm imaging is something like (there are probably better examples than blog):

blogs
  my-first-post.blog.mdx    // Blog type
  my-second-post.blog.mdx     // Blog type 
  index.mdx    // Generic page type 
  about-our-authors.mdx    // Generic page type
  write-for-us.contact.mdx    // Generic contact type                   

How could we index frontmatter into our db? 2023-03-09

My idea is to have another table for frontmatter, something like:

file_id field value (maybe) type: array or string
d9fc09 title My new post string

file_id should be a foreign key pointing to file._id.

To increase performance, since we are going to have many more rows now, we can create a DB index on this table (using the file_id field)

If done this way we are going to be able to query mdx files using frontmatter fields. E.g: (may not be exactly this)

MyMdDb.query({ tags: [economy], frontmatter: { author: 'João' } })
@rufuspollock rufuspollock mentioned this issue Apr 28, 2023
11 tasks
@rufuspollock rufuspollock changed the title [epic] MarkdownDB [epic] MarkdownDB Index v0.1 Apr 28, 2023
@rufuspollock rufuspollock transferred this issue from another repository Apr 28, 2023
@rufuspollock rufuspollock changed the title [epic] MarkdownDB Index v0.1 [epic] MarkdownDB Index and Library v1 Apr 28, 2023
@rufuspollock rufuspollock pinned this issue Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants