OWID in-browser semantic search

This is a prototype of a semantic search engine for the OWID website which runs entirely in the browser (from embedding to similarity search).

The goal is to provide a no-infrastructure solution for quick prototyping and experimentation with (semantic) search on the OWID website. It is also a test of how far we can get with only embedding the titles of the site's content (as opposed to the full content).

It uses the transformers.js library for generating embeddings of the site's content. It stores the embeddings alongside some metadata in a wasm port of postgres, pglite, augmented with pgvector for similarity search.

You can run the search engine locally by following the quick start below, which will use a pre-generated coredump of the site's content.

Alternatively, you can generate a new coredump by following the instructions in "Refresh the coredump.json file".

Quick start

npm run dev / npm run preview
visit http://localhost:5173/ / http://localhost:4173/
"Generate embeddings" -> this will take a while

You can run searches while the embedding generation is in progress, but the results will be incomplete.

Refresh the coredump.json file

create a new route in mockSiteRouter.ts in the owid-grapher repo and wire it to the makeCoreDump() function in coredump.tsx (to be created, file content below). This will generate a lightweight dump of the site's content for the semantic search engine to index.

mockSiteRouter.ts

getPlainRouteWithROTransaction(
    mockSiteRouter,
    "/coredump.json",
    async (_, res, trx) => {
        const dump = await makeCoreDump(explorerAdminServer, trx)
        res.json(dump)
    }
)

coredump.tsx

import {
    BAKED_BASE_URL,
    BAKED_GRAPHER_URL,
} from "../settings/serverSettings.js"
import { dayjs, countries, DbPlainChart, Span } from "@ourworldindata/utils"
import * as db from "../db/db.js"
import urljoin from "url-join"
import { ExplorerAdminServer } from "../explorerAdminServer/ExplorerAdminServer.js"

import { GdocPost } from "../db/model/Gdoc/GdocPost.js"

interface SitemapUrl {
    loc: string
    lastmod?: string
}

// Borrowed from sitemap.ts

export const makeCoreDump = async (
    explorerAdminServer: ExplorerAdminServer,
    knex: db.KnexReadonlyTransaction
) => {
    const gdocPosts = await db.getPublishedGdocPosts(knex)

    const publishedDataInsights = await db.getPublishedDataInsights(knex)

    const charts = await db.knexRaw<
        Pick<DbPlainChart, "updatedAt"> & { slug: string; title: string }
    >(
        knex,
        `-- sql
            SELECT c.updatedAt, cc.slug, JSON_UNQUOTE(cc.full->"$.title") as title
            FROM charts c
            JOIN chart_configs cc ON cc.id = c.configId
            WHERE
                cc.full->"$.isPublished" = true
        `
    )

    const dods = await GdocPost.getDetailsOnDemandGdoc(knex)

    let urls = countries.map((c) => ({
        loc: urljoin(BAKED_BASE_URL, "country", c.slug),
        title: `${c.name}`,
        type: "country",
    })) as SitemapUrl[]

    urls = urls
        .concat(
            gdocPosts.map((p) => ({
                loc: urljoin(BAKED_BASE_URL, p.slug),
                title: p.content.title,
                type: "gdoc",
                lastmod: dayjs(p.updatedAt).format("YYYY-MM-DD"),
            }))
        )
        .concat(
            publishedDataInsights.map((d) => ({
                loc: urljoin(BAKED_BASE_URL, "data-insights", d.slug),
                title: `${d.title}`,
                type: "insight",
                lastmod: dayjs(d.updatedAt).format("YYYY-MM-DD"),
            }))
        )
        .concat(
            charts.map((c) => ({
                loc: urljoin(BAKED_GRAPHER_URL, c.slug),
                title: `${c.title}`,
                type: "chart",
                lastmod: dayjs(c.updatedAt).format("YYYY-MM-DD"),
            }))
        )
        .concat(
            Object.keys(dods.details)
                .filter((id) => {
                    //hack until dod parsing is fixed
                    return dods.details[id].text.length >= 2
                })
                .map((id) => {
                    return {
                        title: extractTextFromSpans(
                            dods.details[id].text[0].value
                        ),
                        content: extractTextFromSpans(
                            dods.details[id].text[1].value
                        ),
                        type: "dod",
                        loc: urljoin(BAKED_BASE_URL, `dods/${id}`),
                    }
                })
        )

    return urls
}

export function extractTextFromSpans(spans: Span[]): string {
    return spans
        .map((span) => {
            switch (span.spanType) {
                case "span-simple-text":
                    return span.text
                case "span-link":
                case "span-ref":
                case "span-dod":
                case "span-italic":
                case "span-bold":
                case "span-underline":
                case "span-subscript":
                case "span-superscript":
                case "span-quote":
                case "span-fallback":
                    return extractTextFromSpans(span.children)
                case "span-newline":
                    return "\n"
                default:
                    return ""
            }
        })
        .join("")
}

update worker.ts to fetch the remote coredump.json file instead of the local one

const response = await fetch("http://localhost:3030/coredump.json");
run the owid-grapher site locally. This will expose the http://localhost:3030/coredump.json route, for this repo to fetch and process.

Explorers, standalone pages, topic country profiles, author pages are not indexed for this experiment.

Inspired by https://github.com/huggingface/transformers.js-examples/tree/main/pglite-semantic-search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OWID in-browser semantic search

Quick start

Refresh the coredump.json file

mockSiteRouter.ts

coredump.tsx

Files

README.md

Latest commit

History

README.md

File metadata and controls

OWID in-browser semantic search

Quick start

Refresh the coredump.json file

mockSiteRouter.ts

coredump.tsx