Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: full text search for pubs #919

Merged
merged 15 commits into from
Feb 6, 2025
Merged

feat: full text search for pubs #919

merged 15 commits into from
Feb 6, 2025

Conversation

tefkah
Copy link
Member

@tefkah tefkah commented Feb 3, 2025

Issue(s) Resolved

High-level Explanation of PR

This PR adds, in order

  1. A searchVector TSVECTOR column on the pubs database
  2. A trigger which sets this column to a combination of the title and the pubvalues anytime the pubvalues are updated
  3. A function which uses this to look for pubs using ts_query
  4. An API endpoint which allows any logged in user to call this function
  5. A client component which is callable with CMD+K which calls this function on search.

Rough explanation of postgres FTS

I'd recommend this excellent video of the great Ben Awad, will do a much better job of explaining how it works than I or the postgres docs can: https://www.youtube.com/watch?v=szfUbzsKvtE&t=362s

But here's a shot at it:

Postgres can create a vectorized representation of a bunch of stuff with it's to_tsvector function/ TSVECTOR column.
It will basically create small, weighted "stubs" of words, which are easy to compare. eg it will unconjugate verbs and unpluralize nouns, so running->run: and ducks->duck:, roughly. You can then assign weights to certain parts of those vectors, which helps in ranking.

Then postgres does some magic and you can search it with this syntax

SELECT * FROM 'pubs'
WHERE to_tsvector('english', title, 'A') @@ $1 
-- 'some query'

very weird syntax, but it works really well.

Doing to_tsvector every query is very slow though, so I created a column on the pubs table called searchVector, and added an index and a trigger that updates it. This makes like a 500ms -> 1ms diff in perf.

Then I made some search thingy using shadcn's Command thing, which uses it.

That's basically it! Much easier than I thought it would be.

Downsides

Absolutely no permission filtering!
Should definitely be done!
But even if we filter permissions, we won't be find related pubs the user has permissions to through the parent pub!

Maybe kind of slow in the long term, but the query can definitely be optimized a bunch.

No typed return responses (should do that)

Test Plan

  1. Go to arcadia

  2. CMD+K

  3. Look up something latin-y, like vilis.

  4. See a lot of search results!

  5. Navigate to search result on keyboard

  6. Enter

  7. Be taken to pub!

  8. Modify a pub, ideally it's title

  9. Check you are able to find it with this new title

Screenshots (if applicable)

Screenshare.-.2025-02-03.7_00_20.PM.mp4
Screenshare.-.2025-02-03.7_05_33.PM.mp4

Notes

@tefkah tefkah marked this pull request as ready for review February 3, 2025 18:19
Comment on lines +58 to +69
useEffect(() => {
const down = (e: KeyboardEvent) => {
if (e.key === "k" && (e.metaKey || e.ctrlKey)) {
e.preventDefault();
onOpenChange?.(true);
setOpen(true);
}
};

document.addEventListener("keydown", down);
return () => document.removeEventListener("keydown", down);
}, [onOpenChange]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be done more elegantly, we should probably have some global thing for this, but for now this is fine i think

Comment on lines +83 to +91
<CommandEmpty className="flex w-full flex-col items-center p-6">
{isFetching ? (
<Loader2 className="h-4 w-4 animate-spin" />
) : isError ? (
<AlertCircle className="h-4 w-4" />
) : (
"No results found"
)}
</CommandEmpty>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very elegant, lots of states not well represented

@@ -5,7 +5,7 @@ import type { GetPubsResult } from "./server";
export type PubTitleProps = {
title?: string | null;
createdAt: Date;
values:
values?:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the values for this technically

Comment on lines +43 to +73
Title: "Machine Learning Fundamentals",
Description: "An introduction to basic ML concepts",
Content: "This article covers neural networks and deep learning basics.",
Tags: "AI, ML, Technology",
},
},
{
pubType: "Basic Pub",
values: {
Title: "Deep Learning Applications",
Description: "Advanced applications of deep learning",
Content: "Exploring real-world uses of neural networks in computer vision.",
Tags: "AI, Deep Learning, CV",
},
},
{
pubType: "Basic Pub",
values: {
Title: "Web Development Guide",
Description: "Complete guide to modern web development",
Content: "Learn about React, TypeScript, and Node.js fundamentals.",
Tags: "Web, JavaScript, Programming",
},
},
{
pubType: "Basic Pub",
values: {
Title: "Database Design Patterns",
Description: "Essential patterns for database architecture",
Content: "Understanding SQL and NoSQL database design principles.",
Tags: "Database, SQL, Architecture",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely wrote these myself

Comment on lines +2 to +5
ALTER TABLE "pubs" ADD COLUMN "searchVector" tsvector;

-- CreateIndex
CREATE INDEX "pubs_searchVector_idx" ON "pubs" USING GIN ("searchVector");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this adds the column and the idx (note: we should add more indices)

Comment on lines +57 to +71
WITH updates AS (
SELECT
affected."pubId",
CASE
WHEN tmp."pubId" IS NULL THEN pubs."title"
WHEN TG_OP = 'DELETE' OR tmp."value" IS NULL THEN NULL
ELSE tmp."value"
END AS new_title
FROM (
SELECT DISTINCT "pubId"
FROM inserted_updated_deleted_rows
) AS affected
LEFT JOIN tmp_affected_pubs tmp ON tmp."pubId" = affected."pubId"
JOIN pubs ON pubs.id = affected."pubId"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just slightly rewriting this as a CTE, st i can use that title in the search vector as well

UPDATE
"pubs"
SET
"updatedAt" = CURRENT_TIMESTAMP,
"title" = CASE
-- this is what's changed from the earlier function
WHEN tmp."pubId" IS NULL THEN
pubs."title"
WHEN TG_OP = 'DELETE' OR tmp."value" IS NULL THEN
NULL
ELSE
tmp."value"
END
FROM ( SELECT DISTINCT
"pubId"
FROM
inserted_updated_deleted_rows
) AS affected
LEFT JOIN tmp_affected_pubs tmp ON tmp."pubId" = affected."pubId"
WHERE
"pubs"."id" = affected."pubId";
RETURN NULL;

Comment on lines +72 to +87
UPDATE "pubs"
SET
"updatedAt" = CURRENT_TIMESTAMP,
"title" = updates.new_title,
-- we weight the searchVector based on the title and its values
"searchVector" = (
SELECT
setweight(to_tsvector('english', COALESCE(updates.new_title, '')), 'A') ||
setweight(to_tsvector('english', COALESCE(
(SELECT string_agg(CAST(value #>> '{}' AS TEXT), ' ')
FROM pub_values
WHERE "pubId" = updates."pubId"),
'')), 'B')
)
FROM updates
WHERE "pubs"."id" = updates."pubId";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this, very naively, just sets the search vector as

(with A priority) title
(with B priority) `${pub.Values.join(' ')}` 

we could do this more sophisticatedly, ignore all the values we don't want to look up (maybe only add stringlike fields), join with values we do want to look up (add the user info instead of the memberId for instance), or give certain fields different priorities (rich text fields should probably have different priorities in the future)

but for now, i think this is fiiine

"flex h-full w-full flex-col overflow-hidden rounded-md bg-white text-gray-950",
"flex h-full w-full flex-col overflow-hidden bg-white text-gray-950",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting: for some reason this "rounded" caused the whole command list to be blurry

hngngn/shadcn-solid#100

Comment on lines +58 to +61
<DialogTitle className={cn(showTitle || "sr-only")}>{title}</DialogTitle>
<DialogDescription className={cn(showDescription || "sr-only")}>
{description}
</DialogDescription>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are necessary for accessibilty, otherwise you get a bunch of errors in the console/in dev

Copy link
Member

@3mcd 3mcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels great to use. Thanks for all the code comments and great job on those tests too!

@3mcd 3mcd merged commit 1cbb099 into main Feb 6, 2025
6 checks passed
@3mcd 3mcd deleted the tfk/fts branch February 6, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants