Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverted index for ketword search within bigger text fields #7

Open
neeraj9 opened this issue Feb 15, 2019 · 0 comments
Open

Inverted index for ketword search within bigger text fields #7

neeraj9 opened this issue Feb 15, 2019 · 0 comments

Comments

@neeraj9
Copy link
Owner

neeraj9 commented Feb 15, 2019

Lets say you have an attribute within your data which is a blob of text. For you to search through that text is a full-scan at present in egraphdb. In order to avoid a full-scan which is impractical, how about creating a simple inverted index from that text and making it keyword-searchable.

Potential Steps:

  • Tokenize
  • Drop common words and retain only the useful ones. Lets just say create another table which shall have such words, which can then be loaded by egraphdb in memory for quick access.
  • Simple spelling correction would be useful too.
  • Store multiple rows {keyword, sourceid} for the same data within the index table for a particular attribute. Where you could say do "select count(keyword),sum(count),sourceid from xyz where keyword in ('a', 'b') group by sourceid limit 10000". This is just a suggestion and not a strong rule.

sample table:

create table `egraph_lookup_rindex_base` (
  `key_data` varbinary(255) NOT NULL,
  `id` binary(8) NOT NULL,
  `count` int NOT NULL COMMENT "number of occurrence of keyword in id",
  CONSTRAINT pkey PRIMARY KEY (`id`, `key_data`),
  KEY `key_data` (`key_data`),
  KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant