Inverted index for ketword search within bigger text fields #7

neeraj9 · 2019-02-15T07:07:01Z

Lets say you have an attribute within your data which is a blob of text. For you to search through that text is a full-scan at present in egraphdb. In order to avoid a full-scan which is impractical, how about creating a simple inverted index from that text and making it keyword-searchable.

Potential Steps:

Tokenize
Drop common words and retain only the useful ones. Lets just say create another table which shall have such words, which can then be loaded by egraphdb in memory for quick access.
Simple spelling correction would be useful too.
Store multiple rows {keyword, sourceid} for the same data within the index table for a particular attribute. Where you could say do "select count(keyword),sum(count),sourceid from xyz where keyword in ('a', 'b') group by sourceid limit 10000". This is just a suggestion and not a strong rule.

sample table:

create table `egraph_lookup_rindex_base` (
  `key_data` varbinary(255) NOT NULL,
  `id` binary(8) NOT NULL,
  `count` int NOT NULL COMMENT "number of occurrence of keyword in id",
  CONSTRAINT pkey PRIMARY KEY (`id`, `key_data`),
  KEY `key_data` (`key_data`),
  KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inverted index for ketword search within bigger text fields #7

Inverted index for ketword search within bigger text fields #7

neeraj9 commented Feb 15, 2019 •

edited

Loading

Inverted index for ketword search within bigger text fields #7

Inverted index for ketword search within bigger text fields #7

Comments

neeraj9 commented Feb 15, 2019 • edited Loading

neeraj9 commented Feb 15, 2019 •

edited

Loading