Releases: mikegoatly/lifti
v6.3.0
V6.2.0
Very minor update. I realised while writing some more detailed documentation to explain query plans that the execution plan node kind CompositePositionalIntersect
was technically identical to PositionalIntersect
. As such, I've obsoleted it for removal in the next major version.
V6.1.0
Adds support for obtaining query execution plans for queries (#110)
The Blazor demo application demonstrates the query execution plans generated for queries executed against it:
Technically breaking
Although very unlikely to cause an issue (if they do, please let me know):
New method overloads on IFullTextIndex/FullTextIndex:
Search(IQuery! query, QueryExecutionOptions options = QueryExecutionOptions.None)
Search(string! searchText, QueryExecutionOptions options = QueryExecutionOptions.None)
New method on ISearchResults:
ISearchResults.GetExecutionPlan() -> Lifti.QueryExecutionPlan
V6.0.1
Note: v6.0.0 was only available for a few minutes due of a nuget publishing error. v6.0.1 should be considered the first official v6 release
There are a couple of breaking changes in this release, most of which are due to renaming of types. Some guidance can be found below for how to deal with them.
New features
- Score boosting!
- Score boosting as part of a query -
grand^3
will boost the score of words matching "grand". - Boosting of object fields -
.WithField("Name", c => c.Name, scoreBoost: 1.5D)
. - Boosting object scores based on a freshness date, e.g. the date it was last updated.
- Boosting object scores based on a magnitude value, e.g. a star rating.
- Score boosting as part of a query -
- Custom stemmers
- Characters can now be escaped in LIFTI queries and field names in LIFTI queries can contain spaces.
- Enhanced query execution logic
- Removed dependency on
System.Collections.Immutable
- only the netstandard2 version of the library now pulls in any dependencies. For net6 to net8, only built in types are used.
Performance increases
There was a significant amount of work done to improve performance and memory usage of building an index, index (de)serialization and searching.
All tests were run with Benchmark.NET:
BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22631.3007)
Intel Core i7-1065G7 CPU 1.30GHz, 1 CPU, 8 logical and 4 physical cores
The results below are a comparison of the previous v5 version of LIFTI against the code in the v6.0.0 branch, running on .NET 8.
Index construction
Populating an index with 200 Wikipedia entries in a single batch
v5 Mean (μs) | v5 Allocated (KB) | v6 Mean (μs) | v6 Allocated (KB) |
---|---|---|---|
1,134.2 | 567,623.8 | 952.6 | 286,617.6 |
Populating each of the 200 Wikipedia entries one at a time (i.e. a new snapshot created after each document)
v5 Mean (μs) | v5 Allocated (KB) | v6 Mean (μs) | v6 Allocated (KB) |
---|---|---|---|
4,284.4 | 1,370,649.9 | 1,212.4 | 613,540.2 |
Searching
Lots of individual optimisations including:
- Merge sorting results during unions and intersections for queries containing more than one part
- Optimised collection of effected results during wildcard and fuzzy match query parts
- Early application of field filters when matching results
- Weighting of query parts to analyse optimal execution order so that documents can be eliminated from collection in other parts of the query.
make for some nice gains for various query types.
Query | v5 Mean (μs) | v5 Allocated (KB) | v6 Mean (μs) | v6 Allocated (KB) |
---|---|---|---|---|
"also has a" | 169.74 | 379.19 | 52.71 | 122.97 |
(confiscation & th*) | "and they" | 1,203.69 | 1,557.29 | 105.23 | 185.02 |
* | 193,333.07 | 103,612.99 | 62,298.80 | 13,152.30 |
?and ?they ?also | 1,725.66 | 1,658.12 | 439.60 | 243.45 |
and | they | 417.70 | 819.98 | 104.23 |
and ~ they | 132.89 | 294.22 | 42.20 | 95.61 |
and ~10> they | 132.64 | 297.67 | 43.34 | 97.04 |
and > they | 214.03 | 455.75 | 106.16 | 169.17 |
and they also | 283.82 | 565.34 | 56.02 | 109.51 |
co*on | 445.27 | 798.77 | 180.04 | 263.47 |
con??* | 2.21 | 2.30 | 1.96 | 1.97 |
confiscation | 4.03 | 2.70 | 3.66 | 2.29 |
th* | 2,277.00 | 2,914.76 | 569.76 | 412.60 |
Title=?great | 416.08 | 399.17 | 108.86 | 34.50 |
Deprecated:
ItemMetadata.Item
/DocumentMetadata.Item
-> use Key
property
IFullTextIndex.Items
-> use Metadata
property
FullTextIndexBuilder.WithDuplicateItemBehavior
-> use WithDuplicateKeyBehavior
method
IndexOptions.DuplicateItemBehavior
-> use DuplicateKeyBehavior
property
ScoredToken.ItemId
-> use DocumentId
property
QueryTokenMatch.ItemId
-> use DocumentId
property
ItemMetadata.Count
-> IndexMetadata.DocumentCount
ItemMetadata.GetMetadata
-> IndexMetadata.GetDocumentMetadata
Technically breaking
IdPool
and IIdPool
are now internal - These weren't really exposed before anyway
Removed interface IItemMetadata
- just using DocumentMetadata
going forwards
QueryContext
no longer has ApplyTo
method
IIndexNavigator
: added Snapshot
property
IIndexNavigator
: added overloads for GetExactMatches
and GetExactAndChildMatches
that allow for the current QueryContext
to be passed in so unnecessary results are not collected.
IIndexNavigator
: new additional methods AddExactMatches
and AddExactAndChildMatches
that allow you to efficiently collect matches using a DocumentMatchCollector
before converting it to an IntermediateQueryResult
.
IQueryPart
now has double CalculateWeighting(Func<IIndexNavigator> navigatorCreator)
method to help the query processing logic evaluate the most efficient order of execution.
TItem
generic type parameter name has been renamed to TObject
.
All query part types are now sealed
New method IIndexNavigator.ExactMatchCount()
IntermediateQueryResult
constructors are no longer public
Index serialization interfaces have been reworked. This shouldn't affect anyone because it was technically impossible to write your own serializers based upon them due to a lack of publicly accessible methods for rehydrating an index.
IIndexNavigatorBookmark
now implements IDisposable
- you don't technically have to dispose it, but doing so will return it to a pool and allow it to be reused.
Querying changes
ScoredFieldMatch
is now quite different and no longer publicly constructable. The only place you would have encountered this is in a custom scorer, and that's no longer necessary.
Several types that are only likely to have been used internally are gone:
FieldMatch
QueryTokenMatch
CompositeTokenMatchLocation
SingleTokenMatchLocation
ITokenLocationMatch
TokenLocationMatch
Breaking
DuplicateItemBehavior
enum -> renamed to DuplicateKeyBehavior
DuplicateItemBehavior.ReplaceItem
-> use DuplicateKeyBehavior.Replace
instead
IQueryContext
-> Just use concrete QueryContext
this affects IQueryPart.Evaluate
as it now takes QueryContext
IIndexNodeFactory.CreateNode
now takes concrete types ChildNodeMap
and DocumentTokenMatchMap
instead of ImmutableDictionary
and ImmutableList
respectively.
A maximum of 31 different object types can now be configured against a single FullTextIndexBuilder
(i.e. 31 distinct calls to WithObjectTokenization
) - if anyone is actually indexing more that 31 object types, I'd be very interested to understand your scenario!
The rest of these will only affect you if you are explicitly referencing the type names in your code:
ItemPhrases
-> renamed to DocumentPhrases
ItemMetadata
-> renamed to DocumentMetadata
IItemStore
-> renamed to IIndexMetadata
v5.0.0
New features in v5.0.0
- Dynamic fields
- More detailed field information
- Smaller binary serialized files
Acknowledgements
Thanks to @kampilan and @h0lg for their thoughts on the design for dynamic fields!
Dynamic fields
v5.0.0 introduces support for dynamic fields, where fields are dynamically registered with the index as it is populated:
var index = new FullTextIndexBuilder<int>()
.WithObjectTokenization<Customer>(o => o
.WithKey(c => c.Id)
.WithDynamicFields("Tags", c => c.TagDictionary, "Tag_")
.WithDynamicFields(
"Questions",
c => c.Questions,
q => q.QuestionName,
q => q.QuestionResponse,
"Question_")
)
.Build();
Indexing this object against the index:
new Customer
{
Tags = new Dictionary<string, string>
{
{ "Foo", "Some text here" }
},
Questions = new List<Question>
{
new Question { QuestionName = "FavoriteColor", QuestionResponse = "My favorite color is blue" }
}
}
Will cause two fields to be registered with text:
Tag_Foo -> "Some text here"
Question_FavouriteColor -> "My favorite color is blue"
More detailed field information
The FieldLookup
property of an index now provides additional information about fields.
Smaller binary serialized files
The binary serializer has been rewritten to support dynamic fields. In addition to this it will now write integers in a variable length encoding, using a few bytes as possible. When serialized using this new approach, indexes will be about 30-50% of the size when the old serializer was used.
Old serialized versions of the index can still be read, as long as the index builder definition remains unchanged.
Breaking Changes
None of these should affect you unless you're doing something really off-the-wall and unexpected.
IIndexedFieldLookup
has new methods on it, IsKnownField
and AllFieldNames
.
IndexedFieldDetails
has changed from being a struct
to an abstract class
and no longer implements IEquatable<IndexedFieldDetails>
.
The IndexedFieldLookup
class is now internal
.
v4.0.1
v4.0.0
New features:
Phrase extraction from search result #57
Use the new CreateMatchPhrasesAsync
methods on search results returned by the index to produce the set of matched phrases by combining them with the original source text:
foreach (var result in await results.CreateMatchPhrasesAsync(i => books.First(x => x.BookId == i)))
{
Console.WriteLine($"{result.SearchResult.Key} ({result.SearchResult.Score})");
foreach (var fieldPhrase in result.FieldPhrases)
{
Console.Write($" {fieldPhrase.FoundIn}: ");
Console.WriteLine(string.Join(", ", fieldPhrase.Phrases.Select(x => $"\"{x}\"")));
}
}
Index thesaurus #63
Define synonym, hyponym and hypernym relationships between words, so that searches can be performed against words that were not in the original source text.
var index = new FullTextIndexBuilder<int>()
.WithDefaultThesaurus(o => o
.WithSynonyms("big", "large")
.WithHyponyms("dog", "poodle", "beagle"))
.Build();
Ignoring characters during tokenization #59
Configures the tokenizer to ignore certain characters as it is parsing input.
var index = new FullTextIndexBuilder<int>()
.WithDefaultTokenization(o =>o
.IgnoreCharacters('<', '>')
)
.Build();
Performance improvements
Enforcing the order of matched token locations while processing queries has allowed a couple of optimisations when merging the results of some query parts. You will notice a small perf bump when using operators that require positional matching of words, e.g. sequential words "word1 word2"
, preceding words word1 ~> word2
and near words word1 ~ word2
.
And a new logo!
It looks a bit more professional to have a logo when looking for LIFTI in nuget, so here it is:
Multi-targeted platforms
From the v4 release, the LIFTI package will multi-target different platforms:
Behavioral changes
For fuzzy matching queries, maxEditDistance
is now defaulted to termLength / 3
(it was termLength / 2
). This provides better matches out-of-the box.
Breaking changes
Most of these shouldn't affect the common usage patterns for LIFTI if you're just using out-of-the-box features. The ones to watch for are the change of return type from IFullTextIndex.Search
and change to the IQueryParser
interface if you're implementing your own query parser.
Async methods
All async methods, can now be passed an optional CancellationToken
. This has primarily had an effect on the IFullTextIndex
interface.
Overloads have been introduced where async delegates can optionally be provided with the CancellationToken
during index building:
- Reading field text asynchronously
- Index modification actions (
WithIndexModificationAction
)
ITokenizer
ITokenizer
renamed toIIndexTokenizer
to improve differentiation between tokenization for indexes and queries- New method:
IIndexTokenizer
:IsSplitCharacter(char character)
- Return type of
Process
changed fromIReadOnlyList<Token>
toIReadOnlyCollection<Token>
IFullTextIndex
IFullTextIndex
implements new interfaceIIndexTokenizerProvider
Search
methods return a new interfaceISearchResults<T>
. The interface implements IEnumerablethough, so impact should be limited. This allows for the new
CreateMatchPhrases` method.- Added property
IThesaurus DefaultThesaurus
to expose the default thesaurus for the index. BeginBatchChange
andCommitBatchChangeAsync
were implemented byFullTextIndex
- they've been added to the interface for parity.
IQueryParser
IQueryParser.Parse
signature changed from:
IQuery Parse(IIndexedFieldLookup fieldLookup, string queryText, ITokenizer tokenizer)
to
IQuery Parse(IIndexedFieldLookup fieldLookup, string queryText, IIndexTokenizerProvider tokenizerProvider)
This change allows you to access the tokenizers for different fields as well as the default tokenizer for
the index, which is all that was accessible previously. Custom query parsers can be fixed up by using
var tokenizer = tokenizerProvider.DefaultTokenizer
to get the default index tokenizer.
IIndexNavigator
- New method overload
IIndexNavigator.Process(string)
v3.5.2
v3.5.1
v3.5.0
- Fixed an issue where fuzzy match search results weren't honoring field filters - queries such as
title=?great
will now only return fuzzy matched results on the required field (in this example,title
), as would be expected. - Added an extension method
IFullTextIndex.ParseQuery
as a convenience wrapper aroundIFullTextIndex.QueryParser.Parse
to save passing in the required dependencies from the index itself. - Added
ToString
onQuery
- this is useful when you want to get a textual representation of the query itself. Previously you had to callQuery.Root.ToString()
, which wasn't very discoverable.