This repository has been archived by the owner on Sep 3, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add simple fragmentation scheme (#7)
* Remove duplication of docs in database * Add simple fragmentation scheme * Update code for better clarity * Add recombine function * Add test suite for fragmentatin scheme * Add tests for recombine * Fix pattern matching in HexClient.get_releases * Update search.add mix task to work with fragmentation * Add handling for fragmentation in web view * Run formatter * Add sorting for recombining doc fragments * Use OptionParser instead of homebrew parsing * Add a new edge case to fragmentation tests * Overhaul FragmentationScheme.split to only build output binaries at the end of processing and remove Regex.run * Run formatter * Remove unnecessary tail recursion * Overhaul the compute_splits function once more * Add docs to FragmentationString.recombine/1 Co-authored-by: Jonatan Kłosko <[email protected]> * Change formatting Co-authored-by: Jonatan Kłosko <[email protected]> --------- Co-authored-by: Jonatan Kłosko <[email protected]>
- Loading branch information
1 parent
6350441
commit 9da417c
Showing
12 changed files
with
256 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
defmodule Search.FragmentationScheme do | ||
@doc """ | ||
Splits a binary into multiple binaries that satisfy limitations specified by opts. | ||
If possible, splits the text on whitespace to preserve words. If that is impossible, splits text in between graphemes. | ||
Supported options: | ||
* `:max_size` - maximum byte_size of the output binaries. The output binaries may have size less or equal to that | ||
value, which also should guarantee the sequence length after tokenization will be bounded by this value. | ||
""" | ||
def split(text, opts \\ []) | ||
def split("", _opts), do: [] | ||
|
||
def split(text, opts) when is_binary(text) do | ||
case Keyword.get(opts, :max_size) do | ||
nil -> | ||
[text] | ||
|
||
max_size -> | ||
text | ||
|> compute_splits(max_size, 0, nil, []) | ||
|> split_binary(text) | ||
end | ||
end | ||
|
||
@doc """ | ||
Recreates the original text from a list of chunks. | ||
""" | ||
def recombine(chunks), do: Enum.join(chunks) | ||
|
||
defp split_binary([], ""), do: [] | ||
|
||
defp split_binary([split_size | splits_tail], string) do | ||
<<chunk::binary-size(^split_size), rest::binary>> = string | ||
[chunk | split_binary(splits_tail, rest)] | ||
end | ||
|
||
defp compute_splits("", _, size, _, sizes), do: Enum.reverse(sizes, [size]) | ||
|
||
defp compute_splits(string, max_size, size, size_until_word, sizes) do | ||
{grapheme, string} = String.next_grapheme(string) | ||
grapheme_size = byte_size(grapheme) | ||
|
||
if size + grapheme_size > max_size do | ||
if size_until_word do | ||
# Split before the current unfinished word | ||
next = size - size_until_word | ||
compute_splits(string, max_size, next + grapheme_size, nil, [size_until_word | sizes]) | ||
else | ||
# The current chunk has a single word, just split it | ||
compute_splits(string, max_size, grapheme_size, nil, [size | sizes]) | ||
end | ||
else | ||
new_size = size + grapheme_size | ||
size_until_word = if whitespace?(grapheme), do: new_size, else: size_until_word | ||
compute_splits(string, max_size, new_size, size_until_word, sizes) | ||
end | ||
end | ||
|
||
defp whitespace?(grapheme), do: grapheme =~ ~r/\s/ | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,4 @@ | ||
<div :for={item <- @items} class="bg-gray-100 p-4 m-4 rounded"> | ||
<div :for={{item, doc_content} <- @items} class="bg-gray-100 p-4 m-4 rounded"> | ||
<p class="text-lg font-bold"><%= item.title %></p> | ||
<%= if item.doc do %> | ||
<%= raw(Earmark.as_html!(item.doc)) %> | ||
<% end %> | ||
<%= raw(Earmark.as_html!(doc_content)) %> | ||
</div> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.