Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: extract bencode tokenizer #14

Merged
merged 13 commits into from
Dec 4, 2024

Conversation

josecelano
Copy link
Member

@josecelano josecelano commented Dec 3, 2024

This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types:

  • Tokenizer: It returns bencoded tokens.
  • Generator: It iterates over bencoded tokens to generate the JSON.

NOTES

  • It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did here. I guess that could be changed later if we think it increases readability and maintainability.

SUBTASKS

  • Separate logic for tokenizer.
  • Extract tokenizer.
  • Remove Writer from the tokenizer. It's not needed.

PERFORMANCE

In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings.

FUTURE PRs

We could:

  • Implement the Iterator trait for the tokenizer.
  • Use recursion for the generator like @da2ce7's proposal here.
  • Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats.

Split parser logic into two types:

- Tokenizer: It returns bencoded tokens.
- Generator: It iterator over bencoded tokens to generate the JSON.
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 93.89671% with 13 lines in your changes missing coverage. Please review.

Project coverage is 99.15%. Comparing base (a2eb63c) to head (ec6cc56).
Report is 14 commits behind head on develop.

Files with missing lines Patch % Lines
src/tokenizer/mod.rs 80.88% 13 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop      #14      +/-   ##
===========================================
- Coverage    99.23%   99.15%   -0.08%     
===========================================
  Files           11       12       +1     
  Lines         2749     2610     -139     
===========================================
- Hits          2728     2588     -140     
- Misses          21       22       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@josecelano josecelano marked this pull request as ready for review December 4, 2024 11:11
@josecelano
Copy link
Member Author

ACK ec6cc56

@josecelano josecelano linked an issue Dec 4, 2024 that may be closed by this pull request
@josecelano josecelano merged commit 9634037 into torrust:develop Dec 4, 2024
8 of 10 checks passed
@da2ce7
Copy link
Contributor

da2ce7 commented Dec 4, 2024

@josecelano Looks good.

Just wondering, should we really tolerate line brakes in a Bencode Input?

@josecelano
Copy link
Member Author

@josecelano Looks good.

Just wondering, should we really tolerate line brakes in a Bencode Input?

Hi @da2ce7 I think I only added it to tolerate the line break at the end of the bencode value because it makes more flexible to run the application like this:

echo "4:spam" | cargo run

If we don't tolerate line breaks you only can use this:

printf "4:spam" | cargo run

I don't like it either. Maybe we can "clean" the input stream only in the main app.

@josecelano
Copy link
Member Author

@josecelano Looks good.

Just wondering, should we really tolerate line brakes in a Bencode Input?

I've opened an issue: #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find a better name for BencodeParser type
2 participants