Merge #14: Refactor: extract bencode tokenizer

ec6cc56 docs: update README (Jose Celano) 68d9915 refactor: rename json::BencodeParser to json::Generator (Jose Celano) a3c7c4b refactor: remove parent parser mod (Jose Celano) 3052d6a refactor: rename BencodeTOkenizer to Tokenizer (Jose Celano) 331c76e refactor: reorganize modules (Jose Celano) 9e0db6c refactor: remove writer from tokenizer string parser (Jose Celano) 0a05544 refactor: remove old int and str parsers with writers (Jose Celano) 75ffdb4 refactor: remove writer from tokenizer integer parser (Jose Celano) 77ad5af refactor: remove writer from main tokenizer (Jose Celano) f6a0584 refactor: duplicate integer and strig parser before removing writer (Jose Celano) 3a7ea5d refactor: extract mod tokenizer (Jose Celano) 63b9b73 refactor: extract struct BencodeTokenizer (Jose Celano) 83eeefd refactor: extract bencode tokenizer (Jose Celano) Pull request description: This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types: - **Tokenizer**: It returns bencoded tokens. - **Generator**: It iterates over bencoded tokens to generate the JSON. **NOTES** - It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did [here](#12 (comment)). I guess that could be changed later if we think it increases readability and maintainability. **SUBTASKS** - [x] Separate logic for tokenizer. - [x] Extract tokenizer. - [x] Remove `Writer` from the tokenizer. It's not needed. **PERFORMANCE** In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings. **FUTURE PRs** We could: - [ ] Implement the `Iterator` trait for the tokenizer. - [ ] Use recursion for the generator like @da2ce7's proposal [here](#12). - [ ] Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats. ACKs for top commit: josecelano: ACK ec6cc56 Tree-SHA512: 9210211d802c8e19aef1f02f814b494c5919c7da81f299cf2c7f4d9fb12b4c63cbec4ac526996e6b1b3d69f75ca58894b9d64936bef2d9da851e70d51234c675
torrust · Dec 4, 2024 · 9634037 · 9634037
2 parents a2eb63c + ec6cc56
commit 9634037
Show file tree

Hide file tree

Showing 17 changed files with 436 additions and 567 deletions.
diff --git a/README.md b/README.md
@@ -65,12 +65,12 @@ Error: Unexpected end of input parsing integer; read context: input pos 3, lates
 
 ```console
 printf "3:ab" | cargo run
-Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`); write context: output pos 0, latest output bytes dump: [] (UTF-8 string: ``)
+Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`)
 ```
 
 ```console
 echo "i00e" | cargo run
-Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`); write context: byte `48` (char: `0`), output pos 2, latest output bytes dump: [48, 48] (UTF-8 string: `00`)
+Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`)
 ```
 
 Generating pretty JSON with [jq][jq]:
@@ -111,36 +111,10 @@ cargo add bencode2json
 
 There two ways of using the library:
 
-- With high-level parser wrappers.
-- With the low-level parsers.
+- With high-level wrappers.
+- With the low-level generators.
 
-Example using the high-level parser wrappers:
-
-```rust
-use bencode2json::{try_bencode_to_json};
-
-let result = try_bencode_to_json(b"d4:spam4:eggse").unwrap();
-
-assert_eq!(result, r#"{"<string>spam</string>":"<string>eggs</<string>string>"}"#);
-```
-
-Example using the low-level parser:
-
-```rust
-use bencode2json::parsers::{BencodeParser};
-
-let mut output = String::new();
-
-let mut parser = BencodeParser::new(&b"4:spam"[..]);
-
-parser
-  .write_str(&mut output)
-  .expect("Bencode to JSON conversion failed");
-
-println!("{output}"); // It prints the JSON string: "<string>spam</string>"
-```
-
-More [examples](./examples/).
+See [examples](./examples/).
 
 ## Test
 
@@ -167,21 +141,19 @@ cargo cov
 ## Performance
 
 In terms of memory usage this implementation consumes at least the size of the
-biggest bencoded string. The string parser keeps all the string bytes in memory until
-it parses the whole string, in order to convert it to UTF-8, when it's possible.
+biggest bencoded integer or string. The string and integer parsers keeps all the bytes in memory until
+it parses the whole value.
 
 The library also wraps the input and output streams in a [BufReader](https://doc.rust-lang.org/std/io/struct.BufReader.html)
  and [BufWriter](https://doc.rust-lang.org/std/io/struct.BufWriter.html) because it can be excessively inefficient to work directly with something that implements [Read](https://doc.rust-lang.org/std/io/trait.Read.html) or [Write](https://doc.rust-lang.org/std/io/trait.Write.html).
 
 ## TODO
 
-- [ ] More examples of using the library.
 - [ ] Counter for number of items in a list for debugging and errors.
 - [ ] Fuzz testing: Generate random valid bencoded values.
 - [ ] Install tracing crate. Add verbose mode that enables debugging.
 - [ ] Option to check if the final JSON it's valid at the end of the process.
 - [ ] Benchmarking for this implementation and the original C implementation.
-- [ ] Optimize string parser. We can stop trying to convert the string to UTF-8 when we find a non valid UTF-8 char.
 
 ## Alternatives
 

diff --git a/examples/parser_file_in_file_out.rs b/examples/parser_file_in_file_out.rs
@@ -10,7 +10,7 @@ use std::{
     io::{Read, Write},
 };
 
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 use clap::{Arg, Command};
 
 fn main() {
@@ -61,7 +61,7 @@ fn main() {
         std::process::exit(1);
     };
 
-    if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
+    if let Err(e) = Generator::new(input).write_bytes(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/examples/parser_stdin_stdout.rs b/examples/parser_stdin_stdout.rs
@@ -7,13 +7,13 @@
 //! It prints "spam".
 use std::io;
 
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 
 fn main() {
     let input = Box::new(io::stdin());
     let mut output = Box::new(io::stdout());
 
-    if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
+    if let Err(e) = Generator::new(input).write_bytes(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/examples/parser_string_in_string_out.rs b/examples/parser_string_in_string_out.rs
@@ -5,13 +5,13 @@
 //! ```
 //!
 //! It prints "spam".
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 
 fn main() {
     let input = "4:spam".to_string();
     let mut output = String::new();
 
-    if let Err(e) = BencodeParser::new(input.as_bytes()).write_str(&mut output) {
+    if let Err(e) = Generator::new(input.as_bytes()).write_str(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/examples/parser_string_in_vec_out.rs b/examples/parser_string_in_vec_out.rs
@@ -5,13 +5,13 @@
 //! ```
 //!
 //! It prints "spam".
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 
 fn main() {
     let input = "4:spam".to_string();
     let mut output = Vec::new();
 
-    if let Err(e) = BencodeParser::new(input.as_bytes()).write_bytes(&mut output) {
+    if let Err(e) = Generator::new(input.as_bytes()).write_bytes(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/examples/parser_vec_in_string_out.rs b/examples/parser_vec_in_string_out.rs
@@ -5,13 +5,13 @@
 //! ```
 //!
 //! It prints "spam".
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 
 fn main() {
     let input = b"4:spam".to_vec();
     let mut output = String::new();
 
-    if let Err(e) = BencodeParser::new(&input[..]).write_str(&mut output) {
+    if let Err(e) = Generator::new(&input[..]).write_str(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/examples/parser_vec_in_vec_out.rs b/examples/parser_vec_in_vec_out.rs
@@ -5,13 +5,13 @@
 //! ```
 //!
 //! It prints "spam".
-use bencode2json::parsers::BencodeParser;
+use bencode2json::generators::json::Generator;
 
 fn main() {
     let input = b"4:spam".to_vec();
     let mut output = Vec::new();
 
-    if let Err(e) = BencodeParser::new(&input[..]).write_bytes(&mut output) {
+    if let Err(e) = Generator::new(&input[..]).write_bytes(&mut output) {
         eprintln!("Error: {e}");
         std::process::exit(1);
     }

diff --git a/src/parsers/error.rs → src/error.rs b/src/parsers/error.rs → src/error.rs
@@ -9,7 +9,7 @@ use thiserror::Error;
 
 use crate::rw;
 
-use super::BencodeType;
+use super::generators::BencodeType;
 
 /// Errors that can occur while parsing a bencoded value.
 #[derive(Debug, Error)]
@@ -27,55 +27,55 @@ pub enum Error {
     /// The main parser peeks one byte ahead to know what kind of bencoded value
     /// is being parsed. If the byte read after peeking does not match the
     /// peeked byte, it means the input is being consumed somewhere else.
-    #[error("Read byte after peeking does match peeked byte; {0}; {1}")]
-    ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext, WriteContext),
+    #[error("Read byte after peeking does match peeked byte; {0}")]
+    ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext),
 
     /// Unrecognized first byte for new bencoded value.
     ///
     /// The main parser peeks one byte ahead to know what kind of bencoded value
     /// is being parsed. This error is raised when the peeked byte is not a
     /// valid first byte for a bencoded value.
-    #[error("Unrecognized first byte for new bencoded value; {0}; {1}")]
-    UnrecognizedFirstBencodeValueByte(ReadContext, WriteContext),
+    #[error("Unrecognized first byte for new bencoded value; {0}")]
+    UnrecognizedFirstBencodeValueByte(ReadContext),
 
     // Integers
     /// Unexpected byte parsing integer.
     ///
     /// The main parser parses integers by reading bytes until it finds the
     /// end of the integer. This error is raised when the byte read is not a
     /// valid byte for an integer bencoded value.
-    #[error("Unexpected byte parsing integer; {0}; {1}")]
-    UnexpectedByteParsingInteger(ReadContext, WriteContext),
+    #[error("Unexpected byte parsing integer; {0}")]
+    UnexpectedByteParsingInteger(ReadContext),
 
     /// Unexpected end of input parsing integer.
     ///
     /// The input ends before the integer ends.
-    #[error("Unexpected end of input parsing integer; {0}; {1}")]
-    UnexpectedEndOfInputParsingInteger(ReadContext, WriteContext),
+    #[error("Unexpected end of input parsing integer; {0}")]
+    UnexpectedEndOfInputParsingInteger(ReadContext),
 
     /// Leading zeros in integers are not allowed, for example b'i00e'.
-    #[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}; {1}")]
-    LeadingZerosInIntegersNotAllowed(ReadContext, WriteContext),
+    #[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}")]
+    LeadingZerosInIntegersNotAllowed(ReadContext),
 
     // Strings
     /// Invalid string length byte, expected a digit.
     ///
     /// The string parser found an invalid byte for the string length. The
     /// length can only be made of digits (0-9).
-    #[error("Invalid string length byte, expected a digit; {0}; {1}")]
-    InvalidStringLengthByte(ReadContext, WriteContext),
+    #[error("Invalid string length byte, expected a digit; {0}")]
+    InvalidStringLengthByte(ReadContext),
 
     /// Unexpected end of input parsing string length.
     ///
     /// The input ends before the string length ends.
-    #[error("Unexpected end of input parsing string length; {0}; {1}")]
-    UnexpectedEndOfInputParsingStringLength(ReadContext, WriteContext),
+    #[error("Unexpected end of input parsing string length; {0}")]
+    UnexpectedEndOfInputParsingStringLength(ReadContext),
 
     /// Unexpected end of input parsing string value.
     ///
     /// The input ends before the string value ends.
-    #[error("Unexpected end of input parsing string value; {0}; {1}")]
-    UnexpectedEndOfInputParsingStringValue(ReadContext, WriteContext),
+    #[error("Unexpected end of input parsing string value; {0}")]
+    UnexpectedEndOfInputParsingStringValue(ReadContext),
 
     // Lists
     /// Unexpected end of input parsing list. Expecting first list item or list end.
@@ -121,7 +121,7 @@ pub enum Error {
     NoMatchingStartForListOrDictEnd(ReadContext, WriteContext),
 }
 
-/// The reader context when the error ocurred.
+/// The reader context when the error occurred.
 #[derive(Debug)]
 pub struct ReadContext {
     /// The read byte that caused the error if any.
@@ -157,7 +157,7 @@ impl fmt::Display for ReadContext {
     }
 }
 
-/// The writer context when the error ocurred.
+/// The writer context when the error occurred.
 #[derive(Debug)]
 pub struct WriteContext {
     /// The written byte that caused the error if any.
@@ -197,7 +197,7 @@ impl fmt::Display for WriteContext {
 mod tests {
 
     mod for_read_context {
-        use crate::parsers::error::ReadContext;
+        use crate::error::ReadContext;
 
         #[test]
         fn it_should_display_the_read_context() {
@@ -237,7 +237,7 @@ mod tests {
     }
 
     mod for_write_context {
-        use crate::parsers::error::WriteContext;
+        use crate::error::WriteContext;
 
         #[test]
         fn it_should_display_the_read_context() {