Skip to content

Commit

Permalink
Merge #14: Refactor: extract bencode tokenizer
Browse files Browse the repository at this point in the history
ec6cc56 docs: update README (Jose Celano)
68d9915 refactor: rename json::BencodeParser to json::Generator (Jose Celano)
a3c7c4b refactor: remove parent parser mod (Jose Celano)
3052d6a refactor: rename BencodeTOkenizer to Tokenizer (Jose Celano)
331c76e refactor: reorganize modules (Jose Celano)
9e0db6c refactor: remove writer from tokenizer string parser (Jose Celano)
0a05544 refactor: remove old int and str parsers with writers (Jose Celano)
75ffdb4 refactor: remove writer from tokenizer integer parser (Jose Celano)
77ad5af refactor: remove writer from main tokenizer (Jose Celano)
f6a0584 refactor: duplicate integer and strig parser before removing writer (Jose Celano)
3a7ea5d refactor: extract mod tokenizer (Jose Celano)
63b9b73 refactor: extract struct BencodeTokenizer (Jose Celano)
83eeefd refactor: extract bencode tokenizer (Jose Celano)

Pull request description:

  This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types:

  - **Tokenizer**: It returns bencoded tokens.
  - **Generator**: It iterates over bencoded tokens to generate the JSON.

  **NOTES**

  - It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did [here](#12 (comment)). I guess that could be changed later if we think it increases readability and maintainability.

  **SUBTASKS**

  - [x] Separate logic for tokenizer.
  - [x] Extract tokenizer.
  - [x] Remove `Writer` from the tokenizer. It's not needed.

  **PERFORMANCE**

  In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings.

  **FUTURE PRs**

  We could:

  - [ ] Implement the `Iterator` trait for the tokenizer.
  - [ ] Use recursion for the generator like @da2ce7's proposal [here](#12).
  - [ ] Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats.

ACKs for top commit:
  josecelano:
    ACK ec6cc56

Tree-SHA512: 9210211d802c8e19aef1f02f814b494c5919c7da81f299cf2c7f4d9fb12b4c63cbec4ac526996e6b1b3d69f75ca58894b9d64936bef2d9da851e70d51234c675
  • Loading branch information
josecelano committed Dec 4, 2024
2 parents a2eb63c + ec6cc56 commit 9634037
Show file tree
Hide file tree
Showing 17 changed files with 436 additions and 567 deletions.
42 changes: 7 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,12 @@ Error: Unexpected end of input parsing integer; read context: input pos 3, lates

```console
printf "3:ab" | cargo run
Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`); write context: output pos 0, latest output bytes dump: [] (UTF-8 string: ``)
Error: Unexpected end of input parsing string value; read context: input pos 4, latest input bytes dump: [51, 58, 97, 98] (UTF-8 string: `3:ab`)
```

```console
echo "i00e" | cargo run
Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`); write context: byte `48` (char: `0`), output pos 2, latest output bytes dump: [48, 48] (UTF-8 string: `00`)
Error: Leading zeros in integers are not allowed, for example b'i00e'; read context: byte `48` (char: `0`), input pos 3, latest input bytes dump: [105, 48, 48] (UTF-8 string: `i00`)
```

Generating pretty JSON with [jq][jq]:
Expand Down Expand Up @@ -111,36 +111,10 @@ cargo add bencode2json

There two ways of using the library:

- With high-level parser wrappers.
- With the low-level parsers.
- With high-level wrappers.
- With the low-level generators.

Example using the high-level parser wrappers:

```rust
use bencode2json::{try_bencode_to_json};

let result = try_bencode_to_json(b"d4:spam4:eggse").unwrap();

assert_eq!(result, r#"{"<string>spam</string>":"<string>eggs</<string>string>"}"#);
```

Example using the low-level parser:

```rust
use bencode2json::parsers::{BencodeParser};

let mut output = String::new();

let mut parser = BencodeParser::new(&b"4:spam"[..]);

parser
.write_str(&mut output)
.expect("Bencode to JSON conversion failed");

println!("{output}"); // It prints the JSON string: "<string>spam</string>"
```

More [examples](./examples/).
See [examples](./examples/).

## Test

Expand All @@ -167,21 +141,19 @@ cargo cov
## Performance

In terms of memory usage this implementation consumes at least the size of the
biggest bencoded string. The string parser keeps all the string bytes in memory until
it parses the whole string, in order to convert it to UTF-8, when it's possible.
biggest bencoded integer or string. The string and integer parsers keeps all the bytes in memory until
it parses the whole value.

The library also wraps the input and output streams in a [BufReader](https://doc.rust-lang.org/std/io/struct.BufReader.html)
and [BufWriter](https://doc.rust-lang.org/std/io/struct.BufWriter.html) because it can be excessively inefficient to work directly with something that implements [Read](https://doc.rust-lang.org/std/io/trait.Read.html) or [Write](https://doc.rust-lang.org/std/io/trait.Write.html).

## TODO

- [ ] More examples of using the library.
- [ ] Counter for number of items in a list for debugging and errors.
- [ ] Fuzz testing: Generate random valid bencoded values.
- [ ] Install tracing crate. Add verbose mode that enables debugging.
- [ ] Option to check if the final JSON it's valid at the end of the process.
- [ ] Benchmarking for this implementation and the original C implementation.
- [ ] Optimize string parser. We can stop trying to convert the string to UTF-8 when we find a non valid UTF-8 char.

## Alternatives

Expand Down
4 changes: 2 additions & 2 deletions examples/parser_file_in_file_out.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ use std::{
io::{Read, Write},
};

use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;
use clap::{Arg, Command};

fn main() {
Expand Down Expand Up @@ -61,7 +61,7 @@ fn main() {
std::process::exit(1);
};

if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
if let Err(e) = Generator::new(input).write_bytes(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
4 changes: 2 additions & 2 deletions examples/parser_stdin_stdout.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@
//! It prints "spam".
use std::io;

use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;

fn main() {
let input = Box::new(io::stdin());
let mut output = Box::new(io::stdout());

if let Err(e) = BencodeParser::new(input).write_bytes(&mut output) {
if let Err(e) = Generator::new(input).write_bytes(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
4 changes: 2 additions & 2 deletions examples/parser_string_in_string_out.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
//! ```
//!
//! It prints "spam".
use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;

fn main() {
let input = "4:spam".to_string();
let mut output = String::new();

if let Err(e) = BencodeParser::new(input.as_bytes()).write_str(&mut output) {
if let Err(e) = Generator::new(input.as_bytes()).write_str(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
4 changes: 2 additions & 2 deletions examples/parser_string_in_vec_out.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
//! ```
//!
//! It prints "spam".
use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;

fn main() {
let input = "4:spam".to_string();
let mut output = Vec::new();

if let Err(e) = BencodeParser::new(input.as_bytes()).write_bytes(&mut output) {
if let Err(e) = Generator::new(input.as_bytes()).write_bytes(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
4 changes: 2 additions & 2 deletions examples/parser_vec_in_string_out.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
//! ```
//!
//! It prints "spam".
use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;

fn main() {
let input = b"4:spam".to_vec();
let mut output = String::new();

if let Err(e) = BencodeParser::new(&input[..]).write_str(&mut output) {
if let Err(e) = Generator::new(&input[..]).write_str(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
4 changes: 2 additions & 2 deletions examples/parser_vec_in_vec_out.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
//! ```
//!
//! It prints "spam".
use bencode2json::parsers::BencodeParser;
use bencode2json::generators::json::Generator;

fn main() {
let input = b"4:spam".to_vec();
let mut output = Vec::new();

if let Err(e) = BencodeParser::new(&input[..]).write_bytes(&mut output) {
if let Err(e) = Generator::new(&input[..]).write_bytes(&mut output) {
eprintln!("Error: {e}");
std::process::exit(1);
}
Expand Down
42 changes: 21 additions & 21 deletions src/parsers/error.rs → src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use thiserror::Error;

use crate::rw;

use super::BencodeType;
use super::generators::BencodeType;

/// Errors that can occur while parsing a bencoded value.
#[derive(Debug, Error)]
Expand All @@ -27,55 +27,55 @@ pub enum Error {
/// The main parser peeks one byte ahead to know what kind of bencoded value
/// is being parsed. If the byte read after peeking does not match the
/// peeked byte, it means the input is being consumed somewhere else.
#[error("Read byte after peeking does match peeked byte; {0}; {1}")]
ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext, WriteContext),
#[error("Read byte after peeking does match peeked byte; {0}")]
ReadByteAfterPeekingDoesMatchPeekedByte(ReadContext),

/// Unrecognized first byte for new bencoded value.
///
/// The main parser peeks one byte ahead to know what kind of bencoded value
/// is being parsed. This error is raised when the peeked byte is not a
/// valid first byte for a bencoded value.
#[error("Unrecognized first byte for new bencoded value; {0}; {1}")]
UnrecognizedFirstBencodeValueByte(ReadContext, WriteContext),
#[error("Unrecognized first byte for new bencoded value; {0}")]
UnrecognizedFirstBencodeValueByte(ReadContext),

// Integers
/// Unexpected byte parsing integer.
///
/// The main parser parses integers by reading bytes until it finds the
/// end of the integer. This error is raised when the byte read is not a
/// valid byte for an integer bencoded value.
#[error("Unexpected byte parsing integer; {0}; {1}")]
UnexpectedByteParsingInteger(ReadContext, WriteContext),
#[error("Unexpected byte parsing integer; {0}")]
UnexpectedByteParsingInteger(ReadContext),

/// Unexpected end of input parsing integer.
///
/// The input ends before the integer ends.
#[error("Unexpected end of input parsing integer; {0}; {1}")]
UnexpectedEndOfInputParsingInteger(ReadContext, WriteContext),
#[error("Unexpected end of input parsing integer; {0}")]
UnexpectedEndOfInputParsingInteger(ReadContext),

/// Leading zeros in integers are not allowed, for example b'i00e'.
#[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}; {1}")]
LeadingZerosInIntegersNotAllowed(ReadContext, WriteContext),
#[error("Leading zeros in integers are not allowed, for example b'i00e'; {0}")]
LeadingZerosInIntegersNotAllowed(ReadContext),

// Strings
/// Invalid string length byte, expected a digit.
///
/// The string parser found an invalid byte for the string length. The
/// length can only be made of digits (0-9).
#[error("Invalid string length byte, expected a digit; {0}; {1}")]
InvalidStringLengthByte(ReadContext, WriteContext),
#[error("Invalid string length byte, expected a digit; {0}")]
InvalidStringLengthByte(ReadContext),

/// Unexpected end of input parsing string length.
///
/// The input ends before the string length ends.
#[error("Unexpected end of input parsing string length; {0}; {1}")]
UnexpectedEndOfInputParsingStringLength(ReadContext, WriteContext),
#[error("Unexpected end of input parsing string length; {0}")]
UnexpectedEndOfInputParsingStringLength(ReadContext),

/// Unexpected end of input parsing string value.
///
/// The input ends before the string value ends.
#[error("Unexpected end of input parsing string value; {0}; {1}")]
UnexpectedEndOfInputParsingStringValue(ReadContext, WriteContext),
#[error("Unexpected end of input parsing string value; {0}")]
UnexpectedEndOfInputParsingStringValue(ReadContext),

// Lists
/// Unexpected end of input parsing list. Expecting first list item or list end.
Expand Down Expand Up @@ -121,7 +121,7 @@ pub enum Error {
NoMatchingStartForListOrDictEnd(ReadContext, WriteContext),
}

/// The reader context when the error ocurred.
/// The reader context when the error occurred.
#[derive(Debug)]
pub struct ReadContext {
/// The read byte that caused the error if any.
Expand Down Expand Up @@ -157,7 +157,7 @@ impl fmt::Display for ReadContext {
}
}

/// The writer context when the error ocurred.
/// The writer context when the error occurred.
#[derive(Debug)]
pub struct WriteContext {
/// The written byte that caused the error if any.
Expand Down Expand Up @@ -197,7 +197,7 @@ impl fmt::Display for WriteContext {
mod tests {

mod for_read_context {
use crate::parsers::error::ReadContext;
use crate::error::ReadContext;

#[test]
fn it_should_display_the_read_context() {
Expand Down Expand Up @@ -237,7 +237,7 @@ mod tests {
}

mod for_write_context {
use crate::parsers::error::WriteContext;
use crate::error::WriteContext;

#[test]
fn it_should_display_the_read_context() {
Expand Down
Loading

0 comments on commit 9634037

Please sign in to comment.