Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a better name for BencodeParser type #12

Closed
josecelano opened this issue Nov 19, 2024 · 15 comments · Fixed by #14
Closed

Find a better name for BencodeParser type #12

josecelano opened this issue Nov 19, 2024 · 15 comments · Fixed by #14
Assignees
Labels
question Further information is requested

Comments

@josecelano
Copy link
Member

josecelano commented Nov 19, 2024

Yesterday, I discussed the main type name with @magecnion. It might not be a good name.

Formal Definition of a Parser

A parser is a component of a computer program that takes an input string (or byte stream) and produces a structured representation, often a tree-like format, based on a formal grammar. Parsers are typically used in the context of programming languages or data formats, where they validate and transform raw input into a more structured form for further processing.

Since the library converts bencoded data directly to JSON without creating an intermediary structured representation, the term "parser" might not entirely reflect its purpose. It implies more extensive processing than what the library does.

Alternative Names (ChatGPT)

  1. BencodeConverter: Highlights the conversion process from bencode to JSON.
  2. BencodeToJson: A clear and direct name describing the functionality.
  3. BencodeTransformer: Suggests transforming the data from one format to another.
  4. BencodeSerializer: Focuses on serializing bencoded data into JSON.
  5. BencodeStreamConverter: Emphasizes that it works on streams and performs conversion.
  6. BencodeTranscoder: Suggests translating data from one encoded format to another.
  7. BencodeJsonDumper: Reflects the "dumping" nature of the process directly to JSON.
  8. BencodeInterpreter: Suggests interpreting the bencoded data directly into JSON output.
  9. BencodeDecoder: Focuses on decoding bencode and outputting JSON.
  10. BencodeEmitter: Implies emitting JSON data directly from the bencode input.

cc @da2ce7 @magecnion

@josecelano josecelano added the question Further information is requested label Nov 19, 2024
@josecelano
Copy link
Member Author

@da2ce7 suggestion:

  • BencodeIterator

@josecelano
Copy link
Member Author

I like these (in this order):

  • BencodeToJsonConverter
  • BencodeJsonDumper
  • BencodeToJson

@magecnion
Copy link
Collaborator

At the beginning I liked BencodeConverter, it reflects what it does. For me, BencodeToJsonConverter is a bit long and also redundant since the project is named bencode2json. Going further, we could even disregard Bencode and name the struct simply Converter, although that might be too drastic.

But @da2ce7's suggestion made me think. I guess you suggested BencodeIterator because it not only converts but also iterates. Could it be a hint that another type is needed instead of just renaming the current one?

@josecelano
Copy link
Member Author

At the beginning I liked BencodeConverter, it reflects what it does. For me, BencodeToJsonConverter is a bit long and also redundant since the project is named bencode2json. Going further, we could even disregard Bencode and name the struct simply Converter, although that might be too drastic.

But @da2ce7's suggestion made me think. I guess you suggested BencodeIterator because it not only converts but also iterates. Could it be a hint that another type is needed instead of just renaming the current one?

Hi @magecnion, when I started working on the library I wanted to implement it like a compiler with a tokenizer and a parser, or like serde, separating the phase when we parse bencode tokens from the phase where we build the output.

bencoded bytes -> bencoded tokens -> generate JSON

That would allow to add new extensions in the future to write to other formats (yaml, xml, toml, ...). However I decided to make it as simple as possible in the first iteration only for the JSON case, following the C implementation.

The main difference with serde is we don't build an intermediary data structure.

Maybe we can explore that path now. The simplest way I see to change the package is:

  1. Rename BencodeParser to BencodeTokenizer

The tokenizer would return only the parsed bencoded tokens:

  • Integer
  • String
  • List begin
  • Dict begin
  • List or Dict end

The BencodeTokenizer would return an iterator so you can iterate over tokens in the input. In fact, it would be like a nested iterator that wraps the Read trait.

  1. Implement an iterator

The BencodeTokenizer would implement an iterator that returns the tokens.

I think the easiest way to refactor the current state is to reuse the current writer. Wherever we use the "writer" we change it to just return the token. This could be added in pararell without changing the main "parse" function.

  1. Create a new "smarter" writer for JSON

Considerations

I think that would be a big change and we would need to rename also the crate :-).

Example generated by ChatGPT


Here is a hypothetical implementation and example usage for a Rust crate that provides a BEncodeTokenizer and supports output generation to JSON and YAML formats. This will showcase the traits and how the components interact:

Traits

Tokenizer Trait

pub trait Tokenizer {
    type Token;

    fn next_token(&mut self) -> Option<Self::Token>;
}

Writer Trait

pub trait Writer<W: std::io::Write> {
    fn write_token(&mut self, token: Token) -> Result<(), std::io::Error>;
}

Example Structs

BEncodeTokenizer

use std::io::{self, Read};

pub enum Token {
    Integer(i64),
    String(String),
    ListBegin,
    DictBegin,
    ListOrDictEnd,
}

pub struct BEncodeTokenizer<R: Read> {
    reader: R,
    buffer: Vec<u8>,
    pos: usize,
}

impl<R: Read> BEncodeTokenizer<R> {
    pub fn new(reader: R) -> Self {
        Self {
            reader,
            buffer: Vec::new(),
            pos: 0,
        }
    }

    fn read_next(&mut self) -> io::Result<()> {
        let mut buf = [0; 1024];
        let bytes_read = self.reader.read(&mut buf)?;
        self.buffer.extend_from_slice(&buf[..bytes_read]);
        Ok(())
    }
}

impl<R: Read> Iterator for BEncodeTokenizer<R> {
    type Item = Token;

    fn next(&mut self) -> Option<Self::Item> {
        // Implement tokenization logic based on BEncode format
        // This is just a stubbed logic; real implementation would parse self.buffer
        if self.pos >= self.buffer.len() {
            self.read_next().ok()?;
        }
        // Parse tokens from `self.buffer` here
        None
    }
}

JsonWriter

use std::io::Write;

pub struct JsonWriter<W: Write> {
    writer: W,
}

impl<W: Write> JsonWriter<W> {
    pub fn new(writer: W) -> Self {
        Self { writer }
    }
}

impl<W: Write> Writer<W> for JsonWriter<W> {
    fn write_token(&mut self, token: Token) -> Result<(), std::io::Error> {
        match token {
            Token::Integer(val) => write!(self.writer, "{}", val)?,
            Token::String(val) => write!(self.writer, "\"{}\"", val)?,
            Token::ListBegin => write!(self.writer, "[")?,
            Token::DictBegin => write!(self.writer, "{{")?,
            Token::ListOrDictEnd => write!(self.writer, "]")?, // or "}" if dictionary
        }
        Ok(())
    }
}

YamlWriter

use std::io::Write;

pub struct YamlWriter<W: Write> {
    writer: W,
}

impl<W: Write> YamlWriter<W> {
    pub fn new(writer: W) -> Self {
        Self { writer }
    }
}

impl<W: Write> Writer<W> for YamlWriter<W> {
    fn write_token(&mut self, token: Token) -> Result<(), std::io::Error> {
        match token {
            Token::Integer(val) => write!(self.writer, "{}", val)?,
            Token::String(val) => write!(self.writer, "\"{}\"", val)?,
            Token::ListBegin => write!(self.writer, "-")?,
            Token::DictBegin => write!(self.writer, "")?,
            Token::ListOrDictEnd => write!(self.writer, "")?,
        }
        Ok(())
    }
}

Example Usage

use std::fs::File;
use std::io::{self, BufReader, BufWriter};

fn main() -> io::Result<()> {
    // Input BEncode file
    let input = File::open("example.bencode")?;
    let reader = BufReader::new(input);

    // Output JSON file
    let output_json = File::create("output.json")?;
    let writer_json = BufWriter::new(output_json);

    // Output YAML file
    let output_yaml = File::create("output.yaml")?;
    let writer_yaml = BufWriter::new(output_yaml);

    // Tokenizer
    let tokenizer = BEncodeTokenizer::new(reader);

    // JSON writer
    let mut json_writer = JsonWriter::new(writer_json);

    // YAML writer
    let mut yaml_writer = YamlWriter::new(writer_yaml);

    // Process tokens
    for token in tokenizer {
        json_writer.write_token(token.clone())?;
        yaml_writer.write_token(token)?;
    }

    Ok(())
}

Explanation

  • BEncodeTokenizer reads the input stream and breaks it into tokens based on the BEncode format.
  • JsonWriter and YamlWriter implement the Writer trait to write the tokens in the desired format.
  • The main function demonstrates how to use the tokenizer and writers to convert BEncode data into JSON and YAML outputs.

This design makes it easy to extend the library with additional writers for other formats, like XML or custom formats.

If you want to generate JSON you have to combine the tokenizer with the write or generator you want to use. In this case, JSON.

cc @da2ce7

@magecnion
Copy link
Collaborator

magecnion commented Nov 30, 2024

Thank you for sharing your proposal. I would like to clarify a few points before deciding on the renaming.

It’s not entirely clear to me who is responsible for decoding (aka parsing), which in the case of JSON includes handling non-UTF8 bytes by writing them as hexadecimal.

On one hand, the proposal states that the smarter writer is responsible for writing the tokens in the desired format. On the other hand, the following statement regarding the tokenizer:

“Reuse the current writer. Wherever we use the 'writer,' we change it to just return the token. This could be added in parallel without changing the main 'parse' function.”

suggests that decoding/parsing will remain part of the tokenizing phase because it involves the parse function, which includes all the decoding/parsing code.

Therefore, I am unsure whether the proposal intends for the tokenizer to handle decoding/parsing or to move that code to the writer so that each writer manages the decoding/parsing phase.

@da2ce7
Copy link
Contributor

da2ce7 commented Nov 30, 2024

use std::vec::IntoIter;

/// A recursive nested data structure that can contain items or lists of nested items.
#[derive(Debug)]
pub enum Nested<T> {
    Item(T),
    List(Vec<Nested<T>>),
}

impl<T> IntoIterator for Nested<T> {
    type Item = T;
    type IntoIter = NestedIntoIterator<T>;

    fn into_iter(self) -> Self::IntoIter {
        NestedIntoIterator::new(self)
    }
}

/// An iterator over a `Nested<T>` that consumes the nested structure and yields items of type `T`.
pub struct NestedIntoIterator<T> {
    stack: Vec<IntoIter<Nested<T>>>,
}

impl<T> NestedIntoIterator<T> {
    /// Creates a new `NestedIntoIterator` from a `Nested<T>`.
    pub fn new(nested: Nested<T>) -> Self {
        let mut iter = Self { stack: Vec::new() };
        iter.push_nested(nested);
        iter
    }

    /// Helper method to push nested elements onto the stack.
    fn push_nested(&mut self, nested: Nested<T>) {
        match nested {
            Nested::Item(item) => {
                // Create an iterator over a single item without allocating a Vec.
                let single_iter = vec![Nested::Item(item)].into_iter();
                self.stack.push(single_iter);
            }
            Nested::List(list) => {
                self.stack.push(list.into_iter());
            }
        }
    }
}

impl<T> Iterator for NestedIntoIterator<T> {
    type Item = T;

    fn next(&mut self) -> Option<Self::Item> {
        while let Some(top_iter) = self.stack.last_mut() {
            match top_iter.next() {
                Some(Nested::Item(item)) => return Some(item),
                Some(Nested::List(list)) => {
                    self.stack.push(list.into_iter());
                }
                None => {
                    self.stack.pop();
                }
            }
        }
        None
    }
}

impl<'a, T> IntoIterator for &'a Nested<T> {
    type Item = &'a T;
    type IntoIter = NestedRefIterator<'a, T>;

    fn into_iter(self) -> Self::IntoIter {
        NestedRefIterator::new(self)
    }
}

/// An iterator over references to items in a `Nested<T>`.
pub struct NestedRefIterator<'a, T> {
    stack: Vec<std::slice::Iter<'a, Nested<T>>>,
}

impl<'a, T> NestedRefIterator<'a, T> {
    /// Creates a new `NestedRefIterator` from a reference to a `Nested<T>`.
    pub fn new(nested: &'a Nested<T>) -> Self {
        let mut iter = Self { stack: Vec::new() };
        iter.push_nested(nested);
        iter
    }

    /// Helper method to push nested elements onto the stack.
    fn push_nested(&mut self, nested: &'a Nested<T>) {
        match nested {
            Nested::Item(_) => {
                // Create an iterator over a single item without allocating a slice.
                let single_iter = std::slice::from_ref(nested).iter();
                self.stack.push(single_iter);
            }
            Nested::List(list) => {
                self.stack.push(list.iter());
            }
        }
    }
}

impl<'a, T> Iterator for NestedRefIterator<'a, T> {
    type Item = &'a T;

    fn next(&mut self) -> Option<Self::Item> {
        while let Some(top_iter) = self.stack.last_mut() {
            match top_iter.next() {
                Some(Nested::Item(item)) => return Some(item),
                Some(Nested::List(list)) => {
                    self.stack.push(list.iter());
                }
                None => {
                    self.stack.pop();
                }
            }
        }
        None
    }
}

@da2ce7
Copy link
Contributor

da2ce7 commented Nov 30, 2024

use std::cell::RefCell;
use std::rc::Rc;

pub enum Nested<'a, T> {
    Item(T),
    List(Box<dyn Iterator<Item = Nested<'a, T>> + 'a>),
}


pub struct NestedTokenizer<'a> {
    data: Rc<RefCell<&'a [u8]>>,
}

impl<'a> NestedTokenizer<'a> {
    pub fn new(data: &'a [u8]) -> Self {
        NestedTokenizer {
            data: Rc::new(RefCell::new(data)),
        }
    }
}

impl<'a> Iterator for NestedTokenizer<'a> {
    type Item = Nested<'a, u8>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut data = self.data.borrow_mut();
        if data.is_empty() {
            return None;
        }

        let byte = data[0];
        *data = &data[1..]; // Advance the slice

        match byte {
            b'[' => {
                // Start of a nested list
                let nested_iter = NestedListIterator::new(Rc::clone(&self.data));
                Some(Nested::List(Box::new(nested_iter)))
            }
            b']' => {
                
                panic!("Should not happen here");
            }
            _ => {
                Some(Nested::Item(byte))
            }
        }
    }
}

struct NestedListIterator<'a> {
    data: Rc<RefCell<&'a [u8]>>,
}

impl<'a> NestedListIterator<'a> {
    fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
        NestedListIterator { data }
    }
}

impl<'a> Iterator for NestedListIterator<'a> {
    type Item = Nested<'a, u8>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut data = self.data.borrow_mut();
        while !data.is_empty() {
            let byte = data[0];
            *data = &data[1..]; // Advance the slice

            match byte {
                b'[' => {
                    // Start of a new nested list
                    let nested_iter = NestedListIterator::new(Rc::clone(&self.data));
                    return Some(Nested::List(Box::new(nested_iter)));
                }
                b']' => {
                    // End of the current list
                    return None;
                }
                _ => {
                    return Some(Nested::Item(byte));
                }
            }
        }
        None
    }
}

@da2ce7
Copy link
Contributor

da2ce7 commented Nov 30, 2024

use std::cell::RefCell;
use std::rc::Rc;

pub enum Bencode<'a> {
    Integer(i64),
    ByteString(&'a [u8]),
    List(Box<dyn Iterator<Item = Bencode<'a>> + 'a>),
    Dictionary(Box<dyn Iterator<Item = (&'a [u8], Bencode<'a>)> + 'a>),
}

pub struct BencodeTokenizer<'a> {
    data: Rc<RefCell<&'a [u8]>>,
}

impl<'a> BencodeTokenizer<'a> {
    pub fn new(data: &'a [u8]) -> Self {
        BencodeTokenizer {
            data: Rc::new(RefCell::new(data)),
        }
    }
}

impl<'a> Iterator for BencodeTokenizer<'a> {
    type Item = Bencode<'a>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut data = self.data.borrow_mut();

        if data.is_empty() {
            return None;
        }

        match data[0] {
            b'i' => {
                *data = &data[1..];
                let end = data.iter().position(|&b| b == b'e')?;
                let num = std::str::from_utf8(&data[..end]).ok()?.parse().ok()?;
                *data = &data[end + 1..];
                Some(Bencode::Integer(num))
            }
            b'l' => {
                *data = &data[1..];
                let iter = BencodeListIterator::new(Rc::clone(&self.data));
                Some(Bencode::List(Box::new(iter)))
            }
            b'd' => {
                *data = &data[1..];
                let iter = BencodeDictIterator::new(Rc::clone(&self.data));
                Some(Bencode::Dictionary(Box::new(iter)))
            }
            b'0'..=b'9' => {
                let colon = data.iter().position(|&b| b == b':')?;
                let len = std::str::from_utf8(&data[..colon]).ok()?.parse::<usize>().ok()?;
                *data = &data[colon + 1..];
                let bytes = &data[..len];
                *data = &data[len..];
                Some(Bencode::ByteString(bytes))
            }
            _ => None,
        }
    }
}

struct BencodeListIterator<'a> {
    data: Rc<RefCell<&'a [u8]>>,
}

impl<'a> BencodeListIterator<'a> {
    fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
        BencodeListIterator { data }
    }
}

impl<'a> Iterator for BencodeListIterator<'a> {
    type Item = Bencode<'a>;

    fn next(&mut self) -> Option<Self::Item> {
        let mut data = self.data.borrow_mut();

        if data.is_empty() || data[0] == b'e' {
            if !data.is_empty() {
                *data = &data[1..];
            }
            return None;
        }

        drop(data);
        BencodeTokenizer {
            data: Rc::clone(&self.data),
        }
        .next()
    }
}

struct BencodeDictIterator<'a> {
    data: Rc<RefCell<&'a [u8]>>,
}

impl<'a> BencodeDictIterator<'a> {
    fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
        BencodeDictIterator { data }
    }
}

impl<'a> Iterator for BencodeDictIterator<'a> {
    type Item = (&'a [u8], Bencode<'a>);

    fn next(&mut self) -> Option<Self::Item> {
        let mut data = self.data.borrow_mut();

        if data.is_empty() || data[0] == b'e' {
            if !data.is_empty() {
                *data = &data[1..];
            }
            return None;
        }

        let colon = data.iter().position(|&b| b == b':')?;
        let len = std::str::from_utf8(&data[..colon]).ok()?.parse::<usize>().ok()?;
        *data = &data[colon + 1..];
        let key = &data[..len];
        *data = &data[len..];

        drop(data);
        let value = BencodeTokenizer {
            data: Rc::clone(&self.data),
        }
        .next()?;

        Some((key, value))
    }
}

@da2ce7
Copy link
Contributor

da2ce7 commented Dec 1, 2024

use num_bigint::BigInt;
use num_traits::Num;
use std::cell::RefCell;
use std::io::{self, BufRead};
use std::rc::Rc;

const MAX_RECURSION_DEPTH: usize = 100;
const MAX_BYTESTRING_LENGTH: usize = 10 * 1024 * 1024;
const MAX_INTEGER_DIGITS: usize = 309;
const MAX_INTEGER_BITS: u64 = 1024;

#[derive(Debug, PartialEq, Clone)]
pub enum BencodeData {
    Integer(BigInt),
    ByteString(Vec<u8>),
    List(Vec<BencodeData>),
    Dictionary(Vec<(Vec<u8>, BencodeData)>),
}

pub struct BencodeTokenizer<'a, R: BufRead + 'a> {
    data: Rc<RefCell<R>>,
    eof: bool,
    _marker: std::marker::PhantomData<&'a ()>,
}

impl<'a, R: BufRead + 'a> BencodeTokenizer<'a, R> {
    pub fn new(reader: R) -> Self {
        BencodeTokenizer {
            data: Rc::new(RefCell::new(reader)),
            eof: false,
            _marker: std::marker::PhantomData,
        }
    }

    fn next_with_depth(&mut self, depth: usize) -> Option<io::Result<Bencode<'a, R>>> {
        if self.eof {
            return None;
        }

        if depth > MAX_RECURSION_DEPTH {
            self.eof = true;
            return Some(Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "Maximum recursion depth exceeded",
            )));
        }

        let mut data = self.data.borrow_mut();
        let byte = match peek_byte(&mut *data) {
            Ok(b) => b,
            Err(e) => {
                self.eof = true;
                if e.kind() == io::ErrorKind::UnexpectedEof {
                    return None;
                } else {
                    return Some(Err(e));
                }
            }
        };

        match byte {
            b'i' => {
                data.consume(1);
                drop(data);
                let num = match read_integer(&self.data) {
                    Ok(num) => num,
                    Err(e) => {
                        self.eof = true;
                        return Some(Err(e));
                    }
                };
                Some(Ok(Bencode::Integer(num)))
            }
            b'l' => {
                data.consume(1);
                drop(data);
                let iter = BencodeListIterator {
                    data: Rc::clone(&self.data),
                    eof: false,
                    depth: depth + 1,
                    _marker: std::marker::PhantomData,
                };
                Some(Ok(Bencode::List(iter)))
            }
            b'd' => {
                data.consume(1);
                drop(data);
                let iter = BencodeDictIterator {
                    data: Rc::clone(&self.data),
                    eof: false,
                    last_key: None,
                    depth: depth + 1,
                    _marker: std::marker::PhantomData,
                };
                Some(Ok(Bencode::Dictionary(iter)))
            }
            b'0'..=b'9' => {
                drop(data);
                let bytes = match read_bytestring(&self.data) {
                    Ok(bytes) => bytes,
                    Err(e) => {
                        self.eof = true;
                        return Some(Err(e));
                    }
                };
                Some(Ok(Bencode::ByteString(bytes)))
            }
            _ => {
                self.eof = true;
                Some(Err(io::Error::new(
                    io::ErrorKind::InvalidData,
                    format!("Invalid bencode prefix: '{}'", byte as char),
                )))
            }
        }
    }
}

impl<'a, R: BufRead + 'a> Iterator for BencodeTokenizer<'a, R> {
    type Item = io::Result<Bencode<'a, R>>;

    fn next(&mut self) -> Option<Self::Item> {
        self.next_with_depth(0)
    }
}

#[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
    Integer(BigInt),
    ByteString(Vec<u8>),
    List(BencodeListIterator<'a, R>),
    Dictionary(BencodeDictIterator<'a, R>),
}

#[derive(Debug)]
pub struct BencodeListIterator<'a, R: BufRead + 'a> {
    data: Rc<RefCell<R>>,
    eof: bool,
    depth: usize,
    _marker: std::marker::PhantomData<&'a ()>,
}

impl<'a, R: BufRead + 'a> Iterator for BencodeListIterator<'a, R> {
    type Item = io::Result<Bencode<'a, R>>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.eof {
            return None;
        }

        if self.depth > MAX_RECURSION_DEPTH {
            self.eof = true;
            return Some(Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "Maximum recursion depth exceeded",
            )));
        }

        let mut data = self.data.borrow_mut();
        let peek_result = peek_byte(&mut *data);

        match peek_result {
            Ok(byte) => {
                if byte == b'e' {
                    data.consume(1);
                    self.eof = true;
                    return None;
                }
            }
            Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
                self.eof = true;
                return Some(Err(io::Error::new(
                    io::ErrorKind::UnexpectedEof,
                    "Unexpected EOF in list",
                )));
            }
            Err(e) => {
                self.eof = true;
                return Some(Err(e));
            }
        }

        drop(data);

        let mut tokenizer = BencodeTokenizer {
            data: Rc::clone(&self.data),
            eof: false,
            _marker: std::marker::PhantomData,
        };

        tokenizer.next_with_depth(self.depth)
    }
}

#[derive(Debug)]
pub struct BencodeDictIterator<'a, R: BufRead + 'a> {
    data: Rc<RefCell<R>>,
    eof: bool,
    last_key: Option<Vec<u8>>,
    depth: usize,
    _marker: std::marker::PhantomData<&'a ()>,
}

impl<'a, R: BufRead + 'a> Iterator for BencodeDictIterator<'a, R> {
    type Item = io::Result<(Vec<u8>, Bencode<'a, R>)>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.eof {
            return None;
        }

        if self.depth > MAX_RECURSION_DEPTH {
            self.eof = true;
            return Some(Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "Maximum recursion depth exceeded",
            )));
        }

        let mut data = self.data.borrow_mut();
        let peek_result = peek_byte(&mut *data);

        match peek_result {
            Ok(byte) => {
                if byte == b'e' {
                    data.consume(1);
                    self.eof = true;
                    return None;
                }
            }
            Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
                self.eof = true;
                return Some(Err(io::Error::new(
                    io::ErrorKind::UnexpectedEof,
                    "Unexpected EOF in dictionary",
                )));
            }
            Err(e) => {
                self.eof = true;
                return Some(Err(e));
            }
        }

        drop(data);

        let key = match read_bytestring(&self.data) {
            Ok(bytes) => bytes,
            Err(e) => {
                self.eof = true;
                return Some(Err(e));
            }
        };

        if let Some(ref last_key) = self.last_key {
            match key.cmp(last_key) {
                std::cmp::Ordering::Less => {
                    self.eof = true;
                    return Some(Err(io::Error::new(
                        io::ErrorKind::InvalidData,
                        "Dictionary keys are not sorted lexicographically",
                    )));
                }
                std::cmp::Ordering::Equal => {
                    self.eof = true;
                    return Some(Err(io::Error::new(
                        io::ErrorKind::InvalidData,
                        "Duplicate dictionary key",
                    )));
                }
                std::cmp::Ordering::Greater => {}
            }
        }
        self.last_key = Some(key.clone());

        let mut tokenizer = BencodeTokenizer {
            data: Rc::clone(&self.data),
            eof: false,
            _marker: std::marker::PhantomData,
        };

        let value = match tokenizer.next_with_depth(self.depth) {
            Some(Ok(v)) => v,
            Some(Err(e)) => {
                self.eof = true;
                return Some(Err(e));
            }
            None => {
                self.eof = true;
                return Some(Err(io::Error::new(
                    io::ErrorKind::UnexpectedEof,
                    "Expected value after key",
                )));
            }
        };

        Some(Ok((key, value)))
    }
}

fn read_byte<R: BufRead>(data: &mut R) -> io::Result<u8> {
    let mut buf = [0u8; 1];
    data.read_exact(&mut buf)?;
    Ok(buf[0])
}

fn peek_byte<R: BufRead>(data: &mut R) -> io::Result<u8> {
    let buf = data.fill_buf()?;
    if buf.is_empty() {
        Err(io::Error::new(
            io::ErrorKind::UnexpectedEof,
            "Unexpected EOF",
        ))
    } else {
        Ok(buf[0])
    }
}

fn read_until<R: BufRead>(data: &mut R, delim: u8, max_length: usize) -> io::Result<Vec<u8>> {
    let mut bytes = Vec::with_capacity(max_length);
    for _ in 0..max_length {
        let byte = read_byte(data)?;
        if byte == delim {
            return Ok(bytes);
        }
        bytes.push(byte);
    }
    Err(io::Error::new(
        io::ErrorKind::InvalidData,
        "Exceeded maximum read length",
    ))
}

const MAX_LENGTH_DIGITS: usize = 10; // Maximum digits in length string

fn read_bytestring<R: BufRead>(data: &Rc<RefCell<R>>) -> io::Result<Vec<u8>> {
    let mut data_mut = data.borrow_mut();
    let length_str = read_until(&mut *data_mut, b':', MAX_LENGTH_DIGITS)?;
    let length_str = std::str::from_utf8(&length_str).map_err(|_| {
        io::Error::new(
            io::ErrorKind::InvalidData,
            "Invalid UTF-8 in byte string length",
        )
    })?;
    if !length_str.chars().all(|c| c.is_ascii_digit()) {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Non-numeric byte string length",
        ));
    }
    if length_str == "0" {
        return Ok(vec![0u8; 0]);
    }
    if length_str.starts_with('0') {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Leading zero in length",
        ));
    }
    let length = length_str
        .parse::<usize>()
        .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid string length"))?;
    if length > MAX_BYTESTRING_LENGTH {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Byte string length exceeds maximum allowed size",
        ));
    }
    let mut buffer = vec![0u8; length];
    data_mut.read_exact(&mut buffer)?;
    Ok(buffer)
}

fn read_integer<R: BufRead>(data: &Rc<RefCell<R>>) -> io::Result<BigInt> {
    let mut data_mut = data.borrow_mut();
    let mut int_bytes = Vec::new();

    loop {
        let byte = match read_byte(&mut *data_mut) {
            Ok(b) => b,
            Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
                return Err(io::Error::new(
                    io::ErrorKind::UnexpectedEof,
                    "Unexpected EOF in integer",
                ));
            }
            Err(e) => return Err(e),
        };

        if byte == b'e' {
            break;
        }
        if int_bytes.len() >= MAX_INTEGER_DIGITS {
            return Err(io::Error::new(
                io::ErrorKind::InvalidData,
                "Integer exceeds maximum allowed digits for creating a 1024-bit signed integer",
            ));
        }
        int_bytes.push(byte);
    }

    let int_str = std::str::from_utf8(&int_bytes)
        .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid UTF-8 in integer"))?;

    if int_str == "0" {
        return Ok(BigInt::ZERO);
    }

    if !int_str.chars().all(|c| c.is_ascii_digit() || c == '-') {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "non-signed-numeric in integer",
        ));
    }

    if int_str.is_empty() || int_str == "-" {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "missing integer value",
        ));
    }

    if int_str.starts_with("-0") || int_str.starts_with('0') {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "integer should not start with zero",
        ));
    }

    let value = BigInt::from_str_radix(int_str, 10)
        .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid integer format"))?;

    if value.bits() > MAX_INTEGER_BITS {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Integer created exceeds maximum allowed bit length",
        ));
    }

    Ok(value)
}

pub fn collect_bencode_with_depth<'a, R: BufRead + 'a>(
    bencode: Bencode<'a, R>,
    depth: usize,
) -> io::Result<BencodeData> {
    if depth > MAX_RECURSION_DEPTH {
        return Err(io::Error::new(
            io::ErrorKind::InvalidData,
            "Maximum recursion depth exceeded",
        ));
    }

    match bencode {
        Bencode::Integer(i) => Ok(BencodeData::Integer(i)),
        Bencode::ByteString(s) => Ok(BencodeData::ByteString(s)),
        Bencode::List(mut iter) => {
            let mut items = Vec::new();
            while let Some(item) = iter.next() {
                let value = collect_bencode_with_depth(item?, depth + 1)?;
                items.push(value);
            }
            Ok(BencodeData::List(items))
        }
        Bencode::Dictionary(mut iter) => {
            let mut items = Vec::new();
            while let Some(item) = iter.next() {
                let (key, value) = item?;
                let value = collect_bencode_with_depth(value, depth + 1)?;
                items.push((key, value));
            }
            Ok(BencodeData::Dictionary(items))
        }
    }
}

pub fn collect_bencode<'a, R: BufRead + 'a>(bencode: Bencode<'a, R>) -> io::Result<BencodeData> {
    collect_bencode_with_depth(bencode, 0)
}

#[cfg(test)]
mod tests {
    use super::*;
    use num_bigint::BigInt;
    use std::io::Cursor;

    #[test]
    fn test_bencode_parser() {
        use BencodeData::*;

        let test_cases = vec![
            // Valid integers
            (b"i42e".to_vec(), Ok(Integer(42.into())), "Valid integer"),
            (
                b"i-42e".to_vec(),
                Ok(Integer((-42).into())),
                "Valid negative integer",
            ),
            (b"i0e".to_vec(), Ok(Integer(0.into())), "Valid zero integer"),
            (
                b"i-1e".to_vec(),
                Ok(Integer((-1).into())),
                "Valid negative one",
            ),
            (
                b"i999999999999999999999999999999999999e".to_vec(),
                Ok(Integer(
                    "999999999999999999999999999999999999".parse().unwrap(),
                )),
                "Valid large integer",
            ),
            (
                b"i-999999999999999999999999999999999999e".to_vec(),
                Ok(Integer(
                    "-999999999999999999999999999999999999".parse().unwrap(),
                )),
                "Valid negative large integer",
            ),
            // Invalid integers
            (
                b"i-e".to_vec(),
                Err("missing integer value"),
                "Missing digits after minus sign",
            ),
            (
                b"i-0e".to_vec(),
                Err("integer should not start with zero"),
                "Negative zero",
            ),
            (
                b"i042e".to_vec(),
                Err("integer should not start with zero"),
                "Leading zero",
            ),
            (
                b"i+42e".to_vec(),
                Err("non-signed-numeric in integer"),
                "Plus sign",
            ),
            (
                b"i4.2e".to_vec(),
                Err("non-signed-numeric in integer"),
                "Non-digit character",
            ),
            (
                b"i4 2e".to_vec(),
                Err("non-signed-numeric in integer"),
                "Whitespace character",
            ),
            (
                b"ie".to_vec(),
                Err("missing integer value"),
                "Empty integer",
            ),
            (
                b"i-42".to_vec(),
                Err("Unexpected EOF in integer"),
                "Missing 'e' terminator",
            ),
            // Valid byte strings
            (b"0:".to_vec(), Ok(ByteString(vec![])), "Empty string"),
            (
                b"3:abc".to_vec(),
                Ok(ByteString(b"abc".to_vec())),
                "Simple string",
            ),
            (
                b"5:hello".to_vec(),
                Ok(ByteString(b"hello".to_vec())),
                "Hello string",
            ),
            (
                b"11:hello world".to_vec(),
                Ok(ByteString(b"hello world".to_vec())),
                "String with spaces",
            ),
            (
                b"5:helloextra".to_vec(),
                Ok(ByteString(b"hello".to_vec())),
                "Extra data after string",
            ),
            // Invalid byte strings
            (
                b"03:abc".to_vec(),
                Err("Leading zero in length"),
                "Leading zero in length",
            ),
            (
                b"0abc:hello".to_vec(),
                Err("Non-numeric byte string length"),
                "Non-numeric length",
            ),
            (
                b"-5:hello".to_vec(),
                Err("Invalid bencode prefix: '-'"),
                "Negative length",
            ),
            // Valid lists
            (b"le".to_vec(), Ok(List(vec![])), "Empty list"),
            (
                b"l4:spami42ee".to_vec(),
                Ok(List(vec![ByteString(b"spam".to_vec()), Integer(42.into())])),
                "List with string and integer",
            ),
            (
                b"l4:spam4:eggse".to_vec(),
                Ok(List(vec![
                    ByteString(b"spam".to_vec()),
                    ByteString(b"eggs".to_vec()),
                ])),
                "List of strings",
            ),
            // Invalid lists
            (
                b"l4:spam".to_vec(),
                Err("Unexpected EOF in list"),
                "Unterminated list",
            ),
            (
                b"l".to_vec(),
                Err("Unexpected EOF in list"),
                "Empty unterminated list",
            ),
            // Valid dictionaries
            (b"de".to_vec(), Ok(Dictionary(vec![])), "Empty dictionary"),
            (
                b"d3:cow3:mooe".to_vec(),
                Ok(Dictionary(vec![(
                    b"cow".to_vec(),
                    ByteString(b"moo".to_vec()),
                )])),
                "Simple dictionary",
            ),
            (
                b"d4:spaml1:a1:bee".to_vec(),
                Ok(Dictionary(vec![(
                    b"spam".to_vec(),
                    List(vec![ByteString(b"a".to_vec()), ByteString(b"b".to_vec())]),
                )])),
                "Dictionary with list",
            ),
            // Invalid dictionaries
            (
                b"d3:key5:value".to_vec(),
                Err("Unexpected EOF in dictionary"),
                "Unterminated dictionary",
            ),
            // Unsorted keys
            (
                b"d3:baa1:13:aaa1:2e".to_vec(),
                Err("Dictionary keys are not sorted lexicographically"),
                "Unsorted dictionary keys",
            ),
            // Duplicate keys
            (
                b"d3:key5:value3:key3:bbbe".to_vec(),
                Err("Duplicate dictionary key"),
                "Duplicate dictionary keys",
            ),
            // Non-UTF8 keys
            (
                b"d3:aaa3:bbb2:\xFF\xFE5:valuee".to_vec(),
                Ok(Dictionary(vec![
                    (b"aaa".to_vec(), ByteString(b"bbb".to_vec())),
                    (vec![0xFF, 0xFE], ByteString(b"value".to_vec())),
                ])),
                "Non-UTF8 dictionary keys",
            ),
            // Invalid prefixes
            (
                b"x42e".to_vec(),
                Err("Invalid bencode prefix: 'x'"),
                "Invalid prefix 'x'",
            ),
            (
                b"f3.14e".to_vec(),
                Err("Invalid bencode prefix: 'f'"),
                "Invalid prefix 'f'",
            ),
            (
                b"n123e".to_vec(),
                Err("Invalid bencode prefix: 'n'"),
                "Invalid prefix 'n'",
            ),
            // Extra data after valid bencode
            (
                b"i42ex".to_vec(),
                Ok(Integer(42.into())),
                "Integer with extra data",
            ),
        ];

        for (data, expected, description) in test_cases {
            let cursor = Cursor::new(data);
            let mut tokenizer = BencodeTokenizer::new(cursor);
            let result = tokenizer.next();

            match (result, expected) {
                (Some(Ok(bencode)), Ok(ref expected_bencode)) => {
                    let collected = collect_bencode(bencode).expect("Failed to collect bencode");
                    assert_eq!(&collected, expected_bencode, "{}", description);
                }
                (Some(Err(e)), Err(expected_error)) => {
                    assert_eq!(e.to_string(), expected_error, "{}", description);
                }
                (Some(Ok(bencode)), Err(expected_error)) => {
                    let collected = collect_bencode(bencode);
                    if let Err(e) = collected {
                        assert_eq!(e.to_string(), expected_error, "{}", description);
                    } else {
                        panic!(
                            "{}: Expected error '{}', but got success",
                            description, expected_error
                        );
                    }
                }
                (Some(Err(e)), Ok(_)) => {
                    panic!("{}: Expected success, but got error '{}'", description, e);
                }
                (None, _) => {
                    panic!("{}: No result returned", description);
                }
            }
        }
    }

    #[test]
    fn test_recursion_depth_limit() {
        let mut data = Vec::new();
        for _ in 0..MAX_RECURSION_DEPTH {
            data.push(b'l');
        }
        data.extend(vec![b'e'; MAX_RECURSION_DEPTH]);

        let cursor = Cursor::new(&data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let bencode = result.unwrap();
        assert!(
            collect_bencode(bencode).is_ok(),
            "Should succeed at max depth"
        );

        data.insert(0, b'l');
        data.push(b'e');

        let cursor = Cursor::new(&data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let bencode = result.unwrap();
        let collected = collect_bencode(bencode);
        assert!(collected.is_err(), "Should fail exceeding max depth");
        assert_eq!(
            collected.unwrap_err().to_string(),
            "Maximum recursion depth exceeded",
            "Error message should indicate depth exceeded"
        );
    }

    #[test]
    fn test_large_integer_limits() {
        // Test integer with MAX_INTEGER_DIGITS digits
        let digits = "1".repeat(MAX_INTEGER_DIGITS);
        let data = format!("i{}e", digits);
        let cursor = Cursor::new(data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap().unwrap();
        let collected = collect_bencode(result).unwrap();

        if let BencodeData::Integer(value) = collected {
            assert_eq!(value.to_string(), digits);
        } else {
            panic!("Expected integer");
        }

        // Test integer with MAX_INTEGER_DIGITS + 1 digits (should error)
        let digits = "1".repeat(MAX_INTEGER_DIGITS + 1);
        let data = format!("i{}e", digits);
        let cursor = Cursor::new(data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let error = result.unwrap_err();
        assert_eq!(
            error.to_string(),
            "Integer exceeds maximum allowed digits for creating a 1024-bit signed integer"
        );

        // Test integer with MAX_INTEGER_BITS bits
        let max_value = (BigInt::from(1) << MAX_INTEGER_BITS) - 1;
        let data = format!("i{}e", max_value);
        let cursor = Cursor::new(data.to_string());
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap().unwrap();
        let collected = collect_bencode(result).unwrap();

        if let BencodeData::Integer(value) = collected {
            assert_eq!(value, max_value);
        } else {
            panic!("Expected integer");
        }

        // Test integer exceeding MAX_INTEGER_BITS (should error)
        let over_max_value = BigInt::from(1) << MAX_INTEGER_BITS;
        let data = format!("i{}e", over_max_value);
        let cursor = Cursor::new(data.to_string());
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let error = result.unwrap_err();
        assert_eq!(
            error.to_string(),
            "Integer created exceeds maximum allowed bit length"
        );

        // Test integer exceeding MAX_INTEGER_BITS but not MAX_INTEGER_DIGITS (should error)
        let digits = "9".repeat(MAX_INTEGER_DIGITS);
        let data = format!("i{}e", digits);
        let cursor = Cursor::new(data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let error = result.unwrap_err();
        assert_eq!(
            error.to_string(),
            "Integer created exceeds maximum allowed bit length"
        );
    }

    #[test]
    fn test_large_bytestring_limits() {
        let data = format!(
            "{}:{}",
            MAX_BYTESTRING_LENGTH,
            "a".repeat(MAX_BYTESTRING_LENGTH)
        );
        let cursor = Cursor::new(data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap().unwrap();
        let collected = collect_bencode(result).unwrap();

        if let BencodeData::ByteString(bytes) = collected {
            assert_eq!(bytes.len(), MAX_BYTESTRING_LENGTH);
        } else {
            panic!("Expected byte string");
        }

        let length = MAX_BYTESTRING_LENGTH + 1;
        let data = format!("{}:{}", length, "a".repeat(10));
        let cursor = Cursor::new(data);
        let mut tokenizer = BencodeTokenizer::new(cursor);
        let result = tokenizer.next().unwrap();
        let error = result.unwrap_err();
        assert_eq!(
            error.to_string(),
            "Byte string length exceeds maximum allowed size"
        );
    }
}

@da2ce7
Copy link
Contributor

da2ce7 commented Dec 1, 2024

@josecelano @magecnion In the last posts, with the help of https://github.com/SillyTavern/SillyTavern , I tried to implement a single page BencodeParser, to learn about the concepts.

@josecelano
Copy link
Member Author

Hi @magecnion @da2ce7

Thank you for sharing your proposal. I would like to clarify a few points before deciding on the renaming.

It’s not entirely clear to me who is responsible for decoding (aka parsing), which in the case of JSON includes handling non-UTF8 bytes by writing them as hexadecimal.

In my proposal, the "generator" would be responsible for converting from the byte sequence to the "" or "" tags. Becuase that's only relevant for JSON. Other future format might support bye sequences.

On one hand, the proposal states that the smarter writer is responsible for writing the tokens in the desired format. On the other hand, the following statement regarding the tokenizer:

“Reuse the current writer. Wherever we use the 'writer,' we change it to just return the token. This could be added in parallel without changing the main 'parse' function.”

suggests that decoding/parsing will remain part of the tokenizing phase because it involves the parse function, which includes all the decoding/parsing code.

Therefore, I am unsure whether the proposal intends for the tokenizer to handle decoding/parsing or to move that code to the writer so that each writer manages the decoding/parsing phase.

Sorry @magecnion, my explanation was confusing because I mixed the design with the implementation.

Design

  • Two pieces: Iterator and Generator.
  • Iterator returns only bencode tokens (not bencoded values).
  • Generator iterates over bencode tokens producing the final JSON ouput.
  • Iterator only return these values (tokens):
    • Integer
    • String
    • List begin
    • Dict begin
    • List or Dict end
  • For the String value, it's always a byte sequence
  • Generator would read tokens and write JSON
  • THe JSON generator would be responsible for converting strings into the format we are using with html-styled tags.

Implementation

What I would do is:

  • I would refactor the current solution towards the new design.
  • I think the easiest way is to reuse the current write to return tokens and create a new writer to produce JSON. Of course, we would rename the reused write into something different.

Alternatively, we could start from scratch with the new design since the lib is small.

Conclusion

I think we all agree that it is worth exploring this new design because:

  • It could easily be adapted for more output formats.
  • The Iterator (tokenizer) could be used for other things.
  • It could simplify the implementation by splitting the responsibilities.

I would create a new branch to test this new proposal.

@da2ce7 I think your example would be a parser like Serde, if I'm not wrong. That means we would need to store the whole bencoded value in memory before writing JSON for lists and dicts. I think with the tokenizer, we need to keep only one token. That would consume more memory that the current implementation but since the store the whole string I don't think it would be a big change. Anyway, we have to be sure that performance remains more or less the same. The goal for this lib was to be fast and have low memory consumption; otherwise, we could have used Serde.

@da2ce7
Copy link
Contributor

da2ce7 commented Dec 2, 2024

@josecelano Hmm... You miss-read the implementation I provided.

The BencodeData is the eager-recursive type, that collects all the Bencode into a single data-structure. - This data is collected by the collect_bencode function to make it convenient to test the implementation.

The Bencode datatype is itself a lazy-recursive type that streams the bencode via nested Iterators.

To stream json, A JsonTokenizer would would have a (something like) from_bencode(bencode: &mut Bencode) method use the Bencode data-type and recursively iterate over the lazy structure, producing another iterator that emitted json tokens.

@josecelano
Copy link
Member Author

@josecelano Hmm... You miss-read the implementation I provided.

The BencodeData is the eager-recursive type, that collects all the Bencode into a single data-structure. - This data is collected by the collect_bencode function to make it convenient to test the implementation.

The Bencode datatype is itself a lazy-recursive type that streams the bencode via nested Iterators.

To stream json, A JsonTokenizer would would have a (something like) from_bencode(bencode: &mut Bencode) method use the Bencode data-type and recursively iterate over the lazy structure, producing another iterator that emitted json tokens.

Hey @da2ce7, sorry, I just checked the latest post because I thought you posted different versions. I will check it carefully. It looks good (and better) to me. I think the recursive solution is more declarative than the current version. I would definitely try this solution. I hope we don't find a performance problem, but I don't see any reason for it.

I propose:

  • To finish your new implementation (if it's not finished).
  • Compare performance with the current one, although I think they should be similar.
  • Extract the tokenizer into a new package.
  • Leave this package only for Bencode to JSON conversion?

@da2ce7, are you going to continue working on it and open a PR? I'm just wondering if you are actively working on it or if it was just a proposal we can follow.

@da2ce7
Copy link
Contributor

da2ce7 commented Dec 2, 2024

@josecelano I wanted to write a bencoder parser myself, to learn the format a bit better. For this code I think it is complete.

you may wish to change the implementation a little:

#[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
    Integer(BigInt),
    ByteString(Vec<u8>),
    List(BencodeListIterator<'a, R>),
    Dictionary(BencodeDictIterator<'a, R>),
}

to

#[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
    Integer(BigInt),
    ByteString(BencodeStringIterator<'a, R>),
    List(BencodeListIterator<'a, R>),
    Dictionary(BencodeDictIterator<'a, R>),
}

so that the strings may also be tokenized progressively.
note, my implementation limits the Integer to 1024bit.

@josecelano josecelano linked a pull request Dec 4, 2024 that will close this issue
6 tasks
@josecelano josecelano self-assigned this Dec 4, 2024
josecelano added a commit that referenced this issue Dec 4, 2024
ec6cc56 docs: update README (Jose Celano)
68d9915 refactor: rename json::BencodeParser to json::Generator (Jose Celano)
a3c7c4b refactor: remove parent parser mod (Jose Celano)
3052d6a refactor: rename BencodeTOkenizer to Tokenizer (Jose Celano)
331c76e refactor: reorganize modules (Jose Celano)
9e0db6c refactor: remove writer from tokenizer string parser (Jose Celano)
0a05544 refactor: remove old int and str parsers with writers (Jose Celano)
75ffdb4 refactor: remove writer from tokenizer integer parser (Jose Celano)
77ad5af refactor: remove writer from main tokenizer (Jose Celano)
f6a0584 refactor: duplicate integer and strig parser before removing writer (Jose Celano)
3a7ea5d refactor: extract mod tokenizer (Jose Celano)
63b9b73 refactor: extract struct BencodeTokenizer (Jose Celano)
83eeefd refactor: extract bencode tokenizer (Jose Celano)

Pull request description:

  This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types:

  - **Tokenizer**: It returns bencoded tokens.
  - **Generator**: It iterates over bencoded tokens to generate the JSON.

  **NOTES**

  - It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did [here](#12 (comment)). I guess that could be changed later if we think it increases readability and maintainability.

  **SUBTASKS**

  - [x] Separate logic for tokenizer.
  - [x] Extract tokenizer.
  - [x] Remove `Writer` from the tokenizer. It's not needed.

  **PERFORMANCE**

  In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings.

  **FUTURE PRs**

  We could:

  - [ ] Implement the `Iterator` trait for the tokenizer.
  - [ ] Use recursion for the generator like @da2ce7's proposal [here](#12).
  - [ ] Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats.

ACKs for top commit:
  josecelano:
    ACK ec6cc56

Tree-SHA512: 9210211d802c8e19aef1f02f814b494c5919c7da81f299cf2c7f4d9fb12b4c63cbec4ac526996e6b1b3d69f75ca58894b9d64936bef2d9da851e70d51234c675
@josecelano
Copy link
Member Author

josecelano commented Dec 4, 2024

Hi @da2ce7 @magecnion, I've closed this issue because I merged a PR where I've extracted the tokenizer. I've also renamed the parser to generator. Now we have:

  • Tokenizer
  • Generator

Therefore, this discussion doesn't make sense anymore. However, @da2ce7 added a lot of feedback and alternatives that would be good to explore. I've opened new issues:

My idea is to progressively refactor towards @da2ce7's proposal if it makes sense to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants