-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Find a better name for BencodeParser
type
#12
Comments
@da2ce7 suggestion:
|
I like these (in this order):
|
At the beginning I liked But @da2ce7's suggestion made me think. I guess you suggested |
Hi @magecnion, when I started working on the library I wanted to implement it like a compiler with a tokenizer and a parser, or like bencoded bytes -> bencoded tokens -> generate JSON That would allow to add new extensions in the future to write to other formats (yaml, xml, toml, ...). However I decided to make it as simple as possible in the first iteration only for the JSON case, following the C implementation. The main difference with serde is we don't build an intermediary data structure. Maybe we can explore that path now. The simplest way I see to change the package is:
The tokenizer would return only the parsed bencoded tokens:
The
The I think the easiest way to refactor the current state is to reuse the current writer. Wherever we use the "writer" we change it to just return the token. This could be added in pararell without changing the main "parse" function.
ConsiderationsI think that would be a big change and we would need to rename also the crate :-). Example generated by ChatGPTHere is a hypothetical implementation and example usage for a Rust crate that provides a Traits
|
Thank you for sharing your proposal. I would like to clarify a few points before deciding on the renaming. It’s not entirely clear to me who is responsible for decoding (aka parsing), which in the case of JSON includes handling non-UTF8 bytes by writing them as hexadecimal. On one hand, the proposal states that the smarter writer is responsible for writing the tokens in the desired format. On the other hand, the following statement regarding the tokenizer:
suggests that decoding/parsing will remain part of the tokenizing phase because it involves the Therefore, I am unsure whether the proposal intends for the tokenizer to handle decoding/parsing or to move that code to the writer so that each writer manages the decoding/parsing phase. |
use std::vec::IntoIter;
/// A recursive nested data structure that can contain items or lists of nested items.
#[derive(Debug)]
pub enum Nested<T> {
Item(T),
List(Vec<Nested<T>>),
}
impl<T> IntoIterator for Nested<T> {
type Item = T;
type IntoIter = NestedIntoIterator<T>;
fn into_iter(self) -> Self::IntoIter {
NestedIntoIterator::new(self)
}
}
/// An iterator over a `Nested<T>` that consumes the nested structure and yields items of type `T`.
pub struct NestedIntoIterator<T> {
stack: Vec<IntoIter<Nested<T>>>,
}
impl<T> NestedIntoIterator<T> {
/// Creates a new `NestedIntoIterator` from a `Nested<T>`.
pub fn new(nested: Nested<T>) -> Self {
let mut iter = Self { stack: Vec::new() };
iter.push_nested(nested);
iter
}
/// Helper method to push nested elements onto the stack.
fn push_nested(&mut self, nested: Nested<T>) {
match nested {
Nested::Item(item) => {
// Create an iterator over a single item without allocating a Vec.
let single_iter = vec![Nested::Item(item)].into_iter();
self.stack.push(single_iter);
}
Nested::List(list) => {
self.stack.push(list.into_iter());
}
}
}
}
impl<T> Iterator for NestedIntoIterator<T> {
type Item = T;
fn next(&mut self) -> Option<Self::Item> {
while let Some(top_iter) = self.stack.last_mut() {
match top_iter.next() {
Some(Nested::Item(item)) => return Some(item),
Some(Nested::List(list)) => {
self.stack.push(list.into_iter());
}
None => {
self.stack.pop();
}
}
}
None
}
}
impl<'a, T> IntoIterator for &'a Nested<T> {
type Item = &'a T;
type IntoIter = NestedRefIterator<'a, T>;
fn into_iter(self) -> Self::IntoIter {
NestedRefIterator::new(self)
}
}
/// An iterator over references to items in a `Nested<T>`.
pub struct NestedRefIterator<'a, T> {
stack: Vec<std::slice::Iter<'a, Nested<T>>>,
}
impl<'a, T> NestedRefIterator<'a, T> {
/// Creates a new `NestedRefIterator` from a reference to a `Nested<T>`.
pub fn new(nested: &'a Nested<T>) -> Self {
let mut iter = Self { stack: Vec::new() };
iter.push_nested(nested);
iter
}
/// Helper method to push nested elements onto the stack.
fn push_nested(&mut self, nested: &'a Nested<T>) {
match nested {
Nested::Item(_) => {
// Create an iterator over a single item without allocating a slice.
let single_iter = std::slice::from_ref(nested).iter();
self.stack.push(single_iter);
}
Nested::List(list) => {
self.stack.push(list.iter());
}
}
}
}
impl<'a, T> Iterator for NestedRefIterator<'a, T> {
type Item = &'a T;
fn next(&mut self) -> Option<Self::Item> {
while let Some(top_iter) = self.stack.last_mut() {
match top_iter.next() {
Some(Nested::Item(item)) => return Some(item),
Some(Nested::List(list)) => {
self.stack.push(list.iter());
}
None => {
self.stack.pop();
}
}
}
None
}
} |
use std::cell::RefCell;
use std::rc::Rc;
pub enum Nested<'a, T> {
Item(T),
List(Box<dyn Iterator<Item = Nested<'a, T>> + 'a>),
}
pub struct NestedTokenizer<'a> {
data: Rc<RefCell<&'a [u8]>>,
}
impl<'a> NestedTokenizer<'a> {
pub fn new(data: &'a [u8]) -> Self {
NestedTokenizer {
data: Rc::new(RefCell::new(data)),
}
}
}
impl<'a> Iterator for NestedTokenizer<'a> {
type Item = Nested<'a, u8>;
fn next(&mut self) -> Option<Self::Item> {
let mut data = self.data.borrow_mut();
if data.is_empty() {
return None;
}
let byte = data[0];
*data = &data[1..]; // Advance the slice
match byte {
b'[' => {
// Start of a nested list
let nested_iter = NestedListIterator::new(Rc::clone(&self.data));
Some(Nested::List(Box::new(nested_iter)))
}
b']' => {
panic!("Should not happen here");
}
_ => {
Some(Nested::Item(byte))
}
}
}
}
struct NestedListIterator<'a> {
data: Rc<RefCell<&'a [u8]>>,
}
impl<'a> NestedListIterator<'a> {
fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
NestedListIterator { data }
}
}
impl<'a> Iterator for NestedListIterator<'a> {
type Item = Nested<'a, u8>;
fn next(&mut self) -> Option<Self::Item> {
let mut data = self.data.borrow_mut();
while !data.is_empty() {
let byte = data[0];
*data = &data[1..]; // Advance the slice
match byte {
b'[' => {
// Start of a new nested list
let nested_iter = NestedListIterator::new(Rc::clone(&self.data));
return Some(Nested::List(Box::new(nested_iter)));
}
b']' => {
// End of the current list
return None;
}
_ => {
return Some(Nested::Item(byte));
}
}
}
None
}
} |
use std::cell::RefCell;
use std::rc::Rc;
pub enum Bencode<'a> {
Integer(i64),
ByteString(&'a [u8]),
List(Box<dyn Iterator<Item = Bencode<'a>> + 'a>),
Dictionary(Box<dyn Iterator<Item = (&'a [u8], Bencode<'a>)> + 'a>),
}
pub struct BencodeTokenizer<'a> {
data: Rc<RefCell<&'a [u8]>>,
}
impl<'a> BencodeTokenizer<'a> {
pub fn new(data: &'a [u8]) -> Self {
BencodeTokenizer {
data: Rc::new(RefCell::new(data)),
}
}
}
impl<'a> Iterator for BencodeTokenizer<'a> {
type Item = Bencode<'a>;
fn next(&mut self) -> Option<Self::Item> {
let mut data = self.data.borrow_mut();
if data.is_empty() {
return None;
}
match data[0] {
b'i' => {
*data = &data[1..];
let end = data.iter().position(|&b| b == b'e')?;
let num = std::str::from_utf8(&data[..end]).ok()?.parse().ok()?;
*data = &data[end + 1..];
Some(Bencode::Integer(num))
}
b'l' => {
*data = &data[1..];
let iter = BencodeListIterator::new(Rc::clone(&self.data));
Some(Bencode::List(Box::new(iter)))
}
b'd' => {
*data = &data[1..];
let iter = BencodeDictIterator::new(Rc::clone(&self.data));
Some(Bencode::Dictionary(Box::new(iter)))
}
b'0'..=b'9' => {
let colon = data.iter().position(|&b| b == b':')?;
let len = std::str::from_utf8(&data[..colon]).ok()?.parse::<usize>().ok()?;
*data = &data[colon + 1..];
let bytes = &data[..len];
*data = &data[len..];
Some(Bencode::ByteString(bytes))
}
_ => None,
}
}
}
struct BencodeListIterator<'a> {
data: Rc<RefCell<&'a [u8]>>,
}
impl<'a> BencodeListIterator<'a> {
fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
BencodeListIterator { data }
}
}
impl<'a> Iterator for BencodeListIterator<'a> {
type Item = Bencode<'a>;
fn next(&mut self) -> Option<Self::Item> {
let mut data = self.data.borrow_mut();
if data.is_empty() || data[0] == b'e' {
if !data.is_empty() {
*data = &data[1..];
}
return None;
}
drop(data);
BencodeTokenizer {
data: Rc::clone(&self.data),
}
.next()
}
}
struct BencodeDictIterator<'a> {
data: Rc<RefCell<&'a [u8]>>,
}
impl<'a> BencodeDictIterator<'a> {
fn new(data: Rc<RefCell<&'a [u8]>>) -> Self {
BencodeDictIterator { data }
}
}
impl<'a> Iterator for BencodeDictIterator<'a> {
type Item = (&'a [u8], Bencode<'a>);
fn next(&mut self) -> Option<Self::Item> {
let mut data = self.data.borrow_mut();
if data.is_empty() || data[0] == b'e' {
if !data.is_empty() {
*data = &data[1..];
}
return None;
}
let colon = data.iter().position(|&b| b == b':')?;
let len = std::str::from_utf8(&data[..colon]).ok()?.parse::<usize>().ok()?;
*data = &data[colon + 1..];
let key = &data[..len];
*data = &data[len..];
drop(data);
let value = BencodeTokenizer {
data: Rc::clone(&self.data),
}
.next()?;
Some((key, value))
}
} |
use num_bigint::BigInt;
use num_traits::Num;
use std::cell::RefCell;
use std::io::{self, BufRead};
use std::rc::Rc;
const MAX_RECURSION_DEPTH: usize = 100;
const MAX_BYTESTRING_LENGTH: usize = 10 * 1024 * 1024;
const MAX_INTEGER_DIGITS: usize = 309;
const MAX_INTEGER_BITS: u64 = 1024;
#[derive(Debug, PartialEq, Clone)]
pub enum BencodeData {
Integer(BigInt),
ByteString(Vec<u8>),
List(Vec<BencodeData>),
Dictionary(Vec<(Vec<u8>, BencodeData)>),
}
pub struct BencodeTokenizer<'a, R: BufRead + 'a> {
data: Rc<RefCell<R>>,
eof: bool,
_marker: std::marker::PhantomData<&'a ()>,
}
impl<'a, R: BufRead + 'a> BencodeTokenizer<'a, R> {
pub fn new(reader: R) -> Self {
BencodeTokenizer {
data: Rc::new(RefCell::new(reader)),
eof: false,
_marker: std::marker::PhantomData,
}
}
fn next_with_depth(&mut self, depth: usize) -> Option<io::Result<Bencode<'a, R>>> {
if self.eof {
return None;
}
if depth > MAX_RECURSION_DEPTH {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
"Maximum recursion depth exceeded",
)));
}
let mut data = self.data.borrow_mut();
let byte = match peek_byte(&mut *data) {
Ok(b) => b,
Err(e) => {
self.eof = true;
if e.kind() == io::ErrorKind::UnexpectedEof {
return None;
} else {
return Some(Err(e));
}
}
};
match byte {
b'i' => {
data.consume(1);
drop(data);
let num = match read_integer(&self.data) {
Ok(num) => num,
Err(e) => {
self.eof = true;
return Some(Err(e));
}
};
Some(Ok(Bencode::Integer(num)))
}
b'l' => {
data.consume(1);
drop(data);
let iter = BencodeListIterator {
data: Rc::clone(&self.data),
eof: false,
depth: depth + 1,
_marker: std::marker::PhantomData,
};
Some(Ok(Bencode::List(iter)))
}
b'd' => {
data.consume(1);
drop(data);
let iter = BencodeDictIterator {
data: Rc::clone(&self.data),
eof: false,
last_key: None,
depth: depth + 1,
_marker: std::marker::PhantomData,
};
Some(Ok(Bencode::Dictionary(iter)))
}
b'0'..=b'9' => {
drop(data);
let bytes = match read_bytestring(&self.data) {
Ok(bytes) => bytes,
Err(e) => {
self.eof = true;
return Some(Err(e));
}
};
Some(Ok(Bencode::ByteString(bytes)))
}
_ => {
self.eof = true;
Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("Invalid bencode prefix: '{}'", byte as char),
)))
}
}
}
}
impl<'a, R: BufRead + 'a> Iterator for BencodeTokenizer<'a, R> {
type Item = io::Result<Bencode<'a, R>>;
fn next(&mut self) -> Option<Self::Item> {
self.next_with_depth(0)
}
}
#[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
Integer(BigInt),
ByteString(Vec<u8>),
List(BencodeListIterator<'a, R>),
Dictionary(BencodeDictIterator<'a, R>),
}
#[derive(Debug)]
pub struct BencodeListIterator<'a, R: BufRead + 'a> {
data: Rc<RefCell<R>>,
eof: bool,
depth: usize,
_marker: std::marker::PhantomData<&'a ()>,
}
impl<'a, R: BufRead + 'a> Iterator for BencodeListIterator<'a, R> {
type Item = io::Result<Bencode<'a, R>>;
fn next(&mut self) -> Option<Self::Item> {
if self.eof {
return None;
}
if self.depth > MAX_RECURSION_DEPTH {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
"Maximum recursion depth exceeded",
)));
}
let mut data = self.data.borrow_mut();
let peek_result = peek_byte(&mut *data);
match peek_result {
Ok(byte) => {
if byte == b'e' {
data.consume(1);
self.eof = true;
return None;
}
}
Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Unexpected EOF in list",
)));
}
Err(e) => {
self.eof = true;
return Some(Err(e));
}
}
drop(data);
let mut tokenizer = BencodeTokenizer {
data: Rc::clone(&self.data),
eof: false,
_marker: std::marker::PhantomData,
};
tokenizer.next_with_depth(self.depth)
}
}
#[derive(Debug)]
pub struct BencodeDictIterator<'a, R: BufRead + 'a> {
data: Rc<RefCell<R>>,
eof: bool,
last_key: Option<Vec<u8>>,
depth: usize,
_marker: std::marker::PhantomData<&'a ()>,
}
impl<'a, R: BufRead + 'a> Iterator for BencodeDictIterator<'a, R> {
type Item = io::Result<(Vec<u8>, Bencode<'a, R>)>;
fn next(&mut self) -> Option<Self::Item> {
if self.eof {
return None;
}
if self.depth > MAX_RECURSION_DEPTH {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
"Maximum recursion depth exceeded",
)));
}
let mut data = self.data.borrow_mut();
let peek_result = peek_byte(&mut *data);
match peek_result {
Ok(byte) => {
if byte == b'e' {
data.consume(1);
self.eof = true;
return None;
}
}
Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Unexpected EOF in dictionary",
)));
}
Err(e) => {
self.eof = true;
return Some(Err(e));
}
}
drop(data);
let key = match read_bytestring(&self.data) {
Ok(bytes) => bytes,
Err(e) => {
self.eof = true;
return Some(Err(e));
}
};
if let Some(ref last_key) = self.last_key {
match key.cmp(last_key) {
std::cmp::Ordering::Less => {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
"Dictionary keys are not sorted lexicographically",
)));
}
std::cmp::Ordering::Equal => {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::InvalidData,
"Duplicate dictionary key",
)));
}
std::cmp::Ordering::Greater => {}
}
}
self.last_key = Some(key.clone());
let mut tokenizer = BencodeTokenizer {
data: Rc::clone(&self.data),
eof: false,
_marker: std::marker::PhantomData,
};
let value = match tokenizer.next_with_depth(self.depth) {
Some(Ok(v)) => v,
Some(Err(e)) => {
self.eof = true;
return Some(Err(e));
}
None => {
self.eof = true;
return Some(Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Expected value after key",
)));
}
};
Some(Ok((key, value)))
}
}
fn read_byte<R: BufRead>(data: &mut R) -> io::Result<u8> {
let mut buf = [0u8; 1];
data.read_exact(&mut buf)?;
Ok(buf[0])
}
fn peek_byte<R: BufRead>(data: &mut R) -> io::Result<u8> {
let buf = data.fill_buf()?;
if buf.is_empty() {
Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Unexpected EOF",
))
} else {
Ok(buf[0])
}
}
fn read_until<R: BufRead>(data: &mut R, delim: u8, max_length: usize) -> io::Result<Vec<u8>> {
let mut bytes = Vec::with_capacity(max_length);
for _ in 0..max_length {
let byte = read_byte(data)?;
if byte == delim {
return Ok(bytes);
}
bytes.push(byte);
}
Err(io::Error::new(
io::ErrorKind::InvalidData,
"Exceeded maximum read length",
))
}
const MAX_LENGTH_DIGITS: usize = 10; // Maximum digits in length string
fn read_bytestring<R: BufRead>(data: &Rc<RefCell<R>>) -> io::Result<Vec<u8>> {
let mut data_mut = data.borrow_mut();
let length_str = read_until(&mut *data_mut, b':', MAX_LENGTH_DIGITS)?;
let length_str = std::str::from_utf8(&length_str).map_err(|_| {
io::Error::new(
io::ErrorKind::InvalidData,
"Invalid UTF-8 in byte string length",
)
})?;
if !length_str.chars().all(|c| c.is_ascii_digit()) {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Non-numeric byte string length",
));
}
if length_str == "0" {
return Ok(vec![0u8; 0]);
}
if length_str.starts_with('0') {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Leading zero in length",
));
}
let length = length_str
.parse::<usize>()
.map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid string length"))?;
if length > MAX_BYTESTRING_LENGTH {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Byte string length exceeds maximum allowed size",
));
}
let mut buffer = vec![0u8; length];
data_mut.read_exact(&mut buffer)?;
Ok(buffer)
}
fn read_integer<R: BufRead>(data: &Rc<RefCell<R>>) -> io::Result<BigInt> {
let mut data_mut = data.borrow_mut();
let mut int_bytes = Vec::new();
loop {
let byte = match read_byte(&mut *data_mut) {
Ok(b) => b,
Err(ref e) if e.kind() == io::ErrorKind::UnexpectedEof => {
return Err(io::Error::new(
io::ErrorKind::UnexpectedEof,
"Unexpected EOF in integer",
));
}
Err(e) => return Err(e),
};
if byte == b'e' {
break;
}
if int_bytes.len() >= MAX_INTEGER_DIGITS {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Integer exceeds maximum allowed digits for creating a 1024-bit signed integer",
));
}
int_bytes.push(byte);
}
let int_str = std::str::from_utf8(&int_bytes)
.map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid UTF-8 in integer"))?;
if int_str == "0" {
return Ok(BigInt::ZERO);
}
if !int_str.chars().all(|c| c.is_ascii_digit() || c == '-') {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"non-signed-numeric in integer",
));
}
if int_str.is_empty() || int_str == "-" {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"missing integer value",
));
}
if int_str.starts_with("-0") || int_str.starts_with('0') {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"integer should not start with zero",
));
}
let value = BigInt::from_str_radix(int_str, 10)
.map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "Invalid integer format"))?;
if value.bits() > MAX_INTEGER_BITS {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Integer created exceeds maximum allowed bit length",
));
}
Ok(value)
}
pub fn collect_bencode_with_depth<'a, R: BufRead + 'a>(
bencode: Bencode<'a, R>,
depth: usize,
) -> io::Result<BencodeData> {
if depth > MAX_RECURSION_DEPTH {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Maximum recursion depth exceeded",
));
}
match bencode {
Bencode::Integer(i) => Ok(BencodeData::Integer(i)),
Bencode::ByteString(s) => Ok(BencodeData::ByteString(s)),
Bencode::List(mut iter) => {
let mut items = Vec::new();
while let Some(item) = iter.next() {
let value = collect_bencode_with_depth(item?, depth + 1)?;
items.push(value);
}
Ok(BencodeData::List(items))
}
Bencode::Dictionary(mut iter) => {
let mut items = Vec::new();
while let Some(item) = iter.next() {
let (key, value) = item?;
let value = collect_bencode_with_depth(value, depth + 1)?;
items.push((key, value));
}
Ok(BencodeData::Dictionary(items))
}
}
}
pub fn collect_bencode<'a, R: BufRead + 'a>(bencode: Bencode<'a, R>) -> io::Result<BencodeData> {
collect_bencode_with_depth(bencode, 0)
}
#[cfg(test)]
mod tests {
use super::*;
use num_bigint::BigInt;
use std::io::Cursor;
#[test]
fn test_bencode_parser() {
use BencodeData::*;
let test_cases = vec![
// Valid integers
(b"i42e".to_vec(), Ok(Integer(42.into())), "Valid integer"),
(
b"i-42e".to_vec(),
Ok(Integer((-42).into())),
"Valid negative integer",
),
(b"i0e".to_vec(), Ok(Integer(0.into())), "Valid zero integer"),
(
b"i-1e".to_vec(),
Ok(Integer((-1).into())),
"Valid negative one",
),
(
b"i999999999999999999999999999999999999e".to_vec(),
Ok(Integer(
"999999999999999999999999999999999999".parse().unwrap(),
)),
"Valid large integer",
),
(
b"i-999999999999999999999999999999999999e".to_vec(),
Ok(Integer(
"-999999999999999999999999999999999999".parse().unwrap(),
)),
"Valid negative large integer",
),
// Invalid integers
(
b"i-e".to_vec(),
Err("missing integer value"),
"Missing digits after minus sign",
),
(
b"i-0e".to_vec(),
Err("integer should not start with zero"),
"Negative zero",
),
(
b"i042e".to_vec(),
Err("integer should not start with zero"),
"Leading zero",
),
(
b"i+42e".to_vec(),
Err("non-signed-numeric in integer"),
"Plus sign",
),
(
b"i4.2e".to_vec(),
Err("non-signed-numeric in integer"),
"Non-digit character",
),
(
b"i4 2e".to_vec(),
Err("non-signed-numeric in integer"),
"Whitespace character",
),
(
b"ie".to_vec(),
Err("missing integer value"),
"Empty integer",
),
(
b"i-42".to_vec(),
Err("Unexpected EOF in integer"),
"Missing 'e' terminator",
),
// Valid byte strings
(b"0:".to_vec(), Ok(ByteString(vec![])), "Empty string"),
(
b"3:abc".to_vec(),
Ok(ByteString(b"abc".to_vec())),
"Simple string",
),
(
b"5:hello".to_vec(),
Ok(ByteString(b"hello".to_vec())),
"Hello string",
),
(
b"11:hello world".to_vec(),
Ok(ByteString(b"hello world".to_vec())),
"String with spaces",
),
(
b"5:helloextra".to_vec(),
Ok(ByteString(b"hello".to_vec())),
"Extra data after string",
),
// Invalid byte strings
(
b"03:abc".to_vec(),
Err("Leading zero in length"),
"Leading zero in length",
),
(
b"0abc:hello".to_vec(),
Err("Non-numeric byte string length"),
"Non-numeric length",
),
(
b"-5:hello".to_vec(),
Err("Invalid bencode prefix: '-'"),
"Negative length",
),
// Valid lists
(b"le".to_vec(), Ok(List(vec![])), "Empty list"),
(
b"l4:spami42ee".to_vec(),
Ok(List(vec![ByteString(b"spam".to_vec()), Integer(42.into())])),
"List with string and integer",
),
(
b"l4:spam4:eggse".to_vec(),
Ok(List(vec![
ByteString(b"spam".to_vec()),
ByteString(b"eggs".to_vec()),
])),
"List of strings",
),
// Invalid lists
(
b"l4:spam".to_vec(),
Err("Unexpected EOF in list"),
"Unterminated list",
),
(
b"l".to_vec(),
Err("Unexpected EOF in list"),
"Empty unterminated list",
),
// Valid dictionaries
(b"de".to_vec(), Ok(Dictionary(vec![])), "Empty dictionary"),
(
b"d3:cow3:mooe".to_vec(),
Ok(Dictionary(vec![(
b"cow".to_vec(),
ByteString(b"moo".to_vec()),
)])),
"Simple dictionary",
),
(
b"d4:spaml1:a1:bee".to_vec(),
Ok(Dictionary(vec![(
b"spam".to_vec(),
List(vec![ByteString(b"a".to_vec()), ByteString(b"b".to_vec())]),
)])),
"Dictionary with list",
),
// Invalid dictionaries
(
b"d3:key5:value".to_vec(),
Err("Unexpected EOF in dictionary"),
"Unterminated dictionary",
),
// Unsorted keys
(
b"d3:baa1:13:aaa1:2e".to_vec(),
Err("Dictionary keys are not sorted lexicographically"),
"Unsorted dictionary keys",
),
// Duplicate keys
(
b"d3:key5:value3:key3:bbbe".to_vec(),
Err("Duplicate dictionary key"),
"Duplicate dictionary keys",
),
// Non-UTF8 keys
(
b"d3:aaa3:bbb2:\xFF\xFE5:valuee".to_vec(),
Ok(Dictionary(vec![
(b"aaa".to_vec(), ByteString(b"bbb".to_vec())),
(vec![0xFF, 0xFE], ByteString(b"value".to_vec())),
])),
"Non-UTF8 dictionary keys",
),
// Invalid prefixes
(
b"x42e".to_vec(),
Err("Invalid bencode prefix: 'x'"),
"Invalid prefix 'x'",
),
(
b"f3.14e".to_vec(),
Err("Invalid bencode prefix: 'f'"),
"Invalid prefix 'f'",
),
(
b"n123e".to_vec(),
Err("Invalid bencode prefix: 'n'"),
"Invalid prefix 'n'",
),
// Extra data after valid bencode
(
b"i42ex".to_vec(),
Ok(Integer(42.into())),
"Integer with extra data",
),
];
for (data, expected, description) in test_cases {
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next();
match (result, expected) {
(Some(Ok(bencode)), Ok(ref expected_bencode)) => {
let collected = collect_bencode(bencode).expect("Failed to collect bencode");
assert_eq!(&collected, expected_bencode, "{}", description);
}
(Some(Err(e)), Err(expected_error)) => {
assert_eq!(e.to_string(), expected_error, "{}", description);
}
(Some(Ok(bencode)), Err(expected_error)) => {
let collected = collect_bencode(bencode);
if let Err(e) = collected {
assert_eq!(e.to_string(), expected_error, "{}", description);
} else {
panic!(
"{}: Expected error '{}', but got success",
description, expected_error
);
}
}
(Some(Err(e)), Ok(_)) => {
panic!("{}: Expected success, but got error '{}'", description, e);
}
(None, _) => {
panic!("{}: No result returned", description);
}
}
}
}
#[test]
fn test_recursion_depth_limit() {
let mut data = Vec::new();
for _ in 0..MAX_RECURSION_DEPTH {
data.push(b'l');
}
data.extend(vec![b'e'; MAX_RECURSION_DEPTH]);
let cursor = Cursor::new(&data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let bencode = result.unwrap();
assert!(
collect_bencode(bencode).is_ok(),
"Should succeed at max depth"
);
data.insert(0, b'l');
data.push(b'e');
let cursor = Cursor::new(&data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let bencode = result.unwrap();
let collected = collect_bencode(bencode);
assert!(collected.is_err(), "Should fail exceeding max depth");
assert_eq!(
collected.unwrap_err().to_string(),
"Maximum recursion depth exceeded",
"Error message should indicate depth exceeded"
);
}
#[test]
fn test_large_integer_limits() {
// Test integer with MAX_INTEGER_DIGITS digits
let digits = "1".repeat(MAX_INTEGER_DIGITS);
let data = format!("i{}e", digits);
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap().unwrap();
let collected = collect_bencode(result).unwrap();
if let BencodeData::Integer(value) = collected {
assert_eq!(value.to_string(), digits);
} else {
panic!("Expected integer");
}
// Test integer with MAX_INTEGER_DIGITS + 1 digits (should error)
let digits = "1".repeat(MAX_INTEGER_DIGITS + 1);
let data = format!("i{}e", digits);
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let error = result.unwrap_err();
assert_eq!(
error.to_string(),
"Integer exceeds maximum allowed digits for creating a 1024-bit signed integer"
);
// Test integer with MAX_INTEGER_BITS bits
let max_value = (BigInt::from(1) << MAX_INTEGER_BITS) - 1;
let data = format!("i{}e", max_value);
let cursor = Cursor::new(data.to_string());
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap().unwrap();
let collected = collect_bencode(result).unwrap();
if let BencodeData::Integer(value) = collected {
assert_eq!(value, max_value);
} else {
panic!("Expected integer");
}
// Test integer exceeding MAX_INTEGER_BITS (should error)
let over_max_value = BigInt::from(1) << MAX_INTEGER_BITS;
let data = format!("i{}e", over_max_value);
let cursor = Cursor::new(data.to_string());
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let error = result.unwrap_err();
assert_eq!(
error.to_string(),
"Integer created exceeds maximum allowed bit length"
);
// Test integer exceeding MAX_INTEGER_BITS but not MAX_INTEGER_DIGITS (should error)
let digits = "9".repeat(MAX_INTEGER_DIGITS);
let data = format!("i{}e", digits);
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let error = result.unwrap_err();
assert_eq!(
error.to_string(),
"Integer created exceeds maximum allowed bit length"
);
}
#[test]
fn test_large_bytestring_limits() {
let data = format!(
"{}:{}",
MAX_BYTESTRING_LENGTH,
"a".repeat(MAX_BYTESTRING_LENGTH)
);
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap().unwrap();
let collected = collect_bencode(result).unwrap();
if let BencodeData::ByteString(bytes) = collected {
assert_eq!(bytes.len(), MAX_BYTESTRING_LENGTH);
} else {
panic!("Expected byte string");
}
let length = MAX_BYTESTRING_LENGTH + 1;
let data = format!("{}:{}", length, "a".repeat(10));
let cursor = Cursor::new(data);
let mut tokenizer = BencodeTokenizer::new(cursor);
let result = tokenizer.next().unwrap();
let error = result.unwrap_err();
assert_eq!(
error.to_string(),
"Byte string length exceeds maximum allowed size"
);
}
} |
@josecelano @magecnion In the last posts, with the help of https://github.com/SillyTavern/SillyTavern , I tried to implement a single page |
In my proposal, the "generator" would be responsible for converting from the byte sequence to the "" or "" tags. Becuase that's only relevant for JSON. Other future format might support bye sequences.
Sorry @magecnion, my explanation was confusing because I mixed the design with the implementation. Design
ImplementationWhat I would do is:
Alternatively, we could start from scratch with the new design since the lib is small. ConclusionI think we all agree that it is worth exploring this new design because:
I would create a new branch to test this new proposal. @da2ce7 I think your example would be a parser like Serde, if I'm not wrong. That means we would need to store the whole bencoded value in memory before writing JSON for lists and dicts. I think with the tokenizer, we need to keep only one token. That would consume more memory that the current implementation but since the store the whole string I don't think it would be a big change. Anyway, we have to be sure that performance remains more or less the same. The goal for this lib was to be fast and have low memory consumption; otherwise, we could have used Serde. |
@josecelano Hmm... You miss-read the implementation I provided. The The To stream json, A |
Hey @da2ce7, sorry, I just checked the latest post because I thought you posted different versions. I will check it carefully. It looks good (and better) to me. I think the recursive solution is more declarative than the current version. I would definitely try this solution. I hope we don't find a performance problem, but I don't see any reason for it. I propose:
@da2ce7, are you going to continue working on it and open a PR? I'm just wondering if you are actively working on it or if it was just a proposal we can follow. |
@josecelano I wanted to write a bencoder parser myself, to learn the format a bit better. For this code I think it is complete. you may wish to change the implementation a little: #[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
Integer(BigInt),
ByteString(Vec<u8>),
List(BencodeListIterator<'a, R>),
Dictionary(BencodeDictIterator<'a, R>),
} to #[derive(Debug)]
pub enum Bencode<'a, R: BufRead + 'a> {
Integer(BigInt),
ByteString(BencodeStringIterator<'a, R>),
List(BencodeListIterator<'a, R>),
Dictionary(BencodeDictIterator<'a, R>),
} so that the strings may also be tokenized progressively. |
ec6cc56 docs: update README (Jose Celano) 68d9915 refactor: rename json::BencodeParser to json::Generator (Jose Celano) a3c7c4b refactor: remove parent parser mod (Jose Celano) 3052d6a refactor: rename BencodeTOkenizer to Tokenizer (Jose Celano) 331c76e refactor: reorganize modules (Jose Celano) 9e0db6c refactor: remove writer from tokenizer string parser (Jose Celano) 0a05544 refactor: remove old int and str parsers with writers (Jose Celano) 75ffdb4 refactor: remove writer from tokenizer integer parser (Jose Celano) 77ad5af refactor: remove writer from main tokenizer (Jose Celano) f6a0584 refactor: duplicate integer and strig parser before removing writer (Jose Celano) 3a7ea5d refactor: extract mod tokenizer (Jose Celano) 63b9b73 refactor: extract struct BencodeTokenizer (Jose Celano) 83eeefd refactor: extract bencode tokenizer (Jose Celano) Pull request description: This refactoring changes the current implementation to extract the tokenizer. It splits parser logic into two types: - **Tokenizer**: It returns bencoded tokens. - **Generator**: It iterates over bencoded tokens to generate the JSON. **NOTES** - It keeps the custom recursivity (with explicit stack) for the time being, instead of using explicit recursivity like @da2ce7 did [here](#12 (comment)). I guess that could be changed later if we think it increases readability and maintainability. **SUBTASKS** - [x] Separate logic for tokenizer. - [x] Extract tokenizer. - [x] Remove `Writer` from the tokenizer. It's not needed. **PERFORMANCE** In the current version, bencoded strings are cached in memory before starting writing to the output (because we nned the whole string to check if it's a valid UTF-8). In this PR, bencoded integers are also cached in memory because the whole integer value is a token. This should not be a problem since integers are short, unlike strings. **FUTURE PRs** We could: - [ ] Implement the `Iterator` trait for the tokenizer. - [ ] Use recursion for the generator like @da2ce7's proposal [here](#12). - [ ] Implement another generator for TOML, for example. Check if this design can be easily extended to other output formats. ACKs for top commit: josecelano: ACK ec6cc56 Tree-SHA512: 9210211d802c8e19aef1f02f814b494c5919c7da81f299cf2c7f4d9fb12b4c63cbec4ac526996e6b1b3d69f75ca58894b9d64936bef2d9da851e70d51234c675
Hi @da2ce7 @magecnion, I've closed this issue because I merged a PR where I've extracted the tokenizer. I've also renamed the parser to generator. Now we have:
Therefore, this discussion doesn't make sense anymore. However, @da2ce7 added a lot of feedback and alternatives that would be good to explore. I've opened new issues:
My idea is to progressively refactor towards @da2ce7's proposal if it makes sense to you. |
Yesterday, I discussed the main type name with @magecnion. It might not be a good name.
Formal Definition of a Parser
Since the library converts bencoded data directly to JSON without creating an intermediary structured representation, the term "parser" might not entirely reflect its purpose. It implies more extensive processing than what the library does.
Alternative Names (ChatGPT)
cc @da2ce7 @magecnion
The text was updated successfully, but these errors were encountered: