Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rust): replace lexical by atoi to parse integer #10655

Closed
wants to merge 4 commits into from

Conversation

reswqa
Copy link
Collaborator

@reswqa reswqa commented Aug 21, 2023

This fixes #10635.

@reswqa reswqa marked this pull request as draft August 21, 2023 16:38
@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Aug 21, 2023
@reswqa
Copy link
Collaborator Author

reswqa commented Aug 21, 2023

The failed test test_error_message imply that atoi::atoi::<i64>(b"1e5") will successfully parsed to Some(1). But we assume that it would fail to parse before.

Furthermore, the following is also true:

assert_eq!(Some(42), atoi::<u32>(b"42 is the answer to life, the universe and everything"));

It doesn't seem to provide options to prevent this(at least I didn't find it), and I'm a bit unclear about what our guiding principles are, more lenient or more strict? 🤔

@ritchie46
Copy link
Member

ritchie46 commented Aug 21, 2023

https://docs.rs/atoi/latest/atoi/trait.FromRadix10Checked.html#tymethod.from_radix_10_checked returns also the index in the byte slice. If index != bytes.len() && !ignore_errors we could raise.

Btw, this is the code from that function:

    fn from_radix_10_checked(text: &[u8]) -> (Option<I>, usize) {
        let max_safe_digits = max(1, I::max_num_digits_negative(nth(10))) - 1;
        let (number, mut index) = I::from_radix_10(&text[..min(text.len(), max_safe_digits)]);
        let mut number = Some(number);
        // We parsed the digits, which do not need checking now lets see the next one:
        while index != text.len() {
            if let Some(digit) = ascii_to_digit(text[index]) {
                number = number.and_then(|n| n.checked_mul(&nth(10)));
                number = number.and_then(|n| n.checked_add(&digit));
                index += 1;
            } else {
                break;
            }
        }
        (number, index)
    }

I think we can inline that and raise on the else { break }. Maybe we can also cache max_safe_digits.

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 21, 2023

I think we can inline that and raise on the else { break }

Ah, thanks for the so quick reply! This is really a helpful suggestion 👍 , I will take a walk along this road tomorrow.

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 22, 2023

Hi @ritchie46, I just inline(almost copy) this part of the code from atoi.

The current implement does not support max_safe_digits caching, but is still necessary if there are exactly performance issues here. And we use from_radix_10_signed_checked instead of from_radix_10_checked as we should take care of the sign character (+ or -).

If index != bytes.len() && !ignore_errors we could raise.

We used to directly return None when dealing with something like overflow, but retry and handle this in the outer layer(parse_bytes). I didn't think too much about whether to maintain the status quo or change the return value of parse to PolarsResult? If it's the former, it doesn't seem like we need to copy and maintain the code from atoi? Or is this for performance reasons?

Additionally, do we have some benchmark to cover this change?

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, do we have some benchmark to cover this change?

We don't have benchmarks. But I am definitely interested in some.

Probably hoist the parsing code into a generic function and benchmark that against lexical. This can be done independent of the csv parsing of course.

fn parse(bytes: &[u8]) -> Option<i32> {
lexical::parse(bytes).ok()
}
macro_rules! impl_integer_parser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this work with generics and then a macro that dispatches to the generics?

// `number += digit * signum`.
let value = match sign {
Sign::Plus => {
let max_safe_digits =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me if this will have constant time. It should be constant computable if I am not mistaken. If we are creating generics, we could also set it on the type?.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are creating generics, we could also set it on the type?

Intuitively, I like this idea. But when it came to implementation, I was a bit scratching my ears and cheeks, and I probably didn't find the trick.

Say we want pre-compute this value: i32::max_num_digits(nth(10)) and set it on the first const generics:

fn parse_from_radix_10<const MAX_DIGITS: usize, other_generics>{
      // ........balabalabala
}

Due to neither max_num_digits nor nth being const fn, the compiler is unhappy and it also preventing me from implementing nth as const fn. I don't have much experience in this area, I would like to hear your advice. 🙇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going on holiday today, but I will see if I can find time to tinker with this.

I think you are correct about the const not working, but the generic part should.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will refactor it into a generic function, then macro only used for dispatching.

Most importantly: Have a pleasant holiday! 😄

@reswqa reswqa marked this pull request as ready for review August 24, 2023 05:44
@orlp
Copy link
Collaborator

orlp commented Aug 24, 2023

I don't think we should merge this. This approach will be much slower than lexical's much more optimized integer parsing. Furthermore I don't think lexical was the source of the problem in the first place, if I try lexical::parse locally it correctly gives errors:

use anyhow::Result;

fn main() -> Result<()> {
    let x: i32 = lexical::parse("2147483647")?;
    let y: i32 = lexical::parse("2147483648")?;
    dbg!(x, y);
    Ok(())
}
Error: lexical parse error: 'numeric overflow occurred' at index 9

@orlp
Copy link
Collaborator

orlp commented Aug 24, 2023

Actually, lexical does have some errors it seems. This is handled incorrectly:

let y: i32 = lexical::parse("9589934592")?;
dbg!(y);
1000000000

Perhaps we can fix it upstream? If that is not viable, we should implement an optimized integer parser ourselves or find another dependency, because radix-by-radix multiplying is rather slow.

@orlp
Copy link
Collaborator

orlp commented Aug 24, 2023

I think fixing lexical upstream is a non-starter, seems to be unmaintained with unsoundness bugs left open. I think we should find an alternative or develop one ourselves that is as fast without the bugs, because digit-by-digit integer parsing is rather slow.

@ritchie46
Copy link
Member

Maybe we should benchmark atoi vs std.

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 24, 2023

lexical does have some errors it seems. This is handled incorrectly

Yes, it sometimes warp overflow value instead of throw error. That's why we want to replace it by other lib.

I don't think we should merge this. This approach will be much slower than lexical's much more optimized integer parsing.

This is just a PoC to demonstrate whether this is feasible. A reasonable solution is to find a way to benchmark it. If there is indeed a significant regression, feel free to close this. :)

@orlp
Copy link
Collaborator

orlp commented Aug 24, 2023

Also, looking at the code, @reswqa you copy/pasted significant chunks of code from the atoi source code without mentioning it: https://docs.rs/atoi/2.0.0/src/atoi/lib.rs.html#421

That's not ok!

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 24, 2023

Also, looking at the code, @reswqa you copy/pasted significant chunks of code from the atoi source code without mentioning it:

Sorry, I accidentally mark it as can be review. It's actually just a Draft. In my opinion, we need to benchmark before merging it, so I did not add more tests as well as the copyright.

@reswqa reswqa marked this pull request as draft August 24, 2023 09:30
@ritchie46
Copy link
Member

Also, looking at the code, @reswqa you copy/pasted significant chunks of code from the atoi source code without mentioning it: https://docs.rs/atoi/2.0.0/src/atoi/lib.rs.html#421

That's not ok!

My bad. I asked to inline part of it to cache some part. We do need to mention the original code.

@orlp
Copy link
Collaborator

orlp commented Aug 24, 2023

Ah, I did not see that you asked for it. Either way we should mention it, but I think we should have a more optimized implementation regardless. Perhaps we can accept a simplified implementation knowing it's a speed regression for now and then find/develop a faster one later. I know I can do that if needed.

@reswqa
Copy link
Collaborator Author

reswqa commented Aug 24, 2023

My bad. I asked to inline part of it to cache some part.

Ah, That's not your fault. TBH, I realized this copyright problem at the beginning, but considering that it's only draft at the moment, so I'm slacking off(this is definitely not a good habit) :) You guys are doing something very meaningful (Polars are really great), which is also why I started to be active here.

At first, I posted this on Discord to see if anyone knew the best alternative to lexical. When I heard about atoi from Ritchie, I felt that giving it a try was better than doing nothing. But I admit that in the long run, we must find a more optimized solution, and I will take some time to see if I can be helpful 😄

@ritchie46
Copy link
Member

You guys are doing something very meaningful (Polars are really great), which is also why I started to be active here.

Your help is very welcome. You contributed a lot of great patches in the last days. Really great to have your help! :)

@reswqa reswqa closed this Nov 16, 2023
@reswqa
Copy link
Collaborator Author

reswqa commented Nov 16, 2023

Closed as completed via #12512.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv() inconsistently handles out-of-range integers when dtypes are Int8/16/32
3 participants