Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust): Move transpose naming to Rust #10009

Merged
merged 9 commits into from
Jul 26, 2023

Conversation

magarick
Copy link
Contributor

Moves the functionality to customize the names of a transposed DataFrame and keep its column names as a new column into Rust.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jul 21, 2023
&self,
dtype: &DataType,
keep_names_as: Option<&str>,
colnames: &[String],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col_names or column_names

.collect::<Vec<_>>();
Ok(DataFrame::new_no_checks(cols))
}
}?;
out.set_column_names(colnames)?;
match keep_names_as {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this column before we transpose/ generate the columns.

Inserting at the start of a Vec is O(n) where it is free if we add it before and then extend that vec with the transposed columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the cleanest way to do this? Right now it collects an iterator into a vec of series before creating the dataframe. Is it best to preallocate a Vec and dump everything into that? Or create a dataframe up front and have the iterator generating the series modify it in place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might get a little awkward and we should only revisit if it seems like there are performance issues. I'm trying to create a vector of series upfront and then push to it, which seemed reasonable. But the numeric_transpose versions use parallel iterators, so I think I'd have to take the tail of this pre-allocated new Series vector, pass it mutably to numeric_transpose and do collect_into_vec. Not sure if there's a cleaner way to do this.

Copy link
Member

@ritchie46 ritchie46 Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-numeric one collects into a vec, so that iterator can extend to an existing buffer.

The numeric one, can reallocate by appending to the prepared vec. That has the same cost we have now.

So this would still make the second branch cheaper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's fine. Easy enough. It seems like this has been optimized pretty thoroughly since it's a slow operation, but that makes it tricky to change. Our of curiosity, do you know if there's a reasonable way to handle this with the parallel iterators? Could I use unsafe code to have the parallel iterator collect into a mutable alias of a prepared buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently there's par_extend so I think there should be no reallocations anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently there's par_extend so I think there should be no reallocations anymore.

Even that one reallocates :) But that is definitely what we need in this case, as it is a relallocation less.

Either::Left(cname) => {
let new_names = self.column(&cname).and_then(|x| x.utf8())?;
polars_ensure!(!new_names.has_validity(), ComputeError: "Column with new names can't have null values");
df = self.drop(&cname)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: c_name or col_name

keep_names_as: Option<&str>,
column_names: Option<Either<String, Vec<String>>>,
) -> PolarsResult<DataFrame> {
let mut df = self.clone(); // Must be owned so we get the same type if dropping a column.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put it behind a Cow then we don't need to clone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cloning actually doing much here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You heap allocate a new Vec and atomically increment every series and follow the indirections. DataFrames can be very wide.

Putting it behind a Cow is almost free in comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was working under the assumption no one would be transposing something with more than, say, a million columns, which would make this small in comparison to the rest of the operation. But it sounds easy enough to avoid moooooving around data we don't need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the Cow is almost free, so it is good practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell that to people from cultures where the cow is a symbol of wealth and prestige :-)

let new_names = self.column(&cname).and_then(|x| x.utf8())?;
polars_ensure!(!new_names.has_validity(), ComputeError: "Column with new names can't have null values");
df = self.drop(&cname)?;
new_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have new_names scoped above, we can collect into a Vec<&str> and we don't need to heap allocate the strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I make that work with the other branch that generates names by formatting? I couldn't figure out how to end up with &str instead of String because of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Yeah, from_dtypes accepts &[String]. Let's leave that one for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had initially tried to do was return an iterator in these blocks that could be any AsRef<str> so we wouldn't have to allocate when either generating column names or taking them from an existing column. But I wasn't able to get that to work. If you have any ideas that would be helpful. I assumed that this operation is so expensive you'd generally be allocating a few thousand strings at most, but who knows.

ritchie46
ritchie46 previously approved these changes Jul 21, 2023
if let Some(cn) = keep_names_as {
// Check that the column name we're using for the original column names is unique before
// wasting time transposing
polars_ensure!(!colnames_t.contains(&cn.to_owned()), Duplicate: "{} is already in output column names", cn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't cn.as_ref() or as_str() work here? That saves an allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I tried that and it didn't work because colnames_t contains owned strings. I'm sure there's a better way to do it. I'm no expert in Rust strings, and they're not easy to work with to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, contains doesn't accept the borrow trait. In that case I think we better use !iter().any(|a| a.as_ref() == b.as_ref()).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work but comparing a.as_str() to cn does.
What's the importance of one seemingly small allocation here? Does it increase the chance of cache issues later?

&self,
dtype: &DataType,
keep_names_as: Option<&str>,
names_t: &[String],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this argument? What is meant by t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know it's a universal symbol for transpose and seemed reasonable here. I thought it would be super clear and useful in this context since at times we refer to the names of the original dataframe and at other times the names of the new transposed one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's name it names_transposed, then it is clear for everyone. I assume it are the names after transposing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they don't know t stands for transpose nothing's gonna help them. I changed it to names_out which will hopefully be clear even to people who don't know what a transpose is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had this discussion before, I don't want single letter identifier in non-local code.

I am trying to have the code in a consistent style and having discussions on this isn't worth both our time I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. I know, I changed it!
b. This function is a helper used only in this file. How non-local are we talking?
c. Is the policy that no names can contain single-letter sub-tokens? Even where it's clear from context like n_something or hstack?

No need to answer b and c, just something to think about for the guide. Like I've said before, I'm not trying to give you a hard time, just trying to help test the policies to lead to maximal clarity. And I'm happy to help writing up guidelines if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I asked initially was because when I read names_t, I found myself thinking; "names for what?" -> "t" -> "transpose" or "transposed"?

Do you mean the names of the columns you want to transpose or rename them after transposing. All this mental arithmetic can be prevented with slightly more explicit names. This helps many readers.

So I think names_out is a good parameter name now. 👍

Regarding c. n_foo is very clear when we talk about the amount of something. It is not a verb that can be in present or past tense. I think that verbs never should be changed into a single letter. We have a hstack for legacy reasons (we should at least name it h_stack), but on the python side we all renamed sum to sum_horizontal, etc. Now, for non-local code I think it is fine to use hstack, n, i etc.

I want things to be more clear when we get into function arguments and even more when we get in public domain. This isn't math, so it will not always be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I asked initially was because when I read names_t, I found myself thinking; "names for what?" -> "t" -> "transpose" or "transposed"?

Do you mean the names of the columns you want to transpose or rename them after transposing. All this mental arithmetic can be prevented with slightly more explicit names. This helps many readers.

Well, in this case, I think even writing out "transpose{d}" would have been wrong, since the names aren't a transpose of anything. It's the names of a thing that's transposed.

Regarding c. n_foo is very clear when we talk about the amount of something. It is not a verb that can be in present or past tense. I think that verbs never should be changed into a single letter. We have a hstack for legacy reasons (we should at least name it h_stack), but on the python side we all renamed sum to sum_horizontal, etc. Now, for non-local code I think it is fine to use hstack, n, i etc.

This may be a topic for another time, but "horizontal", "vertical" and "diagonal" are terms I'd rather nip in the bud. It's not that they're long, though I'm not a fan either, but I find myself constantly confused by them. We have perfectly good, short terms like "row", "col", "by_row", "by_col", "row_wise", "col_wise" etc that are both shorter and more natural for the domain.

I want things to be more clear when we get into function arguments and even more when we get in public domain. This isn't math, so it will not always be clear.

Have you seen the variation and inconsistency in mathematical notation? Anyway, clarity is highly subjective, so the question is always "clear to whom" and "how do you make sure"? But you know what they say, the two hardest problems in computer science are cache invalidation, naming things, and off by one errors.

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @magarick. Almost there.

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved
Remove the comment I guess.

Co-authored-by: Ritchie Vink <[email protected]>
@ritchie46 ritchie46 merged commit c52e70c into pola-rs:main Jul 26, 2023
24 checks passed
@magarick magarick deleted the name-transpose-in-rust branch July 26, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants