feat(rust): Move transpose naming to Rust #10009

magarick · 2023-07-21T01:31:07Z

Moves the functionality to customize the names of a transposed DataFrame and keep its column names as a new column into Rust.

ritchie46 · 2023-07-21T07:30:17Z

polars/polars-core/src/frame/row/transpose.rs

+        &self,
+        dtype: &DataType,
+        keep_names_as: Option<&str>,
+        colnames: &[String],


col_names or column_names

ritchie46 · 2023-07-21T07:37:39Z

polars/polars-core/src/frame/row/transpose.rs

                    .collect::<Vec<_>>();
                Ok(DataFrame::new_no_checks(cols))
            }
+        }?;
+        out.set_column_names(colnames)?;
+        match keep_names_as {


Can we add this column before we transpose/ generate the columns.

Inserting at the start of a Vec is O(n) where it is free if we add it before and then extend that vec with the transposed columns.

What's the cleanest way to do this? Right now it collects an iterator into a vec of series before creating the dataframe. Is it best to preallocate a Vec and dump everything into that? Or create a dataframe up front and have the iterator generating the series modify it in place?

I think this might get a little awkward and we should only revisit if it seems like there are performance issues. I'm trying to create a vector of series upfront and then push to it, which seemed reasonable. But the numeric_transpose versions use parallel iterators, so I think I'd have to take the tail of this pre-allocated new Series vector, pass it mutably to numeric_transpose and do collect_into_vec. Not sure if there's a cleaner way to do this.

The non-numeric one collects into a vec, so that iterator can extend to an existing buffer.

The numeric one, can reallocate by appending to the prepared vec. That has the same cost we have now.

So this would still make the second branch cheaper.

Ok, that's fine. Easy enough. It seems like this has been optimized pretty thoroughly since it's a slow operation, but that makes it tricky to change. Our of curiosity, do you know if there's a reasonable way to handle this with the parallel iterators? Could I use unsafe code to have the parallel iterator collect into a mutable alias of a prepared buffer?

Apparently there's par_extend so I think there should be no reallocations anymore.

Apparently there's par_extend so I think there should be no reallocations anymore.

Even that one reallocates :) But that is definitely what we need in this case, as it is a relallocation less.

ritchie46 · 2023-07-21T07:38:52Z

polars/polars-core/src/frame/row/transpose.rs

+                Either::Left(cname) => {
+                    let new_names = self.column(&cname).and_then(|x| x.utf8())?;
+                    polars_ensure!(!new_names.has_validity(), ComputeError: "Column with new names can't have null values");
+                    df = self.drop(&cname)?;


nit: c_name or col_name

ritchie46 · 2023-07-21T07:40:21Z

polars/polars-core/src/frame/row/transpose.rs

+        keep_names_as: Option<&str>,
+        column_names: Option<Either<String, Vec<String>>>,
+    ) -> PolarsResult<DataFrame> {
+        let mut df = self.clone(); // Must be owned so we get the same type if dropping a column.


You can put it behind a Cow then we don't need to clone.

Is cloning actually doing much here?

You heap allocate a new Vec and atomically increment every series and follow the indirections. DataFrames can be very wide.

Putting it behind a Cow is almost free in comparison.

I guess I was working under the assumption no one would be transposing something with more than, say, a million columns, which would make this small in comparison to the rest of the operation. But it sounds easy enough to avoid moooooving around data we don't need to.

Well the Cow is almost free, so it is good practice.

Tell that to people from cultures where the cow is a symbol of wealth and prestige :-)

ritchie46 · 2023-07-21T07:41:13Z

polars/polars-core/src/frame/row/transpose.rs

+                    let new_names = self.column(&cname).and_then(|x| x.utf8())?;
+                    polars_ensure!(!new_names.has_validity(), ComputeError: "Column with new names can't have null values");
+                    df = self.drop(&cname)?;
+                    new_names


If we have new_names scoped above, we can collect into a Vec<&str> and we don't need to heap allocate the strings.

How do I make that work with the other branch that generates names by formatting? I couldn't figure out how to end up with &str instead of String because of that.

Ah, right. Yeah, from_dtypes accepts &[String]. Let's leave that one for now.

What I had initially tried to do was return an iterator in these blocks that could be any AsRef<str> so we wouldn't have to allocate when either generating column names or taking them from an existing column. But I wasn't able to get that to work. If you have any ideas that would be helpful. I assumed that this operation is so expensive you'd generally be allocating a few thousand strings at most, but who knows.

ritchie46 · 2023-07-21T12:43:31Z

polars/polars-core/src/frame/row/transpose.rs

+        if let Some(cn) = keep_names_as {
+            // Check that the column name we're using for the original column names is unique before
+            // wasting time transposing
+            polars_ensure!(!colnames_t.contains(&cn.to_owned()), Duplicate: "{} is already in output column names", cn)


Doesn't cn.as_ref() or as_str() work here? That saves an allocation.

I think I tried that and it didn't work because colnames_t contains owned strings. I'm sure there's a better way to do it. I'm no expert in Rust strings, and they're not easy to work with to begin with.

Right, contains doesn't accept the borrow trait. In that case I think we better use !iter().any(|a| a.as_ref() == b.as_ref()).

That doesn't work but comparing a.as_str() to cn does.
What's the importance of one seemingly small allocation here? Does it increase the chance of cache issues later?

ritchie46 · 2023-07-25T06:19:27Z

polars/polars-core/src/frame/row/transpose.rs

+        &self,
+        dtype: &DataType,
+        keep_names_as: Option<&str>,
+        names_t: &[String],


Can you rename this argument? What is meant by t?

As far as I know it's a universal symbol for transpose and seemed reasonable here. I thought it would be super clear and useful in this context since at times we refer to the names of the original dataframe and at other times the names of the new transposed one.

Let's name it names_transposed, then it is clear for everyone. I assume it are the names after transposing?

If they don't know t stands for transpose nothing's gonna help them. I changed it to names_out which will hopefully be clear even to people who don't know what a transpose is.

We had this discussion before, I don't want single letter identifier in non-local code.

I am trying to have the code in a consistent style and having discussions on this isn't worth both our time I think.

a. I know, I changed it!
b. This function is a helper used only in this file. How non-local are we talking?
c. Is the policy that no names can contain single-letter sub-tokens? Even where it's clear from context like n_something or hstack?

No need to answer b and c, just something to think about for the guide. Like I've said before, I'm not trying to give you a hard time, just trying to help test the policies to lead to maximal clarity. And I'm happy to help writing up guidelines if you want.

The reason I asked initially was because when I read names_t, I found myself thinking; "names for what?" -> "t" -> "transpose" or "transposed"?

Do you mean the names of the columns you want to transpose or rename them after transposing. All this mental arithmetic can be prevented with slightly more explicit names. This helps many readers.

So I think names_out is a good parameter name now. 👍

Regarding c. n_foo is very clear when we talk about the amount of something. It is not a verb that can be in present or past tense. I think that verbs never should be changed into a single letter. We have a hstack for legacy reasons (we should at least name it h_stack), but on the python side we all renamed sum to sum_horizontal, etc. Now, for non-local code I think it is fine to use hstack, n, i etc.

I want things to be more clear when we get into function arguments and even more when we get in public domain. This isn't math, so it will not always be clear.

The reason I asked initially was because when I read names_t, I found myself thinking; "names for what?" -> "t" -> "transpose" or "transposed"?

Do you mean the names of the columns you want to transpose or rename them after transposing. All this mental arithmetic can be prevented with slightly more explicit names. This helps many readers.

Well, in this case, I think even writing out "transpose{d}" would have been wrong, since the names aren't a transpose of anything. It's the names of a thing that's transposed.

Regarding c. n_foo is very clear when we talk about the amount of something. It is not a verb that can be in present or past tense. I think that verbs never should be changed into a single letter. We have a hstack for legacy reasons (we should at least name it h_stack), but on the python side we all renamed sum to sum_horizontal, etc. Now, for non-local code I think it is fine to use hstack, n, i etc.

This may be a topic for another time, but "horizontal", "vertical" and "diagonal" are terms I'd rather nip in the bud. It's not that they're long, though I'm not a fan either, but I find myself constantly confused by them. We have perfectly good, short terms like "row", "col", "by_row", "by_col", "row_wise", "col_wise" etc that are both shorter and more natural for the domain.

I want things to be more clear when we get into function arguments and even more when we get in public domain. This isn't math, so it will not always be clear.

Have you seen the variation and inconsistency in mathematical notation? Anyway, clarity is highly subjective, so the question is always "clear to whom" and "how do you make sure"? But you know what they say, the two hardest problems in computer science are cache invalidation, naming things, and off by one errors.

ritchie46

Thanks @magarick. Almost there.

py-polars/src/dataframe.rs

Remove the comment I guess. Co-authored-by: Ritchie Vink <[email protected]>

magarick added 2 commits July 20, 2023 18:28

Move transpose naming to Rust

a9fcc38

dependencies

3792352

magarick requested review from ritchie46, stinodego and alexander-beedie as code owners July 21, 2023 01:31

github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels Jul 21, 2023

clippy

3050688

ritchie46 reviewed Jul 21, 2023

View reviewed changes

ritchie46 previously approved these changes Jul 21, 2023

View reviewed changes

ritchie46 self-requested a review July 21, 2023 13:52

magarick added 4 commits July 23, 2023 22:33

checkpoint

4fb9273

more efficient

73e3330

name

fa91891

format

b6aae79

ritchie46 reviewed Jul 25, 2023

View reviewed changes

rename

504d9dc

ritchie46 reviewed Jul 26, 2023

View reviewed changes

py-polars/src/dataframe.rs Outdated Show resolved Hide resolved

Update py-polars/src/dataframe.rs

09af567

Remove the comment I guess. Co-authored-by: Ritchie Vink <[email protected]>

ritchie46 merged commit c52e70c into pola-rs:main Jul 26, 2023
24 checks passed

magarick deleted the name-transpose-in-rust branch July 26, 2023 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rust): Move transpose naming to Rust #10009

feat(rust): Move transpose naming to Rust #10009

magarick commented Jul 21, 2023

ritchie46 Jul 21, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

magarick Jul 24, 2023

ritchie46 Jul 24, 2023 •

edited

Loading

magarick Jul 24, 2023

magarick Jul 24, 2023

ritchie46 Jul 25, 2023

ritchie46 Jul 21, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

ritchie46 Jul 24, 2023

magarick Jul 24, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

ritchie46 Jul 21, 2023

magarick Jul 21, 2023

ritchie46 Jul 24, 2023

magarick Jul 24, 2023

ritchie46 Jul 25, 2023

magarick Jul 25, 2023

ritchie46 Jul 25, 2023

magarick Jul 25, 2023

ritchie46 Jul 25, 2023

magarick Jul 25, 2023

ritchie46 Jul 26, 2023

magarick Jul 26, 2023

ritchie46 left a comment

feat(rust): Move transpose naming to Rust #10009

feat(rust): Move transpose naming to Rust #10009

Conversation

magarick commented Jul 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritchie46 left a comment

Choose a reason for hiding this comment

ritchie46 Jul 24, 2023 •

edited

Loading