Fix: Slide ids turned into floats in split csv when names consist of only number #228

ff98li · 2024-02-26T20:59:51Z

Summary of the Issue

Slide IDs consisting solely of numerical characters are inadvertently converted to floats in the split CSV files
- The unequal lengths of train, val, and test splits introduce NaN values when these splits are concatenated into a dataframe by save_splits().
- Pandas automatically converts columns with all-numeric names and NaN values to floats due to the lack of NaN rep in integer columns in Pandas.

When loading via the following line, ValueError as shown in the screenshot will occur

Line 247 in 3f875f7

    
           all_splits = pd.read_csv(csv_path, dtype=self.slide_data['slide_id'].dtype)  # Without "dtype=self.slide_data['slide_id'].dtype", read_csv() will convert all-number columns to a numerical type. Even if we convert numerical columns back to objects later, we may lose zero-padding in the process; the columns must be correctly read in from the get-go. When we compare the individual train/val/test columns to self.slide_data['slide_id'] in the get_split_from_df() method, we cannot compare objects (strings) to numbers or even to incorrectly zero-padded objects/strings. An example of this breaking is shown in https://github.com/andrew-weisman/clam_analysis/tree/main/datatype_comparison_bug-2021-12-01.

Proposed fix

Cast slide IDs to strings before being saved to CSV in save_splits to prevent unintended type conversion.
- Result:
Continue to read the dataset CSV with dtype=object in Generic_WSI_Classification_Dataset.
- In get_split_from_df(), cast the dtype of the corresponding split column to match that of self.slide_data['slide_id'].
- This fix is pertaining Datatype comparison bug 2021-12-01 #90
- Result:

This happened when I was working with my own task's dataset csv. I can provide the csv file to reproduce this bug if needs be.

…nly numerical characters

Fix:slide ids turned into floats in split csv when names consist of o…

17c85fb

…nly numerical characters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Slide ids turned into floats in split csv when names consist of only number #228

Fix: Slide ids turned into floats in split csv when names consist of only number #228

ff98li commented Feb 26, 2024 •

edited

Loading

Fix: Slide ids turned into floats in split csv when names consist of only number #228

Are you sure you want to change the base?

Fix: Slide ids turned into floats in split csv when names consist of only number #228

Conversation

ff98li commented Feb 26, 2024 • edited Loading

Summary of the Issue

Proposed fix

ff98li commented Feb 26, 2024 •

edited

Loading