Surprising increases in file size, possibly related to origin in SAS or being manipulated in Google Sheets #1110

Jeff-White-AZ · 2024-08-17T21:51:30Z

Jeff-White-AZ
Aug 17, 2024

I'm wondering what people's experience has been with sizes of Excel files. I recently downloaded a workbook from Google Drive and was surprised to see it go from 3.3 MB to over 8 MB after running an openxlsx2 script that basically just read the data from each sheet into a data frame, and then wrote that data frame to a new workbook but with the data starting at row 4.
Saving the new file as a binary (xlsb) gave a file size of 2.2 MB, but then re-saving that file in XML (xlsx) format increased the size to over 10 MB.
To provide a shareable and simpler test case, I am attaching the single sheet with the most data.Size_issue_A_original_data.xlsx
Similar processing with this file gave sizes as follows:

Original: 1.89 MB
Original saved as CSV (since there is just one sheet, this works): 2.69
After using openxlsx2 in R to move the sheet to a new WB with the data starting at row 4: 5.65 MB
Saving the rebuilt WB as a binary (xlsb): 1.32 MB
Resaving the xlsb as XML: 7.04 MB
Saving the original as binary (xlsb): 0.79 MB
Saving the original but as binary back to XML (xlsx): 1.89 MB
As far as I can determine, there are no hidden rows/columns, no formatting, etc. However, in the sample file, there are very large numbers of empty cells (no data). My working hypothesis would be that the cells may have some unusual attributes set with the SAS (statistical package) that was used to create the original Excel xlsx file.
The file change from binary back to xlsx also seems surprising, although it may have nothing to so with openxlsx2.

Jeff-White-AZ · 2024-08-17T22:12:44Z

Jeff-White-AZ
Aug 17, 2024
Author

As I pondered this issue, I realized that there were at least three more test files worth creating. They are:

Using the original data as CSV and then creating a work book with openxlsx2, data shifted to start at row 4: 5.57 MB
Saving the CSV as XLSX: 1.12 MB
Using this CSV-derived XLSX and then running a script with the data shifted to start at row 4: 5.57 MB
The CSV should provide absolutely format free data, so the XLSX file saved from the CSV makes sense with a smaller file size, 1.12 MB. So at least part of the size issue arises when openxlsx2 converts the data to a dataframe and then writes to a new workbook. I paste in the script for the CSV to workbook version below:
#' Script to move all records to start with headers in row 4

#=============================================================
library(openxlsx2)  #Required for manipulating rows and columns of Excel

#' Sets the working directory to the folder where the script, the template  
#' and the variable list reside. The new file will also be created here.
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

options(openxlsx2.na.strings = "")  # Makes an empty string the default for missing data.

#=============================================================
#' First, specify the list of files to update.
target <- c("Size_issue_A_original_as_csv.csv")

#=============================================================
#' Define the function that moves rows to rwo 4 (header), adds, bold, etc.
wb_move_data <- function(df) {
  wb_new <- wb_workbook()  
  wb_new$add_worksheet("Data")
    wb_new$add_data(x=df, sheet="Data", start_row=4, start_col=1, 
                col_names = TRUE)
    
    wb_new$set_sheetview(sheet = "Data", top_left_cell = "A1",  zoom_scale = 100)
  return(wb_new)
}

#=============================================================
#' Loop through a list of datasets, set_list

  path_for_updated_wb <- paste0("Moved_rows_",target)
  df <- read.csv(target)
  
  #' Call function to move and reformat
  target_wb <- wb_move_data(df)
  
  print(paste("Completed processing the wb: ", path_for_updated_wb))
  
  # Before saving set selected and active sheet to first in workbook
  target_wb$set_selected(sheet = 1)  # Necessary to avoid multiple sheets being selected.

  target_wb$set_bookview(active_tab = 0, first_sheet = 0)
  
  # Save the target workbook
  wb_save(wb = target_wb, path_for_updated_wb)

#' End of script

2 replies

JanMarvin Aug 18, 2024
Maintainer

Jeff, you've already closed the issue, but let me add some explanation what is going on. In your original post there are two issues:

options(openxlsx2.na.strings = "")  # Makes an empty string the default for missing data.

This writes a character string into every blank cell in the output. You might want to replace this with

wb_new$add_data(..., na.strings = NULL)

This way NA is completely ignored and nothing is written into the cell, but the cell is still referenced. This means that the cell is still written into the sheet. So instead of <c r="A1"><is><t></t></is></c> we only write the cell reference <c r="A1"/>. Probably this could be circumvented too, but openxlsx2 was not created to work with sparse matrices. This creates a file size of 5.64M, which is close to your second observation. Since there are plenty of blank cells in the data frame, I guess that the remaining differences are from the blank cells.

Additional remark: the output size probably differs by the SAS commands used. The four different export options that come to my mind (1) proc export data = ...; dbms = xlsx; run; or (2,3) libname with Excel driver or the builtin SAS one, or (4) ods print probably return different sized output. (If you're working with SAS, you could also have a look at readsas.

JanMarvin Aug 18, 2024
Maintainer

@Jeff-White-AZ , I have added a PR in #1111 which skips blank cells entirely. Though I'm quite reluctant to merge it right now, because this impacts a somewhat non critical part of openxlsx2 which requires extensive testing against spreadsheet software:

Do the files differ significantly from those created in spreadsheet software.
Does the behavior change e.g. can we still load in power query etc.

After all there are reasons why our current files differ from these slim versions. After all this is a case where I prefer to be quite conservative.

library(openxlsx2)

fl <- "~/Downloads/Size_issue_A_original_data.xlsx"
tp <- temp_xlsx()

wb <- wb_load(fl)
df <- wb_to_df(wb)

new_wb <- wb_workbook()$add_worksheet()$add_data(x = df, dims = wb_dims(x = df, from_row = 4), na.strings = NULL)
new_wb$save(tp)

fs::file_size(tp)
#> 971K

if (interactive()) xl_open(tp)

Jeff-White-AZ · 2024-08-18T16:58:22Z

Jeff-White-AZ
Aug 18, 2024
Author

I ran the script as:
library(openxlsx2)

fl <- "Size_issue_A_original_data.xlsx"
tp <- temp_xlsx()

wb <- wb_load(fl)
df <- wb_to_df(wb)

new_wb <- wb_workbook()$add_worksheet()$add_data(x = df, dims = wb_dims(x = df, from_row = 4), na.strings = NULL)
new_wb$save(tp)

fs::file_size(tp)
#> 5.64M

if (interactive()) xl_open(tp)

But the reported size remains quite large, 5.64M. However, if I save the temporary xlsx file, the size drops to 1.2 M, close to but the still larger than your value of 971k.
So there is still some puzzling behavior, at least to me, perhaps due to versions of Excel or operating system. I am using:
Microsoft® Excel® for Microsoft 365 MSO (Version 2407 Build 16.0.17830.20056) 64-bit
The operating system is Win 11. I am running R 4.4.1 from RStudio 2024.04.2 Build 764.

Please let me know if there are settings or options I might test to help with further diagnostics.
Meanwhile, this confirms my suspicion that I should advise people to avoid sparse matrices of data. In our agricultural research datasets, the sparse matrix arises from merging data from experiments run at multiple locations but with very different crop traits being measured at the different sites. We can reduce the sparseness considerably by grouping types of measurements in different worksheets. The down side is that this somewhat complicates the workbook processing.

You also mentioned the possibility of SAS exporting data in multiple flavors of Excel. I'd offer to test this, but stopped using SAS when the Univ. Florida, my employer, began forcing all users to use a cloud version.
Thank you for looking into this issue. I continue to be surprised by the underlying complexity of Excel workbooks.

2 replies

JanMarvin Aug 18, 2024
Maintainer

Sorry, I’m out for a walk at the moment. You’d have to install it something like this:

remotes::install_github("JanMarvin/openxlsx2#1111")

JanMarvin Aug 18, 2024
Maintainer

You'd have to install the package from the development branch as linked above. But obviously, there might be hidden issues I'm currently not aware of. Therefore you're a bit of a guinea pig if you want to test this.

Meanwhile, this confirms my suspicion that I should advise people to avoid sparse matrices of data. [...]

Hm, dunno, any file < 100MB is nothing I'm losing sleep over. But generally speaking you're obviously correct. A large file containing mostly missings, probably isn't the best idea. What I generally advice people, is the obvious: do not use xlsx files as database. With openxlsx2 there are known memory limitations and unless it is absolutely mandatory to export lots of data to xlsx - because management wants a report - in most cases exporting via some other binary data format is way to go and you can even use external data sources as data reference in pivot tables. Especially since you mention, there are many quirks in the office open xml format that are not always obvious.

If you still use SAS and have a working directory where you can access your sas7bdat files, it's often much quicker to simply import these files with readsas, instead of exporting it to some xlsx file.

Jeff-White-AZ · 2024-08-18T20:20:35Z

Jeff-White-AZ
Aug 18, 2024
Author

I installed the package (after learning a bit about installing and compiling) and can confirm that I now get the 971K file size that you obtained. Yes, storage is so cheap that large files are no longer as big an issue. We do work with people at remote locations with sub-optimal Internet access, however, so I like to avoid exchanging unnecessarily large files. We are trying to avoid using Excel as a database. Our current work involves finding a middle ground for data capture among multiple research groups. One clear need is a way to provide quality assurance tools that not only provide some basic tests for correct values, but test whether identifiers are used correctly, and that all variables are defined. Our hope is that users can run simple R scripts to check their data before submitting it to a repository, where it might be loaded to a database or used from training AI. So far, R and openxlsx2 have worked very well, plus or minus a few issues like the ones you have helped us resolve. No future plans to use SAS. I've been a happy SAS user since 1977, when I used punch cards and my programs and data were transmitted from Cali, Colombia to Bogota over a modem and run on the only IBM mainframe with SAS in the country (with DANE, the national office for statistics).

…

On Sun, Aug 18, 2024 at 11:06 AM Jan Marvin Garbuszus < ***@***.***> wrote: You'd have to install the package from the development branch as linked above. But obviously, there might be hidden issues I'm currently not aware of. Therefore you're a bit of a guinea pig if you want to test this. Meanwhile, this confirms my suspicion that I should advise people to avoid sparse matrices of data. [...] Hm, dunno, any file < 100MB is nothing I'm losing sleep over. But generally speaking you're obviously correct. A large file containing mostly missings, probably isn't the best idea. What I generally advice people, is the obvious: do not use xlsx files as database. With openxlsx2 there are known memory limitations and unless it is absolutely mandatory to export lots of data to xlsx - because management wants a report - in most cases exporting via some other binary data format is way to go and you can even use external data sources as data reference in pivot tables. Especially since you mention, there are many quirks in the office open xml format that are not always obvious. If you still use SAS and have a working directory where you can access your sas7bdat files, it's often much quicker to simply import these files with readsas, instead of exporting it to some xlsx file. — Reply to this email directly, view it on GitHub <#1110 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGFFQ2AZUJ32NFP2ZI27ZR3ZSDPB7AVCNFSM6AAAAABMVX4C3WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGUYDANQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

JanMarvin Aug 18, 2024
Maintainer

You definitely have your reasons and I'm happy to help where I can. I just have to make sure that it works as expected. A slim file size, but a broken file would be a bad tradeoff. Unfortunately this requires testing with many various files and spreadsheet software and the openxlsx2 QA department is lazy and goes for a walk instead of testing ;)

You definitely must have many interesting stories to tell!

Jeff-White-AZ · 2024-08-18T21:39:43Z

Jeff-White-AZ
Aug 18, 2024
Author

Thanks for your support on these issues, even while you are out for a well-earned walk. I will continue with the github pre-release version of openxlsx2 and let you know if anything strange happens. Best regards!

…

On Sun, Aug 18, 2024 at 2:33 PM Jan Marvin Garbuszus < ***@***.***> wrote: You definitely have your reasons and I'm happy to help where I can. I just have to make sure that it works as expected. A slim file size, but a broken file would be a bad tradeoff. Unfortunately this requires testing with many various files and spreadsheet software and the openxlsx2 QA department is lazy and goes for a walk instead of testing ;) You definitely must have many interesting stories to tell! — Reply to this email directly, view it on GitHub <#1110 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BGFFQ2BRKWOQUIZ5OBYRXJTZSEHI5AVCNFSM6AAAAABMVX4C3WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGYYTCNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surprising increases in file size, possibly related to origin in SAS or being manipulated in Google Sheets #1110

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Surprising increases in file size, possibly related to origin in SAS or being manipulated in Google Sheets #1110

Jeff-White-AZ Aug 17, 2024

Replies: 4 comments · 5 replies

Jeff-White-AZ Aug 17, 2024 Author

JanMarvin Aug 18, 2024 Maintainer

JanMarvin Aug 18, 2024 Maintainer

Jeff-White-AZ Aug 18, 2024 Author

JanMarvin Aug 18, 2024 Maintainer

JanMarvin Aug 18, 2024 Maintainer

Jeff-White-AZ Aug 18, 2024 Author

JanMarvin Aug 18, 2024 Maintainer

Jeff-White-AZ Aug 18, 2024 Author

Jeff-White-AZ
Aug 17, 2024

Replies: 4 comments 5 replies

Jeff-White-AZ
Aug 17, 2024
Author

JanMarvin Aug 18, 2024
Maintainer

JanMarvin Aug 18, 2024
Maintainer

Jeff-White-AZ
Aug 18, 2024
Author

JanMarvin Aug 18, 2024
Maintainer

JanMarvin Aug 18, 2024
Maintainer

Jeff-White-AZ
Aug 18, 2024
Author

JanMarvin Aug 18, 2024
Maintainer

Jeff-White-AZ
Aug 18, 2024
Author