Surprising increases in file size, possibly related to origin in SAS or being manipulated in Google Sheets #1110
Replies: 4 comments 5 replies
-
As I pondered this issue, I realized that there were at least three more test files worth creating. They are:
#=============================================================
library(openxlsx2) #Required for manipulating rows and columns of Excel
#' Sets the working directory to the folder where the script, the template
#' and the variable list reside. The new file will also be created here.
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
options(openxlsx2.na.strings = "") # Makes an empty string the default for missing data.
#=============================================================
#' First, specify the list of files to update.
target <- c("Size_issue_A_original_as_csv.csv")
#=============================================================
#' Define the function that moves rows to rwo 4 (header), adds, bold, etc.
wb_move_data <- function(df) {
wb_new <- wb_workbook()
wb_new$add_worksheet("Data")
wb_new$add_data(x=df, sheet="Data", start_row=4, start_col=1,
col_names = TRUE)
wb_new$set_sheetview(sheet = "Data", top_left_cell = "A1", zoom_scale = 100)
return(wb_new)
}
#=============================================================
#' Loop through a list of datasets, set_list
path_for_updated_wb <- paste0("Moved_rows_",target)
df <- read.csv(target)
#' Call function to move and reformat
target_wb <- wb_move_data(df)
print(paste("Completed processing the wb: ", path_for_updated_wb))
# Before saving set selected and active sheet to first in workbook
target_wb$set_selected(sheet = 1) # Necessary to avoid multiple sheets being selected.
target_wb$set_bookview(active_tab = 0, first_sheet = 0)
# Save the target workbook
wb_save(wb = target_wb, path_for_updated_wb)
#' End of script |
Beta Was this translation helpful? Give feedback.
-
I ran the script as: fl <- "Size_issue_A_original_data.xlsx" wb <- wb_load(fl) new_wb <- wb_workbook()$add_worksheet()$add_data(x = df, dims = wb_dims(x = df, from_row = 4), na.strings = NULL) fs::file_size(tp) if (interactive()) xl_open(tp) But the reported size remains quite large, 5.64M. However, if I save the temporary xlsx file, the size drops to 1.2 M, close to but the still larger than your value of 971k. Please let me know if there are settings or options I might test to help with further diagnostics. You also mentioned the possibility of SAS exporting data in multiple flavors of Excel. I'd offer to test this, but stopped using SAS when the Univ. Florida, my employer, began forcing all users to use a cloud version. |
Beta Was this translation helpful? Give feedback.
-
I installed the package (after learning a bit about installing and
compiling) and can confirm that I now get the 971K file size that you
obtained.
Yes, storage is so cheap that large files are no longer as big an issue. We
do work with people at remote locations with sub-optimal Internet access,
however, so I like to avoid exchanging unnecessarily large files.
We are trying to avoid using Excel as a database. Our current work involves
finding a middle ground for data capture among multiple research groups.
One clear need is a way to provide quality assurance tools that not
only provide some basic tests for correct values, but test whether
identifiers are used correctly, and that all variables are defined. Our
hope is that users can run simple R scripts to check their data before
submitting it to a repository, where it might be loaded to a database or
used from training AI. So far, R and openxlsx2 have worked very well, plus
or minus a few issues like the ones you have helped us resolve.
No future plans to use SAS. I've been a happy SAS user since 1977, when I
used punch cards and my programs and data were transmitted from Cali,
Colombia to Bogota over a modem and run on the only IBM mainframe with SAS
in the country (with DANE, the national office for statistics).
…On Sun, Aug 18, 2024 at 11:06 AM Jan Marvin Garbuszus < ***@***.***> wrote:
You'd have to install the package from the development branch as linked
above. But obviously, there might be hidden issues I'm currently not aware
of. Therefore you're a bit of a guinea pig if you want to test this.
Meanwhile, this confirms my suspicion that I should advise people to avoid
sparse matrices of data. [...]
Hm, dunno, any file < 100MB is nothing I'm losing sleep over. But
generally speaking you're obviously correct. A large file containing mostly
missings, probably isn't the best idea. What I generally advice people, is
the obvious: do not use xlsx files as database. With openxlsx2 there are
known memory limitations and unless it is absolutely mandatory to export
lots of data to xlsx - because management wants a report - in most cases
exporting via some other binary data format is way to go and you can even
use external data sources as data reference in pivot tables. Especially
since you mention, there are many quirks in the office open xml format that
are not always obvious.
If you still use SAS and have a working directory where you can access
your sas7bdat files, it's often much quicker to simply import these files
with readsas, instead of exporting it to some xlsx file.
—
Reply to this email directly, view it on GitHub
<#1110 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BGFFQ2AZUJ32NFP2ZI27ZR3ZSDPB7AVCNFSM6AAAAABMVX4C3WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGUYDANQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your support on these issues, even while you are out for a
well-earned walk. I will continue with the github pre-release version of
openxlsx2 and let you know if anything strange happens.
Best regards!
…On Sun, Aug 18, 2024 at 2:33 PM Jan Marvin Garbuszus < ***@***.***> wrote:
You definitely have your reasons and I'm happy to help where I can. I just
have to make sure that it works as expected. A slim file size, but a broken
file would be a bad tradeoff. Unfortunately this requires testing with many
various files and spreadsheet software and the openxlsx2 QA department is
lazy and goes for a walk instead of testing ;)
You definitely must have many interesting stories to tell!
—
Reply to this email directly, view it on GitHub
<#1110 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BGFFQ2BRKWOQUIZ5OBYRXJTZSEHI5AVCNFSM6AAAAABMVX4C3WVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMZXGYYTCNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I'm wondering what people's experience has been with sizes of Excel files. I recently downloaded a workbook from Google Drive and was surprised to see it go from 3.3 MB to over 8 MB after running an openxlsx2 script that basically just read the data from each sheet into a data frame, and then wrote that data frame to a new workbook but with the data starting at row 4.
Saving the new file as a binary (xlsb) gave a file size of 2.2 MB, but then re-saving that file in XML (xlsx) format increased the size to over 10 MB.
To provide a shareable and simpler test case, I am attaching the single sheet with the most data.Size_issue_A_original_data.xlsx
Similar processing with this file gave sizes as follows:
As far as I can determine, there are no hidden rows/columns, no formatting, etc. However, in the sample file, there are very large numbers of empty cells (no data). My working hypothesis would be that the cells may have some unusual attributes set with the SAS (statistical package) that was used to create the original Excel xlsx file.
The file change from binary back to xlsx also seems surprising, although it may have nothing to so with openxlsx2.
Beta Was this translation helpful? Give feedback.
All reactions