Welcome to the basic manual to openxlsx2
. In this manual
you will learn how to use openxlsx2
to import data from
xlsx-files to R as well as how to export data from R to xlsx, and how to
import and modify these openxml workbooks in R. This package is based on
the work of many contributors to openxlsx
. It was mostly
rewritten using pugixml
and R6
making use of
modern technology, providing a fresh and easy to use R package.
Over the years many people have worked on the tricky task to handle
xls and xlsx files. Notably openxlsx
, but there are
countless other R-packages as well as third party libraries or
calculation software capable of handling such files. Please feel free to
use and test your files with other software and or let us know about
your experience. Open an issue on github or write us a mail.
Coming from openxlsx
you might know about
read.xlsx()
(two functions, one for files and one for
workbooks) and readWorkbook()
. Functions that do different
things, but mostly the same. In openxlsx2
we tried our best
to reduce the complexity under the hood and for the user as well. In
openxlsx2
they are replaced with read_xlsx()
,
wb_read()
and they share the same underlying function
wb_to_df()
.
For this example we will use example data provided by the package. You can locate it in our “inst/extdata” folder. The files are included with the package source and you can open them in any calculation software as well.
We begin with the readTest.xlsx
file by telling R where
to find this file on our system
<- system.file("extdata", "readTest.xlsx", package = "openxlsx2") xlsxFile
The object contains a path to the xlsx file and we pass this file to our function to read the workbook into R
# import workbook
wb_to_df(xlsxFile)
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 NA 1 a 2015-02-07 3209324 This #DIV/0!
#> 3 TRUE NA NA #NUM! b 2015-02-06 <NA> <NA>
#> 4 TRUE 2 NA 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 NA <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 NA 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 NA 1.7 f 2015-02-02 <NA> <NA>
#> 8 NA NA NA <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 NA 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 NA 67.3 i 2015-01-30 <NA> <NA>
#> 11 NA 1 NA 123 <NA> 2015-01-29 <NA> <NA>
The output is created as a data frame and contains data types date,
logical, numeric and character. The function to import the file to R,
wb_to_df()
provides similar options as the
openxlsx
functions read.xlsx()
and
readWorkbook()
and a few new functions we will go through
the options. As you might have noticed, we return the column of the xlsx
file as the row name of the data frame returned. Per default the first
sheet in the workbook is imported. If you want to switch this, either
provide the sheet
parameter with the correct index or
provide the sheet name.
In the previous example the first imported row was used as column name for the data frame. This is the default behavior, but not always wanted or expected. Therefore this behavior can be disabled by the user.
# do not convert first row to colNames
wb_to_df(xlsxFile, colNames = FALSE)
#> A B C D E F G H
#> 1 NA Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 NA 1 a 2015-02-07 3209324 This #DIV/0!
#> 3 TRUE <NA> NA #NUM! b 2015-02-06 <NA> <NA>
#> 4 TRUE 2 NA 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 NA <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 NA 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 NA 1.7 f 2015-02-02 <NA> <NA>
#> 8 NA <NA> NA <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 NA 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 NA 67.3 i 2015-01-30 <NA> <NA>
#> 11 NA 1 NA 123 <NA> 2015-01-29 <NA> <NA>
The creators of the openxml standard are well known for mistakenly
treating something as a date and openxlsx2
has built in
ways to identify a cell as a date and will try to convert the value for
you, but unfortunately this is not always a trivial task and might fail.
In such a case we provide an option to disable the date conversion
entirely. In this case the underlying numerical value will be
returned.
# do not try to identify dates in the data
wb_to_df(xlsxFile, detectDates = FALSE)
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 NA 1 a 42042 3209324 This #DIV/0!
#> 3 TRUE NA NA #NUM! b 42041 <NA> <NA>
#> 4 TRUE 2 NA 1.34 c 42040 <NA> #NUM!
#> 5 FALSE 2 NA <NA> #NUM! NA <NA> <NA>
#> 6 FALSE 3 NA 1.56 e NA <NA> <NA>
#> 7 FALSE 1 NA 1.7 f 42037 <NA> <NA>
#> 8 NA NA NA <NA> <NA> 42036 <NA> <NA>
#> 9 FALSE 2 NA 23 h 42035 <NA> <NA>
#> 10 FALSE 3 NA 67.3 i 42034 <NA> <NA>
#> 11 NA 1 NA 123 <NA> 42033 <NA> <NA>
Sometimes things might feel off. This can be because the openxml
files are not updating formula results in the sheets unless they are
opened in software that provides such functionality as certain tabular
calculation software. Therefore the user might be interested in the
underlying functions to see what is going on in the sheet. Using
showFormula
this is possible
# return the underlying Excel formula instead of their values
wb_to_df(xlsxFile, showFormula = TRUE)
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 NA 1 a 2015-02-07 "3209324" & " This" 1/0
#> 3 TRUE NA NA #NUM! b 2015-02-06 <NA> <NA>
#> 4 TRUE 2 NA 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 NA <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 NA 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 NA 1.7 f 2015-02-02 <NA> <NA>
#> 8 NA NA NA <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 NA 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 NA 67.3 i 2015-01-30 <NA> <NA>
#> 11 NA 1 NA 123 <NA> 2015-01-29 <NA> <NA>
Sometimes the entire worksheet contains to much data, in such case we provide functions to read only a selected dimension range. Such a range consists of either a specific cell like “A1” or a cell range in the notion used in the openxml standard
# read dimension withot colNames
wb_to_df(xlsxFile, dims = "A2:C5", colNames = FALSE)
#> A B C
#> 2 TRUE 1 NA
#> 3 TRUE NA NA
#> 4 TRUE 2 NA
#> 5 FALSE 2 NA
If you do not want to read a specific cell, but a cell range you can use the column attribute. This attribute takes a numeric vector as argument
# read selected cols
wb_to_df(xlsxFile, cols = c(1:2, 7))
#> Var1 Var2 Var6
#> 2 TRUE 1 3209324 This
#> 3 TRUE NA <NA>
#> 4 TRUE 2 <NA>
#> 5 FALSE 2 <NA>
#> 6 FALSE 3 <NA>
#> 7 FALSE 1 <NA>
#> 8 NA NA <NA>
#> 9 FALSE 2 <NA>
#> 10 FALSE 3 <NA>
#> 11 NA 1 <NA>
The same goes with rows. You can select them using numeric vectors
# read selected rows
wb_to_df(xlsxFile, rows = c(1, 4, 6))
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 4 TRUE 2 NA 1.34 c 2015-02-05 NA #NUM!
#> 6 FALSE 3 NA 1.56 e <NA> NA <NA>
In xml exists no difference between value types. All values are per
default characters. To provide these as numerics, logicals or dates,
openxlsx2
and every other software dealing with xlsx files
has to make assumptions about the cell type. This is especially tricky
due to the notion of worksheets. Unlike in a data frame, a worksheet can
have a wild mix of all types of data. Even though the conversion process
from character to date or numeric is rather solid, sometimes the user
might want to see the data without any conversion applied. This might be
useful in cases where something unexpected happened or the import
created warnings. In such a case you can look at the raw input data. If
you want to disable date detection as well, please see the entry
above.
# convert characters to numerics and date (logical too?)
wb_to_df(xlsxFile, convert = FALSE)
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 <NA> 1 a 2015-02-07 3209324 This #DIV/0!
#> 3 TRUE <NA> <NA> #NUM! b 2015-02-06 <NA> <NA>
#> 4 TRUE 2 <NA> 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 <NA> <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 <NA> 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 <NA> 1.7 f 2015-02-02 <NA> <NA>
#> 8 <NA> <NA> <NA> <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 <NA> 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 <NA> 67.3 i 2015-01-30 <NA> <NA>
#> 11 <NA> 1 <NA> 123 <NA> 2015-01-29 <NA> <NA>
Even though openxlsx2
imports everything as requested,
sometimes it might be helpful to remove empty lines from the data. These
might be either left empty intentional or empty because they are were
formatted, but the cell value was removed afterwards. This was added
mostly for backward comparability, but the default has been changed to
FALSE
. The behavior has changed a bit as well. Previously
empty cells were removed prior to the conversion to R data frames, now
they are removed after the conversion and are removed only if they are
completely empty
# erase empty Rows from dataset
wb_to_df(xlsxFile, sheet = 3, skipEmptyRows = TRUE) |> head()
#> Date value word bool wordZ2
#> 7 2014-04-28 0.8390764 N-U-B-R-A FALSE FALSE-Z
#> 8 2014-04-27 0.8863800 N-Z-P-S-Y TRUE TRUE-Z
#> 9 2014-04-26 0.5741314 C-G-D-X-H TRUE TRUE-Z
#> 10 2014-04-25 0.1366065 <NA> FALSE FALSE-Z
#> 11 2014-04-24 0.3692582 B-K-A-O-W TRUE TRUE-Z
#> 12 2014-04-23 NA H-P-G-O-K TRUE TRUE-Z
The same for columns
# erase empty Cols from dataset
wb_to_df(xlsxFile, skipEmptyCols = TRUE)
#> Var1 Var2 Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 1 a 2015-02-07 3209324 This #DIV/0!
#> 3 TRUE NA #NUM! b 2015-02-06 <NA> <NA>
#> 4 TRUE 2 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 1.7 f 2015-02-02 <NA> <NA>
#> 8 NA NA <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 67.3 i 2015-01-30 <NA> <NA>
#> 11 NA 1 123 <NA> 2015-01-29 <NA> <NA>
Sometimes the data source might provide rownames as well. In such a
case you can openxlsx2
to treat the first column as
rowname
# convert first row to rownames
wb_to_df(xlsxFile, sheet = 3, dims = "C6:G9", rowNames = TRUE)
#> value word bool wordZ2
#> 2014-04-28 0.8390764 N-U-B-R-A FALSE FALSE-Z
#> 2014-04-27 0.8863800 N-Z-P-S-Y TRUE TRUE-Z
#> 2014-04-26 0.5741314 C-G-D-X-H TRUE TRUE-Z
If the user know better than the software what type to expect in a
worksheet, this can be provided via types. This parameter takes a named
numeric. 0
is character, 1
is numeric and
2
is date
# define type of the data.frame
wb_to_df(xlsxFile, cols = c(1, 4), types = c("Var1" = 0, "Var3" = 1))
#> Var1 Var3
#> 2 TRUE 1.00
#> 3 TRUE NaN
#> 4 TRUE 1.34
#> 5 FALSE NA
#> 6 FALSE 1.56
#> 7 FALSE 1.70
#> 8 <NA> NA
#> 9 FALSE 23.00
#> 10 FALSE 67.30
#> 11 <NA> 123.00
Often the creator of the worksheet has used a lot of creativity and
the data does not begin in the first row, instead it begins somewhere
else. To define the row where to begin reading, define it via the
startRow
parameter
# start in row 5
wb_to_df(xlsxFile, startRow = 5, colNames = FALSE)
#> A B C D E F G H
#> 5 FALSE 2 NA NA #NUM! <NA> NA NA
#> 6 FALSE 3 NA 1.56 e <NA> NA NA
#> 7 FALSE 1 NA 1.70 f 2015-02-02 NA NA
#> 8 NA NA NA NA <NA> 2015-02-01 NA NA
#> 9 FALSE 2 NA 23.00 h 2015-01-31 NA NA
#> 10 FALSE 3 NA 67.30 i 2015-01-30 NA NA
#> 11 NA 1 NA 123.00 <NA> 2015-01-29 NA NA
There is the “#N/A” string, but often the user will be faced with
custom missing values and other values we are not interested. Such
strings can be passed as character vector via
na.strings
# na string
wb_to_df(xlsxFile, na.strings = "")
#> Var1 Var2 NA Var3 Var4 Var5 Var6 Var7
#> 2 TRUE 1 NA 1 a 2015-02-07 3209324 This #DIV/0!
#> 3 TRUE NA NA #NUM! b 2015-02-06 <NA> #N/A
#> 4 TRUE 2 NA 1.34 c 2015-02-05 <NA> #NUM!
#> 5 FALSE 2 NA <NA> #NUM! <NA> <NA> <NA>
#> 6 FALSE 3 NA 1.56 e <NA> <NA> <NA>
#> 7 FALSE 1 NA 1.7 f 2015-02-02 <NA> <NA>
#> 8 NA NA NA <NA> <NA> 2015-02-01 <NA> <NA>
#> 9 FALSE 2 NA 23 h 2015-01-31 <NA> <NA>
#> 10 FALSE 3 NA 67.3 i 2015-01-30 <NA> <NA>
#> 11 NA 1 NA 123 <NA> 2015-01-29 <NA> <NA>
In addition to importing directly from xlsx or xlsm files,
openxlsx2
provides the wbWorkbook
class used
for importing and modifying entire the openxml files in R
.
This workbook
class is the heart of openxlsx2
and probably the reason why you are reading this manual in the first
place.
Importing a file into a workbook looks like this:
# the file we are going to load
<- system.file("extdata", "loadExample.xlsx", package = "openxlsx2")
xlsxFile # loading the file into the workbook
<- wb_load(file = xlsxFile) wb
The additional options wb_load()
provides are for
internal use: sheet
loads only a selected sheet from the
workbook and data_only
reads only the data parts from a
workbook and ignores any additional graphics or pivot tables. Both
functions create workbook objects that can only be used to read data,
and we do not recommend end users to use them. Especially not if they
intend to re-export the workbook afterwards.
Once a workbook is imported, we provide several functions to interact
with and modify it (the wb_to_df()
function mentioned above
works the same way for an imported workbook). It is possible to add new
sheets and remove sheets, as well as to add or remove data. R-plots can
be inserted and also the style of the workbook can be changed, new
fonts, background colors and number formats. There is a wealth of
options explained in the man pages and the additional style vignette
(more vignettes to follow).
If you want to export a data frame from R, you can use
write_xlsx()
which will create an xlsx file. This file can
be tweaked further. The man page provides various options (further
explanation and examples will follow).
write_xlsx(mtcars, "mtcars.xlsx")
wbWorkbooks
Imported workbooks can be saved as xlsx or xlsm files with the
wrapper wb_save()
or with wb$save()
. Both
functions take the filename and an optional Overwrite
option. If the latter is set, an optional guard is provided to check if
the file you want to write already exists. But be careful, this is
optional. The default is to save the file and replace an existing file.
Of course, in Windows, files that are locked (for example, if they were
opened by another process) will not be replaced.
# replace the existing file
$save("mtcars.xlsx")
wb
# do not overwrite the exisisting file
try(wb$save("mtcars.xlsx", overwrite = FALSE))