readr will guess column types from the data if the user does not
specify the types. The guess_max
parameter controls how
many rows of the input file are used to form these guesses. Ideally, the
column types would be completely obvious from the first non-header row
and we could use guess_max = 1
. That would be very
efficient! But the situation is rarely so clear-cut.
By default, readr consults 1000 rows when type-guessing,
i.e. guess_max = 1000
. Note that readr never consults rows
that won’t be part of the import, so the actual default is
guess_max = min(1000, n_max)
.
Sometimes you want to convey “use all of the data we’re going to
import to guess the column types”, often without even knowing or
specifying n_max
. How should you say that? It’s also worth
discussing the possible downsides of such a request.
library(readr)
readr got a new parsing engine in version 2.0.0 (released July 2021).
In this so-called second edition, readr calls
vroom::vroom()
, by default. The vroom package and,
therefore, the second edition of readr supports a very natural
expression of “use all the data to guess”: namely
guess_max = Inf
.
read_csv("path/to/your/file", ..., guess_max = Inf)
Why isn’t this the default? Why not do this all the time? Because column type guessing basically adds another pass through the data, in addition to the main parsing.
If you routinely use guess_max = Inf
, you’re basically
processing every file twice, in its entirety. If you only work with
small files, this is fine. But for larger files, this can be very costly
and for relatively little benefit. Often the column types guessed based
on a subset of the file are “good enough”.
Note also that guess_max = n
, for finite n
,
works better in the second edition parser. Due to its different design,
vroom is able to sub-sample n
rows throughout the file and
it always includes the last row, whereas earlier versions of readr just
consulted the first n
rows. In practice, the result is that
the default of guess_max = min(1000, n_max)
produces better
guessed column types that it used to. It should feel less necessary to
fiddle with guess_max
now.
As always, remember that the best strategy is to provide explicit column types as any data analysis project matures past the exploratory phase.
The parsing engine in readr versions prior to 2.0.0 is now called the
first edition. If you’re using readr >= 2.0.0, you can still access
first edition parsing via the functions with_edition()
and
local_edition()
. And, obviously, if you’re using readr <
2.0.0, you will get first edition parsing, by definition, because that’s
all there is.
The first edition parser doesn’t have a perfect way to convey “use all of the data to guess the column types”. (This is one of several reasons to prefer readr >= 2.0.0.)
Let’s set up a slightly tricky file, so we can demonstrate different
approaches. The column x
is mostly empty, but has some
numeric data at the very end, in row 1001.
<- tibble::tibble(
tricky_dat x = rep(c("", "2"), c(1000, 1)),
y = "y"
)<- tempfile("tricky-column-type-guessing-", fileext = ".csv")
tfile write_csv(tricky_dat, tfile)
First, note that the second edition parser guesses the right type for
x
, even with the default guess_max
behaviour.
tail(read_csv(tfile))
#> Rows: 1001 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): y
#> dbl (1): x
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 6 × 2
#> x y
#> <dbl> <chr>
#> 1 NA y
#> 2 NA y
#> 3 NA y
#> 4 NA y
#> 5 NA y
#> 6 2 y
In contrast, the first edition parser doesn’t guess the right type
for x
with the guess_max
default.
x
is imported as logical and the 2
becomes an
NA
.
with_edition(
1,
tail(read_csv(tfile))
)#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> x = col_logical(),
#> y = col_character()
#> )
#> Warning: 1 parsing failure.
#> row col expected actual file
#> 1001 x 1/0/T/F/TRUE/FALSE 2 '/tmp/RtmpbSXTvb/tricky-column-type-guessing-17f3dad05ee4.csv'
#> # A tibble: 6 × 2
#> x y
#> <lgl> <chr>
#> 1 NA y
#> 2 NA y
#> 3 NA y
#> 4 NA y
#> 5 NA y
#> 6 NA y
There are three ways to proceed, each of which has some downside:
Specify guess_max = Inf
, just like we do for the
second edition parser.
Since readr does not know how much data it will be processing, the
first edition engine pre-allocates a large amount of memory in the face
of this uncertainty. This means that reading with
guess_max = Inf
can be extremely slow and might even crash
your R session.
with_edition(
1,
tail(read_csv(tfile, guess_max = Inf))
)#> Warning: `guess_max` is a very large value, setting to `21474836` to avoid
#> exhausting memory
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> # A tibble: 6 × 2
#> x y
#> <dbl> <chr>
#> 1 NA y
#> 2 NA y
#> 3 NA y
#> 4 NA y
#> 5 NA y
#> 6 2 y
Specify an actual, non-infinite value for
guess_max
.
This is an awkward suggestion, because if you knew how many rows
there were, we wouldn’t be having this conversation in the first place.
But sometimes you have a decent estimate and can choose a value of
guess_max
that is “big enough”. This usually results in
much better performance than guess_max = Inf
.
with_edition(
1,
tail(read_csv(tfile, guess_max = 1200))
)#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
#> # A tibble: 6 × 2
#> x y
#> <dbl> <chr>
#> 1 NA y
#> 2 NA y
#> 3 NA y
#> 4 NA y
#> 5 NA y
#> 6 2 y
Read all columns as character, then use
type_convert()
. This is a bit clunky, since this obligates
you to post-processing once you’ve brought you data into R.
<- with_edition(
dat_chr 1,
read_csv(tfile, col_types = cols(.default = col_character()))
)tail(dat_chr)
#> # A tibble: 6 × 2
#> x y
#> <chr> <chr>
#> 1 <NA> y
#> 2 <NA> y
#> 3 <NA> y
#> 4 <NA> y
#> 5 <NA> y
#> 6 2 y
<- type_convert(dat_chr)
dat #>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> x = col_double(),
#> y = col_character()
#> )
tail(dat)
#> # A tibble: 6 × 2
#> x y
#> <dbl> <chr>
#> 1 NA y
#> 2 NA y
#> 3 NA y
#> 4 NA y
#> 5 NA y
#> 6 2 y
Clean up the temporary tricky csv file.
file.remove(tfile)
#> [1] TRUE