How fast does a compressed file in Part 2

rtip
benchmark
arrow
duckdb
datatable
readr
Author

Steven P. Sanderson II, MPH

Published

March 28, 2023

Introduction

Yesterday I posted on performing a benchmark on reading in a compressed .csv.gz file of a 2,000 by 2,000 data.frame. It was brought to my attention by someone on Mastadon (@mariviere@fediscience.org - https://fediscience.org/@mariviere) that I should also use {duckdb} and {arrow} so I will perform the same analysis as yesterday but I will also add in the two aforementioned packages.

Function

The functions that we will be using this time around are as follows:

  • fread() - data.table
  • duckdb_read_csv() - duckdb
  • read_csv() - readr
  • read.csv() - base
  • read.table() - base
  • vroom() with altrep = FALSE - vroom
  • vroom() with altrep = TRUE - vroom
  • oprn_csv_dataset() - arrow

Example

Make the Data

Let’s make that dataset again:

library(R.utils)

# create a 1000 x 1000 matrix of random numbers
df <- matrix(rnorm(2000000), nrow = 2000, ncol = 2000) |>
  as.data.frame()

# Make and save gzipped file
write.csv(df, "df.csv")
gzip(
  filename = "df.csv", 
  destname = "df.csv.gz",
  overwrite = FALSE, 
  remove = TRUE
)

Benchmarking

Time to benchmark

library(rbenchmark)
library(data.table)
library(readr)
library(duckdb)
library(arrow)
library(vroom)
library(dplyr)
library(DBI)

n <- 30

benchmark(
  # Base R
  "read.table" = {
    a <- read.table(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  "read.csv" = {
    b <- read.csv(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  
  # data.table
  "fread" = {
    c <- fread(
      "df.csv.gz", 
      sep = ",", 
      colClasses = list(numeric = 1:2000)
    )
  },
  
  # vroom
  "vroom alltrep false" = {
    d <- vroom("df.csv.gz", delim = ",", col_types = "d")
  },
  "vroom alltrep true" = {
    e <- vroom("df.csv.gz", delim = ",", altrep = TRUE, col_types = "d")
  },
  
  # readr
  "readr" = {
    f <- read_csv("df.csv.gz", col_types = "d")
  },
  
  # Arrow
  "arrow" = {
    g <- open_csv_dataset("df.csv.gz")
  },
  
  # DuckDB
  "duckdb" = {
    con <- dbConnect(duckdb())
    h <- duckdb_read_csv(
      conn = con,
      name = "df",
      files = "C:\\Users\\ssanders\\Documents\\GitHub\\steveondata\\posts\\rtip-2023-03-28\\df.csv.gz"
    )
    dbDisconnect(con)
  },
  
  # Replications
  replications = n,
  
  # Columns
  columns = c(
    "test","replications","elapsed","relative","user.self","sys.self")
) |>
  arrange(relative)
                 test replications elapsed relative user.self sys.self
1               arrow           30    3.01    1.000      5.04     0.25
2               fread           30   28.28    9.395     19.56     4.30
3 vroom alltrep false           30   31.89   10.595     26.25    10.75
4  vroom alltrep true           30   33.72   11.203     25.75    10.67
5              duckdb           30   94.09   31.259     90.70     2.77
6               readr           30   98.28   32.651    113.05    45.12
7          read.table           30  109.97   36.535    107.78     1.24
8            read.csv           30  153.79   51.093    152.44     0.56

Important note is the session info on the pc I am using to write this:

sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.1.1       vroom_1.6.1       arrow_11.0.0.3    duckdb_0.7.1-1   
 [5] DBI_1.1.3         readr_2.1.4       data.table_1.14.8 rbenchmark_1.0.0 
 [9] R.utils_2.12.2    R.oo_1.25.0       R.methodsS3_1.8.2

loaded via a namespace (and not attached):
 [1] pillar_1.9.0      compiler_4.2.3    tools_4.2.3       digest_0.6.31    
 [5] bit_4.0.5         jsonlite_1.8.4    evaluate_0.20     lifecycle_1.0.3  
 [9] tibble_3.2.1      pkgconfig_2.0.3   rlang_1.1.0       cli_3.6.1        
[13] rstudioapi_0.14   parallel_4.2.3    yaml_2.3.7        xfun_0.38        
[17] fastmap_1.1.1     knitr_1.42        generics_0.1.3    vctrs_0.6.1      
[21] htmlwidgets_1.6.2 hms_1.1.3         bit64_4.0.5       tidyselect_1.2.0 
[25] glue_1.6.2        R6_2.5.1          fansi_1.0.4       rmarkdown_2.21   
[29] tzdb_0.3.0        purrr_1.0.1       magrittr_2.0.3    htmltools_0.5.5  
[33] assertthat_0.2.1  utf8_1.2.3        crayon_1.5.2     
 Sys.info() |> 
   as.data.frame() |> 
   tibble::rownames_to_column() |> 
   as_tibble() |> 
   slice(1,2,3,5)
# A tibble: 4 × 2
  rowname `Sys.info()`
  <chr>   <chr>       
1 sysname Windows     
2 release 10 x64      
3 version build 19045 
4 machine x86-64      
 memory.profile() |>
   as.data.frame()
            memory.profile()
NULL                       1
symbol                 24303
pairlist              642504
closure                11189
environment             4009
promise                22963
language              189766
special                   47
builtin                  701
char                 2039073
logical                18866
integer               108132
double                 20060
complex                    5
character             160381
...                       21
any                        0
list                   58500
expression                 5
bytecode               41555
externalptr            12382
weakref                13860
raw                    10113
S4                      1362
 gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  3363479 179.7    5830931 311.5  5830931 311.5
Vcells 32950395 251.4   81254422 620.0 81254324 620.0