Feature request: Support for ODBC bulk insert and update operations #34

aryoda · 2017-03-06T20:48:54Z

I know that the nanodbc lib supports bulk inserts and updates via batch sizes of n rows that are send to the server via one (or a minimum number of) server roundtrip(s):

https://github.com/lexicalunit/nanodbc/issues/242

Could you support this in the odbc package too please?

A simple test with a prepared statement and parameter binding indicates that only one row can be sent but not a whole batch of rows at once (tested with MySQL Version: 5.5.53-MariaDB and an Ubuntu client with unixODBC + DBI_0.5-1 + odbc_1.0.1):

library(DBI)
library(odbc)

con <- dbConnect(odbc::odbc(), dsn="testDB", uid = "user", pwd = "password")

dbWriteTable(con, "iris", iris)      # create and populate a table

# Test bulk inserts
insert <- dbSendQuery(con, "INSERT INTO iris(`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) VALUES (?, ?, ?, ?, ?)")
dbBind(insert, list(1, 2, 3, 4, "blue"))   # works
dbBind(insert, list(1:2, 2:3, 3:4, 4:5, c("blue", "red")))  # Error: `params` elements can only be of length 1
dbClearResult(insert)

dbDisconnect(con)

Bulk inserts and updates are the last missing link for e. g. Microsoft SQL Server since Microsoft does not provide that currently for R even though they have integrated R into the MS SQL Server 2016.

Thanks for this great new package!

The text was updated successfully, but these errors were encountered:

jimhester · 2017-04-04T19:12:12Z

dbWriteTable() does batch insertion, you can call it with append = TRUE to append to an existing table. dbBind() currently explicitly fails to ensure DBI compatibility (r-dbi/DBItest#96), but perhaps the situation there has settled and we can relax that constraint.

jimhester · 2017-04-17T21:10:21Z

This should now be working how you were expecting.

library(DBI);con <- dbConnect(odbc::odbc(), "MySQL")
dbWriteTable(con, "iris", head(iris))
insert <- dbSendQuery(con, "INSERT INTO iris(`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) VALUES (?, ?, ?, ?, ?)")
dbBind(insert, list(1, 2, 3, 4, "blue"))
dbBind(insert, list(1:2, 2:3, 3:4, 4:5, c("blue", "red")))
dbClearResult(insert)
dbReadTable(con, "iris")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
#> 7          1.0         2.0          3.0         4.0    blue
#> 8          1.0         2.0          3.0         4.0    blue
#> 9          2.0         3.0          4.0         5.0     red
dbDisconnect(con)

aryoda · 2017-04-24T16:48:55Z

@jimhester Many thanks for your work on my FR and forgive me for answering so late (mix of much work and holidays).

I have now tested the implementation (see my results below).

Unfortunately I was mixing two terms in my FR that have different meanings: bulk and batch and made not clear enough that my intention was a fast insert of many rows no matter how.

Nevertheless the results are promising:

I have tested the above code with three different databases and it did work:

db.info <- dbGetInfo(con)
# MySQL:
# $db.version
# [1] "5.5.53-MariaDB"
#
# SQLite:
# $db.version
# [1] "3.7.16.2"
#
# $dbms.name
# [1] "Microsoft SQL Server"
# $db.version
# [1] "11.00.6579"

I did an indicative performance test using the following code against a Microsoft SQL Server 2012 in the local network on a virtual server machine with 150.000 and 450.000 rows to be inserted.

bulk.iris <- iris[rep(1:NROW(iris),1000), ] # 1000 reps generate 150.000 rows

system.time(dbWriteTable(con, "iris", bulk.iris))                 # 6.31 seconds for 150.000 rows, 17.99 for 450.000 rows
system.time(dbWriteTable(con, "iris", bulk.iris, append = TRUE))  # 5.96 seconds for 150.000 rows, 17.38 for 450.000 rows
insert <- dbSendQuery(con, 'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (?, ?, ?, ?, ?)')
system.time(dbBind(insert, bulk.iris))                            # 5.38 seconds for 150.000 rows, 17.24 for 450.000 rows

Surprisingly the performance was very good but does not differ significantly!

So I started the DB tracer and captured the SQL server traffic with the following findings:

All three methods (even dbSendQuery with dbBind with multiple rows) end up with the same SQL statements and order send to the SQL server, basically a call to prepare the statement, one call per row to be inserted and one call to finalize the prepared statement:

exec sp_describe_undeclared_parameters N'INSERT INTO "iris" ("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)'
declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 float,@P2 float,@P3 float,@P4 float,@P5 varchar(255)',N'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)',1
select @p1
exec sp_execute 1, 5.0999999999999996, 3.5, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.9000000000000004, 3, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.7000000000000002, 3.2000000000000002, 1.3,0.20000000000000001, 'setosa'
...
exec sp_unprepare 1

-- For details on "sp_execute" see:
-- https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-execute-transact-sql
-- -> Executes a prepared Transact-SQL statement...

I then compared the performance with pure RODBC (without DBI/odbc):

library(RODBC)
con.String <- "Driver={SQL Server Native Client 11.0};Server=sqlserver1.domain.com\\default,1234;Trusted_Connection=yes;database=test"
con <- odbcDriverConnect(con.String)
system.time(RODBC::sqlSave(con, bulk.iris, tablename = "iris1", fast = TRUE))  # 233 seconds for 150.000 rows, 725 seconds for 450.000 rows
odbcClose(con)

The performance of pure RODBC was dramatically slower (reproducible!), about 40 times in my (non-representative) test setup!

Looking at the captured SQL tracing data I could only discover one major difference:

The RODBC package does NOT run within a transaction (no TransactionID set in the trace) so that each insert (of each single row!) is possibly using an separate implicit transaction while the DBI/odbc version runs within the same TransactionID!

This could explain the big performance difference (but I did not dig deeper).

Above that, the client side overhead of DBI/odbc compared to RODBC must be less (as indicated by the benchmark in the odbc readme file at github.

Summary

Binding multiple rows to a prepared query does work as expected (fast + saving an explicit loop)
DBI/odbc is significantly faster that RODBC with sqlSave
It doesn*t matter which DBI/odbc function your are using to insert rows, the performance is always similar
A real bulk insert (which inserts many rows at once while reducing network round-trips, triggers, constraints and transaction overhead) could even bring more performance but it remains unclear how this could be implemented DB-overarchingly.

PS: For a fast update with many rows (e. g. in case of data warehousing) one "best practice" is to insert the updated data into a new (temp or staging) table and merge-join with the original data table so I have ignored the update scenario in my performance tests...

jimhester · 2017-04-25T19:25:05Z

dbWriteTable is essentially just doing INSERT INTO dbBind() in the function, so it is not a surprise the performance will be the same. We are using the ODBC function SQLBindParameters() to do array binding, so it should be sending the parameters in batches. I think the behavior you are seeing is driver specific and am not sure there is anything we can do through the ODBC interface to improve it.

jimhester · 2017-04-25T19:28:22Z

Also to be clear I am using the same nanodbc interface you link to, specifically here is the code for binding integers.

aryoda · 2017-04-25T19:35:27Z

THX for these clarifications, I am absolutely happy with the performance (didn't know this before I have tested it). Another test with 1,25 mio rows and about 50 columns was also very fast (2 - 3 minutes). If Ihave new ideas for a bulk insert I will write a new FR.

BTW: I have errors when inserting UTF-8 R strings into the database (requires nvarchar columns using MS SQL Server). I will investigate that and open a bug report if required.

Update (May 20th, 2017): I have opened the issue #56 for the UTF-8 encoding problem...

andrewsali · 2019-08-26T12:45:36Z

We have observed that allowing to increase the batch_size parameter can have a dramatic impact on large inserts for Snowflake (

odbc/src/odbc_result.cpp

Line 107 in 24bf73b

int batch_size = 1024;

)

Setting it to around 10,000 speeds up inserts by a factor of 10 for even medium datasets (100,000 rows).

Would it be possible to make this parameter user-configurable?

jimhester · 2019-08-26T13:15:57Z

@andrewsali if you are seeing that large of a performance difference we should probably expose the parameter. Can you open a new issue with the feature request?

jimhester closed this as completed Apr 17, 2017

andrewsali mentioned this issue Aug 27, 2019

Allow configuring the batch_size parameter #297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Support for ODBC bulk insert and update operations #34

Feature request: Support for ODBC bulk insert and update operations #34

aryoda commented Mar 6, 2017 •

edited

Loading

jimhester commented Apr 4, 2017

jimhester commented Apr 17, 2017

aryoda commented Apr 24, 2017 •

edited

Loading

jimhester commented Apr 25, 2017

jimhester commented Apr 25, 2017

aryoda commented Apr 25, 2017 •

edited

Loading

andrewsali commented Aug 26, 2019

jimhester commented Aug 26, 2019

Feature request: Support for ODBC bulk insert and update operations #34

Feature request: Support for ODBC bulk insert and update operations #34

Comments

aryoda commented Mar 6, 2017 • edited Loading

jimhester commented Apr 4, 2017

jimhester commented Apr 17, 2017

aryoda commented Apr 24, 2017 • edited Loading

jimhester commented Apr 25, 2017

jimhester commented Apr 25, 2017

aryoda commented Apr 25, 2017 • edited Loading

andrewsali commented Aug 26, 2019

jimhester commented Aug 26, 2019

aryoda commented Mar 6, 2017 •

edited

Loading

aryoda commented Apr 24, 2017 •

edited

Loading

aryoda commented Apr 25, 2017 •

edited

Loading