Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for ODBC bulk insert and update operations #34

Closed
aryoda opened this issue Mar 6, 2017 · 8 comments
Closed

Comments

@aryoda
Copy link
Contributor

aryoda commented Mar 6, 2017

I know that the nanodbc lib supports bulk inserts and updates via batch sizes of n rows that are send to the server via one (or a minimum number of) server roundtrip(s):

https://github.com/lexicalunit/nanodbc/issues/242

Could you support this in the odbc package too please?

A simple test with a prepared statement and parameter binding indicates that only one row can be sent but not a whole batch of rows at once (tested with MySQL Version: 5.5.53-MariaDB and an Ubuntu client with unixODBC + DBI_0.5-1 + odbc_1.0.1):

library(DBI)
library(odbc)

con <- dbConnect(odbc::odbc(), dsn="testDB", uid = "user", pwd = "password")

dbWriteTable(con, "iris", iris)      # create and populate a table

# Test bulk inserts
insert <- dbSendQuery(con, "INSERT INTO iris(`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) VALUES (?, ?, ?, ?, ?)")
dbBind(insert, list(1, 2, 3, 4, "blue"))   # works
dbBind(insert, list(1:2, 2:3, 3:4, 4:5, c("blue", "red")))  # Error: `params` elements can only be of length 1
dbClearResult(insert)

dbDisconnect(con)

Bulk inserts and updates are the last missing link for e. g. Microsoft SQL Server since Microsoft does not provide that currently for R even though they have integrated R into the MS SQL Server 2016.

Thanks for this great new package!

@jimhester
Copy link
Contributor

dbWriteTable() does batch insertion, you can call it with append = TRUE to append to an existing table. dbBind() currently explicitly fails to ensure DBI compatibility (r-dbi/DBItest#96), but perhaps the situation there has settled and we can relax that constraint.

@jimhester
Copy link
Contributor

This should now be working how you were expecting.

library(DBI);con <- dbConnect(odbc::odbc(), "MySQL")
dbWriteTable(con, "iris", head(iris))
insert <- dbSendQuery(con, "INSERT INTO iris(`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) VALUES (?, ?, ?, ?, ?)")
dbBind(insert, list(1, 2, 3, 4, "blue"))
dbBind(insert, list(1:2, 2:3, 3:4, 4:5, c("blue", "red")))
dbClearResult(insert)
dbReadTable(con, "iris")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa
#> 7          1.0         2.0          3.0         4.0    blue
#> 8          1.0         2.0          3.0         4.0    blue
#> 9          2.0         3.0          4.0         5.0     red
dbDisconnect(con)

@aryoda
Copy link
Contributor Author

aryoda commented Apr 24, 2017

@jimhester Many thanks for your work on my FR and forgive me for answering so late (mix of much work and holidays).

I have now tested the implementation (see my results below).

Unfortunately I was mixing two terms in my FR that have different meanings: bulk and batch and made not clear enough that my intention was a fast insert of many rows no matter how.

Nevertheless the results are promising:

  1. I have tested the above code with three different databases and it did work:
db.info <- dbGetInfo(con)
# MySQL:
# $db.version
# [1] "5.5.53-MariaDB"
#
# SQLite:
# $db.version
# [1] "3.7.16.2"
#
# $dbms.name
# [1] "Microsoft SQL Server"
# $db.version
# [1] "11.00.6579"
  1. I did an indicative performance test using the following code against a Microsoft SQL Server 2012 in the local network on a virtual server machine with 150.000 and 450.000 rows to be inserted.
bulk.iris <- iris[rep(1:NROW(iris),1000), ] # 1000 reps generate 150.000 rows

system.time(dbWriteTable(con, "iris", bulk.iris))                 # 6.31 seconds for 150.000 rows, 17.99 for 450.000 rows
system.time(dbWriteTable(con, "iris", bulk.iris, append = TRUE))  # 5.96 seconds for 150.000 rows, 17.38 for 450.000 rows
insert <- dbSendQuery(con, 'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (?, ?, ?, ?, ?)')
system.time(dbBind(insert, bulk.iris))                            # 5.38 seconds for 150.000 rows, 17.24 for 450.000 rows

Surprisingly the performance was very good but does not differ significantly!

So I started the DB tracer and captured the SQL server traffic with the following findings:

  1. All three methods (even dbSendQuery with dbBind with multiple rows) end up with the same SQL statements and order send to the SQL server, basically a call to prepare the statement, one call per row to be inserted and one call to finalize the prepared statement:
exec sp_describe_undeclared_parameters N'INSERT INTO "iris" ("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)'
declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 float,@P2 float,@P3 float,@P4 float,@P5 varchar(255)',N'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)',1
select @p1
exec sp_execute 1, 5.0999999999999996, 3.5, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.9000000000000004, 3, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.7000000000000002, 3.2000000000000002, 1.3,0.20000000000000001, 'setosa'
...
exec sp_unprepare 1

-- For details on "sp_execute" see:
-- https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-execute-transact-sql
-- -> Executes a prepared Transact-SQL statement...

I then compared the performance with pure RODBC (without DBI/odbc):

library(RODBC)
con.String <- "Driver={SQL Server Native Client 11.0};Server=sqlserver1.domain.com\\default,1234;Trusted_Connection=yes;database=test"
con <- odbcDriverConnect(con.String)
system.time(RODBC::sqlSave(con, bulk.iris, tablename = "iris1", fast = TRUE))  # 233 seconds for 150.000 rows, 725 seconds for 450.000 rows
odbcClose(con)

The performance of pure RODBC was dramatically slower (reproducible!), about 40 times in my (non-representative) test setup!

Looking at the captured SQL tracing data I could only discover one major difference:

The RODBC package does NOT run within a transaction (no TransactionID set in the trace) so that each insert (of each single row!) is possibly using an separate implicit transaction while the DBI/odbc version runs within the same TransactionID!

This could explain the big performance difference (but I did not dig deeper).

Above that, the client side overhead of DBI/odbc compared to RODBC must be less (as indicated by the benchmark in the odbc readme file at github.

Summary

  • Binding multiple rows to a prepared query does work as expected (fast + saving an explicit loop)
  • DBI/odbc is significantly faster that RODBC with sqlSave
  • It doesn*t matter which DBI/odbc function your are using to insert rows, the performance is always similar
  • A real bulk insert (which inserts many rows at once while reducing network round-trips, triggers, constraints and transaction overhead) could even bring more performance but it remains unclear how this could be implemented DB-overarchingly.

PS: For a fast update with many rows (e. g. in case of data warehousing) one "best practice" is to insert the updated data into a new (temp or staging) table and merge-join with the original data table so I have ignored the update scenario in my performance tests...

@jimhester
Copy link
Contributor

dbWriteTable is essentially just doing INSERT INTO dbBind() in the function, so it is not a surprise the performance will be the same. We are using the ODBC function SQLBindParameters() to do array binding, so it should be sending the parameters in batches. I think the behavior you are seeing is driver specific and am not sure there is anything we can do through the ODBC interface to improve it.

@jimhester
Copy link
Contributor

Also to be clear I am using the same nanodbc interface you link to, specifically here is the code for binding integers.

@aryoda
Copy link
Contributor Author

aryoda commented Apr 25, 2017

THX for these clarifications, I am absolutely happy with the performance (didn't know this before I have tested it). Another test with 1,25 mio rows and about 50 columns was also very fast (2 - 3 minutes). If Ihave new ideas for a bulk insert I will write a new FR.

BTW: I have errors when inserting UTF-8 R strings into the database (requires nvarchar columns using MS SQL Server). I will investigate that and open a bug report if required.

Update (May 20th, 2017): I have opened the issue #56 for the UTF-8 encoding problem...

@andrewsali
Copy link

We have observed that allowing to increase the batch_size parameter can have a dramatic impact on large inserts for Snowflake (

int batch_size = 1024;
)

Setting it to around 10,000 speeds up inserts by a factor of 10 for even medium datasets (100,000 rows).

Would it be possible to make this parameter user-configurable?

@jimhester
Copy link
Contributor

@andrewsali if you are seeing that large of a performance difference we should probably expose the parameter. Can you open a new issue with the feature request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants