-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Support for ODBC bulk insert and update operations #34
Comments
|
This should now be working how you were expecting. library(DBI);con <- dbConnect(odbc::odbc(), "MySQL")
dbWriteTable(con, "iris", head(iris))
insert <- dbSendQuery(con, "INSERT INTO iris(`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) VALUES (?, ?, ?, ?, ?)")
dbBind(insert, list(1, 2, 3, 4, "blue"))
dbBind(insert, list(1:2, 2:3, 3:4, 4:5, c("blue", "red")))
dbClearResult(insert)
dbReadTable(con, "iris")
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 1.0 2.0 3.0 4.0 blue
#> 8 1.0 2.0 3.0 4.0 blue
#> 9 2.0 3.0 4.0 5.0 red
dbDisconnect(con) |
@jimhester Many thanks for your work on my FR and forgive me for answering so late (mix of much work and holidays). I have now tested the implementation (see my results below). Unfortunately I was mixing two terms in my FR that have different meanings: bulk and batch and made not clear enough that my intention was a fast insert of many rows no matter how. Nevertheless the results are promising:
bulk.iris <- iris[rep(1:NROW(iris),1000), ] # 1000 reps generate 150.000 rows
system.time(dbWriteTable(con, "iris", bulk.iris)) # 6.31 seconds for 150.000 rows, 17.99 for 450.000 rows
system.time(dbWriteTable(con, "iris", bulk.iris, append = TRUE)) # 5.96 seconds for 150.000 rows, 17.38 for 450.000 rows
insert <- dbSendQuery(con, 'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (?, ?, ?, ?, ?)')
system.time(dbBind(insert, bulk.iris)) # 5.38 seconds for 150.000 rows, 17.24 for 450.000 rows Surprisingly the performance was very good but does not differ significantly! So I started the DB tracer and captured the SQL server traffic with the following findings:
exec sp_describe_undeclared_parameters N'INSERT INTO "iris" ("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)'
declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 float,@P2 float,@P3 float,@P4 float,@P5 varchar(255)',N'INSERT INTO iris("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") VALUES (@P1, @P2, @P3, @P4, @P5)',1
select @p1
exec sp_execute 1, 5.0999999999999996, 3.5, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.9000000000000004, 3, 1.3999999999999999, 0.20000000000000001, 'setosa'
exec sp_execute 1, 4.7000000000000002, 3.2000000000000002, 1.3,0.20000000000000001, 'setosa'
...
exec sp_unprepare 1
-- For details on "sp_execute" see:
-- https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-execute-transact-sql
-- -> Executes a prepared Transact-SQL statement... I then compared the performance with pure RODBC (without DBI/odbc): library(RODBC)
con.String <- "Driver={SQL Server Native Client 11.0};Server=sqlserver1.domain.com\\default,1234;Trusted_Connection=yes;database=test"
con <- odbcDriverConnect(con.String)
system.time(RODBC::sqlSave(con, bulk.iris, tablename = "iris1", fast = TRUE)) # 233 seconds for 150.000 rows, 725 seconds for 450.000 rows
odbcClose(con) The performance of pure RODBC was dramatically slower (reproducible!), about 40 times in my (non-representative) test setup! Looking at the captured SQL tracing data I could only discover one major difference: The RODBC package does NOT run within a transaction (no TransactionID set in the trace) so that each insert (of each single row!) is possibly using an separate implicit transaction while the DBI/odbc version runs within the same TransactionID! This could explain the big performance difference (but I did not dig deeper). Above that, the client side overhead of DBI/odbc compared to RODBC must be less (as indicated by the benchmark in the odbc readme file at github. Summary
PS: For a fast update with many rows (e. g. in case of data warehousing) one "best practice" is to insert the updated data into a new (temp or staging) table and merge-join with the original data table so I have ignored the update scenario in my performance tests... |
dbWriteTable is essentially just doing |
Also to be clear I am using the same nanodbc interface you link to, specifically here is the code for binding integers. |
THX for these clarifications, I am absolutely happy with the performance (didn't know this before I have tested it). Another test with 1,25 mio rows and about 50 columns was also very fast (2 - 3 minutes). If Ihave new ideas for a bulk insert I will write a new FR. BTW: I have errors when inserting UTF-8 R strings into the database (requires Update (May 20th, 2017): I have opened the issue #56 for the UTF-8 encoding problem... |
We have observed that allowing to increase the batch_size parameter can have a dramatic impact on large inserts for Snowflake ( Line 107 in 24bf73b
Setting it to around 10,000 speeds up inserts by a factor of 10 for even medium datasets (100,000 rows). Would it be possible to make this parameter user-configurable? |
@andrewsali if you are seeing that large of a performance difference we should probably expose the parameter. Can you open a new issue with the feature request? |
I know that the
nanodbc
lib supports bulk inserts and updates via batch sizes of n rows that are send to the server via one (or a minimum number of) server roundtrip(s):https://github.com/lexicalunit/nanodbc/issues/242
Could you support this in the
odbc
package too please?A simple test with a prepared statement and parameter binding indicates that only one row can be sent but not a whole batch of rows at once (tested with MySQL Version: 5.5.53-MariaDB and an Ubuntu client with unixODBC + DBI_0.5-1 + odbc_1.0.1):
Bulk inserts and updates are the last missing link for e. g. Microsoft SQL Server since Microsoft does not provide that currently for R even though they have integrated R into the MS SQL Server 2016.
Thanks for this great new package!
The text was updated successfully, but these errors were encountered: