Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new ADO.NET batching API #650

Closed
bgrainger opened this issue Jun 11, 2019 · 30 comments
Closed

Implement new ADO.NET batching API #650

bgrainger opened this issue Jun 11, 2019 · 30 comments

Comments

@bgrainger
Copy link
Member

bgrainger commented Jun 11, 2019

From comment:

MySqlBatch shipped in 0.57.0; keeping this issue open to track implementing the abstract base classes and virtual methods.


The new API is defined in https://github.com/dotnet/corefx/issues/35135.

It seems like the general approach should be to concatenate all SQL from all commands in the batch into one string, substituting parameter values as we go. (That is, this doesn't provide any advantage over what the user can manually do already, but ensures that MySqlDbBatch isn't needlessly inefficient for database-independent code.)

Special attention will need to be taken if the user is mixing CommandType.Text and StoredProcedure in one batch, particularly if the stored procedure has out parameters.

  • Do any MySQL Servers support batched prepared (binary) statements, or will the SQL still need to be parsed and split into individual statements if Prepare is called?

A WIP PR is available in #637.

@bgrainger
Copy link
Member Author

Do any MySQL Servers support batched prepared (binary) statements, or will the SQL still need to be parsed and split into individual statements if Prepare is called?

Tested MySQL 8.0.16 and MariaDB 10.4.5 and neither supports it. (They both return an ERR packet with the message "You have an error in your SQL syntax".) Moreover, it seems extremely unlikely to ever be supported based on the protocol definition: COM_STMT_PREPARE_OK contains just one num_columns field. Preparing multiple commands would imply sending multiple result sets, each of which might have a different number of columns; a single num_columns field can't represent this.

Alternatively, perhaps the commands would have to be prepared individually (once), then could be executed (multiple times) as a batch. However, COM_STMT_EXECUTE doesn't permit this, because it has a single stmt-id field.

Thus, for prepared statements it appears that they need to be prepared individually, then executed individually.

@bgrainger
Copy link
Member Author

MariaDB invented a COM_MULTI protocol extension, which looks as if it could be used to support batching, including for prepared statements.

@roji
Copy link

roji commented Jun 17, 2019

That's great! Do you think you're going to be able to attempt an implementation over this anytime soon?

@bgrainger
Copy link
Member Author

I'd like to, but I suspect it will need a fairly significant overhaul of the current command execution and MySqlDataReader processing subsystem. (I'm glad I didn't start yet, because the discovery of COM_MULTI is changing how I will approach it.)

@roji
Copy link

roji commented Jun 18, 2019

That makes sense, batching isn't just a feature you add easily. FWIW the combination of batching and prepared binary statements could be a big step forward for perf, especially if one day you do decide to persist prepared statements. Although the real question is what kind perf increase preparation delivers on MySQL (on PG it's quite decisive).

@bgrainger
Copy link
Member Author

Simple prepared commands are already supported and persisted. I also added it to TFB benchmarks: TechEmpower/FrameworkBenchmarks#4015

However, although local profiling of MySQL Server shows that 17% of statement execution time is spent in SQL parsing (and preparing statements should theoretically eliminate this), there was no appreciable throughput increase in TFB. (And I don't think that any of the top MySQL entrants are using prepared commands to achieve their speed.)

@roji
Copy link

roji commented Jun 18, 2019

OK, thanks for the info - I'm guessing more investigation/experimentation would be needed to understand exactly what's going on. It would be quite strange if prepared statements were supported by MySQL but had no (noticeable) perf advantages.

@bgrainger
Copy link
Member Author

I've got a simplistic version of batching (using MariaDB's COM_MULTI) working now in #637.

Some benchmark code and results are at https://gist.github.com/bgrainger/0504a52065eeade9e4b30e88b6363f2c.

The DbBatch API hasn't caused any problems so far, although I haven't implemented CommandTimeout yet, which (from experience) will be complicated.

I'm not confident that I can implement DbException.BatchCommand in MySqlConnector. If a problem occurs, the MySQL Server will just put an ERR packet on the wire and I can't think of a good way to map it back to the DbBatchCommand object that caused the error.

CC @roji

@roji
Copy link

roji commented Jun 19, 2019

@bgrainger great to see this work! If I'm understanding correctly, at this point your batching implementation only does concatenation batching via text mode, right? And PreparedCommandsBatch would be the one actually batching in binary mode without any concatenation?

Timeouts are indeed notoriously tricky, but I'm curious if batching introduces any specific difficulties that you don't encounter with regular non-batched command execution.

Re DbException.BatchCommand, it's expected that not all providers will support all features. Just for comparison, PostgreSQL doesn't actually have a batch-specific protocol message: multiple Execute messages are simply pipelined one after the other. This means Npgsql receives the results of earlier Execute messages, and only then receives an error, allowing it to know which statement triggered the failure. PostgreSQL does have a useful protocol sync feature: after all Execute messages, Npgsql inserts a Sync messages, which tells PostgreSQL that in case of error all subsequent messages are discarded until that one.

In any case, am looking forward to seeing the results for PreparedCommandsBatch! And obviously post back if you have any feedback on the batching API shape.

@bgrainger
Copy link
Member Author

If I'm understanding correctly, at this point your batching implementation only does concatenation batching via text mode, right?

No; for MariaDB, MySqlBatch will send all the queries (as separate commands, not concatenated SQL) in one "batch" packet. (This is just for the text protocol, no prepared statements.)

I haven't yet written the code to concatenate SQL for a batch sent to MySQL Server, but that is my planned implementation for non-MariaDB servers.

And PreparedCommandsBatch would be the one actually batching in binary mode without any concatenation?

Yes, that's the plan.

BTW, MariaDB allows multiple kinds of commands to be batched; for example a "prepare statement" and "execute statement" can be grouped and executed in one batch. There's also a (separate) bulk insert command that allows one prepared insert statement to be executed with multiple sets of values. I don't see an easy way to express either of these with the DbBatch API so I might not end up supporting them through DbBatch.

@roji
Copy link

roji commented Jun 19, 2019

There's also a (separate) bulk insert command that allows one prepared insert statement to be executed with multiple sets of values. I don't see an easy way to express either of these with the DbBatch API so I might not end up supporting them through DbBatch.

@divega and I discussed this at some point, calling it "SIMD". In fact, IIRC JDBC's batching API (for prepared statements) allows only this - no way to send heterogeneous statements in a single roundtrip.

While "SIMD mode" may be a relatively common usage for batching, assuming prepared statements exist, there doesn't seem to be much extra value in it - one can simply execute the same prepared statement N times with different parameters within a single batch. How exactly that's exposed in a .NET API would be somewhat provider-specific; my expectation is that calling Prepare() on a batch would detect identical SQL and do the right thing.

@bgrainger
Copy link
Member Author

@roji Prepared Batched Commands are now implemented. Results updated here: https://gist.github.com/bgrainger/0504a52065eeade9e4b30e88b6363f2c

As expected, there is no benefit for MySQL Server, because its protocol requires that prepared statements be sent individually. But for MariaDB, all "Batch" methods are a performance win.

Even for MySQL Server, "BatchCommands" is faster than "Commands" because the driver (safely) concatenates multiple commands' statements together into as long a string as possible, and sends it all at once. (This is also new since the last update.)

@roji
Copy link

roji commented Aug 8, 2019

Thanks for the update @bgrainger, great to hear and even better to see those perf numbers! You're ahead of me, I haven't yet finished implementing this in Npgsql.

Unfortunately the batching API didn't make it into .NET Core 3.0 or .NET Standard 2.1 - but the API will definitely go in. My plan for Npgsql is to expose the API as provider-specific, and when it makes it into .NET Core/Standard to simply multitarget and override for that TFM. You can do the same - I think the API shape itself is done and there's very little chance we'll make any further changes.

/cc @divega @ajcvickers @Wraith2

@bgrainger
Copy link
Member Author

Yes, I saw that it had been omitted from .NET Core 3; I too am planning to ship MySqlBatch as a new public (but possibly still "experimental") API, then use a lot of #ifdefs when a TFM supports it. (Similar to DbColumn for .NET 4.5.)

@bgrainger
Copy link
Member Author

MySqlBatch shipped in 0.57.0; keeping this issue open to track implementing the abstract base classes and virtual methods.

@Wraith2
Copy link

Wraith2 commented Sep 24, 2019

Neat, any measured performance advantages?

It doesn't look like anything new is going into corefx SqlClient. Everything will go into Microsoft.Data.SqlClient. To start that process i ported over the internal RPC changes that make batch RPC easier/possible in dotnet/SqlClient#209 and once that's integrated i can look at integrating the batching work on top of it. The addition of always encrypted makes thing a little more complicated and has re-opened my consideration of a single-query-multiple-parameters mode

@roji
Copy link

roji commented Sep 24, 2019

@Wraith2 I think the best next step would be to set up a BenchmarkDotNet project in Microsoft.Data.SqlClient with some minimal infrastructure and proceed to measuring how your implementation fairs.

For MySQL perf results, scroll up and take a look at the comments.

@roji
Copy link

roji commented Sep 24, 2019

FYI I probably have some unrelated work for the next few weeks, but after that the plan is definitely to re-tackle bringing in the batching API into corefx. If we can have clear benchmarks from MySQL, SqlClient and Npgsql that would be absolutely great.

@bgrainger
Copy link
Member Author

@Wraith2 Yes, see https://gist.github.com/bgrainger/0504a52065eeade9e4b30e88b6363f2c for benchmarks.

(Note that for MySQL, there's no appreciable difference of batching over manually concatenating all the SQL and executing it with ExecuteNonQuery or ExecuteReader, but I expected that going in.)

@Wraith2
Copy link

Wraith2 commented Sep 24, 2019

@Wraith2 I think the best next step would be to set up a BenchmarkDotNet project in Microsoft.Data.SqlClient with some minimal infrastructure and proceed to measuring how your implementation fairs.

Agreed, as always i'm at the mercy of the sql team on the microsoft side. implementing batching without the rpc rework would be much more complicated and less performant. The RPC rework clashed with the introduction of AlwaysEncrypted and verifying that it's correct will need some careful review.

If you also want batching in corefx SqlClient i've still got that branch but with the component being marked as "Archived component - limited churn/contributions" i assumed it wouldn't be going in.

@roji
Copy link

roji commented Sep 24, 2019

Yeah, we should definitely forget about System.Data.SqlClient - anything new at this point will be going into Microsoft.Data.SqlClient.

Do you have an issue tracking the RPC rework?

@Wraith2
Copy link

Wraith2 commented Sep 24, 2019

Just the PR linked above. You're right, i should probably raise issues.

@roji
Copy link

roji commented Sep 24, 2019

Yeah, that could be helpful. Everyone is now in post-release mode, but things will soon probably quiet down and hopefully the SqlClient team will have the bandwidth to start merging this work.

@bgrainger
Copy link
Member Author

MariaDB has opened an issue to remove their COM_MULTI support: https://jira.mariadb.org/browse/MDEV-21612

They suggest using pipelining instead, but that can break other MySQL servers (e.g. Aurora: #486 (comment)). A new extended server capability feature flag (to indicate that pipelining is supported by the server), or a connection string option that forces pipelining on, might let MySqlConnector use pipelining when it's safe and would provide a speedup.

CC @roji

@roji
Copy link

roji commented Feb 4, 2020

Interesting... If I understand correctly, pipelining (unlike COM_MULTI) means not (necessarily?) coalescing multiple commands into a single TCP packet? If so then it could be quite significant, at least in the high-perf world of TechEmpower. I'm just now working on coalescing more commands into a single packet in Npgsql, so if there's any interest in competing at that level of performance I'd think twice about removing this feature. Let me know if you think it's worth it for me to write some thoughts on that issue.

Regardless, assuming COM_MULTI goes away, would you be implementing the ADO.NET batching API over pipelining? Or allowing pipelining without use of the batchin API (or both)?

@bgrainger
Copy link
Member Author

I think pipelining could be a viable approach to batching for non-COM_MULTI servers. It might need to be opt-in, since it's not an "official" part of the MySQL protocol and some servers or proxies don't support it.

RE: TechEmpower, the latest "ruling" I saw was that pipelining was allowed as long as it didn't require modifications to application code (https://www.techempower.com/blog/2018/10/30/framework-benchmarks-round-17/). I'm not sure if this can be implemented transparently for MySQL (I don't understand how it currently works for Postgres). But this seems like it might preclude the use of DbBatch in the ADO.NET benchmarks.

@roji
Copy link

roji commented Feb 5, 2020

Yeah, explicit usage of a batching API is definitely disallowed in TechEmpower - though that doesn't preclude the batching API from being implemented over pipelining. However, note that batching typically also implies some semantics as well, and not just a perf improvement, e.g. an earlier failing statement causes later ones to be skipped (could be done via transactions too).

I think that when people say pipelining, they usually mean that you can simply execute a new command on a connection before the previous one has completed (and its results consumed). Npgsql definitely doesn't support this, but I'm currently working on exactly that (I'm actually doing multiplexing, which would combine multiple commands from multiple producers/threads into the same physical connection, which is a step further. Some people call this pipelining I think, there's some terminological confusion...). The former version, as I understand it, would indeed require changes to application code: the application would execute a command, and then execute another without waiting for the first to complete; in that sense it's not far from an explicit batching API.

I still think there's a lot of perf value in reducing the number of packets (and system calls), and am working in this direction. Batching is one way to go in this direction (with something like COM_MULTI), but multiplexing definitely goes further.

@roji
Copy link

roji commented Aug 13, 2021

Great to see this @bgrainger. If you ran into any weirdness/difficulties with the API, don't hesitate to report - there's still time to amend it in ADO.NET.

For my understanding, does the implementation make use of the (removed) COM_MULTI, or with JDBC-style protocol pipelining (as per #946)? Or is that work still to be done at some point?

@bgrainger
Copy link
Member Author

If you ran into any weirdness/difficulties with the API

It seemed that the behaviour of ExecuteReader(CommandBehavior) wasn't fully specified in dotnet/runtime#28633 (perhaps since that was a late change). I ended up just reusing the existing code for my implementation, and so specifying CommandBehavior.SingleRow returns just the first row from the first command in the batch that returns a result set, ignoring all others. This seems congruent with the documentation ("The query is expected to return a single row of the first result set.") so I'm happy with it for now, but I don't know what user expectations will be.

The implementation does still make use of COM_MULTI; switching to pipelining is still to be done.

@roji
Copy link

roji commented Aug 13, 2021

Thanks @bgrainger. Yes, SingleRow is indeed expected to behave as described; in general, CommandBehavior is expecetd to behave in the same way as it already does when a semicolon-batching command is executed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants