Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of trivial INSERT SELECT significantly slows down some queries #36792

Closed
aristeidismoustakas opened this issue Apr 29, 2022 · 5 comments · Fixed by #37047
Closed

Comments

@aristeidismoustakas
Copy link

The creation of table becomes too slow when I use CTEs within the query.

ClickHouse server version 22.4.3

How to reproduce
Trying to create a table based on the result of a query, I take very bad performance when I use a CTE:

--creation of table t and insertion of 10000000000 records
--exec time: ~260 second
drop table if exists t;
create table t(a Int64) engine = MergeTree() order by a;
insert into t SELECT * FROM system.numbers LIMIT 10000000000;


--scenario 1 
--exec time: ~200 seconds
drop table if exists t2;
create table t2 engine = MergeTree() order by c
as (

with

cte1 as (
  SELECT * FROM t WHERE modulo(a,2)=1
)

SELECT count(*) as c FROM cte1 

);
--end of scenario 1


--scenario 2 
--exec time: ~10 seconds
drop table if exists t3;
create table t3 engine = MergeTree() order by c
as (

  SELECT count(*) as c FROM t WHERE modulo(a,2)=1

);
--end of scenario 2

Although the two scenarios do exactly the same thing, with the difference that in the first scenario we use an intermediate CTE, there is a huge difference in the execution time.

The issue becomes even stranger if you execute the following queries and see that the execution time is the same, that in generally makes sense.

--query from scenario 1 
--exec time: ~9 seconds
with

cte1 as (
  SELECT * FROM t WHERE modulo(a,2)=1
)

SELECT count(*) as c FROM t WHERE modulo(a,2)=1


--query from scenario 2
--exec time: ~9 seconds
SELECT count(*) as c FROM t WHERE modulo(a,2)=1

Obviously, there is a performance issue when someone wants to CREATE TABLE AS and uses CTEs in the query.

Does anyone know why clickhouse behaves in that strange way?

@tavplubix
Copy link
Member

tavplubix commented May 2, 2022

The difference is in number of threads: https://gist.github.com/tavplubix/b2c9a7f101ccc7dbe21a844348d7757f
Actually it's related to INSERT ... SELECT query, not to CREATE ... SELECT. ClickHouse tries to apply "trivial insert select optimization" (#12195) to the query from scenario 1 and number of threads is determined incorrectly in this case:

if (is_trivial_insert_select)
{
/** When doing trivial INSERT INTO ... SELECT ... FROM table,
* don't need to process SELECT with more than max_insert_threads
* and it's reasonable to set block size for SELECT to the desired block size for INSERT
* to avoid unnecessary squashing.
*/
Settings new_settings = getContext()->getSettings();
new_settings.max_threads = std::max<UInt64>(1, settings.max_insert_threads);

But default value of max_insert_threads is 0, which means "choose automatically".

However, it's not clear why the first query with CTE is "trivial" and the second one is not...

@tavplubix
Copy link
Member

However, it's not clear why the first query with CTE is "trivial" and the second one is not...

It's because isTrivialSelect does not take into account CTEs and implicit GROUP BY:

dell9510 :) insert into t2 WITH cte1 AS
                (
                    SELECT *
                    FROM t
                    WHERE (a % 2) = 1
                )
            SELECT count(*) AS c
            FROM cte1 group by 1

Ok.

0 rows in set. Elapsed: 9.222 sec. Processed 1.00 billion rows, 8.00 GB (108.44 million rows/s., 867.49 MB/s.)

dell9510 :) insert into t2 WITH cte1 AS
                (
                    SELECT *
                    FROM t
                    WHERE (a % 2) = 1
                )
            SELECT count(*) AS c
            FROM cte1

Ok.

0 rows in set. Elapsed: 39.420 sec. Processed 1.00 billion rows, 8.00 GB (25.37 million rows/s., 202.94 MB/s.)

cc: @ucasfl

@tavplubix
Copy link
Member

Simpler example:

dell9510 :) insert into t2  SELECT sum(a) as c FROM t

0 rows in set. Elapsed: 29.498 sec. Processed 1.00 billion rows, 8.00 GB (33.90 million rows/s., 271.20 MB/s.)

dell9510 :) insert into t2  SELECT sum(a) as c FROM t group by 1

0 rows in set. Elapsed: 5.763 sec. Processed 1.00 billion rows, 8.00 GB (173.53 million rows/s., 1.39 GB/s.)

@tavplubix tavplubix changed the title CREATE TABLE AS takes too long when I use CTEs within the query Optimization of trivial INSERT SELECT significantly slows down some queries May 2, 2022
@ucasfl
Copy link
Collaborator

ucasfl commented May 6, 2022

70ae043

Hi, why this be reverted? @KochetovNicolai

@tavplubix
Copy link
Member

It was not reverted, it was never merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants