-
What is your question? After the script is submitted, there is no error in the output of the command console. |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments
-
There are a lot of possibilities as to why it is running slower, and we'll need more information to help. Some initial questions:
The results shown in that image are from running the TpcxbbLikeSpark queries on two DGX-2 machines with the input and intermediate storage systems on fast NVMe drives. Without sufficiently fast I/O the query will become I/O bound before the GPU is fully utilized. Some of the queries are only the ETL portions of the original TPCx-BB query (e.g.: query 5 also includes logistical regression which is not included in The Tuning Guide has tips on tuning the RAPIDS Accelerator. One item notably missing from the set of configs above is pinned memory. Having at least some pinned memory (e.g.: between 2g to 8g) will significantly increase the performance. You can also try reducing the shuffle partitions and other tips discussed in the tuning guide. |
Beta Was this translation helpful? Give feedback.
-
I hvae a question , does different data types have a big impact on performance ? like double or decimal, in order to test tpc-ds , when i generate tpc-ds data set , i set useDoubleForDecimal=true . if we support decimal types in future , will performance be improved ? |
Beta Was this translation helpful? Give feedback.
-
Yes, it can have a very significant impact on performance. The RAPIDS Accelerator currently does not support Spark's
Yes, operations that need to deal directly with Note if you are already removing all decimals from the inputs (e.g.: via |
Beta Was this translation helpful? Give feedback.
-
Hi, we have a Spark cluster composed of three nodes, 36 concurrent CPU cores were utilized in the CPU-only run(by setting: --total-executor-cores=36, --conf spark.task.cpus=2). |
Beta Was this translation helpful? Give feedback.
-
That implies the concurrency of your cluster is actually only 18 tasks at a time instead of 36 since you're specifying each task requires 2 CPU cores.
This is a particularly small dataset, probably too small to be effective on GPUs. GPUs are not well suited for very small amounts of data. Note that the scale factor refers to the approximate size of the entire data set, not the amount of data that will be processed by any one query against that dataset. Often queries will hit only a small fraction of that dataset, and the first thing they'll do from there is filter the data down even further before it gets to significant processing like groupby aggregates or joins. I would recommend trying this with a 100G dataset or larger.
One of the first things the TCPx-BB benchmark does is perform a database load of the CSV data into Parquet, ORC or some other columnar format that the queries are then run against. The problem with using CSV for your main dataset to query is that you'll likely be mostly I/O bound because CSV forces the entire table data to be loaded even if the query only wants to see a few columns from the table. Columnar formats such as Parquet or ORC enables loading only the data associated with the columns being accessed by the query, drastically lowering the I/O requirements for a typical query. That places more of the performance of the query in the computation rather than I/O which is where the GPU can shine. I recommend transcoding the data from CSV to Parquet before running the query. Note that the GPU can write Parquet data often much faster than the CPU, so I wouldn't be surprised if you see a nice speedup relative to the CPU just during the transcoding from CSV to Parquet (given a non-trivial amount of data to transcode). If you're already using the
When using the Spark built-in shuffle, shuffle compression will still be performed by the CPU. Having less cores available to the GPU query than the CPU query can hurt performance as a result. Given a sufficient speedup in a query you can get away with running significantly less total cores in the system than the CPU version, but when using Spark's built-in shuffle it can be harmful if the query has a significant amount of data to shuffle (and thus process through the shuffle compression codec).
There could be a number of reasons. Is your driver running with sufficient resources (e.g.: has at least a couple of free CPU cores dedicated to it, is not garbage collecting due to insufficient heap size, etc.)? It may be related to the relative speed at which stages are being executed as well. Also make sure you enable a pinned memory pool as I mentioned earlier. It can have a significant effect on performance. |
Beta Was this translation helpful? Give feedback.
-
@jlowe. Yes, I have enabled the pinned memory pool. I followed your instructions to convert csv to parquet, but the running time was longer than before. The resources are sufficient in the cluster. |
Beta Was this translation helpful? Give feedback.
-
Just to be clear, the timings you are seeing in your queries @YeahNew start at parquet now right? E.g. the csvToParquet function isn't being included in the timings.
I see Jason was wondering about the 100GB dataset. Are you still using 2GB in your case? e.g. it would be helpful to know what was tested since last time.
Interesting, the broadcasts you mention may be showing up at odd places with smaller datasets, this seems like something we need to investigate in our end (e.g. run with the same settings you did, and see if we can reproduce) |
Beta Was this translation helpful? Give feedback.
-
@abellina, thank you for your help. Yes, the csvToParquet function isn't being included in the timings. I did not use the 100GB dataset, but I used the 20GB dataset, the GPU is still slower than the CPU. |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your help, the problem has been resolved. The test results found that the acceleration effect of GPU is very significant. |
Beta Was this translation helpful? Give feedback.
-
can you show me some test results compared to cpu ? thanks a lot. |
Beta Was this translation helpful? Give feedback.
That implies the concurrency of your cluster is actually only 18 tasks at a time instead of 36 since you're specifying each task requires 2 CPU cores.
This is a particularly small dataset, probably too small to be effective on GPUs. GPUs are not well suited for very small amounts of data. Note that the scale factor refers to the approximate size of the entire data set, not the amount of data that will be processed by any one query against that dataset. Often queries will hit only a small fraction of that d…