Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental]backend: add new oneDNN backend #855

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rfsaliev
Copy link

This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.

I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.

Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.

Known issues and TODOs:

  • Functionality:

    • Limited set of operations are implemented - just to support GPT2 sample model.

    It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

    • This PoC supports CPU engine only.
    • By default, the backend utilize CPU buffer type - own buffer type is under development.
  • Performance:

    • Operations fusing is not implemented - oneDNN allows to fuse several operations to 1 call which significantly improve performance due to reduced R/W memory access.
    • oneDNN MatMul and InnerProduct (Linear) primitives executed in non-optimal mode because weights provided in a plain memory layout. To gain maximum performance , it is recommended to 'reorder' at least weights to a blocked layout which is effective for memory access and AI acceleration instructions. See (oneDNN Memory Format Propagation)[https://oneapi-src.github.io/oneDNN/page_memory_format_propagation_cpp.html].

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Some technical details:

  • Added files: ggml-dnnl.h, ggml-dnnl.cpp. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped by USE_DNNL_BACKEND macros.
  • CMake files modified to support GGML_DNNL configuration option.
  • gpt2-sched is modified to convert model weights from FP16 to FP32 if DNNL backend enabled - current oneDNN release version does not support MatMul cases with src_type=dst_type=f32 and weights_type=fp16

@slaren
Copy link
Collaborator

slaren commented Jun 12, 2024

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

This will be supported through ggml_backend_sched after ggerganov/llama.cpp#6210 is merged.

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type. For example, the CUDA backend has a split buffer type that splits the tensors between multiple GPUs. Since this buffer type would only be used to store weights, in most cases it would be ok to leave some functionality unimplemented, such as support for creating views, or reading data back through get_tensor.

@WilliamTambellini
Copy link
Contributor

Very good idea @rfsaliev .
Most recent intel CPUs support bf16.
int8b should be easy to add support to (vnni).

@rfsaliev
Copy link
Author

Thank you @slaren for your response.

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in graph_compute. IMHO full support of graph_plan_create + graph_plan_compute would give best opportunities for backend-side optimizations.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type.

In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in set_tensor call?
I found that buffer.init_tensor is called on every operation during model execution - should I rely on such behavior or it will be changed in future? I mean, should I expect that init_tensor will be called for e.g. all MatMul operations assigned to backend? If yes, is there any design rules which prevent me from changing/replacing op->src with own buffer?

set(GGML_HEADERS_DNNL ggml-dnnl.h)
set(GGML_SOURCES_DNNL ggml-dnnl.cpp)

set(GGML_EXTRA_INCS ${GGML_EXTRA_INCS} ${CLBLAST_INC} ${OPENCL_INC})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLBLAST vars look out of place here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.

@rfsaliev
Copy link
Author

@slaren, can you please help me understand how a backend should work in gpt-2-sched sample?
I tried to enable BLAS backend in the sample, but do not see any call to ggml_backend_blas_mul_mat.
I did following steps to run the sample:

  1. clone ggml
(tf_env) rfsaliev:~$ git clone https://github.com/ggerganov/ggml.git
  1. Apply patch with debug fprintf calls (see the patch below):
(tf_env) rfsaliev:~$ cd ggml
(tf_env) rfsaliev:~/ggml$ git apply < ~/ggml-blas-debug.patch
  1. Download and convert a model keeping FP32 weights:
(tf_env) rfsaliev:~/ggml$ mkdir build && cd build
(tf_env) rfsaliev:~/ggml/build$ ../examples/gpt-2/download-model.sh 117M
(tf_env) rfsaliev:~/ggml/build$ python ../examples/gpt-2/convert-ckpt-to-ggml.py models/gpt-2-117M 0
  1. Build and run the sample:
(tf_env) rfsaliev:~/ggml/build$ cmake .. -DGGML_BLAS=ON && cmake --build . --target gpt-2-sched
(tf_env) rfsaliev:~/ggml/build$ ./bin/gpt-2-sched -m models/gpt-2-117M/ggml-model-f32.bin -p "This is an example of" -n 1 -ngl 32 -s 1

And got the number of BLAS op supported in sample's output but no one BLAS: MUL_MAT or BLAS: OUT_PROD:

main: seed = 1
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model-f32.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 0
gpt2_model_load: qntvr   = 0
gpt2_model_load:     BLAS buffer size =   622.01 MB
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: backend_kv = BLAS
gpt2_model_load: model size  =   621.94 MB
gpt2_model_load: backend_in = BLAS (8192 bytes)
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
BLAS: op supported
BLAS: op supported
...
BLAS: op supported
BLAS: op supported
main:     BLAS compute buffer size =     6.32 MB
main: total compute buffer size: 6.32 MB
main: prompt: 'This is an example of'
main: number of tokens in prompt = 5, first 8 tokens: 1212 318 281 1672 286

This is an example of something

main:     load time =   743.46 ms
main:   sample time =     0.79 ms
main:  predict time =    33.71 ms / 6.74 ms per token
main:    total time =   781.95 ms

BLAS backend debug print patch ggml-blas-debug.patch

diff --git a/src/ggml-blas.cpp b/src/ggml-blas.cpp
index d709a35..7fff962 100644
--- a/src/ggml-blas.cpp
+++ b/src/ggml-blas.cpp
@@ -52,6 +52,8 @@ static void ggml_backend_blas_mul_mat(ggml_backend_blas_context * ctx, struct gg
     const struct ggml_tensor * src0 = dst->src[0];
     const struct ggml_tensor * src1 = dst->src[1];
 
+    fprintf(stderr, "BLAS: MUL_MAT");
+
     GGML_TENSOR_BINARY_OP_LOCALS
 
     const enum ggml_type type = src0->type;
@@ -170,6 +172,8 @@ static void ggml_backend_blas_out_prod(ggml_backend_blas_context * ctx, struct g
     const struct ggml_tensor * src0 = dst->src[0];
     const struct ggml_tensor * src1 = dst->src[1];
 
+    fprintf(stderr, "BLAS: OUT_PROD");
+
     GGML_TENSOR_BINARY_OP_LOCALS
 
     GGML_ASSERT(ne0  == ne00);
@@ -284,14 +288,15 @@ GGML_CALL static bool ggml_backend_blas_supports_op(ggml_backend_t backend, cons
     const struct ggml_tensor * src0 = op->src[0];
     const struct ggml_tensor * src1 = op->src[1];
 
-    return (op->op == GGML_OP_MUL_MAT  && ggml_backend_blas_use_blas(op)) ||
+    bool ok = (op->op == GGML_OP_MUL_MAT  && ggml_backend_blas_use_blas(op)) ||
            (op->op == GGML_OP_OUT_PROD && op->src[0]->type == GGML_TYPE_F32 &&
                                           op->src[1]->type == GGML_TYPE_F32 &&
                                           ggml_is_matrix(src0) &&
                                           ggml_is_matrix(src1) &&
                                           ggml_is_contiguous(src0) &&
                                           (ggml_is_contiguous(src1) || ggml_is_transposed(src1)));
-
+    if (ok) fprintf(stderr, "BLAS: op supported\n");
+    return ok;
     GGML_UNUSED(backend);
 }
 

@slaren
Copy link
Collaborator

slaren commented Jun 20, 2024

BLAS is only used with batches of at least 32 tokens. The "OP supported" you are seeing are probably from the reserve run, which is never executed. Try a larger prompt, or always return true from ggml_backend_blas_use_blas. You can also view the backend assigned to each node in the graph by setting the environment variable GGML_SCHED_DEBUG.

@rfsaliev
Copy link
Author

Thank you, main: number of tokens in prompt = 40, - solved the issue.

* Backend logic is based on BLAS backend
* Implemented support for MUL_MAT operation
* Implemented MUL_MAT fusing with subsequential ADD as bias-add
* Implemented weights 'pre-packing'(reordering) for MUL_MAT operation

Notes:
* This it is the second version of the DNNL-backend based on refactored ggml backend support implemented together with BLAS-backend
* It is recommended to enable GGML_OPENMP when oneDNN compiled with DNNL_CPU_RUNTIME=OMP(default)
@rfsaliev
Copy link
Author

rfsaliev commented Jun 28, 2024

Hello,
I"ve published the new simplified backend version based on the logic of BLAS backend.

Added also simple MUL_MAT+ADD fusing and weights 'pre-packing' (reordering) features.
The 'pre-packing' executed at schedule stage based on OneDNN 'primitive descriptor' which is constructed in supports_op() interface.

@qnixsynapse
Copy link
Contributor

Is this PR dead?

@WilliamTambellini
Copy link
Contributor

RFC please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants