How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

JeroendenBoef · 2022-03-01T16:31:08Z

Hi HuggingFace team,

Last December I looked into exporting MBartForConditionalGeneration from "facebook/mbart-large-50-many-to-one-mmt" for the purpose of multilingual machine translation. Originally I followed the approach as described in this BART + beam search example, extending the example to support MBART and overriding the max 2GB model size. While this approach worked for CPUExecutionProvider in ORT sessions, it did not actually improve runtime, nor did it work for TensorRT or CUDA execution providers (out of cuda memory and dynamic shape inference failure).

Today I saw this issue and exported MBartForConditionalGeneration with python -m transformers.onnx --model=facebook/mbart-large-50-many-to-one-mmt --feature seq2seq-lm-with-past --atol=5e-5 onnx/. While this worked for exporting to ONNX (passing all validation checks), I couldn't run an actual ORT session due to input dimensionality mismatch (past keys encoder/decoder missing for seq2seq-lm-with-past, decoder_inputs_ids and decoder_attention_mask missing for seq2seq-lm).

I could use some clarification as to whether this is the implementation I am looking for (does the latter ONNX export support .generate() through beam search or should I refocus my attempts at the BART + beam search modification). In case the newer command line ONNX export implementation is what I require, which feature head would be the correct head for the ConditionalGeneration many-to-one-mmt MBART head (seq2seq-lm or seq2seq-lm-with-past) and where can I find the additional inputs that I need for the model to run .generate() in an ORT session? The BART beam search implementation I mentioned earlier required input_ids, attention_mask, num_beams, max_length and decoder_start_token_id. The required inputs for the newer conversion are a bit more confusing to me.

I assume @lewtun would be the person to ask for help here but I appreciate any pointers!

Environment info

transformers version: 4.16.2
Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.10.1+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

The text was updated successfully, but these errors were encountered:

HaukurPall · 2022-03-11T13:21:33Z

I was also adjusting the BART export to work for mBART in a translation setting. There you need to be able to set the decoder_start_token_id dynamically. I noticed that the ONNX conversion example you link to (thank you!) is not complete. It is missing the dynamic_axes configuration for the attention_mask. It needs to be adjusted to:

        torch.onnx.export(
            bart_script_model,
            (
                inputs["input_ids"],
                inputs["attention_mask"],
                num_beams,
                max_length,
                decoder_start_token_id,
            ),
            onnx_file_path,
            opset_version=14,
            input_names=[
                "input_ids",
                "attention_mask",
                "num_beams",
                "max_length",
                "decoder_start_token_id",
            ],
            output_names=["output_ids"],
            dynamic_axes={
                "input_ids": {0: "batch", 1: "seq"},
                "attention_mask": {0: "batch", 1: "seq"},
                "output_ids": {0: "batch", 1: "seq_out"},
            },
            example_outputs=output_ids,
        )

Hope this helps!

JeroendenBoef · 2022-03-18T14:20:23Z

Thanks for jumping in with the tip @HaukurPall! Did this approach allow you to not only export the mBART model to ONNX but also run it (with increased speed) on CUDA/TensorRT execution providers? Moreover, I assume you made a custom version of convert.py. where you overrode the torch.onnx.export() call with the snippet you posted above, is this correct? Thanks in advance!

LysandreJik · 2022-03-28T09:02:59Z

Thanks for opening an issue @JeroendenBoef!

Pinging @lewtun, @mfuntowicz, should this issue be moved to optimum?

lewtun · 2022-03-28T09:08:42Z

Thanks for the ping! Yes, I think it would make sense to move this issue to the optimum repo :)

@JeroendenBoef we currently have a PR in optimum that will enable simple inference / text-generation for ONNX Runtime: huggingface/optimum#113

Once that is completed, I think it should address most of the points raised in this issue!

HaukurPall · 2022-03-28T10:37:13Z

Hey @JeroendenBoef. No, I was not able to get the inference working efficiently on the CUDA execution providers. I even attempted to use the IOBindings (as suggested by the ONNX team) but was not successful. I have put this endeavour aside until there is better support for autoregressive inference.

If you still want to try this, there is a different approach to exporting the models presented in https://github.com/Ki6an/fastT5/. This is for T5, but some of the code has been adjusted to work for mBART, see issue: Ki6an/fastT5#7. I did not try to running that model on CUDA as it would require some work getting the IOBindings correct/efficient.

JeroendenBoef · 2022-03-29T08:52:19Z

Thanks for the reply and the pointer to the new PR on optimum @lewtun. I will close this issue for now and keep an eye out for the new developments regarding seq2seq models on optimum. I feel like this issue already covers the core of this issue, save for maybe the problems with actually achieving improved inference speed with the exported model, so unless this is preferred for documentation, I would not open a new issue on optimum.

Thanks for the detailed response @HaukurPall, this saves me some headaches and time :). I was already afraid there would not be an improved performance but now I have confirmation that I should also postpone my efforts on this until there is a better approach in place for ORT seq2seq models.

JeroendenBoef closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

JeroendenBoef commented Mar 1, 2022 •

edited

Loading

HaukurPall commented Mar 11, 2022

JeroendenBoef commented Mar 18, 2022

LysandreJik commented Mar 28, 2022

lewtun commented Mar 28, 2022

HaukurPall commented Mar 28, 2022

JeroendenBoef commented Mar 29, 2022

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

Comments

JeroendenBoef commented Mar 1, 2022 • edited Loading

Environment info

HaukurPall commented Mar 11, 2022

JeroendenBoef commented Mar 18, 2022

LysandreJik commented Mar 28, 2022

lewtun commented Mar 28, 2022

HaukurPall commented Mar 28, 2022

JeroendenBoef commented Mar 29, 2022

JeroendenBoef commented Mar 1, 2022 •

edited

Loading