Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA #15871

Closed
JeroendenBoef opened this issue Mar 1, 2022 · 6 comments

Comments

@JeroendenBoef
Copy link

JeroendenBoef commented Mar 1, 2022

Hi HuggingFace team,

Last December I looked into exporting MBartForConditionalGeneration from "facebook/mbart-large-50-many-to-one-mmt" for the purpose of multilingual machine translation. Originally I followed the approach as described in this BART + beam search example, extending the example to support MBART and overriding the max 2GB model size. While this approach worked for CPUExecutionProvider in ORT sessions, it did not actually improve runtime, nor did it work for TensorRT or CUDA execution providers (out of cuda memory and dynamic shape inference failure).

Today I saw this issue and exported MBartForConditionalGeneration with python -m transformers.onnx --model=facebook/mbart-large-50-many-to-one-mmt --feature seq2seq-lm-with-past --atol=5e-5 onnx/. While this worked for exporting to ONNX (passing all validation checks), I couldn't run an actual ORT session due to input dimensionality mismatch (past keys encoder/decoder missing for seq2seq-lm-with-past, decoder_inputs_ids and decoder_attention_mask missing for seq2seq-lm).

I could use some clarification as to whether this is the implementation I am looking for (does the latter ONNX export support .generate() through beam search or should I refocus my attempts at the BART + beam search modification). In case the newer command line ONNX export implementation is what I require, which feature head would be the correct head for the ConditionalGeneration many-to-one-mmt MBART head (seq2seq-lm or seq2seq-lm-with-past) and where can I find the additional inputs that I need for the model to run .generate() in an ORT session? The BART beam search implementation I mentioned earlier required input_ids, attention_mask, num_beams, max_length and decoder_start_token_id. The required inputs for the newer conversion are a bit more confusing to me.

I assume @lewtun would be the person to ask for help here but I appreciate any pointers!

Environment info

  • transformers version: 4.16.2
  • Platform: Linux-5.4.0-96-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.10.1+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no
@HaukurPall
Copy link

I was also adjusting the BART export to work for mBART in a translation setting. There you need to be able to set the decoder_start_token_id dynamically. I noticed that the ONNX conversion example you link to (thank you!) is not complete. It is missing the dynamic_axes configuration for the attention_mask. It needs to be adjusted to:

        torch.onnx.export(
            bart_script_model,
            (
                inputs["input_ids"],
                inputs["attention_mask"],
                num_beams,
                max_length,
                decoder_start_token_id,
            ),
            onnx_file_path,
            opset_version=14,
            input_names=[
                "input_ids",
                "attention_mask",
                "num_beams",
                "max_length",
                "decoder_start_token_id",
            ],
            output_names=["output_ids"],
            dynamic_axes={
                "input_ids": {0: "batch", 1: "seq"},
                "attention_mask": {0: "batch", 1: "seq"},
                "output_ids": {0: "batch", 1: "seq_out"},
            },
            example_outputs=output_ids,
        )

Hope this helps!

@JeroendenBoef
Copy link
Author

Thanks for jumping in with the tip @HaukurPall! Did this approach allow you to not only export the mBART model to ONNX but also run it (with increased speed) on CUDA/TensorRT execution providers? Moreover, I assume you made a custom version of convert.py. where you overrode the torch.onnx.export() call with the snippet you posted above, is this correct? Thanks in advance!

@LysandreJik
Copy link
Member

Thanks for opening an issue @JeroendenBoef!

Pinging @lewtun, @mfuntowicz, should this issue be moved to optimum?

@lewtun
Copy link
Member

lewtun commented Mar 28, 2022

Thanks for the ping! Yes, I think it would make sense to move this issue to the optimum repo :)

@JeroendenBoef we currently have a PR in optimum that will enable simple inference / text-generation for ONNX Runtime: huggingface/optimum#113

Once that is completed, I think it should address most of the points raised in this issue!

@HaukurPall
Copy link

Hey @JeroendenBoef. No, I was not able to get the inference working efficiently on the CUDA execution providers. I even attempted to use the IOBindings (as suggested by the ONNX team) but was not successful. I have put this endeavour aside until there is better support for autoregressive inference.

If you still want to try this, there is a different approach to exporting the models presented in https://github.com/Ki6an/fastT5/. This is for T5, but some of the code has been adjusted to work for mBART, see issue: Ki6an/fastT5#7. I did not try to running that model on CUDA as it would require some work getting the IOBindings correct/efficient.

@JeroendenBoef
Copy link
Author

Thanks for the reply and the pointer to the new PR on optimum @lewtun. I will close this issue for now and keep an eye out for the new developments regarding seq2seq models on optimum. I feel like this issue already covers the core of this issue, save for maybe the problems with actually achieving improved inference speed with the exported model, so unless this is preferred for documentation, I would not open a new issue on optimum.

Thanks for the detailed response @HaukurPall, this saves me some headaches and time :). I was already afraid there would not be an improved performance but now I have confirmation that I should also postpone my efforts on this until there is a better approach in place for ORT seq2seq models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants