Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Rename LLMEngine to MLCEngine #2210

Merged
merged 1 commit into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,11 @@ We can run the Llama-3 model with the chat completion Python API of MLC LLM.
You can save the code below into a Python file and run it.

```python
from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
Expand All @@ -125,12 +125,12 @@ print("\n")
engine.terminate()
```

**The Python API of `mlc_llm.LLMEngine` fully aligns with OpenAI API**.
You can use LLMEngine in the same way of using
**The Python API of `mlc_llm.MLCEngine` fully aligns with OpenAI API**.
You can use MLCEngine in the same way of using
[OpenAI's Python package](https://github.com/openai/openai-python?tab=readme-ov-file#usage)
for both synchronous and asynchronous generation.

If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncLLMEngine` instead.
If you would like to do concurrent asynchronous generation, you can use `mlc_llm.AsyncMLCEngine` instead.

### REST Server

Expand Down
72 changes: 36 additions & 36 deletions docs/deploy/python_engine.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Python API
==========

.. note::
This page introduces the Python API with LLMEngine in MLC LLM.
This page introduces the Python API with MLCEngine in MLC LLM.
If you want to check out the old Python API which uses :class:`mlc_llm.ChatModule`,
please go to :ref:`deploy-python-chat-module`

Expand All @@ -13,10 +13,10 @@ Python API
:depth: 2


MLC LLM provides Python API through classes :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine`
MLC LLM provides Python API through classes :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine`
which **support full OpenAI API completeness** for easy integration into other Python projects.

This page introduces how to use the LLM engines in MLC LLM.
This page introduces how to use the engines in MLC LLM.
The Python API is a part of the MLC-LLM package, which we have prepared pre-built pip wheels via
the :ref:`installation page <install-mlc-packages>`.

Expand All @@ -26,31 +26,31 @@ Verify Installation

.. code:: bash

python -c "from mlc_llm import LLMEngine; print(LLMEngine)"
python -c "from mlc_llm import MLCEngine; print(MLCEngine)"

You are expected to see the output of ``<class 'mlc_llm.serve.engine.LLMEngine'>``.
You are expected to see the output of ``<class 'mlc_llm.serve.engine.MLCEngine'>``.

If the command above results in error, follow :ref:`install-mlc-packages` to install prebuilt pip
packages or build MLC LLM from source.


Run LLMEngine
Run MLCEngine
-------------

:class:`mlc_llm.LLMEngine` provides the interface of OpenAI chat completion synchronously.
:class:`mlc_llm.LLMEngine` does not batch concurrent request due to the synchronous design,
and please use :ref:`AsyncLLMEngine <python-engine-async-llm-engine>` for request batching process.
:class:`mlc_llm.MLCEngine` provides the interface of OpenAI chat completion synchronously.
:class:`mlc_llm.MLCEngine` does not batch concurrent request due to the synchronous design,
and please use :ref:`AsyncMLCEngine <python-engine-async-llm-engine>` for request batching process.

**Stream Response.** In :ref:`quick-start` and :ref:`introduction-to-mlc-llm`,
we introduced the basic use of :class:`mlc_llm.LLMEngine`.
we introduced the basic use of :class:`mlc_llm.MLCEngine`.

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
Expand All @@ -64,9 +64,9 @@ we introduced the basic use of :class:`mlc_llm.LLMEngine`.

engine.terminate()

This code example first creates an :class:`mlc_llm.LLMEngine` instance with the 8B Llama-3 model.
**We design the Python API** :class:`mlc_llm.LLMEngine` **to align with OpenAI API**,
which means you can use :class:`mlc_llm.LLMEngine` in the same way of using
This code example first creates an :class:`mlc_llm.MLCEngine` instance with the 8B Llama-3 model.
**We design the Python API** :class:`mlc_llm.MLCEngine` **to align with OpenAI API**,
which means you can use :class:`mlc_llm.MLCEngine` in the same way of using
`OpenAI's Python package <https://github.com/openai/openai-python?tab=readme-ov-file#usage>`_
for both synchronous and asynchronous generation.

Expand All @@ -90,14 +90,14 @@ for the complete chat completion interface.

.. _python-engine-async-llm-engine:

Run AsyncLLMEngine
Run AsyncMLCEngine
------------------

:class:`mlc_llm.AsyncLLMEngine` provides the interface of OpenAI chat completion with
:class:`mlc_llm.AsyncMLCEngine` provides the interface of OpenAI chat completion with
asynchronous features.
**We recommend using** :class:`mlc_llm.AsyncLLMEngine` **to batch concurrent request for better throughput.**
**We recommend using** :class:`mlc_llm.AsyncMLCEngine` **to batch concurrent request for better throughput.**

**Stream Response.** The core use of :class:`mlc_llm.AsyncLLMEngine` for stream responses is as follows.
**Stream Response.** The core use of :class:`mlc_llm.AsyncMLCEngine` for stream responses is as follows.

.. code:: python

Expand All @@ -109,14 +109,14 @@ asynchronous features.
for choice in response.choices:
print(choice.delta.content, end="", flush=True)

.. collapse:: The collapsed is a complete runnable example of AsyncLLMEngine in Python.
.. collapse:: The collapsed is a complete runnable example of AsyncMLCEngine in Python.

.. code:: python

import asyncio
from typing import Dict

from mlc_llm.serve import AsyncLLMEngine
from mlc_llm.serve import AsyncMLCEngine

model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
prompts = [
Expand All @@ -127,7 +127,7 @@ asynchronous features.

async def test_completion():
# Create engine
async_engine = AsyncLLMEngine(model=model)
async_engine = AsyncMLCEngine(model=model)

num_requests = len(prompts)
output_texts: Dict[str, str] = {}
Expand Down Expand Up @@ -176,8 +176,8 @@ for the complete chat completion interface.
Engine Mode
-----------

To ease the engine configuration, the constructors of :class:`mlc_llm.LLMEngine` and
:class:`mlc_llm.AsyncLLMEngine` have an optional argument ``mode``,
To ease the engine configuration, the constructors of :class:`mlc_llm.MLCEngine` and
:class:`mlc_llm.AsyncMLCEngine` have an optional argument ``mode``,
which falls into one of the three options ``"local"``, ``"interactive"`` or ``"server"``.
The default mode is ``"local"``.

Expand All @@ -203,59 +203,59 @@ Deploy Your Own Model with Python API
The :ref:`introduction page <introduction-deploy-your-own-model>` introduces how we can deploy our
own models with MLC LLM.
This section introduces how you can use the model weights you convert and the model library you build
in :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine`.
in :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine`.

We use the `Phi-2 <https://huggingface.co/microsoft/phi-2>`_ as the example model.

**Specify Model Weight Path.** Assume you have converted the model weights for your own model,
you can construct a :class:`mlc_llm.LLMEngine` as follows:
you can construct a :class:`mlc_llm.MLCEngine` as follows:

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

model = "models/phi-2" # Assuming the converted phi-2 model weights are under "models/phi-2"
engine = LLMEngine(model)
engine = MLCEngine(model)


**Specify Model Library Path.** Further, if you build the model library on your own,
you can use it in :class:`mlc_llm.LLMEngine` by passing the library path through argument ``model_lib_path``.
you can use it in :class:`mlc_llm.MLCEngine` by passing the library path through argument ``model_lib_path``.

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

model = "models/phi-2"
model_lib_path = "models/phi-2/lib.so" # Assuming the phi-2 model library is built at "models/phi-2/lib.so"
engine = LLMEngine(model, model_lib_path=model_lib_path)
engine = MLCEngine(model, model_lib_path=model_lib_path)


The same applies to :class:`mlc_llm.AsyncLLMEngine`.
The same applies to :class:`mlc_llm.AsyncMLCEngine`.


.. _python-engine-api-reference:

API Reference
-------------

The :class:`mlc_llm.LLMEngine` and :class:`mlc_llm.AsyncLLMEngine` classes provide the following constructors.
The :class:`mlc_llm.MLCEngine` and :class:`mlc_llm.AsyncMLCEngine` classes provide the following constructors.

The LLMEngine and AsyncLLMEngine have full OpenAI API completeness.
The MLCEngine and AsyncMLCEngine have full OpenAI API completeness.
Please refer to `OpenAI's Python package <https://github.com/openai/openai-python?tab=readme-ov-file#usage>`_
and `OpenAI chat completion API <https://platform.openai.com/docs/api-reference/chat/create>`_
for the complete chat completion interface.

.. currentmodule:: mlc_llm

.. autoclass:: LLMEngine
.. autoclass:: MLCEngine
:members:
:exclude-members: evaluate
:undoc-members:
:show-inheritance:

.. automethod:: __init__

.. autoclass:: AsyncLLMEngine
.. autoclass:: AsyncMLCEngine
:members:
:exclude-members: evaluate
:undoc-members:
Expand Down
18 changes: 9 additions & 9 deletions docs/get_started/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,11 @@ You can save the code below into a Python file and run it.

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
Expand All @@ -114,9 +114,9 @@ You can save the code below into a Python file and run it.

MLC LLM Python API

This code example first creates an :class:`mlc_llm.LLMEngine` instance with the 4-bit quantized Llama-3 model.
**We design the Python API** :class:`mlc_llm.LLMEngine` **to align with OpenAI API**,
which means you can use :class:`mlc_llm.LLMEngine` in the same way of using
This code example first creates an :class:`mlc_llm.MLCEngine` instance with the 4-bit quantized Llama-3 model.
**We design the Python API** :class:`mlc_llm.MLCEngine` **to align with OpenAI API**,
which means you can use :class:`mlc_llm.MLCEngine` in the same way of using
`OpenAI's Python package <https://github.com/openai/openai-python?tab=readme-ov-file#usage>`_
for both synchronous and asynchronous generation.

Expand All @@ -134,7 +134,7 @@ If you want to run without streaming, you can run
print(response)

You can also try different arguments supported in `OpenAI chat completion API <https://platform.openai.com/docs/api-reference/chat/create>`_.
If you would like to do concurrent asynchronous generation, you can use :class:`mlc_llm.AsyncLLMEngine` instead.
If you would like to do concurrent asynchronous generation, you can use :class:`mlc_llm.AsyncMLCEngine` instead.

REST Server
-----------
Expand Down Expand Up @@ -229,7 +229,7 @@ You can also use this model in Python API, MLC serve and other use scenarios.
(Optional) Compile Model Library
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In previous sections, model libraries are compiled when the :class:`mlc_llm.LLMEngine` launches,
In previous sections, model libraries are compiled when the :class:`mlc_llm.MLCEngine` launches,
which is what we call "JIT (Just-in-Time) model compilation".
In some cases, it is beneficial to explicitly compile the model libraries.
We can deploy LLMs with reduced dependencies by shipping the library for deployment without going through compilation.
Expand Down Expand Up @@ -257,12 +257,12 @@ At runtime, we need to specify this model library path to use it. For example,

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# For Python API
model = "models/phi-2"
model_lib_path = "models/phi-2/lib.so"
engine = LLMEngine(model, model_lib_path=model_lib_path)
engine = MLCEngine(model, model_lib_path=model_lib_path)

:ref:`compile-model-libraries` introduces the model compilation command in detail,
where you can find instructions and example commands to compile model to different
Expand Down
4 changes: 2 additions & 2 deletions docs/get_started/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ It is recommended to have at least 6GB free VRAM to run it.

.. code:: python

from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
Expand Down
4 changes: 2 additions & 2 deletions examples/python/sample_mlc_engine.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from mlc_llm import LLMEngine
from mlc_llm import MLCEngine

# Create engine
model = "HF://mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"
engine = LLMEngine(model)
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
Expand Down
2 changes: 1 addition & 1 deletion python/mlc_llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
from . import protocol, serve
from .chat_module import ChatConfig, ChatModule, ConvConfig, GenerationConfig
from .libinfo import __version__
from .serve import AsyncLLMEngine, LLMEngine
from .serve import AsyncMLCEngine, MLCEngine
2 changes: 1 addition & 1 deletion python/mlc_llm/help.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@
The number of draft tokens to generate in speculative proposal. The default values is 4.
""",
"engine_config_serve": """
The LLMEngine execution configuration.
The MLCEngine execution configuration.
Currently speculative decoding mode is specified via engine config.
For example, you can use "--engine-config='spec_draft_length=4;speculative_mode=EAGLE'" to
specify the eagle-style speculative decoding.
Expand Down
2 changes: 1 addition & 1 deletion python/mlc_llm/interface/serve.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def serve(
): # pylint: disable=too-many-arguments, too-many-locals
"""Serve the model with the specified configuration."""
# Create engine and start the background loop
async_engine = engine.AsyncLLMEngine(
async_engine = engine.AsyncMLCEngine(
model=model,
device=device,
model_lib_path=model_lib_path,
Expand Down
2 changes: 1 addition & 1 deletion python/mlc_llm/serve/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from .. import base
from .config import EngineConfig, GenerationConfig, SpeculativeMode
from .data import Data, ImageData, RequestStreamOutput, TextData, TokenData
from .engine import AsyncLLMEngine, LLMEngine
from .engine import AsyncMLCEngine, MLCEngine
from .grammar import BNFGrammar, GrammarStateMatcher
from .radix_tree import PagedRadixTree
from .request import Request
Expand Down
2 changes: 1 addition & 1 deletion python/mlc_llm/serve/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ class SpeculativeMode(enum.IntEnum):

@tvm._ffi.register_object("mlc.serve.EngineConfig") # pylint: disable=protected-access
class EngineConfig(tvm.runtime.Object):
"""The class of LLMEngine execution configuration.
"""The class of MLCEngine execution configuration.

Parameters
----------
Expand Down
Loading
Loading