From 96b7dd3ad4b224143419c0f3c32151fb462aa466 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Sat, 27 Jan 2024 13:40:33 -0800 Subject: [PATCH 1/8] update peft doc Signed-off-by: Chen Cui --- .../nlp/nemo_megatron/peft/landing_page.rst | 16 ++++++++-------- .../nlp/nemo_megatron/peft/quick_start.rst | 6 ++++-- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst index c90dcdfff1c5..6f14c189e0f6 100644 --- a/docs/source/nlp/nemo_megatron/peft/landing_page.rst +++ b/docs/source/nlp/nemo_megatron/peft/landing_page.rst @@ -12,14 +12,14 @@ fraction of the computational and storage costs. NeMo supports four PEFT methods which can be used with various transformer-based models. -==================== ===== ===== ========= == -\ GPT 3 NvGPT LLaMa 1/2 T5 -==================== ===== ===== ========= == -Adapters (Canonical) ✅ ✅ ✅ ✅ -LoRA ✅ ✅ ✅ ✅ -IA3 ✅ ✅ ✅ ✅ -P-Tuning ✅ ✅ ✅ ✅ -==================== ===== ===== ========= == +==================== ===== ======== ========= ====== == +\ GPT 3 Nemotron LLaMa 1/2 Falcon T5 +==================== ===== ======== ========= ====== == +Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ +LoRA ✅ ✅ ✅ ✅ ✅ +IA3 ✅ ✅ ✅ ✅ ✅ +P-Tuning ✅ ✅ ✅ ✅ ✅ +==================== ===== ======== ========= ====== == Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works in NeMo. Read about the supported PEFT methods diff --git a/docs/source/nlp/nemo_megatron/peft/quick_start.rst b/docs/source/nlp/nemo_megatron/peft/quick_start.rst index 000e242b9508..fd46444eee54 100644 --- a/docs/source/nlp/nemo_megatron/peft/quick_start.rst +++ b/docs/source/nlp/nemo_megatron/peft/quick_start.rst @@ -62,7 +62,7 @@ Base model classes PEFT in NeMo is built with a mix-in class that does not belong to any model in particular. This means that the same interface is available to different NeMo models. Currently, NeMo supports PEFT for GPT-style -models such as GPT 3, NvGPT, LLaMa 1/2 (``MegatronGPTSFTModel``), as +models such as GPT 3, Nemotron, LLaMa 1/2 (``MegatronGPTSFTModel``), as well as T5 (``MegatronT5SFTModel``). Full finetuning vs PEFT @@ -78,11 +78,13 @@ PEFT. trainer = MegatronTrainerBuilder(config).create_trainer() model_cfg = MegatronGPTSFTModel.merge_cfg_with(config.model.restore_from_path, config) + ### Training API ### model = MegatronGPTSFTModel.restore_from(restore_path, model_cfg, trainer) # restore from pretrained ckpt - + peft_cfg = LoRAPEFTConfig(model_cfg) + + peft_cfg = LoraPEFTConfig(model_cfg) + model.add_adapter(peft_cfg) trainer.fit(model) # saves adapter weights only + ### Inference API ### # Restore from base then load adapter API model = MegatronGPTSFTModel.restore_from(restore_path, trainer, model_cfg) + model.load_adapters(adapter_save_path, peft_cfg) From 648629e64975d7f901b87faa22bf3b230ea4b6de Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Sat, 27 Jan 2024 13:45:25 -0800 Subject: [PATCH 2/8] remove old prompt learning doc and notebook Signed-off-by: Chen Cui --- README.rst | 2 +- .../nlp/nemo_megatron/prompt_learning.rst | 390 --------- .../nlp/Multitask_Prompt_and_PTuning.ipynb | 786 ------------------ 3 files changed, 1 insertion(+), 1177 deletions(-) delete mode 100644 docs/source/nlp/nemo_megatron/prompt_learning.rst delete mode 100644 tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb diff --git a/README.rst b/README.rst index 05cf7d8d7124..eb169f2cafc4 100644 --- a/README.rst +++ b/README.rst @@ -125,7 +125,7 @@ Key Features * `Information retrieval `_ * `Entity Linking `_ * `Dialogue State Tracking `_ - * `Prompt Learning `_ + * `Parameter Efficient Finetuning (PEFT) `_ * `NGC collection of pre-trained NLP models. `_ * `Synthetic Tabular Data Generation `_ * Text-to-Speech Synthesis (TTS): diff --git a/docs/source/nlp/nemo_megatron/prompt_learning.rst b/docs/source/nlp/nemo_megatron/prompt_learning.rst deleted file mode 100644 index 8fe481019a6f..000000000000 --- a/docs/source/nlp/nemo_megatron/prompt_learning.rst +++ /dev/null @@ -1,390 +0,0 @@ -.. _promptlearning: - -Prompt Learning ---------------- - -Within NeMo we refer to **p-tuning** and **prompt tuning** methods collectively as prompt learning. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model's full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoids catastrophic forgetting issues often encountered when fine-tuning models. - -Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning utilize virtual prompt embeddings that can be optimized via gradient descent. The only difference between prompt tuning and p-tuning within NeMo-Megatron is the architecture used to tune the soft prompt tokens during training. - -- Our prompt tuning implementation is based off Lester et. al’s EMNLP 2021 paper "`The Power of Scale for Parameter-Efficient Prompt Tuning `_" -- Our p-tuning implementation is based off Liu et al's paper "`GPT Understands, Too `_" - -Our continuous learning capability for combined p-tuning and prompt tuning with GPT style models is a NeMo specific extension of the author's original work. - -Please also checkout our `prompt learning tutorial notebook. `_ - - -Terminology -^^^^^^^^^^^ -We will be using the terms ``continuous``, ``soft``, and ``virtual`` token interchangeably to refer to embeddings inserted into the model prompt that have no concrete mapping to strings or characters within the model’s vocabulary. These virtual token embeddings exist in contrast to the ``discrete``, ``hard``, or ``real`` tokens that do make up the model’s vocabulary. Virtual tokens are purely 1D vectors with dimensionality equal to that of each real token embedding, matching the ``hidden_size`` hyperparameter. In training and inference, continuous token embeddings are inserted among discrete token embeddings according to a template you provide in the model's config. We will demonstrate how to do this below. - -When referring to p-tuning and prompt tuning together, we will be using the phrase prompt learning for simplicity. - -Prompt Tuning -^^^^^^^^^^^^^ - -In prompt-tuning a pretrained GPT model, soft prompt embeddings are initialized as a 2D matrix of size ``total_virtual_tokens X hidden_size``. Each task the model is prompt-tuned to perform has its own 2D embedding matrix associated with it. Tasks do not share any parameters during training or inference. All GPT model parameters are frozen and only the embedding parameters for each task are updated during training. - -In prompt tuning you can specify how the embeddings are initialized for each task. You can either - -- Initialize embedding parameters according to some random distribution -- Initialize embedding parameters from existing vocabulary embeddings (recommended) - -If you choose to initialize virtual token embeddings from existing embedding weights, you can provide the string of words you want to use for initialization in the model's config. This string will be tokenized and tiled or truncated to match the specified number of virtual tokens you would like to use (``total_virtual_tokens``). Vocab embeddings are copied and used to initialize the soft prompt embedding matrix for each task. The vocab embeddings themselves are not updated or changed during prompt tuning. - -P-Tuning -^^^^^^^^ - -In p-tuning, an LSTM model is used to predict virtual token embeddings. We refer to this LSTM model as our ``prompt_encoder``. LSTM parameters are randomly initialized at the start of p-tuning. All GPT model parameters are frozen, and only the LSTM weights are updated at each training step. LSTM parameters are shared between all tasks that are p-tuned at the same time, but the LSTM model outputs unique virtual token embeddings for each task. The virtual tokens predicted by the LSTM are inserted among the discrete token input in the exact same manner as with prompt-tuning. You still specify the number of virtual tokens you want to use by setting ``total_virtual_tokens`` and each virtual token embedding is still a 1D vector of size ``hidden_size``. - -Using Both Prompt and P-Tuning -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A single pretrained GPT model can use both p-tuning and prompt-tuning. While you must decide to use either p-tuning or prompt-tuning for each task you want your model to perform, you can p-tune your model on a set of tasks *A*, then prompt tune your same model on a different set of tasks *B*, then finally run inference on tasks from both *A* and *B* at the same time. During prompt-tuning or p-tuning, tasks tuned at the same time must use the same number of virtual tokens. During inference, tasks using differing amounts of virtual tokens can be run at the same time. - -When p-tuning completes, prompt tuned virtual tokens from the p-tuning ``prompt_encoder`` are automatically moved to the ``prompt_table`` where all prompt tuned and p-tuned soft prompts are stored. The LSTM ``prompt_encoder`` is then removed from the model. This allows us to preserve previously p-tuned soft prompts while still maintaining the ability to add new p-tuned or prompt-tuned soft prompts in the future. The ``prompt_table`` uses the ``taskname`` as a key to look up the correct virtual tokens for a specified task. The ``prompt_table``'s hash table data structure also makes it possible for each task to flexibly use a different number of virtual tokens. - -P-tuning usually requires fewer virtual tokens per task to achieve good results but uses a higher number of parameters compared to prompt-tuning. For example, if you prompt tune a 125M parameter GPT model (with hidden size 768) on two tasks, using 100 virtual tokens per task, the total parameters tuned during prompt tuning would equal 153k (~.1% of the pre-trained model size). If you p-tune the same 125M GPT model on 2 tasks, using an LSTM with two layers and 10 tokens per task, you will be tuning 8.3M parameters (~6.6% of the pre-trained model size). The increased number of parameters used during p-tuning is mitigated by our ``prompt_table``. When p-tuned soft prompts are placed in the prompt table, only the parameters for the predicted virtual tokens are saved. This allows us to keep the benefit of tuning a larger number of parameters during training, while also preserving the parameter efficiency of prompt-tuning during inference and storing of the model. - -Because p-tuning shares parameters between tasks during training, p-tuning your model on multiple tasks that are similar might allow your model to share insight between tasks. In the same vein, p-tuning on many very different tasks at once might perform worse than prompt tuning, which tunes a distinct set of parameters per task. **Generally we recommend using p-tuning over prompt tuning.** - -Users can also optionally tune the model's full parameters in addition to the soft prompt parameters. See ``model.lm_finetune`` in the Prompt Learning Config section for details on how to configure this. - -Dataset Preprocessing -^^^^^^^^^^^^^^^^^^^^^ - -The prompt learning dataset accepts a list of json/dictionary objects or a list of json file names where each json file contains a collection of json objects. Each json object must include the field ``taskname`` which is a string identifier for the task the data example corresponds to. They should also include one or more fields corresponding to different sections of the discrete text prompt. The input data might look like: - -.. code:: - - [ - {"taskname": "squad", "context": [CONTEXT_PARAGRAPH_TEXT1], "question": [QUESTION_TEXT1], "answer": [ANSWER_TEXT1]}, - {"taskname": "squad", "context": [CONTEXT_PARAGRAPH_TEXT2], "question": [QUESTION_TEXT2], "answer": [ANSWER_TEXT2]}, - {"taskname": "intent_and_slot", "utterance": [UTTERANCE_TEXT1], "label": [INTENT_TEXT1][SLOT_TEXT1]}, - {"taskname": "intent_and_slot", "utterance": [UTTERANCE_TEXT2], "label": [INTENT_TEXT2][SLOT_TEXT2]}, - {"taskname": "sentiment", "sentence": [SENTENCE_TEXT1], "label": [SENTIMENT_LABEL1]}, - {"taskname": "sentiment", "sentence": [SENTENCE_TEXT2], "label": [SENTIMENT_LABEL2]}, - ] - -These additional fields can be unlimited in number and will be used to help map different parts of the discrete text input to a prompt template that you define. We show how this mapping works and how to construct your prompt template in the Prompt Formatting section. Data examples for each dataset can all be passed to the dataset class in one file, or in separate ``.jsonl`` files in a list. - -.. _data-example-label: - -Prompt Formatting -^^^^^^^^^^^^^^^^^ - -To customize different prompts for different tasks, we simply need to specify the prompt task template in the config file at ``model.task_templates``. The virtual token markers ``<|VIRTUAL_PROMPT_#|>`` signify where you want virtual tokens to be placed in the template string. ``<|VIRTUAL_PROMPT_0|>``, ``<|VIRTUAL_PROMPT_1|>``, and ``<|VIRTUAL_PROMPT_2|>`` indicate where a number of virtual tokens matching the values given at ``virtual_token_splits[0]``, ``virtual_token_splits[1]`` and ``virtual_token_splits[2]`` will be placed. The other variable fields ``{var}`` refer to the fields in the data json. - -For example, given: - -- the data json ``{"sentence1": "And he said, Mama, I'm home.", "sentence2": "He didn't say a word."}`` -- virtual token splits set to ``virtual_token_splits = [3, 3, 3]`` -- a prompt template set to ``prompt_template = "<|VIRTUAL_PROMPT_0|> Hypothesis: [sentence1], <|VIRTUAL_PROMPT_1|> Premise: [sentence2] <|VIRTUAL_PROMPT_2|> Answer:"`` - -the input will be translated into ``VVV Hypothesis: And he said, Mama, I'm home. VVV Premise: He didn't say a word. VVV Answer:``, where ``VVV`` are three virtual tokens. - -**We recommend you first try prompt learning by placing all virtual tokens at the very beginning of your prompt template** like we do with the ``sentiment`` task example below. We've found this gives strong performance. -.. code:: - - config.model.task_templates = [ - { - "taskname": "sentiment", - "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}", - "total_virtual_tokens": 10, - "virtual_token_splits": [10], - "truncate_field": "sentence", - "answer_only_loss": False, - }, - { - "taskname": "intent_and_slot", - "prompt_template": "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}", - "total_virtual_tokens": 10, - "virtual_token_splits": [7, 3], - "truncate_field": None, - "answer_only_loss": True, - "answer_field": "label" - } - ] - -.. _prompt-formatting-label: - -``model.task_templates`` Config Parameters -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. list-table:: - :widths: 15 15 25 - :header-rows: 1 - - * - **Parameter** - - **Data type** - - **Description** - * - **taskname** - - string - - Short string denoting the task, used to lookup task specific virtual tokens from the ``prompt_table``. Refers to the same ``taskname`` in the dataset json objects. - * - **prompt_template** - - string - - a string showing the model where to place virtual tokens and how to map dataset json fields to where they belong in the model prompt - * - **total_virtual_tokens** - - int - - specifies the total number of virtual tokens that will be inserted into the model prompt - * - **virtual_token_splits** - - list of ints - - specifies the number of virtual tokens that belong at each ``<|VIRTUAL_PROMPT_#|>`` marker. ``virtual_token_splits`` values should add up to ``total_virtual_tokens``. The number of ``virtual_token_splits`` should match the number of ``<|VIRTUAL_PROMPT_#|>`` markers. - * - **answer_only_loss** - - bool - - Whether to limit loss calculation to only the answer portion of the prompt during tuning. Strongly recommended for long prompts. - * - **answer_field** - - string - - The field in the data json corresponding to the answer. The loss will only be calculated on this portion of the prompt if ``answer_only_loss`` is ``True``. The answer field must be at the end of the prompt template. - * - **truncate_field** - - string - - specifies which field in the data json to truncate if the length of the input exceeds the maximum sequence length of the model. If ``truncate_field`` is set to ``None``, examples that are too long are simply dropped from the dataset. - -Prompt Learning Specific Config Values -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.. list-table:: - :widths: 15 15 25 - :header-rows: 1 - - * - **Parameter** - - **Data type** - - **Description** - * - **model.nemo_path** - - string - - Path to where you want to save your model after prompt tuning/p-tuning, must end in `.nemo` - * - **model.virtual_prompt_style** - - string - - one of 'prompt-tuning', 'p-tuning', or 'inference' - * - **model.language_model_path** - - string - - Path to the GPT language model .nemo file you want to use for prompt learning, not needed if ``restore_path`` is set - * - **model.restore_path** - - string - - Path to a .nemo file of existing ``MegatronGPTPromptLearningModel`` that has already been prompt tuned or p-tuned on at least one task. P-tuned or prompt tuned in this training session will be added to this model's `prompt_table`. Should be set to ``null`` if none. - * - **model.new_tasks** - - list of strings - - List of new tasknames to be prompt or p-tuned, - * - **model.existing_tasks** - - list of strings - - List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given. Should be set to ``[]`` if None. - * - **model.task_templates** - - list - - See the ``model.task_templates`` Config Parameters Table above - * - **model.prompt_tuning.new_prompt_init_methods** - - list of strings - - List of 'text' or 'random', should correspond to the order of tasks listed in ``model.new_tasks``. Only needed if `virtual_prompt_style='prompt-tuning'` - * - **model.prompt_tuning.new_prompt_init_text** - - list of strings - - The text you want to use for soft prompt initalization if ``model.prompt_tuning.new_prompt_init_methods`` is set to 'text' for a task. Should correspond to the order of tasks listed in ``model.new_tasks``. The text is tokenized and clipped or tiled to match ``total_virtual_tokens`` in ``model.task_templates``. The vocab embeddings associated with each token are copied and use to initialize the soft prompts before tuning. - * - **model.p_tuning.dropout** - - float - - LSTM prompt encoder dropout prob - * - **model.p_tuning.num_layers** - - int - - Num layers in LSTM prompt encoder - * - **model.tensor_model_parallel_size** - - int - - intra-layer model parallelism, must match the ``tensor_model_parallel_size`` of the GPT model given at ``language_model_path`` - * - **model.batch_size** - - int - - global batch size - * - **model.data.train_ds** - - list of strings - - list of ``.json`` or ``.jsonl`` training dataset files with json ojects that have the dataset format described above - * - **model.data.validation_ds** - - list of strings - - list of ``.json`` or ``.jsonl`` validation dataset files with json ojects that have the dataset format described above - * - **model.data.add_eos** - - bool - - Whether to add an EOS token at the end of each training example (recommended). - -An example config file can be found at https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml - -Setting New Tasks -^^^^^^^^^^^^^^^^^ - -After you p-tune or prompt-tune your model, you can always go back and p-tune or prompt-tune your model on more tasks without over writing the virtual prompts who've trained already. You can also use a different number of ``total_virtual_tokens`` between each training session as long as tasks ptuned or prompt tuned at the same time have the same number of ``total_virtual_tokens``. For this reason, when you ptune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). You do this by setting the ``new_tasks`` and ``existing_tasks`` values in the config file. - -Example Multi-Task Prompt Tuning Config and Command -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -First define a config called ``multitask-prompt-learning.yaml`` demonstrated below. **In the** ``exp_manager`` **portion of the config,** ``save_nemo_on_train_end`` **should be set to** ``False`` **to avoid unnecessarily saving the incorrect model weights.** This is already done in the example `megatron_gpt_prompt_learning_config.yaml config `_ that you should use as your starting point. The correct prompt learning model will be saved at the ``model.nemo_path`` you set. - -.. code:: - - name: multitask_prompt_tuning - trainer: ... - exp_manager: ... - model: - seed: 1234 - nemo_path: ${name}.nemo - virtual_prompt_style: "prompt-tuning" - encoder_seq_length: 2048 - tensor_model_parallel_size: 1 - pipeline_model_parallel_size: 1 - global_batch_size: 16 - micro_batch_size: 4 - - restore_path: null - language_model_path: models/megatron_125M_gpt.nemo - existing_tasks: [] - new_tasks: ["sentiment", "intent_and_slot"] - - task_templates: - - taskname: "sentiment" - prompt_template: "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}" - total_virtual_tokens: 100 - virtual_token_splits: [100] - truncate_field: null - answer_only_loss: False - - - taskname: "intent_and_slot" - prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" - total_virtual_tokens: 100 - virtual_token_splits: [80, 20] - truncate_field: null - answer_only_loss: True - answer_field: "label" - - prompt_tuning: - new_prompt_init_methods: ["text", "text"] - new_prompt_init_text: ["financial sentiment analysis postive neutral negative", "intent and slot classification virtual assistant task bot please"] - - data: - train_ds: ["data/financial_phrase_bank_train.jsonl", "data/assistent_train.jsonl"] - validation_ds: ["data/financial_phrase_bank_val.jsonl", "data/assistent_val.jsonl"] - add_eos: True - shuffle: True - num_workers: 1 - pin_memory: True - - optim: ... - -(See https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml for what should go in the ``trainer``, ``exp_manager``, and ``optim`` sections.) - -Then run the command - -.. code:: - - python megatron_gpt_prompt_learning.py --config-name=multitask-prompt-learning.yaml - - -Example Multi-Task P-Tuning Config and Command After Prompt-Tuning -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Update ``multitask-prompt-learning.yaml`` from the example above with p-tuning parameters for the new task. Be sure to update ``model.existing_tasks`` with the tasknames from previous prompt learning runs and to use the ``.nemo`` file saved at the end of your last prompt learning session. Values different from the config above have stars commented next to them. - -In this example, the SQuAD task includes the question context as part of the prompt. Because the context is long, we recommend setting ``answer_only_loss`` to ``True`` for this task, and any task where a significant portion of the prompt is not a part of the answer. ``answer_only_loss`` tells the model to only calculate the cross-entropy loss on the answer portion of the training example. Though we recommend placing all virtual tokens at the beginning of the prompt, we place them throughout the prompt in this example to demonstrate how to do so. - -.. code:: - - name: multitask_p_tuning # *** - trainer: ... - exp_manager: ... - model: - seed: 1234 - nemo_path: ${name}.nemo - virtual_prompt_style: "p-tuning" # *** - encoder_seq_length: 2048 - tensor_model_parallel_size: 1 - pipeline_model_parallel_size: 1 - global_batch_size: 16 - micro_batch_size: 4 - - restore_path: multitask_prompt_tuning.nemo # *** - language_model_path: models/megatron_125M_gpt.nemo - existing_tasks: ["sentiment", "intent_and_slot"] # *** - new_tasks: ["squad"] - - task_templates: - - taskname: "sentiment" - prompt_template: "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}" - total_virtual_tokens: 100 - virtual_token_splits: [100] - truncate_field: null - answer_only_loss: False - - - taskname: "intent_and_slot" - prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" - total_virtual_tokens: 100 - virtual_token_splits: [80, 20] - truncate_field: null - answer_only_loss: True - answer_field: "label" - - - taskname: "squad" # *** - prompt_template: "<|VIRTUAL_PROMPT_0|> Answer the question from the context {question} {context} Answer: {answer}" # *** - total_virtual_tokens: 9 # *** - virtual_token_splits: [9] # *** - truncate_field: context # *** - answer_only_loss: True # *** - answer_field: "answer" # *** - - p_tuning: # *** - dropout: 0.0 # *** - num_layers: 2 # *** - - data: - train_ds: ["data/squad_train.jsonl"] # *** - validation_ds: ["data/squad_val.jsonl"] # *** - add_eos: True - shuffle: True - num_workers: 1 - pin_memory: True - - optim: ... - -Then run the command again: - -.. code:: - - python megatron_gpt_prompt_learning.py --config-name=multitask-prompt-learning.yaml - - -Example Multi-Task Inference -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -The inference file can contain a mix of prompts from all the tasks the model has been prompt tuned on. - -.. code:: - - python megatron_gpt_prompt_learning_eval.py \ - virtual_prompt_model_file=PATH_TO_NEMO_PROMPT_LEARNING_MODEL_FILE \ - gpt_model_file=PATH_TO_FROZEN_GPT_MODEL_FILE \ - inference.greedy=True \ - inference.add_BOS=False \ - trainer.devices=1 \ - trainer.num_nodes=1 \ - tensor_model_parallel_size=1 \ - pipeline_model_parallel_size=1 \ - prompts=[prompt1,prompt2] - -``virtual_prompt_model_file`` should be a path to a .nemo file saved after p-tuning/prompt tuning and ``model_file`` is still the path to the gpt model's .nemo file. - -prompts in this case should be a list of .json or .jsonl files containing json objects similar to the ones used during prompt learning. They should have keys that match the fields specified in the prompt template. Fields can be dropped from the prompt dict and their corresponding section of the prompt template will be automatically removed. - -For example, say the prompt template during p-tuning/prompt-tuning looked like: - -.. code:: - - '<|VIRTUAL_PROMPT_0|> Context: {context} Question: {question} Answer: {answer}' - -but you don't want to include the answer field during inference. Just don't include the answer field in the prompt dict like below: - -.. code:: - - {"taskname": "squad", "context": "some paragraph", "question": "question related to paragraph"} - {"taskname": "squad", "context": "another paragraph", "question": "a different question related to paragraph"} - - -And the dataset class will automatically format your input to have the form: - -.. code:: - - [ - '<|VIRTUAL_PROMPT_0|> Context: some paragraph Question: question related to paragraph Answer: ', - '<|VIRTUAL_PROMPT_0|> Context: another paragraph Question: a different question related to paragraph Answer: ' - ] - -Generally prompt learning inference is just like running inference with a GPT model. The only difference is you need to add ``virtual_prompt_model_file=PATH_TO_NEMO_PROMPT_LEARNING_MODEL_FILE`` to your command if you're using a p-tuned/prompt-tuned model. - -Example prompt learning script: `NeMo/examples/nlp/language_modeling/megatron_gpt_prompt_learning.py.py `__. - -Example prompt tuned inference script: `NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py `__. diff --git a/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb b/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb deleted file mode 100644 index 076a8ffad3df..000000000000 --- a/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb +++ /dev/null @@ -1,786 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "b7a434f4", - "metadata": {}, - "outputs": [], - "source": [ - "BRANCH='main'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "developmental-gibraltar", - "metadata": {}, - "outputs": [], - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell\n", - "\n", - "# install NeMo\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]" - ] - }, - { - "cell_type": "markdown", - "id": "42daf8bf", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "In this notebook we demonstrate how to use p-tuning and prompt tuning within NeMo-Megatron. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model’s full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid catastrophic forgetting issues often encountered when fine-tuning models.\n", - "\n", - "- Our prompt tuning implementation is based off Lester et. al’s EMNLP 2021 paper [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)\n", - "\n", - "- Our p-tuning implementation is based off Liu et al's paper [GPT Understands, Too](https://arxiv.org/abs/2103.10385).\n", - "\n", - "- Command line usage examples and API documentation can be found in [our user docs](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html). \n", - "\n", - "\"Prompt\n", - "\n", - "Our continuous learning capability for combined p-tuning and prompt tuning with GPT style models is a NeMo specific extension of the author’s original work.\n", - "\n", - "# The Plan\n", - "\n", - "We are going to show you how to:\n", - " \n", - " 1. P-Tune/Prompt Tune a model on multiple tasks at the same time\n", - " 2. Add a new task to a model that has already been P-Tuned/Prompt Tuned previously\n", - " \n", - "We will first p-tune a GPT model on sentiment analysis, and intent and slot classification tasks. Then we will show how to add the squad question answering task to the same model we already p-tuned once.\n", - "\n", - "\n", - "# Technical Overview\n", - "Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning utilize virtual prompt embeddings that can be optimized via gradient decent. The only difference between prompt tuning and p-tuning within NeMo-Megatron is the architecture used to tune the soft prompt tokens during training.\n", - "\n", - "### Terminology\n", - "We will be using the terms `continuous`, `soft`, and `virtual` token interchangeably to refer to embeddings inserted into the model prompt that have no concrete mapping to strings or characters within the model’s vocabulary. These virtual token embeddings exist in contrast to the `discrete`, `hard`, or `real` tokens that do make up the model’s vocabulary. Virtual tokens are purely 1D vectors with dimensionality equal to that of each real token embedding, matching the `hidden_size` hyperparameter. In training and inference, continuous token embeddings are inserted among discrete token embeddings according to a template you provide in the model’s config. We will demonstrate how to do this below.\n", - "\n", - "When referring to p-tuning and prompt tuning together, we will be using the phrase prompt learning for simplicity.\n", - "\n", - "### Prompt-Tuning\n", - "In prompt-tuning a pretrained GPT model, soft prompt embeddings are initialized as a 2D matrix of size `total_virtual_tokens X hidden_size`. Each task the model is prompt-tuned to perform has its own 2D embedding matrix associated with it. Tasks do not share any parameters during training or inference. All GPT model parameters are frozen and only the embedding parameters for each task are updated during training.\n", - "\n", - "In prompt tuning you can specify how the embeddings are initialized for each task. You can either\n", - "\n", - "1. Initialize embedding parameters according to some random distribution\n", - "2. Initialize embedding parameters from existing vocabulary embeddings (recommended)\n", - "\n", - "If you choose to initialize virtual token embeddings from existing embedding weights, you can provide the string of words you want to use for initialization in the model’s config. This string will be tokenized and tiled or truncated to match the specified number of virtual tokens you would like to use (`total_virtual_tokens`). Vocab embeddings are copied and used to initialize the soft prompt embedding matrix for each task. The vocab embeddings themselves are not updated or changed during prompt tuning.\n", - "\n", - "\n", - "### P-Tuning\n", - "In p-tuning, an LSTM model is used to predict virtual token embeddings. We refer to this LSTM model as our `prompt_encoder`. LSTM parameters are randomly initialized at the start of p-tuning. All GPT model parameters are frozen, and only the LSTM weights are updated at each training step. LSTM parameters are shared between all tasks that are p-tuned at the same time, but the LSTM model outputs unique virtual token embeddings for each task. The virtual tokens predicted by the LSTM are inserted among the discrete token input in the exact same manner as with prompt-tuning. You still specify the number of virtual tokens you want to use by setting `total_virtual_tokens` and each virtual token embedding is still a 1D vector of size `hidden_size`.\n", - "\n", - "\n", - "\n", - "# The Best of Both\n", - "A single pretrained GPT model can use both p-tuning and prompt-tuning. While you must decide to use either p-tuning or prompt-tuning for each task you want your model to perform, you can p-tune your model on a set of tasks A, then prompt tune your same model on a different set of tasks B, then finally run inference on tasks from both A and B at the same time. During prompt-tuning or p-tuning, tasks tuned at the same time must use the same number of virtual tokens. During inference, tasks using differing amounts of virtual tokens can be run at the same time.\n", - "\n", - "Please see our [docs for more comparisons between prompt and p-tuning](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/nemo_megatron/prompt_learning.html). \n", - "\n", - "With all that covered, let's get started!\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31c27562", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import wget" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "0bfc7709", - "metadata": {}, - "source": [ - "# Tasks and Datasets\n", - "We will be using p-tuning to teach our GPT model to do **Question Answering**.\n", - "\n", - "We will be using the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) reading comprehension dataset, consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text. More information on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) can be found on their website or in their paper by Rajpurkar et. al \"[Know What You Don’t Know: Unanswerable Questions for SQuAD](https://arxiv.org/pdf/1806.03822.pdf)\"." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e0b0072a", - "metadata": {}, - "source": [ - "# Data Preparation\n", - "\n", - "The prompt learning dataset loader accepts a list of json/dictionary objects or a list of json file names where each json file contains a collection of json objects. Each json object must include the field `taskname` which is a string identifier for the task the data example corresponds to. They should also include one or more fields corresponding to different sections of the discrete text prompt. The input data might look like:\n", - "\n", - "```\n", - "[\n", - " {\"taskname\": \"squad\", \"context\": [CONTEXT_PARAGRAPH_TEXT1], \"question\": [QUESTION_TEXT1], \"answer\": [ANSWER_TEXT1]},\n", - " {\"taskname\": \"squad\", \"context\": [CONTEXT_PARAGRAPH_TEXT2], \"question\": [QUESTION_TEXT2], \"answer\": [ANSWER_TEXT2]},\n", - "]\n", - "```\n", - "\n", - "These additional fields can be unlimited in number and will be used to help map different parts of the discrete text input to a prompt template that you define. We will show how this mapping works and how to construct your prompt template in the `Prompt Formatting` section. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0dbd41fd", - "metadata": {}, - "outputs": [], - "source": [ - "# You can replace DATA_DIR and NEMO_DIR with your own locations\n", - "DATA_DIR = \"data\"\n", - "NEMO_DIR = '.'\n", - "\n", - "os.makedirs(DATA_DIR, exist_ok=True)" - ] - }, - { - "cell_type": "markdown", - "id": "504a7b40", - "metadata": {}, - "source": [ - "\n", - "For each dataset we have preprocessing scripts pre-written in NeMo's example directory located in `examples/nlp`. Let's download those now. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e72a1dc1", - "metadata": {}, - "outputs": [], - "source": [ - "# download the preprocessing scripts from github for the purpose of this tutorial\n", - "wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/scripts/dataset_processing/nlp/squad/prompt_learning_squad_preprocessing.py', NEMO_DIR)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "71813919", - "metadata": {}, - "source": [ - "Now let's down load and process the dataset." - ] - }, - { - "cell_type": "markdown", - "id": "816791de", - "metadata": {}, - "source": [ - "### SQuAD Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa16d8ac", - "metadata": {}, - "outputs": [], - "source": [ - "SQUAD_DIR = os.path.join(DATA_DIR, \"SQuAD\")\n", - "os.makedirs(SQUAD_DIR, exist_ok=True)\n", - "\n", - "# Download the SQuAD dataset\n", - "!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json\n", - "!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json\n", - "!mv train-v1.1.json {SQUAD_DIR}\n", - "!mv dev-v1.1.json {SQUAD_DIR}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "64e3e25b", - "metadata": {}, - "outputs": [], - "source": [ - "# Preprocess squad data\n", - "!python $NEMO_DIR/prompt_learning_squad_preprocessing.py --data-dir {SQUAD_DIR}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b562d1de", - "metadata": {}, - "outputs": [], - "source": [ - "# What the squad dataset looks like after processing\n", - "!head -4 $SQUAD_DIR/squad_train.jsonl" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a385d319", - "metadata": {}, - "source": [ - "We made a `.jsonl` file for each of the train, validation, and testing splits of the squad data. Every `.jsonl` file contains json objects with the fields `taskname`, `context`, `question`, and `answer`. The preprocessing script is called `prompt_learning_squad_preprocessing.py`. It should be in your `NEMO_DIR` and at `scripts/dataset_processing/nlp/squad/prompt_learning_squad_preprocessing.py` in the NeMo repo. \n", - "\n", - "The SQuAD dataset consists of various topics like `Beyoncé`, `IPod`, and `Symbiosis`. Each topic has several paragraphs associated with it, and each paragraph has several questions and answers related to it. When we separated the train/validation/test splits, we separated them on the topic level. For example, if the training set contains paragraphs and questions about the topic `Beyoncé`, neither the validation nor test sets will contain any questions on this topic. All questions about a certain topic are isolated to one split of the data. \n", - "\n", - "Like the Financial PhraseBank Dataset, we randomly selected 80% of the questions for training, 10% for validation, and 10% for test. This resulted in `69125` test examples, `8952` validation examples, and `8744` testing examples. The `answer` field was removed from test examples.\n", - "\n", - "Training on the full train split could take a lot of time, so we are going to clip the train split to 2k examples for the sake of this tutorial, and limit the validation dataset to 200 samples." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0f1473ba", - "metadata": {}, - "outputs": [], - "source": [ - "! head -2000 $SQUAD_DIR/squad_train.jsonl > $SQUAD_DIR/squad_short_train.jsonl\n", - "! head -200 $SQUAD_DIR/squad_val.jsonl > $SQUAD_DIR/squad_short_val.jsonl\n" - ] - }, - { - "cell_type": "markdown", - "id": "2e19c8dc", - "metadata": {}, - "source": [ - "# P-Tuning Model Config Setup\n", - "\n", - "Now we will begin setting up the config file used for prompt/p-tuning our GPT models! GPT Prompt learning within NeMo uses a class called `MegatronGPTPromptLearningModel` which has its own config file. We will start by loading an example prompt learning config file, then make changes to it to fit our tasks and training plans. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5749c387", - "metadata": {}, - "outputs": [], - "source": [ - "from omegaconf import OmegaConf\n", - "\n", - "CONFIG_DIR = os.path.join(NEMO_DIR, \"conf\")\n", - "os.makedirs(CONFIG_DIR, exist_ok=True)\n", - "\n", - "# Download the example config file\n", - "wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml', CONFIG_DIR)\n", - "\n", - "# Load the example config file so we can start editing it\n", - "CONFIG_PATH = os.path.join(CONFIG_DIR, \"megatron_gpt_prompt_learning_config.yaml\")\n", - "config = OmegaConf.load(CONFIG_PATH)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ce966bcf", - "metadata": {}, - "source": [ - "First let's set the datasets we've created in the config. We are going to start by p-tuning a GPT model on a small subset of the **Squad** task. We do this by setting the following config params below: " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6bb1590f", - "metadata": {}, - "outputs": [], - "source": [ - "config.model.data.train_ds = [f\"{SQUAD_DIR}/squad_short_train.jsonl\"]\n", - "config.model.data.validation_ds = [f\"{SQUAD_DIR}/squad_short_val.jsonl\"]" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "4e021b24", - "metadata": {}, - "source": [ - "### Prompt Formatting\n", - "Now that we have our dataset, lets define what we want the prompt to look like. \n", - "\n", - "The squad dataset json files contain fields named \"context\", \"question\" and \"answer\". The prompt formatting template allows us to arrange these fields and decide where to insert virtual prompts. We can add the `<|VIRTUAL_PROMPT_0|>` token anywhere between the fields (although we recommend simply adding it in the leftmost position will be sufficient).\n", - "\n", - "For example, given a data jsonl file with examples like this: \n", - "\n", - "\n", - "**{\"taskname\": \"squad\", \"context\": \"Super Bowl 50 was an American football ga... numerals 50.\", \"question\": \"What does AFC stand for?\", \"answer\": \"American Football Conference\"}**. \n", - "\n", - "\n", - "We can create a prompt template set to `prompt_template = \"<|VIRTUAL_PROMPT_0|> Context: {context}\\n\\nquestion: {question}\\n\\nanswer: {answer}\"` other options are also possible, for example the `\\n` can be replaced with whitespace or the other of the context and question can be swapped. The answer however, should be at the end.\n", - "\n", - "Let's configure the prompt template for the task below:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f935b411", - "metadata": {}, - "outputs": [], - "source": [ - "config.model.task_templates = [\n", - " \n", - " {\n", - " \"taskname\": \"squad\",\n", - " \"prompt_template\": \"<|VIRTUAL_PROMPT_0|> Context: {context}\\n\\nQuestion: {question}\\n\\nAnswer:{answer}\",\n", - " \"total_virtual_tokens\": 15,\n", - " \"virtual_token_splits\": [15],\n", - " \"truncate_field\": \"context\",\n", - " \"answer_only_loss\": True,\n", - " \"answer_field\": \"answer\",\n", - " },\n", - " \n", - "]" - ] - }, - { - "cell_type": "markdown", - "id": "dcc438b5", - "metadata": {}, - "source": [ - "Note each `task_template` item has 5 fields. \n", - "\n", - "- **`prompt_template`** is a string showing the model where to place virtual tokens and how to map dataset json fields to where they belong in the model prompt. \n", - "\n", - "\n", - "- **`taskname`** refers to the same `taskname` in the dataset json objects. \n", - "\n", - "\n", - "- **`total_virtual_tokens`** specifies the total number of virtual tokens that will be inserted into the model prompt.\n", - "\n", - "\n", - "- **`virtual_token_splits`** specifies the number of virtual tokens that belong at each `<|VIRTUAL_PROMPT_#|>` marker. `virtual_token_splits` values should add up to `total_virtual_tokens`. The number of `virtual_token_splits` should match the number of `<|VIRTUAL_PROMPT_#|>` markers. \n", - "\n", - "\n", - "- **`truncate_field`** specifies which field in the data json to truncate if the length of the input exceeds the maximum sequence length of the model. If `truncate_field` is set to `None`, examples that are too long are simply dropped from the dataset.\n", - "\n", - "\n", - "- **`answer_only_loss`** Whether to limit loss calculation to only the answer portion of the prompt during tuning. `True` Strongly recommended for long prompts, but shorter prompts with single word answers seem to benefit from setting this to `False`. \n", - "\n", - "\n", - "- **`answer_field`** The field in the data json corresponding to the answer. The loss will only be calculated on this portion of the prompt if `answer_only_loss` is `True`. The answer field must be at the end of the prompt template.\n", - "\n", - "In the `task_templates` we set above, `squad` has a different number of virtual tokens than `sentiment` and `intent_and_slot`. This is because we will be p-tuning on `squad` after we p-tune on the other two tasks and **we do not need to use the same number of virtual tokens between sessions**. We also set the `truncate` field for squad because the context can sometimes be longer than the model's max sequence length, and we want that field to be truncated if the example is too long. Lastly, we set `answer_only_loss` to true for `squad` due to the longer prompt. We've found `answer_only_loss=True` to work significantly better for this task." - ] - }, - { - "cell_type": "markdown", - "id": "84579c7a", - "metadata": {}, - "source": [ - "### Setting New Tasks\n", - "After you p-tune your model this time, you can always go back and p-tune or prompt-tune your model on more tasks without over writing the virtual prompts who've trained this time. You can also use a different number of `total_virtual_tokens` between each training session as long as tasks p-tuned or prompt tuned at the same time have the same number of `total_virtual_tokens`. For this reason, when you p-tune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). \n", - "\n", - "You do this by setting the `new_tasks` and `existing_tasks` values in the config file. Because we are p-tuning a model with no existing tasks, you should set `existing_tasks=[]` and `new_tasks=[\"sentiment\", \"intent_and_slot\"]` as follows:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "57a73e01", - "metadata": {}, - "outputs": [], - "source": [ - "config.model.existing_tasks = []\n", - "config.model.new_tasks = [\"squad\"]" - ] - }, - { - "cell_type": "markdown", - "id": "3b77e88c", - "metadata": {}, - "source": [ - "After p-tuning and/or prompt tuning is complete, you can run inference on all tasks at the same time, regardless of their `total_virtual_tokens` value." - ] - }, - { - "cell_type": "markdown", - "id": "a0d5017e", - "metadata": {}, - "source": [ - "### Setting The Pre-Trained GPT Model\n", - "We still need to set which GPT model we want to p-tune/prompt tune. Prompt learning methods work best with large GPT language models (5B or above), but the purposes of this tutorial, we are going to download a 345M parameter GPT model from NVIDIA NGC." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "48cdf868", - "metadata": {}, - "outputs": [], - "source": [ - "# Check what GPT .nemo models we have available on NGC\n", - "from nemo.collections.nlp.models.language_modeling.megatron_gpt_model import MegatronGPTModel\n", - "MegatronGPTModel.list_available_models()" - ] - }, - { - "cell_type": "markdown", - "id": "ede350ed", - "metadata": {}, - "source": [ - "If we wanted to use the GPT model class directly, we could instantiate a trainer then download the model by calling running \n", - "`gpt_model = MegatronGPTModel.from_pretrained(model_name=\"megatron_gpt_345m\", trainer=trainer).cuda()`. But we just need the `.nemo` file in our working NeMo directory in this tutorial, so we will download it using `wget`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "364439a1", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Download the model from NGC\n", - "gpt_file_name = \"megatron_gpt_345m.nemo\"\n", - "!wget -nc --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/megatron_gpt_345m/versions/1/files/megatron_gpt_345m.nemo -O {NEMO_DIR}/{gpt_file_name}" - ] - }, - { - "cell_type": "markdown", - "id": "1d6a8a67", - "metadata": {}, - "source": [ - "Now that we have a `.nemo` GPT file to work with. We need to add its path in our prompt learning config. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2778a5fa", - "metadata": {}, - "outputs": [], - "source": [ - "# Set GPT model path on prompt learning config\n", - "config.model.language_model_path = gpt_file_name" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "943a9c83", - "metadata": {}, - "source": [ - "We can also set where we want the final prompt tuned model to be saved by setting `model.nemo_path`. By default the tuned prompt learning model will be saved in your current working directory to a `.nemo` file with the same name as your experiment (`config.name`). Let's change the save name to be `p_tuned_gpt.nemo`. **Your model path must end in `.nemo`.**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a278cbdf", - "metadata": {}, - "outputs": [], - "source": [ - "config.exp_manager.checkpoint_callback_params.save_nemo_on_train_end= True\n", - "config.exp_manager.checkpoint_callback_params.always_save_nemo= True\n", - "config.exp_manager.checkpoint_callback_params.save_best_model= True" - ] - }, - { - "cell_type": "markdown", - "id": "378a73e7", - "metadata": {}, - "source": [ - "### Setting P-Tuning Specific Params\n", - "Within the config file, p-tuning and prompt-tuning each have a couple of hyperparameters specific to them. We first need to tell the model that we want to do p-tuning, not prompt-tuning. To do this, we set the **`model.virtual_prompt_style`** hyperparameter like this:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "68763763", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo.collections.nlp.modules.common import VirtualPromptStyle\n", - "config.model.virtual_prompt_style = VirtualPromptStyle.P_TUNING" - ] - }, - { - "cell_type": "markdown", - "id": "947dec63", - "metadata": {}, - "source": [ - "Then we can set the 2 p-tuning specific parameters. Reminder, p-tuning uses an LSTM prompt encoder to predict virtual tokens. \n", - "\n", - "- **`p_tuning.dropout`** the LSTM prompt encoder dropout probability \n", - "- **`p_tuning.num_layers`** the number of LSTM layers you want your p-tuning prompt encoder to have\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "03f893ef", - "metadata": {}, - "outputs": [], - "source": [ - "config.model.p_tuning.dropout = 0.0\n", - "config.model.p_tuning.num_layers = 2\n", - "config.model.global_batch_size = 2\n", - "config.model.micro_batch_size = 1" - ] - }, - { - "cell_type": "markdown", - "id": "a988d16e", - "metadata": {}, - "source": [ - "Let's have a look at all the values we've set in the model config. You can change any of these values in the same manner we've been using above. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "12a37ada", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Final model config\n", - "print(OmegaConf.to_yaml(config.model))" - ] - }, - { - "cell_type": "markdown", - "id": "6b4bc7f3", - "metadata": {}, - "source": [ - "### Setting Prompt-Tuning Specific Params\n", - "\n", - "Though we are not using prompt tuning in this training session, let's go over the prompt tuning specific parameters we would use if we were. \n", - "\n", - "- **`prompt_tuning.new_prompt_init_methods`** Whether you want to initialize virtual token embeddings from the embeddings of existing parts of the model's vocabulary (either 'text' or 'random')\n", - "- **`prompt_tuning.new_prompt_init_text`** The text you want to use if you have 'text' in the list above, should be None otherwise. \n", - "\n", - "Each of the above hyperparameters are a list of strings. \n", - "\n", - "`new_prompt_init_methods` would look like `[\"text\", \"random\", \"text\", \"text\"]` if you were prompt tuning on 4 tasks at once, and you wanted the second task in `new_tasks` to use random initialization. \n", - "\n", - "`new_prompt_init_text` might look like `[\"some text I want to use\", None, \"some other text\", \"task text goes here\"]` for those four new tasks. \n", - "\n", - "The order of both should correspond to the order of the tasks you have listed in `model.new_tasks`. " - ] - }, - { - "cell_type": "markdown", - "id": "4c048852", - "metadata": {}, - "source": [ - "# Building the PyTorch Lightning Trainer\n", - "NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.\n", - "\n", - "Let's first instantiate a Trainer object" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90f85b2a", - "metadata": {}, - "outputs": [], - "source": [ - "import torch\n", - "import pytorch_lightning as pl\n", - "from nemo.collections.nlp.parts.nlp_overrides import NLPDDPStrategyNotebook\n", - "from pytorch_lightning.plugins.environments import TorchElasticEnvironment\n", - "\n", - "# let's modify some trainer configs\n", - "# check if we have GPU available and uses it\n", - "accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'\n", - "config.trainer.accelerator = accelerator\n", - "config.trainer.devices = 1\n", - "config.trainer.max_epochs = 4\n", - "config.trainer.val_check_interval = 1.0\n", - "\n", - "# for PyTorch Native AMP set precision=16\n", - "config.trainer.precision = 16 if torch.cuda.is_available() else 32\n", - "\n", - "# setup cluster environment parameters\"\n", - "# use torch elastic cluster environment so `create_process_externally` is True\n", - "# the launcher is set to None. It will not try to spawn new processes.\n", - "# It won't create the misconfiguration error because of the `interactive session`\n", - "os.environ[\"LOCAL_RANK\"] = '0'\n", - "os.environ[\"RANK\"] = '0'\n", - "os.environ[\"WORLD_SIZE\"] = '1'\n", - "\n", - "strategy = NLPDDPStrategyNotebook(find_unused_parameters=False, no_ddp_communication_hook=True)\n", - "plugins = [TorchElasticEnvironment()]\n", - "trainer = pl.Trainer(plugins= plugins, strategy=strategy, **config.trainer)\n", - "\n", - "print(\"Trainer config - \\n\")\n", - "print(OmegaConf.to_yaml(config.trainer))" - ] - }, - { - "cell_type": "markdown", - "id": "4d0124c1", - "metadata": {}, - "source": [ - "# Setting up a NeMo Experiment\n", - "\n", - "NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f2c943ba", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo.utils.exp_manager import exp_manager\n", - "\n", - "# Set name of the experiment \n", - "config.name = 'p_tuning'\n", - "config.exp_manager.resume_if_exists = False\n", - "\n", - "# Init the experiment manager and view the exp_dir\n", - "exp_dir = exp_manager(trainer, config.get(\"exp_manager\", None))\n", - "exp_dir = str(exp_dir)\n", - "print(exp_dir)" - ] - }, - { - "cell_type": "markdown", - "id": "5860bd90", - "metadata": {}, - "source": [ - "We can also set learning hyperparameters as follows:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4c4ec542", - "metadata": {}, - "outputs": [], - "source": [ - "# Set some of the learning parameters\n", - "config.model.optim.lr = 1e-4\n", - "config.model.precision = config.trainer.precision" - ] - }, - { - "cell_type": "markdown", - "id": "298b3dce", - "metadata": {}, - "source": [ - "# First P-Tuning Session\n", - "The only thing left to do is load up the model and begin p-tuning!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b4bda19b", - "metadata": {}, - "outputs": [], - "source": [ - "from nemo.collections.nlp.models.language_modeling.megatron_gpt_prompt_learning_model import MegatronGPTPromptLearningModel\n", - "\n", - "model = MegatronGPTPromptLearningModel(cfg=config.model, trainer=trainer)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2d99f433", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# Training set to 2 epochs by default in a cell above\n", - "# Each epoch will take around 1min 15sec, but training time can vary\n", - "trainer.fit(model)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "6aab09d4", - "metadata": {}, - "source": [ - "# Inference After P-Tuning\n", - "One way to run inference after p-tuning or prompt-tuning your model is to call `model.generate()`. `model.generate()` takes in \n", - "\n", - "- `inputs` which can be either a list of dictionary objects or `.jsonl` files containing dictionary objects, \n", - "- `length_params`\n", - "- `sampling_params`\n", - "\n", - "as arguments. More information about the [text generation API can be found here](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/transformer/text_generation.py).\n", - "\n", - "If `length_params` and `sampling_params` are set to `None`, the model generates output with a greedy decoding strategy and generates up to `30` new tokens. Most predictive downstream tasks (not text generation tasks), use greedy sampling. To see other ways to run inference with your prompt learning model and more details on how to define various inference parameters, visit `examples/nlp/language_modeling/megatron_gpt_eval.py`.\n", - "\n", - "Below are some randomly selected test examples from the sentiment classification and intent and slot classification test files. Notice that the `label` field is dropped from all test examples. The `MegatronPromptLearningDataset` called within `.generate()` automatically leaves fields in the prompt template empty when they are not provided in the data json. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dc95e764", - "metadata": {}, - "outputs": [], - "source": [ - "test_examples = [\n", - " {\"taskname\": \"squad\", \"context\": \"The build was released for download later in the day in standard 32-bit and 64-bit versions, plus a special 64-bit version which included SDKs and developer tools (Visual Studio Express and Expression Blend) for developing Metro-style apps. The Windows Store was announced during the presentation, but was not available in this build. According to Microsoft, there were about 535,000 downloads of the developer preview within the first 12 hours of its release. Originally set to expire on March 11, 2012, in February 2012 the Developer Preview's expiry date was changed to January 15, 2013.\", \"question\": \"When was the Developer preview initially intended to expire?\"},\n", - " {\"taskname\": \"squad\", \"context\": \"The structures of most federal governments incorporate mechanisms to protect the rights of component states. One method, known as 'intrastate federalism', is to directly represent the governments of component states in federal political institutions. Where a federation has a bicameral legislature the upper house is often used to represent the component states while the lower house represents the people of the nation as a whole. A federal upper house may be based on a special scheme of apportionment, as is the case in the senates of the United States and Australia, where each state is represented by an equal number of senators irrespective of the size of its population.\", \"question\": \"What is a bicameral legislature?\"},\n", - " {\"taskname\": \"squad\", \"context\": \"Imported mystery religions, which offered initiates salvation in the afterlife, were a matter of personal choice for an individual, practiced in addition to carrying on one's family rites and participating in public religion. The mysteries, however, involved exclusive oaths and secrecy, conditions that conservative Romans viewed with suspicion as characteristic of \\\"magic\\\", conspiratorial (coniuratio), or subversive activity. Sporadic and sometimes brutal attempts were made to suppress religionists who seemed to threaten traditional morality and unity, as with the senate's efforts to restrict the Bacchanals in 186 BC.\", \"question\": \"What was the practice of religion to the Romans?\"}\n", - "]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "74a5a358", - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "response = model.generate(inputs=test_examples, length_params=None)\n", - "\n", - "print('The prediction results of some sample queries with the trained model:')\n", - "for result in response['sentences']:\n", - " print(result)\n", - " print(\"-\" * 30)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.16" - } - }, - "nbformat": 4, - "nbformat_minor": 5 - } From 392875ae5131af45345315cec9d8564f1ecce73a Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Sun, 28 Jan 2024 10:24:31 -0800 Subject: [PATCH 3/8] fix table Signed-off-by: Chen Cui --- docs/source/nlp/nemo_megatron/peft/landing_page.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst index 6f14c189e0f6..0f33b6835874 100644 --- a/docs/source/nlp/nemo_megatron/peft/landing_page.rst +++ b/docs/source/nlp/nemo_megatron/peft/landing_page.rst @@ -15,10 +15,10 @@ transformer-based models. ==================== ===== ======== ========= ====== == \ GPT 3 Nemotron LLaMa 1/2 Falcon T5 ==================== ===== ======== ========= ====== == -Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ -LoRA ✅ ✅ ✅ ✅ ✅ -IA3 ✅ ✅ ✅ ✅ ✅ -P-Tuning ✅ ✅ ✅ ✅ ✅ +Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ +LoRA ✅ ✅ ✅ ✅ ✅ +IA3 ✅ ✅ ✅ ✅ ✅ +P-Tuning ✅ ✅ ✅ ✅ ✅ ==================== ===== ======== ========= ====== == Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works From e0434aee6bfa67ab265f7805e855f04374298922 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Sun, 28 Jan 2024 10:29:38 -0800 Subject: [PATCH 4/8] fix table Signed-off-by: Chen Cui --- docs/source/nlp/nemo_megatron/peft/landing_page.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst index 0f33b6835874..bf9cb6907ab3 100644 --- a/docs/source/nlp/nemo_megatron/peft/landing_page.rst +++ b/docs/source/nlp/nemo_megatron/peft/landing_page.rst @@ -15,10 +15,10 @@ transformer-based models. ==================== ===== ======== ========= ====== == \ GPT 3 Nemotron LLaMa 1/2 Falcon T5 ==================== ===== ======== ========= ====== == -Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ -LoRA ✅ ✅ ✅ ✅ ✅ -IA3 ✅ ✅ ✅ ✅ ✅ -P-Tuning ✅ ✅ ✅ ✅ ✅ +Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ +LoRA ✅ ✅ ✅ ✅ ✅ +IA3 ✅ ✅ ✅ ✅ ✅ +P-Tuning ✅ ✅ ✅ ✅ ✅ ==================== ===== ======== ========= ====== == Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works From c1939bc118cc5f57a93a4aba93cbdb26a8699180 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Sun, 28 Jan 2024 10:58:02 -0800 Subject: [PATCH 5/8] fix table Signed-off-by: Chen Cui --- docs/source/nlp/nemo_megatron/peft/landing_page.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/nlp/nemo_megatron/peft/landing_page.rst b/docs/source/nlp/nemo_megatron/peft/landing_page.rst index bf9cb6907ab3..4461feffb5db 100644 --- a/docs/source/nlp/nemo_megatron/peft/landing_page.rst +++ b/docs/source/nlp/nemo_megatron/peft/landing_page.rst @@ -15,10 +15,10 @@ transformer-based models. ==================== ===== ======== ========= ====== == \ GPT 3 Nemotron LLaMa 1/2 Falcon T5 ==================== ===== ======== ========= ====== == -Adapters (Canonical) ✅ ✅ ✅ ✅ ✅ LoRA ✅ ✅ ✅ ✅ ✅ -IA3 ✅ ✅ ✅ ✅ ✅ P-Tuning ✅ ✅ ✅ ✅ ✅ +Adapters (Canonical) ✅ ✅ ✅ ✅ +IA3 ✅ ✅ ✅ ✅ ==================== ===== ======== ========= ====== == Learn more about PEFT in NeMo with the :ref:`peftquickstart` which provides an overview on how PEFT works From 0c6e1217a7f6f3820aec8b5940cd043f34a3c364 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Tue, 30 Jan 2024 16:42:28 -0800 Subject: [PATCH 6/8] Merge branch 'r1.23.0' into chcui/update_peft_doc Signed-off-by: Chen Cui --- tutorials/00_NeMo_Primer.ipynb | 2 +- tutorials/01_NeMo_Models.ipynb | 5302 ++++++++--------- tutorials/02_NeMo_Adapters.ipynb | 2 +- tutorials/AudioTranslationSample.ipynb | 2 +- ...blish_NeMo_Model_On_Hugging_Face_Hub.ipynb | 2 +- tutorials/VoiceSwapSample.ipynb | 2 +- .../asr/ASR_CTC_Language_Finetuning.ipynb | 2 +- tutorials/asr/ASR_for_telephony_speech.ipynb | 682 +-- tutorials/asr/ASR_with_NeMo.ipynb | 2350 ++++---- 9 files changed, 4174 insertions(+), 4172 deletions(-) diff --git a/tutorials/00_NeMo_Primer.ipynb b/tutorials/00_NeMo_Primer.ipynb index 50aa60260b35..3bef999ae225 100644 --- a/tutorials/00_NeMo_Primer.ipynb +++ b/tutorials/00_NeMo_Primer.ipynb @@ -42,7 +42,7 @@ "!pip install text-unidecode\n", "\n", "# ## Install NeMo\n", - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Install TorchAudio\n", diff --git a/tutorials/01_NeMo_Models.ipynb b/tutorials/01_NeMo_Models.ipynb index 4255a6656b8a..130518b39498 100644 --- a/tutorials/01_NeMo_Models.ipynb +++ b/tutorials/01_NeMo_Models.ipynb @@ -1,2654 +1,2652 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "01_NeMo_Models.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "ASnx4b5jXsil" - }, - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "\n", - "# ## Install NeMo\n", - "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "## Install TorchAudio\n", - "!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html\n", - "\n", - "## Grab the config we'll use in this example\n", - "!mkdir configs" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "a0eAURFKXdFT" - }, - "source": [ - "# minGPT License\n", - "\n", - "*This notebook port's the [minGPT codebase](https://github.com/karpathy/minGPT) into equivalent NeMo code. The license for minGPT has therefore been attached here.*\n", - "\n", - "```\n", - "The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy\n", - "\n", - "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", - "\n", - "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", - "\n", - "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2b7Z064UZFH9" - }, - "source": [ - "# torch-rnn License\n", - "*This notebook utilizes the `tiny-shakespeare` dataset from the [torch-rnn](https://github.com/jcjohnson/torch-rnn) codebase. The license for torch-rnn has therefore been attached here.*\n", - "\n", - "```\n", - "The MIT License (MIT)\n", - "\n", - "Copyright (c) 2016 Justin Johnson\n", - "\n", - "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", - "\n", - "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", - "\n", - "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eKzK-Z7obCED" - }, - "source": [ - "-------\n", - "\n", - "***Note: This notebook will intentionally introduce some errors to show the power of Neural Types or model development concepts, inside the cells marked with `[ERROR CELL]`. The explanation of and resolution of such errors can be found in the subsequent cells.***\n", - "\n", - "-----" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "81qdv0mPee-j" - }, - "source": [ - "# The NeMo Model\n", - "\n", - "NeMo comes with several state-of-the-art pre-trained Conversational AI models for users to quickly be able to start training and fine-tuning on their own datasets. \n", - "\n", - "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", - "\n", - "In this tutorial we will learn how to develop a non-trivial NeMo model from scratch. This helps us to understand the underlying components and how they interact with the overall PyTorch ecosystem.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nKNftwxzllth" - }, - "source": [ - "-------\n", - "At the heart of NeMo lies the concept of the \"Model\". For NeMo developers, a \"Model\" is the neural network(s) as well as all the infrastructure supporting those network(s), wrapped into a singular, cohesive unit. As such, most NeMo models are constructed to contain the following out of the box (note: some NeMo models support additional functionality specific to the domain/use case!) - \n", - "\n", - " - Neural Network architecture - all of the modules that are required for the model.\n", - "\n", - " - Dataset + Data Loaders - all of the components that prepare the data for consumption during training or evaluation.\n", - "\n", - " - Preprocessing + Postprocessing - any of the components that process the datasets so the modules can easily consume them.\n", - "\n", - " - Optimizer + Schedulers - basic defaults that work out of the box and allow further experimentation with ease.\n", - "\n", - " - Any other supporting infrastructure - tokenizers, language model configuration, data augmentation, etc." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5VOoAQT1mipO" - }, - "source": [ - "# Constructing a NeMo Model\n", - "\n", - "NeMo \"Models\" are comprised of a few key components, so let's tackle them one by one. We will attempt to go in the order that's stated above.\n", - "\n", - "To make this slightly challenging, let's port a model from the NLP domain this time. Transformers are all the rage, with BERT and his friends from Sesame Street forming the core infrastructure for many NLP tasks. \n", - "\n", - "An excellent (yet simple) implementation of one such model - GPT - can be found in the `minGPT` repository - https://github.com/karpathy/minGPT. While the script is short, it explains and succinctly explores all of the core components we expect in a NeMo model, so it's a prime candidate for NeMo! Sidenote: NeMo supports GPT in its NLP collection, and as such, this notebook aims to be an in-depth development walkthrough for such models.\n", - "\n", - "In the following notebook, we will attempt to port minGPT to NeMo, and along the way, discuss some core concepts of NeMo itself." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fOlQKsaRot1l" - }, - "source": [ - "# Constructing the Neural Network Architecture\n", - "\n", - "First, on the list - the neural network that forms the backbone of the NeMo Model.\n", - "\n", - "So how do we create such a model? Using PyTorch! As you'll see below, NeMo components are compatible with all of PyTorch, so you can augment your workflow without ever losing the flexibility of PyTorch itself!\n", - "\n", - "Let's start with a couple of imports - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "piLOgwOPX1FS" - }, - "source": [ - "import torch\n", - "import nemo\n", - "from nemo.core import NeuralModule\n", - "from nemo.core import typecheck" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yySYjHgAqVvT" - }, - "source": [ - "## Neural Module\n", - "Wait, what's `NeuralModule`? Where is the wonderful `torch.nn.Module`? \n", - "\n", - "`NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.\n", - "\n", - "In addition to being a `torch.nn.Module`, thereby being entirely compatible with the PyTorch ecosystem, it has the following capabilities - \n", - "\n", - "1) `Typing` - It adds support for `Neural Type Checking` to the model. `Typing` is optional but quite useful, as we will discuss below!\n", - "\n", - "2) `Serialization` - Remember the `OmegaConf` config dict and YAML config files? Well, all `NeuralModules` inherently supports serialization/deserialization from such config dictionaries!\n", - "\n", - "3) `FileIO` - This is another entirely optional file serialization system. Does your `NeuralModule` require some way to preserve data that can't be saved into a PyTorch checkpoint? Write your serialization and deserialization logic in two handy methods! **Note**: When you create the final NeMo Model, this will be implemented for you! Automatic serialization and deserialization support of NeMo models!\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bseLiNoqqQrE" - }, - "source": [ - "class MyEmptyModule(NeuralModule):\n", - "\n", - " def forward(self):\n", - " print(\"Neural Module ~ hello world!\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "j4Q36L5urdOQ" - }, - "source": [ - "x = MyEmptyModule()\n", - "x()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lHXAcn5Ot_1I" - }, - "source": [ - "## Neural Types\n", - "\n", - "Neural Types? You might be wondering what that term refers to.\n", - "\n", - "Almost all NeMo components inherit the class `Typing`. `Typing` is a simple class that adds two properties to the class that inherits it - `input_types` and `output_types`. A NeuralType, by its shortest definition, is simply a semantic tensor. It contains information regarding the semantic shape the tensor should hold, as well as the semantic information of what that tensor represents. That's it.\n", - "\n", - "So what semantic information does such a typed tensor contain? Let's take an example below.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ezOJERbVwG34" - }, - "source": [ - "------\n", - "Across the Deep Learning domain, we often encounter cases where tensor shapes may match, but the semantics don't match at all. For example take a look at the following rank 3 tensors - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZvC57bbxwXxN" - }, - "source": [ - "# Case 1:\n", - "embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", - "x = torch.randint(high=10, size=(1, 5))\n", - "print(\"x :\", x)\n", - "print(\"embedding(x) :\", embedding(x).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "sMaqhMBgxe2C" - }, - "source": [ - "# Case 2\n", - "lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", - "x = torch.randn(1, 5, 1)\n", - "print(\"x :\", x)\n", - "print(\"lstm(x) :\", lstm(x)[0].shape) # Let's take all timestep outputs of the LSTM" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9IQHjki-yezX" - }, - "source": [ - "-------\n", - "As you can see, the output of Case 1 is an embedding of shape [1, 5, 30], and the output of Case 2 is an LSTM output (state `h` over all time steps), also of the same shape [1, 5, 30].\n", - "\n", - "Do they have the same shape? **Yes**.
If we do a Case 1 .shape == Case 2 .shape, will we get True as an output? **Yes**.
\n", - "Do they represent the same concept? **No**.
\n", - "\n", - "\n", - "The ability to recognize that the two tensors do not represent the same semantic information is precisely why we utilize Neural Types. It contains the information of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the two outputs of those tensors, it would raise an error saying semantically they were different things (more technically, it would say that they are `INCOMPATIBLE` with each other)!\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ucP0hNI7vWrU" - }, - "source": [ - "--------\n", - "\n", - "You may have read of concepts such as [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html). While conceptually similar, Neural Types attached by NeMo are not as tightly bound to the PyTorch ecosystem - practically any object of a class can be attached with a neural type!\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Uvf5oLt9zxSS" - }, - "source": [ - "## Neural Types - Usage\n", - "\n", - "Neural Types sound interesting, so how do we go about adding them? Let's take a few cases below. \n", - "\n", - "Neural Types are one of the core foundations of NeMo - you will find them in a vast majority of Neural Modules, and every NeMo Model will have its Neural Types defined. While they are entirely optional and not intrusive, NeMo takes great care to support it so that there is no semantic incompatibility between components being used by users." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eTizOBUg0qIB" - }, - "source": [ - "Let's start with a basic example of a type checked module." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "yp0FG8NJt1Jd" - }, - "source": [ - "from nemo.core.neural_types import NeuralType\n", - "from nemo.core.neural_types import *" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "3tsgs8Fp0-WV" - }, - "source": [ - "class EmbeddingModule(NeuralModule):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", - "\n", - " @typecheck()\n", - " def forward(self, x):\n", - " return self.embedding(x)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'x': NeuralType(axes=('B', 'T'), elements_type=Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EmbeddedTextType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sY9GYEoD3Yy0" - }, - "source": [ - "To show the benefit of Neural Types, we are going to replicate the above cases inside NeuralModules.\n", - "\n", - "Let's discuss how we added type checking support to the above class.\n", - "\n", - "1) `forward` has a decorator `@typecheck()` on it.\n", - "\n", - "2) `input_types` and `output_types` properties are defined.\n", - "\n", - "That's it!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "on268fAX4LLU" - }, - "source": [ - "-------\n", - "\n", - "Let's expand on each of the above steps.\n", - "\n", - "- `@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the two default properties of `input_types` and `output_types`, which by default returns None.\n", - "\n", - "The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can \"opt-in\" to type checking by overriding the two properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.\n", - "\n", - "So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator. Inside NeMo, we will show some advanced use cases (which are quite crucial to particular domains such as TTS)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o9i1KugG5om7" - }, - "source": [ - "------\n", - "\n", - "As we see above, `@typecheck()` enforces the types. How then, do we provide this type of information to NeMo? \n", - "\n", - "By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`.\n", - "\n", - "In the above case, we define a `NeuralType` as two components - \n", - "\n", - "- `axes`: This is the semantic information of the carried by the axes themselves. The most common axes information is from single character notation.\n", - "\n", - "> `B` = Batch
\n", - "> `C` / `D` - Channel / Dimension (treated the same)
\n", - "> `T` - Time
\n", - "> `H` / `W` - Height / Width
\n", - "\n", - "- `elements_type`: This is the semantic information of \"what the tensor represents\". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us to build a hierarchy of custom semantic types that can be used by NeMo!\n", - "\n", - "Here, we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "boxxMniv27vi" - }, - "source": [ - "embedding_module = EmbeddingModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BgfDuBm27wiV" - }, - "source": [ - "Now let's construct the equivalent of the Case 2 above, but as a `NeuralModule`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "SZZOOoCJ2-iV" - }, - "source": [ - "class LSTMModule(NeuralModule):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", - "\n", - " @typecheck()\n", - " def forward(self, x):\n", - " return self.lstm(x)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7iIWIunz8IQq" - }, - "source": [ - "------\n", - "Here, we define the LSTM module from the Case 2 above.\n", - "\n", - "We changed the input to be a rank three tensor, now representing a \"SpectrogramType\". We intentionally keep it generic - it can be a `MelSpectrogramType` or a `MFCCSpectrogramType` as its input!\n", - "\n", - "The output of an LSTM is now an `EncodedRepresentation`. Practically, this can be the output of a CNN layer, a Transformer block, or in this case, an LSTM layer. We can, of course, specialize by subclassing EncodedRepresentation and then using that!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "6LlOJf0C8GN4" - }, - "source": [ - "lstm_module = LSTMModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hj0wonSz8_0c" - }, - "source": [ - "------\n", - "Now for the test !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "giLJlub78-Ja" - }, - "source": [ - "# Case 1 [ERROR CELL]\n", - "x1 = torch.randint(high=10, size=(1, 5))\n", - "print(\"x :\", x1)\n", - "print(\"embedding(x) :\", embedding_module(x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K-fhclja9WLr" - }, - "source": [ - "-----\n", - "You might be wondering why we get a `TypeError` right off the bat. This `TypeError` is raised by design.\n", - "\n", - "Positional arguments can cause significant issues during model development, mostly when the model/module design is not finalized. To reduce the potential for mistakes caused by wrong positional arguments and enforce the name of arguments provided to the function, `Typing` requires you to **call all of your type-checked functions by kwargs only**." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "2KUj_p6M9L-f" - }, - "source": [ - "# Case 1\n", - "print(\"x :\", x1)\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dirhWWvMRusx" - }, - "source": [ - "Now let's try the same for the `LSTMModule` in Case 2" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FMu3B0-9-CqE" - }, - "source": [ - "# Case 2 [ERROR CELL]\n", - "x2 = torch.randn(1, 5, 1) # Input = [B=1, T=5, C=1]\n", - "print(\"x :\", x2)\n", - "print(\"lstm(x) :\", lstm_module(x=x2)[0].shape) # Let's take all timestep outputs of the LSTM" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-OTLdR_4-isV" - }, - "source": [ - "-----\n", - "Now we get a type error stating that the number of output arguments provided does not match what is expected.\n", - "\n", - "What exactly is going on here? Well, inside our `LSTMModule` class, we declare the output types to be a single NeuralType - an `EncodedRepresentation` of shape [B, T, C].\n", - "\n", - "But the output of an LSTM layer is a tuple of \n", - "1) the encoded representation of shape [B, T, C]\n", - "2) another tuple containing two state values - the hidden state `h` and the cell state `c`, each of shape [num_layers * num_directions, B, C]!\n", - "\n", - "So the neural type system raises an error saying that the number of output arguments does not match what is expected.\n", - "\n", - "**NOTE**: The axis kind information of the two states will be represented by `D` to represent a general \"Dimension\" - since `num_layers` and `num_directions` are collapsed under a single axis. For NeMo, Axis types of `C` and `D` are equivalent and can be interchanged, so we will use `C` here to represent the hidden dimension of the LSTM and `D` to represent the merged axis `num_layers * num_directions`.\n", - "\n", - "Let's fix the above." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "q2u-keAM-d-B" - }, - "source": [ - "class CorrectLSTMModule(LSTMModule): # Let's inherit the wrong class to make it easy to override\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),\n", - " 'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())],\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "a99NX0O8KMvW" - }, - "source": [ - "You should note that for the `h_c` neural type, we wrap it in a list - `[]`. NeMo, by default, assumes that each `NeuralType` corresponds to a single returned value. However, in the case of LSTMs, they produce a tuple of two state tensors.\n", - "\n", - "So we inform NeMo that this particular `NeuralType` is a single-dimensional list of items - and that each element of this list shares the same `NeuralType` and has the same shape.\n", - "\n", - "NeMo then ensures that the `h_c` is always a list of tensors. It will not check *how many* items are in the list, but will ensure that the returned value *must be a list containing zero or more items* - and that each of these items share the same `NeuralType`. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "GyPZH-fz_dG4" - }, - "source": [ - "lstm_module = CorrectLSTMModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "9whH50PE_Xyx" - }, - "source": [ - "# Case 2\n", - "x2 = torch.randn(1, 5, 1)\n", - "y2, (h, c) = lstm_module(x=x2)\n", - "print(\"x :\", x2)\n", - "print(\"lstm(x) :\", y2.shape) # The output of the LSTM RNN\n", - "print(\"hidden state (h) :\", h.shape) # The first hidden state of the LSTM RNN\n", - "print(\"hidden state (c) :\", c.shape) # The second hidden state of the LSTM RNN" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cRueNvNY_jI3" - }, - "source": [ - "------\n", - "Great! So now, the type checking system is happy.\n", - "\n", - "If you looked closely, the outputs were ordinary Torch Tensors (this is good news; we don't want to be incompatible with torch Tensors after all!). So, where exactly is the type of information stored?\n", - "\n", - "When the `output_types` is overridden, and valid torch tensors are returned as a result, these tensors are attached with the attribute `neural_type`. Let's inspect this -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bGQ9XbWU_ffa" - }, - "source": [ - "emb_out = embedding_module(x=x1)\n", - "lstm_out = lstm_module(x=x2)[0]\n", - "\n", - "assert hasattr(emb_out, 'neural_type')\n", - "assert hasattr(lstm_out, 'neural_type')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "kEpBruSOScPJ" - }, - "source": [ - "print(\"Embedding tensor :\", emb_out.neural_type)\n", - "print(\"LSTM tensor :\", lstm_out.neural_type)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BWTsqiAHAony" - }, - "source": [ - "-------\n", - "So we see that these tensors now have this attribute called `neural_type` and are the same shape.\n", - "\n", - "This exercise's entire goal was to assert that the two outputs are semantically **not** the same object, even if they are the same shape. \n", - "\n", - "Let's test this!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "8AU9FMtdATIm" - }, - "source": [ - "emb_out.neural_type.compare(lstm_out.neural_type)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "2cqnqAGIBCjA" - }, - "source": [ - "emb_out.neural_type == lstm_out.neural_type" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HmH6B0mHDJqb" - }, - "source": [ - "## Neural Types - Limitations\n", - "\n", - "You might have noticed one interesting fact - our inputs were just `torch.Tensor` to both typed function calls, and they had no `neural_type` assigned to them.\n", - "\n", - "So why did the type check system not raise any error? \n", - "\n", - "This is to maintain compatibility - type checking is meant to work on a chain of function calls - and each of these functions should themselves be wrapped with the `@typecheck()` decorator. This is also done because we don't want to overtax the forward call with dozens of checks, and therefore we only type modules that perform some higher-order logical computation. \n", - "\n", - "------\n", - "\n", - "As an example, it is mostly unnecessary (but still possible) to type the input and output of every residual block of a ResNet model. However, it is practically important to type the encoder (no matter how many layers is inside it) and the decoder (the classification head) separately so that when one does fine-tuning, there is no semantic mismatch of the tensors input to the encoder and bound to the decoder." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6m28zSEKEjt_" - }, - "source": [ - "-------\n", - "For this case, since it would be impractical to extend a class to attach a type to the input tensor, we can take a shortcut and directly attach the neural type to the input!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "AGbKB4gJEzcU" - }, - "source": [ - "embedding_module = EmbeddingModule()\n", - "x1 = torch.randint(high=10, size=(1, 5))\n", - "\n", - "# Attach correct neural type\n", - "x1.neural_type = NeuralType(('B', 'T'), Index())\n", - "\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "F0j-evylFM5j" - }, - "source": [ - "# Attach wrong neural type [ERROR CELL]\n", - "x1.neural_type = NeuralType(('B', 'T'), LabelsType())\n", - "\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "StMPyg6oCC9B" - }, - "source": [ - "## Let's create the minGPT components\n", - "\n", - "Now that we have a somewhat firm grasp of neural type checking, let's begin porting the minGPT example code. Once again, most of the code will be a direct port from the [minGPT repository](https://github.com/karpathy/minGPT).\n", - "\n", - "Here, you will notice one thing. By just changing class imports, one `@typecheck()` on forward, and adding `input_types` and `output_types` (which are also entirely optional!), we are almost entirely done with the PyTorch Lightning port!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "raFkuSRaBAE0" - }, - "source": [ - "import math\n", - "from typing import List, Set, Dict, Tuple, Optional\n", - "\n", - "import torch\n", - "import torch.nn as nn\n", - "from torch.nn import functional as F" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yakGOXrzF1XW" - }, - "source": [ - "## Creating Element Types\n", - "\n", - "Till now, we have used the Neural Types provided by the NeMo core. But we need not be restricted to the pre-defined element types !\n", - "\n", - "Users have total flexibility in defining any hierarchy of element types as they please!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ybhLLVyUF0mo" - }, - "source": [ - "class AttentionType(EncodedRepresentation):\n", - " \"\"\"Basic Attention Element Type\"\"\"\n", - "\n", - "class SelfAttentionType(AttentionType):\n", - " \"\"\"Self Attention Element Type\"\"\"\n", - "\n", - "class CausalSelfAttentionType(SelfAttentionType):\n", - " \"\"\"Causal Self Attention Element Type\"\"\"" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mONJRMdbZNSE" - }, - "source": [ - "## Creating the modules\n", - "\n", - "Neural Modules are generally top-level modules but can be used at any level of the module hierarchy.\n", - "\n", - "For demonstration, we will treat an encoder comprising a block of Causal Self Attention modules as a typed Neural Module. Of course, we can also treat each Causal Self Attention layer itself as a neural module if we require it, but top-level modules are generally preferred." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "w4oXpAL_CoDp" - }, - "source": [ - "class CausalSelfAttention(nn.Module):\n", - " \"\"\"\n", - " A vanilla multi-head masked self-attention layer with a projection at the end.\n", - " It is possible to use torch.nn.MultiheadAttention here but I am including an\n", - " explicit implementation here to show that there is nothing too scary here.\n", - " \"\"\"\n", - "\n", - " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", - " super().__init__()\n", - " assert n_embd % n_head == 0\n", - " self.n_head = n_head\n", - " # key, query, value projections for all heads\n", - " self.key = nn.Linear(n_embd, n_embd)\n", - " self.query = nn.Linear(n_embd, n_embd)\n", - " self.value = nn.Linear(n_embd, n_embd)\n", - " # regularization\n", - " self.attn_drop = nn.Dropout(attn_pdrop)\n", - " self.resid_drop = nn.Dropout(resid_pdrop)\n", - " # output projection\n", - " self.proj = nn.Linear(n_embd, n_embd)\n", - " # causal mask to ensure that attention is only applied to the left in the input sequence\n", - " self.register_buffer(\"mask\", torch.tril(torch.ones(block_size, block_size))\n", - " .view(1, 1, block_size, block_size))\n", - " def forward(self, x, layer_past=None):\n", - " B, T, C = x.size()\n", - "\n", - " # calculate query, key, values for all heads in batch and move head forward to be the batch dim\n", - " k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - " q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - " v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - "\n", - " # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)\n", - " att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))\n", - " att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))\n", - " att = F.softmax(att, dim=-1)\n", - " att = self.attn_drop(att)\n", - " y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)\n", - " y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side\n", - "\n", - " # output projection\n", - " y = self.resid_drop(self.proj(y))\n", - " return y\n", - " \n", - "\n", - "class Block(nn.Module):\n", - " \"\"\" an unassuming Transformer block \"\"\"\n", - "\n", - " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", - " super().__init__()\n", - " self.ln1 = nn.LayerNorm(n_embd)\n", - " self.ln2 = nn.LayerNorm(n_embd)\n", - " self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)\n", - " self.mlp = nn.Sequential(\n", - " nn.Linear(n_embd, 4 * n_embd),\n", - " nn.GELU(),\n", - " nn.Linear(4 * n_embd, n_embd),\n", - " nn.Dropout(resid_pdrop),\n", - " )\n", - "\n", - " def forward(self, x):\n", - " x = x + self.attn(self.ln1(x))\n", - " x = x + self.mlp(self.ln2(x))\n", - " return x" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mv0dyrLifkw0" - }, - "source": [ - "## Building the NeMo Model\n", - "\n", - "Since a NeMo Model is comprised of various parts, we are going to iterate on the model step by step inside this notebook. As such, we will have multiple intermediate NeMo \"Models\", which will be partial implementations, and they will inherit each other iteratively.\n", - "\n", - "In a complete implementation of a NeMo Model (as found in the NeMo collections), all of these components will generally be found in a single class.\n", - "\n", - "Let's start by inheriting `ModelPT` - the core class of a PyTorch NeMo Model, which inherits the PyTorch Lightning Module." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TxeG-qMrRgNU" - }, - "source": [ - "-------\n", - "**Remember**:\n", - "\n", - " - The NeMo equivalent of `torch.nn.Module` is the `NeuralModule.\n", - " - The NeMo equivalent of the `LightningModule` is `ModelPT`.\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0TsfmCYthMux" - }, - "source": [ - "import pytorch_lightning as ptl\n", - "from nemo.core import ModelPT\n", - "from omegaconf import OmegaConf" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_ib2rSz2hjaP" - }, - "source": [ - "------\n", - "Next, let's construct the bare minimum implementation of the NeMo Model - just the constructor, the initializer of weights, and the forward method.\n", - "\n", - "Initially, we will follow the steps followed by the minGPT implementation, and progressively refactor for NeMo " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "98x9-Fh-HVwj" - }, - "source": [ - "class PTLGPT(ptl.LightningModule):\n", - " def __init__(self,\n", - " # model definition args\n", - " vocab_size: int, # size of the vocabulary (number of possible tokens)\n", - " block_size: int, # length of the model's context window in time\n", - " n_layer: int, # depth of the model; number of Transformer blocks in sequence\n", - " n_embd: int, # the \"width\" of the model, number of channels in each Transformer\n", - " n_head: int, # number of heads in each multi-head attention inside each Transformer block\n", - " # model optimization args\n", - " learning_rate: float = 3e-4, # the base learning rate of the model\n", - " weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops\n", - " betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer\n", - " embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", - " resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", - " attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix\n", - " ):\n", - " super().__init__()\n", - "\n", - " # save these for optimizer init later\n", - " self.learning_rate = learning_rate\n", - " self.weight_decay = weight_decay\n", - " self.betas = betas\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", - " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", - " self.drop = nn.Dropout(embd_pdrop)\n", - " # deep transformer: just a sequence of transformer blocks\n", - " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])\n", - " # decoder: at the end one more layernorm and decode the answers\n", - " self.ln_f = nn.LayerNorm(n_embd)\n", - " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", - "\n", - " self.block_size = block_size\n", - " self.apply(self._init_weights)\n", - "\n", - " print(\"number of parameters: %e\" % sum(p.numel() for p in self.parameters()))\n", - "\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", - "\n", - " # forward the GPT model\n", - " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", - " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", - " x = self.drop(token_embeddings + position_embeddings)\n", - " x = self.blocks(x)\n", - " x = self.ln_f(x)\n", - " logits = self.head(x)\n", - "\n", - " return logits\n", - "\n", - " def get_block_size(self):\n", - " return self.block_size\n", - "\n", - " def _init_weights(self, module):\n", - " \"\"\"\n", - " Vanilla model initialization:\n", - " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", - " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", - " \"\"\"\n", - " if isinstance(module, (nn.Linear, nn.Embedding)):\n", - " module.weight.data.normal_(mean=0.0, std=0.02)\n", - " if isinstance(module, nn.Linear) and module.bias is not None:\n", - " module.bias.data.zero_()\n", - " elif isinstance(module, nn.LayerNorm):\n", - " module.bias.data.zero_()\n", - " module.weight.data.fill_(1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2bMf5SO7wmor" - }, - "source": [ - "------\n", - "Let's create a PyTorch Lightning Model above, just to make sure it works !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "rrXIBzg4wutC" - }, - "source": [ - "m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZCcgn1bajPW8" - }, - "source": [ - "------\n", - "Now, let's convert the above easily into a NeMo Model.\n", - "\n", - "A NeMo Model constructor generally accepts only two things - \n", - "\n", - "1) `cfg`: An OmegaConf DictConfig object that defines precisely the components required by the model to define its neural network architecture, data loader setup, optimizer setup, and any additional components needed for the model itself.\n", - "\n", - "2) `trainer`: An optional Trainer from PyTorch Lightning if the NeMo model will be used for training. It can be set after construction (if required) using the `set_trainer` method. For this notebook, we will not be constructing the config for the Trainer object." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WQMTCB3kz0UA" - }, - "source": [ - "## Refactoring Neural Modules\n", - "\n", - "As we discussed above, Neural Modules are generally higher-level components of the Model and can potentially be replaced by equivalent Neural Modules.\n", - "\n", - "As we see above, the embedding modules, deep transformer decoder network, and final decoder layer have all been combined inside the PyTorch Lightning implementation constructor.\n", - "\n", - "------\n", - "\n", - "However, the final decoder module could have been an RNN instead of a simple Linear layer, or it could have been a 1D-CNN instead.\n", - "\n", - "Likewise, the deep transformer decoder could potentially have a different implementation of Self Attention modules.\n", - "\n", - "These changes cannot be easily implemented any more inside the above implementation. However, if we refactor these components into their respective NeuralModules, then we can easily replace them with equivalent modules we construct in the future!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EJj5sSkX0xHi" - }, - "source": [ - "### Refactoring the Embedding module\n", - "\n", - "Let's first refactor out the embedding module from the above implementation" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "uYwMyjqK05RL" - }, - "source": [ - "class GPTEmbedding(NeuralModule):\n", - " def __init__(self, vocab_size: int, n_embd: int, block_size: int, embd_pdrop: float = 0.0):\n", - " super().__init__()\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", - " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", - " self.drop = nn.Dropout(embd_pdrop)\n", - "\n", - " @typecheck()\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " \n", - " # forward the GPT model\n", - " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", - " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", - " x = self.drop(token_embeddings + position_embeddings)\n", - " return x\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'idx': NeuralType(('B', 'T'), Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'embeddings': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "l5rOP6lyOyRt" - }, - "source": [ - "### Refactoring the Encoder\n", - "\n", - "Next, let's refactor the GPT Encoder - which is implemented as a multi layer Transformer (Decoder) network.\n", - "\n", - "------\n", - "It can be noted that we refer to the GPT \"Encoder\" module - but it is constructed by using Transformer \"Decoder\" blocks.\n", - "\n", - "***When we discuss Neural Modules - we are discussing an abstract module with a certain input neural type and a certain output neural type.***\n", - "\n", - "For us, the GPT \"Encoder\" neural module will accept any implementation, whose\n", - "\n", - "- input neural type is `NeuralType(('B', 'T', 'C'), EmbeddedTextType())`\n", - "\n", - "- output type is `NeuralType(('B', 'T', 'C'), EncodedRepresentation())`\n", - "\n", - "-----\n", - "One concrete implementation of such a GPT \"Encoder\" neural module is a Deep Transformer \"Decoder\" network." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "1QeQnQ_G2PwH" - }, - "source": [ - "class GPTTransformerEncoder(NeuralModule):\n", - " def __init__(self, n_embd: int, block_size: int, n_head: int, n_layer: int, attn_pdrop: float = 0.0, resid_pdrop: float = 0.0):\n", - " super().__init__()\n", - "\n", - " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) \n", - " for _ in range(n_layer)])\n", - " \n", - " @typecheck()\n", - " def forward(self, embed):\n", - " return self.blocks(embed)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'embed': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'encoding': NeuralType(('B', 'T', 'C'), CausalSelfAttentionType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NmCR3LK3QHum" - }, - "source": [ - "### Refactoring the Decoder\n", - "\n", - "Finally, let's refactor the Decoder - the small one-layer feed-forward network to decode the answer.\n", - "\n", - "-------\n", - "\n", - "Note an interesting detail - The `input_types` of the Decoder accepts the generic `EncoderRepresentation()`, where as the `neural_type` of the `GPTTransformerEncoder` has the `output_type` of `CausalSelfAttentionType`.\n", - "\n", - "This is semantically *not* a mismatch! As you can see above in the inheritance chart, we declare `EncodedRepresentation` -> `AttentionType` -> `SelfAttentionType` -> `CausalSelfAttentionType`. \n", - "\n", - "Such an inheritance hierarchy for the `element_type` allows future encoders (which also have a neural output type of at least `EncodedRepresentation`) to be swapped in place of the current GPT Causal Self Attention Encoder while keeping the rest of the NeMo model working just fine!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VCPUu0EWQIBX" - }, - "source": [ - "class GPTDecoder(NeuralModule):\n", - " def __init__(self, n_embd: int, vocab_size: int):\n", - " super().__init__()\n", - " self.ln_f = nn.LayerNorm(n_embd)\n", - " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", - "\n", - " @typecheck()\n", - " def forward(self, encoding):\n", - " x = self.ln_f(encoding)\n", - " logits = self.head(x)\n", - " return logits\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'encoding': NeuralType(('B', 'T', 'C'), EncodedRepresentation())\n", - " }\n", - " \n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", - " }\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nYLMjlW0Sdy1" - }, - "source": [ - "### Refactoring the NeMo GPT Model\n", - "\n", - "Now that we have 3 NeuralModules for the embedding, the encoder, and the decoder, let's refactor the NeMo model to take advantage of this refactor!\n", - "\n", - "This time, we inherit from `ModelPT` instead of the general `LightningModule`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZQlmtYU6iDwi" - }, - "source": [ - "class AbstractNeMoGPT(ModelPT):\n", - " def __init__(self, cfg: OmegaConf, trainer: ptl.Trainer = None):\n", - " super().__init__(cfg=cfg, trainer=trainer)\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.embedding = self.from_config_dict(self.cfg.embedding)\n", - " # deep transformer: just a sequence of transformer blocks\n", - " self.encoder = self.from_config_dict(self.cfg.encoder)\n", - " # decoder: at the end one more layernorm and decode the answers\n", - " self.decoder = self.from_config_dict(self.cfg.decoder)\n", - "\n", - " self.block_size = self.cfg.embedding.block_size\n", - " self.apply(self._init_weights)\n", - "\n", - " print(\"number of parameters: %e\" % self.num_weights)\n", - "\n", - " @typecheck()\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", - "\n", - " # forward the GPT model\n", - " # Remember: Only kwargs are allowed !\n", - " e = self.embedding(idx=idx)\n", - " x = self.encoder(embed=e)\n", - " logits = self.decoder(encoding=x)\n", - "\n", - " return logits\n", - "\n", - " def get_block_size(self):\n", - " return self.block_size\n", - "\n", - " def _init_weights(self, module):\n", - " \"\"\"\n", - " Vanilla model initialization:\n", - " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", - " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", - " \"\"\"\n", - " if isinstance(module, (nn.Linear, nn.Embedding)):\n", - " module.weight.data.normal_(mean=0.0, std=0.02)\n", - " if isinstance(module, nn.Linear) and module.bias is not None:\n", - " module.bias.data.zero_()\n", - " elif isinstance(module, nn.LayerNorm):\n", - " module.bias.data.zero_()\n", - " module.weight.data.fill_(1.0)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'idx': NeuralType(('B', 'T'), Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DFRmxWiSmdF3" - }, - "source": [ - "## Creating a config for a Model\n", - "\n", - "At first glance, not much changed compared to the PyTorch Lightning implementation above. Other than the constructor, which now accepts a config, nothing changed at all!\n", - "\n", - "NeMo operates on the concept of a NeMo Model being accompanied by a corresponding config dict (instantiated as an OmegaConf object). This enables us to prototype the model by utilizing Hydra rapidly. This includes various other benefits - such as hyperparameter optimization and serialization/deserialization of NeMo models.\n", - "\n", - "Let's look at how actually to construct such config objects!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "uygo0BEYjKuj" - }, - "source": [ - "# model definition args (required)\n", - "# ================================\n", - "# vocab_size: int # size of the vocabulary (number of possible tokens)\n", - "# block_size: int # length of the model's context window in time\n", - "# n_layer: int # depth of the model; number of Transformer blocks in sequence\n", - "# n_embd: int # the \"width\" of the model, number of channels in each Transformer\n", - "# n_head: int # number of heads in each multi-head attention inside each Transformer block \n", - "\n", - "# model definition args (optional)\n", - "# ================================\n", - "# embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", - "# resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", - "# attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "s4sdqRAFop-n" - }, - "source": [ - "------\n", - "As we look at the required parameters above, we need a way to tell OmegaConf that these values are currently not set, but the user should set them before we use them.\n", - "\n", - "OmegaConf supports such behavior using the `MISSING` value. A similar effect can be achieved in YAML configs by using `???` as a placeholder." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "XqLSZq7Soo2j" - }, - "source": [ - "from omegaconf import MISSING" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "JTH-1vu8TO7o" - }, - "source": [ - "# Let's create a utility for building the class path\n", - "def get_class_path(cls):\n", - " return f'{cls.__module__}.{cls.__name__}'" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6xToaWAJUmtX" - }, - "source": [ - "### Structure of a Model config\n", - "\n", - "Let's first create a config for the common components of the model level config -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZCvLdOlMVLy_" - }, - "source": [ - "common_config = OmegaConf.create({\n", - " 'vocab_size': MISSING,\n", - " 'block_size': MISSING,\n", - " 'n_layer': MISSING,\n", - " 'n_embd': MISSING,\n", - " 'n_head': MISSING,\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j8hvdKa4VmCV" - }, - "source": [ - "-----\n", - "The model config right now is still being built - it needs to contain a lot more details!\n", - "\n", - "A complete Model Config should have the sub-configs of all of its top-level modules as well. This means the configs of the `embedding`, `encoder`, and the `decoder`.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v-2_QOZyVgrE" - }, - "source": [ - "### Structure of sub-module config\n", - "\n", - "For top-level models, we generally don't change the actual module very often, and instead, primarily change the hyperparameters of that model.\n", - "\n", - "So we will make use of `Hydra`'s Class instantiation method - which can easily be accessed via the class method `ModelPT.from_config_dict()`.\n", - "\n", - "Let's take a few examples below -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ntsxQKH0pDac" - }, - "source": [ - "embedding_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTEmbedding),\n", - " 'vocab_size': '${model.vocab_size}',\n", - " 'n_embd': '${model.n_embd}',\n", - " 'block_size': '${model.block_size}',\n", - " 'embd_pdrop': 0.1\n", - "})\n", - "\n", - "encoder_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTTransformerEncoder),\n", - " 'n_embd': '${model.n_embd}',\n", - " 'block_size': '${model.block_size}',\n", - " 'n_head': '${model.n_head}',\n", - " 'n_layer': '${model.n_layer}',\n", - " 'attn_pdrop': 0.1,\n", - " 'resid_pdrop': 0.1\n", - "})\n", - "\n", - "decoder_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTDecoder),\n", - " # n_embd: int, vocab_size: int\n", - " 'n_embd': '${model.n_embd}',\n", - " 'vocab_size': '${model.vocab_size}'\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtloTqkqWhpl" - }, - "source": [ - "##### What is `_target_`?\n", - "--------\n", - "\n", - "In the above config, we see a `_target_` in the config. `_target_` is usually a full classpath to the actual class in the python package/user local directory. It is required for Hydra to locate and instantiate the model from its path correctly.\n", - "\n", - "So why do we want to set a classpath?\n", - "\n", - "In general, when developing models, we don't often change the encoder or the decoder, but we do change the hyperparameters of the encoder and decoder.\n", - "\n", - "This notation helps us keep the Model level declaration of the forward step neat and precise. It also logically helps us demark which parts of the model can be easily replaced - in the future, we can easily replace the encoder with some other type of self-attention block or the decoder with an RNN or 1D-CNN neural module (as long as they have the same Neural Type definition as the current blocks).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ASDmcgE4XtQ4" - }, - "source": [ - "##### What is the `${}` syntax?\n", - "-------\n", - "\n", - "OmegaConf, and by extension, Hydra, supports Variable Interpolation. As you can see in the `__init__` of embedding, encoder, and decoder neural modules, they often share many parameters between each other.\n", - "\n", - "It would become tedious and error-prone to set each of these constructors' values separately in each of the embedding, encoder, and decoder configs.\n", - "\n", - "So instead, we define standard keys inside of the `model` level config and then interpolate these values inside of the respective configs!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zXvEcXGhZi5I" - }, - "source": [ - "### Attaching the model and module-level configs\n", - "\n", - "So now, we have a Model level and per-module level configs for the core components. Sub-module configs generally fall under the \"model\" namespace, but you have the flexibility to define the structure as you require.\n", - "\n", - "Let's attach them!\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "c8hvNeB_aDgi" - }, - "source": [ - "model_config = OmegaConf.create({\n", - " 'model': common_config\n", - "})\n", - "\n", - "# Then let's attach the sub-module configs\n", - "model_config.model.embedding = embedding_config\n", - "model_config.model.encoder = encoder_config\n", - "model_config.model.decoder = decoder_config" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zIubuFcOpIB0" - }, - "source": [ - "-----\n", - "Let's print this config!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "2SyKNgp9pG0N" - }, - "source": [ - "print(OmegaConf.to_yaml(model_config))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4PAA07EAauCn" - }, - "source": [ - "-----\n", - "Wait, why did OmegaConf not fill in the value of the variable interpolation for the configs yet?\n", - "\n", - "This is because OmegaConf takes a deferred approach to variable interpolation. First, we fill in temporary values of the required fields (those marked by `???`). Then, to force resolution ahead of time, we can use the following snippet - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0X4C76JyOAnN" - }, - "source": [ - "import copy" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ugxA0TPtbHVZ" - }, - "source": [ - "temp_config = copy.deepcopy(model_config)\n", - "temp_config.model.vocab_size = 10\n", - "temp_config.model.block_size = 4\n", - "temp_config.model.n_layer = 1\n", - "temp_config.model.n_embd = 32\n", - "temp_config.model.n_head = 4\n", - "\n", - "temp_config = OmegaConf.create(OmegaConf.to_container(temp_config, resolve=True))\n", - "print(OmegaConf.to_yaml(temp_config))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V41RFIpEpiOu" - }, - "source": [ - "-----\n", - "Now that we have a config, let's try to create an object of the NeMo Model !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "IIIVi2IfpsJ4" - }, - "source": [ - "# Let's work on a copy of the model config and update it before we send it into the Model.\n", - "cfg = copy.deepcopy(model_config)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "OllBhswPqQXq" - }, - "source": [ - "# Let's set the values of the config (for some plausible small model)\n", - "cfg.model.vocab_size = 100\n", - "cfg.model.block_size = 128\n", - "cfg.model.n_layer = 1\n", - "cfg.model.n_embd = 32\n", - "cfg.model.n_head = 4" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QJm2LnTqqcIM" - }, - "source": [ - "print(OmegaConf.to_yaml(cfg))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "E7tpB8BcqeBO" - }, - "source": [ - "# Try to create a model with this config [ERROR CELL]\n", - "m = AbstractNeMoGPT(cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cXOLhpxdq4Ni" - }, - "source": [ - "-----\n", - "\n", - "You will note that we added the `Abstract` tag for a reason to this NeMo Model and that when we try to instantiate it - it raises an error that we need to implement specific methods.\n", - "\n", - "1) `setup_training_data` & `setup_validation_data` - All NeMo models should implement two data loaders - the training data loader and the validation data loader. Optionally, they can go one step further and also implement the `setup_test_data` method to add support for evaluating the Model on its own.\n", - "\n", - "Why do we enforce this? NeMo Models are meant to be a unified, cohesive object containing the details about the neural network underlying that Model and the data loaders to train, validate, and optionally test those models.\n", - "\n", - "In doing so, once the Model is created/deserialized, it would take just a few more steps to train the Model from scratch / fine-tune/evaluate the Model on any data that the user provides, as long as this user-provided dataset is in a format supported by the Dataset / DataLoader that is used by this Model!\n", - "\n", - "2) `list_available_models` - This is a utility method to provide a list of pre-trained NeMo models to the user from the cloud.\n", - "\n", - "Typically, NeMo models can be easily packaged into a tar file (which we call a .nemo file in the earlier primer notebook). These tar files contain the model config + the pre-trained checkpoint weights of the Model, and can easily be downloaded from some cloud service. \n", - "\n", - "For this notebook, we will not be implementing this method.\n", - "\n", - "--------\n", - "Finally, let's create a concrete implementation of the above NeMo Model!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Vcwi1lO7t7Sm" - }, - "source": [ - "from nemo.core.classes.common import PretrainedModelInfo" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ckCxyVLYqrz0" - }, - "source": [ - "class BasicNeMoGPT(AbstractNeMoGPT):\n", - "\n", - " @classmethod\n", - " def list_available_models(cls) -> PretrainedModelInfo:\n", - " return None\n", - "\n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self._train_dl = None\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " self._validation_dl = None\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " self._test_dl = None" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ofUoJ8DDvq_Y" - }, - "source": [ - "------\n", - "Now let's try to create an object of the `BasicNeMoGPT` model" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "G8iYQSC5vptU" - }, - "source": [ - "m = BasicNeMoGPT(cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "otvYW4TBxAju" - }, - "source": [ - "## Setting up train-val-test steps\n", - "\n", - "The above `BasicNeMoGPT` Model is a basic PyTorch Lightning Module, with some added functionality - \n", - "\n", - "1) Neural Type checks support - as defined in the Model as well as the internal modules.\n", - "\n", - "2) Save and restore of the Model (in the trivial case) to a tarfile.\n", - "\n", - "But as the Model is right now, it crucially does not support PyTorch Lightning's `Trainer`. As such, while this Model can be called manually, it cannot be easily trained or evaluated by using the PyTorch Lightning framework.\n", - "\n", - "------\n", - "\n", - "Let's begin adding support for this then -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "QU3oQAVovxRg" - }, - "source": [ - "class BasicNeMoGPTWithSteps(BasicNeMoGPT):\n", - "\n", - " def step_(self, split, batch, batch_idx=None):\n", - " idx, targets = batch\n", - " logits = self(idx=idx)\n", - " loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n", - " key = 'loss' if split == 'train' else f\"{split}_loss\"\n", - " self.log(key, loss)\n", - " return {key: loss}\n", - "\n", - " def training_step(self, *args, **kwargs):\n", - " return self.step_('train', *args, **kwargs)\n", - "\n", - " def validation_step(self, *args, **kwargs):\n", - " return self.step_('val', *args, **kwargs)\n", - "\n", - " def test_step(self, *args, **kwargs):\n", - " return self.step_('test', *args, **kwargs)\n", - " \n", - " # This is useful for multiple validation data loader setup\n", - " def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):\n", - " val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()\n", - " return {'val_loss': val_loss_mean}\n", - "\n", - " # This is useful for multiple test data loader setup\n", - " def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):\n", - " test_loss_mean = torch.stack([x['test_loss'] for x in outputs]).mean()\n", - " return {'test_loss': test_loss_mean}" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "2Ki3kRxag511" - }, - "source": [ - "m = BasicNeMoGPTWithSteps(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "f_7YziAw_Isu" - }, - "source": [ - "### Setup for Multi Validation and Multi Test data loaders\n", - "\n", - "As discussed in the NeMo Primer, NeMo has in-built support for multiple data loaders for validation and test steps. Therefore, as an example of how easy it is to add such support, we include the `multi_validation_epoch_end` and `multi_test_epoch_end` overrides.\n", - "\n", - "It is also practically essential to collate results from more than one distributed GPUs, and then aggregate results properly at the end of the epoch. NeMo strictly enforces the correct collation of results, even if you will work on only one device! Future-proofing is baked into the model design for this case!\n", - "\n", - "Therefore NeMo provides the above two generic methods to support aggregation and simultaneously support multiple datasets!\n", - "\n", - "**Please note, you can prepend your already existing `on_validation_epoch_end` and `on_test_epoch_end` implementations with the `multi_` in the name, and that alone is sufficient to enable multi-dataset and multi-GPU support!**\n", - "\n", - "------\n", - "**Note: To disable multi-dataset support, simply override `on_validation_epoch_end` and `on_test_epoch_end` instead of `multi_validation_epoch_end` and `multi_test_epoch_end`!**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QpfSn-YUh7GK" - }, - "source": [ - "## Setting up the optimizer / scheduler\n", - "\n", - "We are relatively close to reaching feature parity with the MinGPT Model! But we are missing a crucial piece - the optimizer.\n", - "\n", - "All NeMo Model's come with a default implementation of `setup_optimization()`, which will parse the provided model config to obtain the `optim` and `sched` sub-configs, and automatically configure the optimizer and scheduler.\n", - "\n", - "If training GPT was as simple as plugging in an Adam optimizer over all the parameters with a cosine weight decay schedule, we could do that from the config alone.\n", - "\n", - "-------\n", - "\n", - "But GPT is not such a trivial model - more specifically, it requires weight decay to be applied to the weight matrices but not to the biases, the embedding matrix, or the LayerNorm layers.\n", - "\n", - "We can drop the support that Nemo provides for such special cases and instead utilize the PyTorch Lightning method `configure_optimizers` to perform the same task.\n", - "\n", - "-------\n", - "\n", - "Note, for NeMo Models; the `configure_optimizers` is implemented as a trivial call to `setup_optimization()` followed by returning the generated optimizer and scheduler! So we can override the `configure_optimizer` method and manage the optimizer creation manually!\n", - "\n", - "NeMo's goal is to provide usable defaults for the general case and simply back off to either PyTorch Lightning or PyTorch nn.Module itself in cases when the additional flexibility becomes necessary!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FgXkZQiVjnOv" - }, - "source": [ - "class BasicNeMoGPTWithOptim(BasicNeMoGPTWithSteps):\n", - "\n", - " def configure_optimizers(self):\n", - " \"\"\"\n", - " This long function is unfortunately doing something very simple and is being very defensive:\n", - " We are separating out all parameters of the model into two buckets: those that will experience\n", - " weight decay for regularization and those that won't (biases, and layernorm/embedding weights).\n", - " We are then returning the PyTorch optimizer object.\n", - " \"\"\"\n", - "\n", - " # separate out all parameters to those that will and won't experience weight decay\n", - " decay = set()\n", - " no_decay = set()\n", - " whitelist_weight_modules = (torch.nn.Linear, )\n", - " blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)\n", - " for mn, m in self.named_modules():\n", - " for pn, p in m.named_parameters():\n", - " fpn = '%s.%s' % (mn, pn) if mn else pn # full param name\n", - "\n", - " if pn.endswith('bias'):\n", - " # all biases will not be decayed\n", - " no_decay.add(fpn)\n", - " elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):\n", - " # weights of whitelist modules will be weight decayed\n", - " decay.add(fpn)\n", - " elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):\n", - " # weights of blacklist modules will NOT be weight decayed\n", - " no_decay.add(fpn)\n", - "\n", - " # special case the position embedding parameter in the root GPT module as not decayed\n", - " no_decay.add('embedding.pos_emb')\n", - "\n", - " # validate that we considered every parameter\n", - " param_dict = {pn: p for pn, p in self.named_parameters()}\n", - " inter_params = decay & no_decay\n", - " union_params = decay | no_decay\n", - " assert len(inter_params) == 0, \"parameters %s made it into both decay/no_decay sets!\" % (str(inter_params), )\n", - " assert len(param_dict.keys() - union_params) == 0, \"parameters %s were not separated into either decay/no_decay set!\" \\\n", - " % (str(param_dict.keys() - union_params), )\n", - "\n", - " # create the pytorch optimizer object\n", - " optim_groups = [\n", - " {\"params\": [param_dict[pn] for pn in sorted(list(decay))], \"weight_decay\": self.cfg.optim.weight_decay},\n", - " {\"params\": [param_dict[pn] for pn in sorted(list(no_decay))], \"weight_decay\": 0.0},\n", - " ]\n", - " optimizer = torch.optim.AdamW(optim_groups, lr=self.cfg.optim.lr, betas=self.cfg.optim.betas)\n", - " return optimizer\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "kARDwthakEQk" - }, - "source": [ - "m = BasicNeMoGPTWithOptim(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iB1kwctv2cYv" - }, - "source": [ - "-----\n", - "Now let's setup the config for the optimizer !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "5K7zh9Cn2s2u" - }, - "source": [ - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "optim_config = OmegaConf.create({\n", - " 'lr': 3e-4,\n", - " 'weight_decay': 0.1,\n", - " 'betas': [0.9, 0.95]\n", - "})\n", - "\n", - "cfg.model.optim = optim_config\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "P31p8ABthsh0" - }, - "source": [ - "## Setting up the dataset / data loaders\n", - "\n", - "So we were able almost entirely to replicate the MinGPT implementation. \n", - "\n", - "Remember, NeMo models should contain all of the logic to load the Dataset and DataLoader for at least the train and validation step.\n", - "\n", - "We temporarily provided empty implementations to get around it till now, but let's fill that in now!\n", - "\n", - "-------\n", - "\n", - "**Note for datasets**: Below, we will show an example using a very small dataset called `tiny_shakespeare`, found at the original [char-rnn repository](https://github.com/karpathy/char-rnn), but practically you could use any text corpus. The one suggested in minGPT is available at http://mattmahoney.net/dc/textdata.html" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "q8dlOcZPkxM1" - }, - "source": [ - "### Creating the Dataset\n", - "\n", - "NeMo has Neural Type checking support, even for Datasets! It's just a minor change of the import in most cases and one difference in how we handle `collate_fn`.\n", - "\n", - "We could paste the dataset info from minGPT, and you'd only need to make 2 changes!\n", - "\n", - "-----\n", - "In this example, we will be writing a thin subclass over the datasets provided by `nlp` from HuggingFace!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "E-fswFkig9t4" - }, - "source": [ - "from nemo.core import Dataset\n", - "from torch.utils import data\n", - "from torch.utils.data.dataloader import DataLoader" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "-Z8XuPeClGNm" - }, - "source": [ - "class TinyShakespeareDataset(Dataset):\n", - "\n", - " def __init__(self, data_path, block_size, crop=None, override_vocab=None):\n", - "\n", - " # load the data and crop it appropriately\n", - " with open(data_path, 'r') as f:\n", - " if crop is None:\n", - " data = f.read()\n", - " else:\n", - " f.seek(crop[0])\n", - " data = f.read(crop[1])\n", - "\n", - " # build a vocabulary from data or inherit it\n", - " vocab = sorted(list(set(data))) if override_vocab is None else override_vocab\n", - "\n", - " # Add UNK\n", - " special_tokens = ['', ''] # We use just and in the call, but can add others.\n", - " if not override_vocab:\n", - " vocab = [*special_tokens, *vocab] # Update train vocab with special tokens\n", - "\n", - " data_size, vocab_size = len(data), len(vocab)\n", - " print('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))\n", - " print('Num samples in dataset : %d' % (data_size // block_size))\n", - "\n", - " self.stoi = { ch:i for i,ch in enumerate(vocab) }\n", - " self.itos = { i:ch for i,ch in enumerate(vocab) }\n", - " self.block_size = block_size\n", - " self.vocab_size = vocab_size\n", - " self.data = data\n", - " self.vocab = vocab\n", - " self.special_tokens = special_tokens\n", - "\n", - " def __len__(self):\n", - " return len(self.data) // self.block_size\n", - "\n", - " def __getitem__(self, idx):\n", - " # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too\n", - " chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]\n", - " # map the string into a sequence of integers\n", - " ixes = [self.stoi[s] if s in self.stoi else self.stoi[''] for s in chunk ]\n", - " # if stars align (last idx and len(self.data) % self.block_size == 0), pad with \n", - " if len(ixes) < self.block_size + 1:\n", - " assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure\n", - " ixes.append(self.stoi[''])\n", - " dix = torch.tensor(ixes, dtype=torch.long)\n", - " return dix[:-1], dix[1:]\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'input': NeuralType(('B', 'T'), Index()),\n", - " 'target': NeuralType(('B', 'T'), LabelsType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7MEMR4TcmP5K" - }, - "source": [ - "------\n", - "We didn't have to change anything until here. How then is type-checking done? \n", - "\n", - "NeMo does type-checking inside of the collate function implementation itself! In this case, it is not necessary to override the `collate_fn` inside the Dataset, but if we did need to override it, **NeMo requires that the private method `_collate_fn` be overridden instead**.\n", - "\n", - "We can then use data loaders with minor modifications!\n", - "\n", - "**Also, there is no need to implement the `input_types` for Dataset, as they are the ones generating the input for the model!**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZeKXAknenVch" - }, - "source": [ - "-----\n", - "Let's prepare the dataset that we are going to use - Tiny Shakespeare from the following codebase [char-rnn](https://github.com/karpathy/char-rnn)." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VwsdXtVzo--t" - }, - "source": [ - "import os" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QvKcDCvIl9-A" - }, - "source": [ - "if not os.path.exists('tiny-shakespeare.txt'):\n", - " !wget https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ynCwqDu6vK8P" - }, - "source": [ - "!head -n 5 tiny-shakespeare.txt" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "bfRL4t9_oS4C" - }, - "source": [ - "train_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(0, int(1e6)))\n", - "val_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1e6), int(50e3)), override_vocab=train_dataset.vocab)\n", - "test_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1.05e6), int(100e3)), override_vocab=train_dataset.vocab)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kIlCoZDksEDO" - }, - "source": [ - "### Setting up dataset/data loader support in the Model\n", - "\n", - "So we now know our data loader works. Let's integrate it as part of the Model itself!\n", - "\n", - "To do this, we use the three special attributes of the NeMo Model - `self._train_dl`, `self._validation_dl` and `self._test_dl`. Once you construct your DataLoader, place your data loader to these three variables. \n", - "\n", - "For multi-data loader support, the same applies! NeMo will automatically handle the management of multiple data loaders for you!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "SVSfIk_-rMSg" - }, - "source": [ - "class NeMoGPT(BasicNeMoGPTWithOptim):\n", - "\n", - " def _setup_data_loader(self, cfg):\n", - " if self.vocab is None:\n", - " override_vocab = None\n", - " else:\n", - " override_vocab = self.vocab\n", - "\n", - " dataset = TinyShakespeareDataset(\n", - " data_path=cfg.data_path,\n", - " block_size=cfg.block_size,\n", - " crop=tuple(cfg.crop) if 'crop' in cfg else None,\n", - " override_vocab=override_vocab\n", - " )\n", - "\n", - " if self.vocab is None:\n", - " self.vocab = dataset.vocab\n", - "\n", - " return DataLoader(\n", - " dataset=dataset,\n", - " batch_size=cfg.batch_size,\n", - " shuffle=cfg.shuffle,\n", - " collate_fn=dataset.collate_fn, # <-- this is necessary for type checking\n", - " pin_memory=cfg.pin_memory if 'pin_memory' in cfg else False,\n", - " num_workers=cfg.num_workers if 'num_workers' in cfg else 0\n", - " )\n", - " \n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self.vocab = None\n", - " self._train_dl = self._setup_data_loader(train_data_config)\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " self._validation_dl = self._setup_data_loader(val_data_config)\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " self._test_dl = self._setup_data_loader(test_data_config)\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ait4nLtIxS96" - }, - "source": [ - "### Creating the dataset / dataloader config\n", - "\n", - "The final step to setup this model is to add the `train_ds`, `validation_ds` and `test_ds` configs inside the model config!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "C6zcTqJixOOL" - }, - "source": [ - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "# Set the data path and update vocabular size\n", - "cfg.model.data_path = 'tiny-shakespeare.txt'\n", - "cfg.model.vocab_size = train_dataset.vocab_size\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "zlvThf7BysyT" - }, - "source": [ - "train_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [0, int(1e6)],\n", - " 'batch_size': 64,\n", - " 'shuffle': True,\n", - "})\n", - "\n", - "validation_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [int(1e6), int(50e3)],\n", - " 'batch_size': 4,\n", - " 'shuffle': False,\n", - "})\n", - "\n", - "test_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [int(1.05e6), int(100e3)],\n", - " 'batch_size': 4,\n", - " 'shuffle': False,\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QVVzR6WKyMT5" - }, - "source": [ - "# Attach to the model config\n", - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "cfg.model.train_ds = train_ds\n", - "cfg.model.validation_ds = validation_ds\n", - "cfg.model.test_ds = test_ds\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "nd_9_mxS0ET-" - }, - "source": [ - "# Let's see the config now !\n", - "print(OmegaConf.to_yaml(cfg))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "dlwSQENU0JxA" - }, - "source": [ - "# Let's try creating a model now !\n", - "model = NeMoGPT(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Q_Mp4bhH0tR1" - }, - "source": [ - "-----\n", - "All the data loaders load properly ! Yay!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CZHDqCyo6uWd" - }, - "source": [ - "# Evaluate the model - end to end!\n", - "\n", - "Now that the data loaders have been set up, all that's left is to train and test the model! We have most of the components required by this model - the train, val and test data loaders, the optimizer, and the type-checked forward step to perform the train-validation-test steps! \n", - "\n", - "But training a GPT model from scratch is not the goal of this primer, so instead, let's do a sanity check by merely testing the model for a few steps using random initial weights.\n", - "\n", - "The above will ensure that - \n", - "\n", - "1) Our data loaders work as intended\n", - "\n", - "2) The type checking system assures us that our Neural Modules are performing their forward step correctly.\n", - "\n", - "3) The loss is calculated, and therefore the model runs end to end, ultimately supporting PyTorch Lightning." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "johk6Z0e0WEm" - }, - "source": [ - "if torch.cuda.is_available():\n", - " accelerator = 'gpu'\n", - "else:\n", - " accelerator = 'cpu'\n", - "\n", - "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches=1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "oqeeofEr1S8e" - }, - "source": [ - "trainer.test(model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pqJy7esrA-Ha" - }, - "source": [ - "# Saving and restoring models\n", - "\n", - "NeMo internally keeps track of the model configuration, as well as the model checkpoints and parameters.\n", - "\n", - "As long as your NeMo follows the above general guidelines, you can call the `save_to` and `restore_from` methods to save and restore your models!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "DksG_-7G1Vbe" - }, - "source": [ - "model.save_to('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "JhjoFdCnBWVh" - }, - "source": [ - "!ls -d -- *.nemo" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "567txSF0BYXN" - }, - "source": [ - "temp_model = NeMoGPT.restore_from('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "YvnfG0kxBfTt" - }, - "source": [ - "# [ERROR CELL]\n", - "temp_model.setup_test_data(temp_model.cfg.test_ds)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N0ckN44YB-1K" - }, - "source": [ - "-----\n", - "\n", - "Hmm, it seems it wasn't so easy in this case. Non-trivial models have non-trivial issues!\n", - "\n", - "Remember, our NeMoGPT model sets its self.vocab inside the `setup_train_data` step. But that depends on the vocabulary generated by the train set... which is **not** restored during model restoration (unless you call `setup_train_data` explicitly!).\n", - "\n", - "We can quickly resolve this issue by constructing an external data file to enable save and restore support, and NeMo supports that too! We will use the `register_artifact` API in NeMo to support external files being attached to the .nemo checkpoint." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_Atyoc4NBjEV" - }, - "source": [ - "class NeMoGPTv2(NeMoGPT):\n", - " \n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self.vocab = None\n", - " self._train_dl = self._setup_data_loader(train_data_config)\n", - "\n", - " # Save the vocab into a text file for now\n", - " with open('vocab.txt', 'w') as f:\n", - " for token in self.vocab:\n", - " f.write(f\"{token}\")\n", - " \n", - " # This is going to register the file into .nemo!\n", - " # When you later use .save_to(), it will copy this file into the tar file.\n", - " self.register_artifact('vocab_file', 'vocab.txt')\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " # This is going to try to find the same file, and if it fails, \n", - " # it will use the copy in .nemo\n", - " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", - " \n", - " with open(vocab_file, 'r') as f:\n", - " vocab = []\n", - " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", - " self.vocab = vocab\n", - "\n", - " self._validation_dl = self._setup_data_loader(val_data_config)\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " # This is going to try to find the same file, and if it fails, \n", - " # it will use the copy in .nemo\n", - " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", - "\n", - " with open(vocab_file, 'r') as f:\n", - " vocab = []\n", - " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", - " self.vocab = vocab\n", - "\n", - " self._test_dl = self._setup_data_loader(test_data_config)\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "mn09jsRZDusN" - }, - "source": [ - "# Let's try creating a model now !\n", - "model = NeMoGPTv2(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "sQPIPySDD1K0" - }, - "source": [ - "# Now let's try to save and restore !\n", - "model.save_to('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "0YwCJ4xaJ3bU" - }, - "source": [ - "temp_model = NeMoGPTv2.restore_from('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "tcxwDIIWKKCQ" - }, - "source": [ - "temp_model.setup_multiple_test_data(temp_model.cfg.test_ds)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "j3Olm6ZTKRbO" - }, - "source": [ - "if torch.cuda.is_available():\n", - " accelerator = 'gpu'\n", - "else:\n", - " accelerator = 'cpu'\n", - "\n", - "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches =1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "_QE2SngCKV2p" - }, - "source": [ - "trainer.test(model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o2HpKzwKJ_MW" - }, - "source": [ - "------\n", - "There we go ! Now our models can be serialized and de-serialized without any issue, even with an external vocab file !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZjCV5u3_OO7a" - }, - "source": [ - "" - ], - "execution_count": null, - "outputs": [] - } - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "01_NeMo_Models.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "ASnx4b5jXsil" + }, + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "\n", + "# ## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "## Install TorchAudio\n", + "!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html\n", + "\n", + "## Grab the config we'll use in this example\n", + "!mkdir configs" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a0eAURFKXdFT" + }, + "source": [ + "# minGPT License\n", + "\n", + "*This notebook port's the [minGPT codebase](https://github.com/karpathy/minGPT) into equivalent NeMo code. The license for minGPT has therefore been attached here.*\n", + "\n", + "```\n", + "The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy\n", + "\n", + "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", + "\n", + "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", + "\n", + "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2b7Z064UZFH9" + }, + "source": [ + "# torch-rnn License\n", + "*This notebook utilizes the `tiny-shakespeare` dataset from the [torch-rnn](https://github.com/jcjohnson/torch-rnn) codebase. The license for torch-rnn has therefore been attached here.*\n", + "\n", + "```\n", + "The MIT License (MIT)\n", + "\n", + "Copyright (c) 2016 Justin Johnson\n", + "\n", + "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", + "\n", + "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", + "\n", + "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKzK-Z7obCED" + }, + "source": [ + "-------\n", + "\n", + "***Note: This notebook will intentionally introduce some errors to show the power of Neural Types or model development concepts, inside the cells marked with `[ERROR CELL]`. The explanation of and resolution of such errors can be found in the subsequent cells.***\n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "81qdv0mPee-j" + }, + "source": [ + "# The NeMo Model\n", + "\n", + "NeMo comes with several state-of-the-art pre-trained Conversational AI models for users to quickly be able to start training and fine-tuning on their own datasets. \n", + "\n", + "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", + "\n", + "In this tutorial we will learn how to develop a non-trivial NeMo model from scratch. This helps us to understand the underlying components and how they interact with the overall PyTorch ecosystem.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nKNftwxzllth" + }, + "source": [ + "-------\n", + "At the heart of NeMo lies the concept of the \"Model\". For NeMo developers, a \"Model\" is the neural network(s) as well as all the infrastructure supporting those network(s), wrapped into a singular, cohesive unit. As such, most NeMo models are constructed to contain the following out of the box (note: some NeMo models support additional functionality specific to the domain/use case!) - \n", + "\n", + " - Neural Network architecture - all of the modules that are required for the model.\n", + "\n", + " - Dataset + Data Loaders - all of the components that prepare the data for consumption during training or evaluation.\n", + "\n", + " - Preprocessing + Postprocessing - any of the components that process the datasets so the modules can easily consume them.\n", + "\n", + " - Optimizer + Schedulers - basic defaults that work out of the box and allow further experimentation with ease.\n", + "\n", + " - Any other supporting infrastructure - tokenizers, language model configuration, data augmentation, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5VOoAQT1mipO" + }, + "source": [ + "# Constructing a NeMo Model\n", + "\n", + "NeMo \"Models\" are comprised of a few key components, so let's tackle them one by one. We will attempt to go in the order that's stated above.\n", + "\n", + "To make this slightly challenging, let's port a model from the NLP domain this time. Transformers are all the rage, with BERT and his friends from Sesame Street forming the core infrastructure for many NLP tasks. \n", + "\n", + "An excellent (yet simple) implementation of one such model - GPT - can be found in the `minGPT` repository - https://github.com/karpathy/minGPT. While the script is short, it explains and succinctly explores all of the core components we expect in a NeMo model, so it's a prime candidate for NeMo! Sidenote: NeMo supports GPT in its NLP collection, and as such, this notebook aims to be an in-depth development walkthrough for such models.\n", + "\n", + "In the following notebook, we will attempt to port minGPT to NeMo, and along the way, discuss some core concepts of NeMo itself." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fOlQKsaRot1l" + }, + "source": [ + "# Constructing the Neural Network Architecture\n", + "\n", + "First, on the list - the neural network that forms the backbone of the NeMo Model.\n", + "\n", + "So how do we create such a model? Using PyTorch! As you'll see below, NeMo components are compatible with all of PyTorch, so you can augment your workflow without ever losing the flexibility of PyTorch itself!\n", + "\n", + "Let's start with a couple of imports - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "piLOgwOPX1FS" + }, + "source": [ + "import torch\n", + "import nemo\n", + "from nemo.core import NeuralModule\n", + "from nemo.core import typecheck" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yySYjHgAqVvT" + }, + "source": [ + "## Neural Module\n", + "Wait, what's `NeuralModule`? Where is the wonderful `torch.nn.Module`? \n", + "\n", + "`NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.\n", + "\n", + "In addition to being a `torch.nn.Module`, thereby being entirely compatible with the PyTorch ecosystem, it has the following capabilities - \n", + "\n", + "1) `Typing` - It adds support for `Neural Type Checking` to the model. `Typing` is optional but quite useful, as we will discuss below!\n", + "\n", + "2) `Serialization` - Remember the `OmegaConf` config dict and YAML config files? Well, all `NeuralModules` inherently supports serialization/deserialization from such config dictionaries!\n", + "\n", + "3) `FileIO` - This is another entirely optional file serialization system. Does your `NeuralModule` require some way to preserve data that can't be saved into a PyTorch checkpoint? Write your serialization and deserialization logic in two handy methods! **Note**: When you create the final NeMo Model, this will be implemented for you! Automatic serialization and deserialization support of NeMo models!\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bseLiNoqqQrE" + }, + "source": [ + "class MyEmptyModule(NeuralModule):\n", + "\n", + " def forward(self):\n", + " print(\"Neural Module ~ hello world!\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j4Q36L5urdOQ" + }, + "source": [ + "x = MyEmptyModule()\n", + "x()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lHXAcn5Ot_1I" + }, + "source": [ + "## Neural Types\n", + "\n", + "Neural Types? You might be wondering what that term refers to.\n", + "\n", + "Almost all NeMo components inherit the class `Typing`. `Typing` is a simple class that adds two properties to the class that inherits it - `input_types` and `output_types`. A NeuralType, by its shortest definition, is simply a semantic tensor. It contains information regarding the semantic shape the tensor should hold, as well as the semantic information of what that tensor represents. That's it.\n", + "\n", + "So what semantic information does such a typed tensor contain? Let's take an example below.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ezOJERbVwG34" + }, + "source": [ + "------\n", + "Across the Deep Learning domain, we often encounter cases where tensor shapes may match, but the semantics don't match at all. For example take a look at the following rank 3 tensors - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZvC57bbxwXxN" + }, + "source": [ + "# Case 1:\n", + "embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", + "x = torch.randint(high=10, size=(1, 5))\n", + "print(\"x :\", x)\n", + "print(\"embedding(x) :\", embedding(x).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sMaqhMBgxe2C" + }, + "source": [ + "# Case 2\n", + "lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", + "x = torch.randn(1, 5, 1)\n", + "print(\"x :\", x)\n", + "print(\"lstm(x) :\", lstm(x)[0].shape) # Let's take all timestep outputs of the LSTM" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9IQHjki-yezX" + }, + "source": [ + "-------\n", + "As you can see, the output of Case 1 is an embedding of shape [1, 5, 30], and the output of Case 2 is an LSTM output (state `h` over all time steps), also of the same shape [1, 5, 30].\n", + "\n", + "Do they have the same shape? **Yes**.
If we do a Case 1 .shape == Case 2 .shape, will we get True as an output? **Yes**.
\n", + "Do they represent the same concept? **No**.
\n", + "\n", + "\n", + "The ability to recognize that the two tensors do not represent the same semantic information is precisely why we utilize Neural Types. It contains the information of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the two outputs of those tensors, it would raise an error saying semantically they were different things (more technically, it would say that they are `INCOMPATIBLE` with each other)!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucP0hNI7vWrU" + }, + "source": [ + "--------\n", + "\n", + "You may have read of concepts such as [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html). While conceptually similar, Neural Types attached by NeMo are not as tightly bound to the PyTorch ecosystem - practically any object of a class can be attached with a neural type!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uvf5oLt9zxSS" + }, + "source": [ + "## Neural Types - Usage\n", + "\n", + "Neural Types sound interesting, so how do we go about adding them? Let's take a few cases below. \n", + "\n", + "Neural Types are one of the core foundations of NeMo - you will find them in a vast majority of Neural Modules, and every NeMo Model will have its Neural Types defined. While they are entirely optional and not intrusive, NeMo takes great care to support it so that there is no semantic incompatibility between components being used by users." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eTizOBUg0qIB" + }, + "source": [ + "Let's start with a basic example of a type checked module." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yp0FG8NJt1Jd" + }, + "source": [ + "from nemo.core.neural_types import NeuralType\n", + "from nemo.core.neural_types import *" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3tsgs8Fp0-WV" + }, + "source": [ + "class EmbeddingModule(NeuralModule):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", + "\n", + " @typecheck()\n", + " def forward(self, x):\n", + " return self.embedding(x)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'x': NeuralType(axes=('B', 'T'), elements_type=Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EmbeddedTextType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sY9GYEoD3Yy0" + }, + "source": [ + "To show the benefit of Neural Types, we are going to replicate the above cases inside NeuralModules.\n", + "\n", + "Let's discuss how we added type checking support to the above class.\n", + "\n", + "1) `forward` has a decorator `@typecheck()` on it.\n", + "\n", + "2) `input_types` and `output_types` properties are defined.\n", + "\n", + "That's it!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "on268fAX4LLU" + }, + "source": [ + "-------\n", + "\n", + "Let's expand on each of the above steps.\n", + "\n", + "- `@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the two default properties of `input_types` and `output_types`, which by default returns None.\n", + "\n", + "The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can \"opt-in\" to type checking by overriding the two properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.\n", + "\n", + "So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator. Inside NeMo, we will show some advanced use cases (which are quite crucial to particular domains such as TTS)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o9i1KugG5om7" + }, + "source": [ + "------\n", + "\n", + "As we see above, `@typecheck()` enforces the types. How then, do we provide this type of information to NeMo? \n", + "\n", + "By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`.\n", + "\n", + "In the above case, we define a `NeuralType` as two components - \n", + "\n", + "- `axes`: This is the semantic information of the carried by the axes themselves. The most common axes information is from single character notation.\n", + "\n", + "> `B` = Batch
\n", + "> `C` / `D` - Channel / Dimension (treated the same)
\n", + "> `T` - Time
\n", + "> `H` / `W` - Height / Width
\n", + "\n", + "- `elements_type`: This is the semantic information of \"what the tensor represents\". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us to build a hierarchy of custom semantic types that can be used by NeMo!\n", + "\n", + "Here, we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boxxMniv27vi" + }, + "source": [ + "embedding_module = EmbeddingModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BgfDuBm27wiV" + }, + "source": [ + "Now let's construct the equivalent of the Case 2 above, but as a `NeuralModule`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SZZOOoCJ2-iV" + }, + "source": [ + "class LSTMModule(NeuralModule):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", + "\n", + " @typecheck()\n", + " def forward(self, x):\n", + " return self.lstm(x)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7iIWIunz8IQq" + }, + "source": [ + "------\n", + "Here, we define the LSTM module from the Case 2 above.\n", + "\n", + "We changed the input to be a rank three tensor, now representing a \"SpectrogramType\". We intentionally keep it generic - it can be a `MelSpectrogramType` or a `MFCCSpectrogramType` as its input!\n", + "\n", + "The output of an LSTM is now an `EncodedRepresentation`. Practically, this can be the output of a CNN layer, a Transformer block, or in this case, an LSTM layer. We can, of course, specialize by subclassing EncodedRepresentation and then using that!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6LlOJf0C8GN4" + }, + "source": [ + "lstm_module = LSTMModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hj0wonSz8_0c" + }, + "source": [ + "------\n", + "Now for the test !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "giLJlub78-Ja" + }, + "source": [ + "# Case 1 [ERROR CELL]\n", + "x1 = torch.randint(high=10, size=(1, 5))\n", + "print(\"x :\", x1)\n", + "print(\"embedding(x) :\", embedding_module(x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-fhclja9WLr" + }, + "source": [ + "-----\n", + "You might be wondering why we get a `TypeError` right off the bat. This `TypeError` is raised by design.\n", + "\n", + "Positional arguments can cause significant issues during model development, mostly when the model/module design is not finalized. To reduce the potential for mistakes caused by wrong positional arguments and enforce the name of arguments provided to the function, `Typing` requires you to **call all of your type-checked functions by kwargs only**." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2KUj_p6M9L-f" + }, + "source": [ + "# Case 1\n", + "print(\"x :\", x1)\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dirhWWvMRusx" + }, + "source": [ + "Now let's try the same for the `LSTMModule` in Case 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FMu3B0-9-CqE" + }, + "source": [ + "# Case 2 [ERROR CELL]\n", + "x2 = torch.randn(1, 5, 1) # Input = [B=1, T=5, C=1]\n", + "print(\"x :\", x2)\n", + "print(\"lstm(x) :\", lstm_module(x=x2)[0].shape) # Let's take all timestep outputs of the LSTM" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-OTLdR_4-isV" + }, + "source": [ + "-----\n", + "Now we get a type error stating that the number of output arguments provided does not match what is expected.\n", + "\n", + "What exactly is going on here? Well, inside our `LSTMModule` class, we declare the output types to be a single NeuralType - an `EncodedRepresentation` of shape [B, T, C].\n", + "\n", + "But the output of an LSTM layer is a tuple of \n", + "1) the encoded representation of shape [B, T, C]\n", + "2) another tuple containing two state values - the hidden state `h` and the cell state `c`, each of shape [num_layers * num_directions, B, C]!\n", + "\n", + "So the neural type system raises an error saying that the number of output arguments does not match what is expected.\n", + "\n", + "**NOTE**: The axis kind information of the two states will be represented by `D` to represent a general \"Dimension\" - since `num_layers` and `num_directions` are collapsed under a single axis. For NeMo, Axis types of `C` and `D` are equivalent and can be interchanged, so we will use `C` here to represent the hidden dimension of the LSTM and `D` to represent the merged axis `num_layers * num_directions`.\n", + "\n", + "Let's fix the above." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q2u-keAM-d-B" + }, + "source": [ + "class CorrectLSTMModule(LSTMModule): # Let's inherit the wrong class to make it easy to override\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),\n", + " 'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())],\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a99NX0O8KMvW" + }, + "source": [ + "You should note that for the `h_c` neural type, we wrap it in a list - `[]`. NeMo, by default, assumes that each `NeuralType` corresponds to a single returned value. However, in the case of LSTMs, they produce a tuple of two state tensors.\n", + "\n", + "So we inform NeMo that this particular `NeuralType` is a single-dimensional list of items - and that each element of this list shares the same `NeuralType` and has the same shape.\n", + "\n", + "NeMo then ensures that the `h_c` is always a list of tensors. It will not check *how many* items are in the list, but will ensure that the returned value *must be a list containing zero or more items* - and that each of these items share the same `NeuralType`. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GyPZH-fz_dG4" + }, + "source": [ + "lstm_module = CorrectLSTMModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9whH50PE_Xyx" + }, + "source": [ + "# Case 2\n", + "x2 = torch.randn(1, 5, 1)\n", + "y2, (h, c) = lstm_module(x=x2)\n", + "print(\"x :\", x2)\n", + "print(\"lstm(x) :\", y2.shape) # The output of the LSTM RNN\n", + "print(\"hidden state (h) :\", h.shape) # The first hidden state of the LSTM RNN\n", + "print(\"hidden state (c) :\", c.shape) # The second hidden state of the LSTM RNN" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cRueNvNY_jI3" + }, + "source": [ + "------\n", + "Great! So now, the type checking system is happy.\n", + "\n", + "If you looked closely, the outputs were ordinary Torch Tensors (this is good news; we don't want to be incompatible with torch Tensors after all!). So, where exactly is the type of information stored?\n", + "\n", + "When the `output_types` is overridden, and valid torch tensors are returned as a result, these tensors are attached with the attribute `neural_type`. Let's inspect this -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bGQ9XbWU_ffa" + }, + "source": [ + "emb_out = embedding_module(x=x1)\n", + "lstm_out = lstm_module(x=x2)[0]\n", + "\n", + "assert hasattr(emb_out, 'neural_type')\n", + "assert hasattr(lstm_out, 'neural_type')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kEpBruSOScPJ" + }, + "source": [ + "print(\"Embedding tensor :\", emb_out.neural_type)\n", + "print(\"LSTM tensor :\", lstm_out.neural_type)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BWTsqiAHAony" + }, + "source": [ + "-------\n", + "So we see that these tensors now have this attribute called `neural_type` and are the same shape.\n", + "\n", + "This exercise's entire goal was to assert that the two outputs are semantically **not** the same object, even if they are the same shape. \n", + "\n", + "Let's test this!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8AU9FMtdATIm" + }, + "source": [ + "emb_out.neural_type.compare(lstm_out.neural_type)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2cqnqAGIBCjA" + }, + "source": [ + "emb_out.neural_type == lstm_out.neural_type" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HmH6B0mHDJqb" + }, + "source": [ + "## Neural Types - Limitations\n", + "\n", + "You might have noticed one interesting fact - our inputs were just `torch.Tensor` to both typed function calls, and they had no `neural_type` assigned to them.\n", + "\n", + "So why did the type check system not raise any error? \n", + "\n", + "This is to maintain compatibility - type checking is meant to work on a chain of function calls - and each of these functions should themselves be wrapped with the `@typecheck()` decorator. This is also done because we don't want to overtax the forward call with dozens of checks, and therefore we only type modules that perform some higher-order logical computation. \n", + "\n", + "------\n", + "\n", + "As an example, it is mostly unnecessary (but still possible) to type the input and output of every residual block of a ResNet model. However, it is practically important to type the encoder (no matter how many layers is inside it) and the decoder (the classification head) separately so that when one does fine-tuning, there is no semantic mismatch of the tensors input to the encoder and bound to the decoder." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6m28zSEKEjt_" + }, + "source": [ + "-------\n", + "For this case, since it would be impractical to extend a class to attach a type to the input tensor, we can take a shortcut and directly attach the neural type to the input!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AGbKB4gJEzcU" + }, + "source": [ + "embedding_module = EmbeddingModule()\n", + "x1 = torch.randint(high=10, size=(1, 5))\n", + "\n", + "# Attach correct neural type\n", + "x1.neural_type = NeuralType(('B', 'T'), Index())\n", + "\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "F0j-evylFM5j" + }, + "source": [ + "# Attach wrong neural type [ERROR CELL]\n", + "x1.neural_type = NeuralType(('B', 'T'), LabelsType())\n", + "\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "StMPyg6oCC9B" + }, + "source": [ + "## Let's create the minGPT components\n", + "\n", + "Now that we have a somewhat firm grasp of neural type checking, let's begin porting the minGPT example code. Once again, most of the code will be a direct port from the [minGPT repository](https://github.com/karpathy/minGPT).\n", + "\n", + "Here, you will notice one thing. By just changing class imports, one `@typecheck()` on forward, and adding `input_types` and `output_types` (which are also entirely optional!), we are almost entirely done with the PyTorch Lightning port!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "raFkuSRaBAE0" + }, + "source": [ + "import math\n", + "from typing import List, Set, Dict, Tuple, Optional\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "from torch.nn import functional as F" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yakGOXrzF1XW" + }, + "source": [ + "## Creating Element Types\n", + "\n", + "Till now, we have used the Neural Types provided by the NeMo core. But we need not be restricted to the pre-defined element types !\n", + "\n", + "Users have total flexibility in defining any hierarchy of element types as they please!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ybhLLVyUF0mo" + }, + "source": [ + "class AttentionType(EncodedRepresentation):\n", + " \"\"\"Basic Attention Element Type\"\"\"\n", + "\n", + "class SelfAttentionType(AttentionType):\n", + " \"\"\"Self Attention Element Type\"\"\"\n", + "\n", + "class CausalSelfAttentionType(SelfAttentionType):\n", + " \"\"\"Causal Self Attention Element Type\"\"\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mONJRMdbZNSE" + }, + "source": [ + "## Creating the modules\n", + "\n", + "Neural Modules are generally top-level modules but can be used at any level of the module hierarchy.\n", + "\n", + "For demonstration, we will treat an encoder comprising a block of Causal Self Attention modules as a typed Neural Module. Of course, we can also treat each Causal Self Attention layer itself as a neural module if we require it, but top-level modules are generally preferred." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w4oXpAL_CoDp" + }, + "source": [ + "class CausalSelfAttention(nn.Module):\n", + " \"\"\"\n", + " A vanilla multi-head masked self-attention layer with a projection at the end.\n", + " It is possible to use torch.nn.MultiheadAttention here but I am including an\n", + " explicit implementation here to show that there is nothing too scary here.\n", + " \"\"\"\n", + "\n", + " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", + " super().__init__()\n", + " assert n_embd % n_head == 0\n", + " self.n_head = n_head\n", + " # key, query, value projections for all heads\n", + " self.key = nn.Linear(n_embd, n_embd)\n", + " self.query = nn.Linear(n_embd, n_embd)\n", + " self.value = nn.Linear(n_embd, n_embd)\n", + " # regularization\n", + " self.attn_drop = nn.Dropout(attn_pdrop)\n", + " self.resid_drop = nn.Dropout(resid_pdrop)\n", + " # output projection\n", + " self.proj = nn.Linear(n_embd, n_embd)\n", + " # causal mask to ensure that attention is only applied to the left in the input sequence\n", + " self.register_buffer(\"mask\", torch.tril(torch.ones(block_size, block_size))\n", + " .view(1, 1, block_size, block_size))\n", + " def forward(self, x, layer_past=None):\n", + " B, T, C = x.size()\n", + "\n", + " # calculate query, key, values for all heads in batch and move head forward to be the batch dim\n", + " k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + " q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + " v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + "\n", + " # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)\n", + " att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))\n", + " att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))\n", + " att = F.softmax(att, dim=-1)\n", + " att = self.attn_drop(att)\n", + " y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)\n", + " y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side\n", + "\n", + " # output projection\n", + " y = self.resid_drop(self.proj(y))\n", + " return y\n", + " \n", + "\n", + "class Block(nn.Module):\n", + " \"\"\" an unassuming Transformer block \"\"\"\n", + "\n", + " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", + " super().__init__()\n", + " self.ln1 = nn.LayerNorm(n_embd)\n", + " self.ln2 = nn.LayerNorm(n_embd)\n", + " self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)\n", + " self.mlp = nn.Sequential(\n", + " nn.Linear(n_embd, 4 * n_embd),\n", + " nn.GELU(),\n", + " nn.Linear(4 * n_embd, n_embd),\n", + " nn.Dropout(resid_pdrop),\n", + " )\n", + "\n", + " def forward(self, x):\n", + " x = x + self.attn(self.ln1(x))\n", + " x = x + self.mlp(self.ln2(x))\n", + " return x" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mv0dyrLifkw0" + }, + "source": [ + "## Building the NeMo Model\n", + "\n", + "Since a NeMo Model is comprised of various parts, we are going to iterate on the model step by step inside this notebook. As such, we will have multiple intermediate NeMo \"Models\", which will be partial implementations, and they will inherit each other iteratively.\n", + "\n", + "In a complete implementation of a NeMo Model (as found in the NeMo collections), all of these components will generally be found in a single class.\n", + "\n", + "Let's start by inheriting `ModelPT` - the core class of a PyTorch NeMo Model, which inherits the PyTorch Lightning Module." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TxeG-qMrRgNU" + }, + "source": [ + "-------\n", + "**Remember**:\n", + "\n", + " - The NeMo equivalent of `torch.nn.Module` is the `NeuralModule.\n", + " - The NeMo equivalent of the `LightningModule` is `ModelPT`.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0TsfmCYthMux" + }, + "source": [ + "import pytorch_lightning as ptl\n", + "from nemo.core import ModelPT\n", + "from omegaconf import OmegaConf" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ib2rSz2hjaP" + }, + "source": [ + "------\n", + "Next, let's construct the bare minimum implementation of the NeMo Model - just the constructor, the initializer of weights, and the forward method.\n", + "\n", + "Initially, we will follow the steps followed by the minGPT implementation, and progressively refactor for NeMo " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "98x9-Fh-HVwj" + }, + "source": [ + "class PTLGPT(ptl.LightningModule):\n", + " def __init__(self,\n", + " # model definition args\n", + " vocab_size: int, # size of the vocabulary (number of possible tokens)\n", + " block_size: int, # length of the model's context window in time\n", + " n_layer: int, # depth of the model; number of Transformer blocks in sequence\n", + " n_embd: int, # the \"width\" of the model, number of channels in each Transformer\n", + " n_head: int, # number of heads in each multi-head attention inside each Transformer block\n", + " # model optimization args\n", + " learning_rate: float = 3e-4, # the base learning rate of the model\n", + " weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops\n", + " betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer\n", + " embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", + " resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", + " attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix\n", + " ):\n", + " super().__init__()\n", + "\n", + " # save these for optimizer init later\n", + " self.learning_rate = learning_rate\n", + " self.weight_decay = weight_decay\n", + " self.betas = betas\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", + " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", + " self.drop = nn.Dropout(embd_pdrop)\n", + " # deep transformer: just a sequence of transformer blocks\n", + " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])\n", + " # decoder: at the end one more layernorm and decode the answers\n", + " self.ln_f = nn.LayerNorm(n_embd)\n", + " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", + "\n", + " self.block_size = block_size\n", + " self.apply(self._init_weights)\n", + "\n", + " print(\"number of parameters: %e\" % sum(p.numel() for p in self.parameters()))\n", + "\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", + "\n", + " # forward the GPT model\n", + " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", + " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", + " x = self.drop(token_embeddings + position_embeddings)\n", + " x = self.blocks(x)\n", + " x = self.ln_f(x)\n", + " logits = self.head(x)\n", + "\n", + " return logits\n", + "\n", + " def get_block_size(self):\n", + " return self.block_size\n", + "\n", + " def _init_weights(self, module):\n", + " \"\"\"\n", + " Vanilla model initialization:\n", + " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", + " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", + " \"\"\"\n", + " if isinstance(module, (nn.Linear, nn.Embedding)):\n", + " module.weight.data.normal_(mean=0.0, std=0.02)\n", + " if isinstance(module, nn.Linear) and module.bias is not None:\n", + " module.bias.data.zero_()\n", + " elif isinstance(module, nn.LayerNorm):\n", + " module.bias.data.zero_()\n", + " module.weight.data.fill_(1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bMf5SO7wmor" + }, + "source": [ + "------\n", + "Let's create a PyTorch Lightning Model above, just to make sure it works !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rrXIBzg4wutC" + }, + "source": [ + "m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZCcgn1bajPW8" + }, + "source": [ + "------\n", + "Now, let's convert the above easily into a NeMo Model.\n", + "\n", + "A NeMo Model constructor generally accepts only two things - \n", + "\n", + "1) `cfg`: An OmegaConf DictConfig object that defines precisely the components required by the model to define its neural network architecture, data loader setup, optimizer setup, and any additional components needed for the model itself.\n", + "\n", + "2) `trainer`: An optional Trainer from PyTorch Lightning if the NeMo model will be used for training. It can be set after construction (if required) using the `set_trainer` method. For this notebook, we will not be constructing the config for the Trainer object." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WQMTCB3kz0UA" + }, + "source": [ + "## Refactoring Neural Modules\n", + "\n", + "As we discussed above, Neural Modules are generally higher-level components of the Model and can potentially be replaced by equivalent Neural Modules.\n", + "\n", + "As we see above, the embedding modules, deep transformer decoder network, and final decoder layer have all been combined inside the PyTorch Lightning implementation constructor.\n", + "\n", + "------\n", + "\n", + "However, the final decoder module could have been an RNN instead of a simple Linear layer, or it could have been a 1D-CNN instead.\n", + "\n", + "Likewise, the deep transformer decoder could potentially have a different implementation of Self Attention modules.\n", + "\n", + "These changes cannot be easily implemented any more inside the above implementation. However, if we refactor these components into their respective NeuralModules, then we can easily replace them with equivalent modules we construct in the future!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EJj5sSkX0xHi" + }, + "source": [ + "### Refactoring the Embedding module\n", + "\n", + "Let's first refactor out the embedding module from the above implementation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uYwMyjqK05RL" + }, + "source": [ + "class GPTEmbedding(NeuralModule):\n", + " def __init__(self, vocab_size: int, n_embd: int, block_size: int, embd_pdrop: float = 0.0):\n", + " super().__init__()\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", + " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", + " self.drop = nn.Dropout(embd_pdrop)\n", + "\n", + " @typecheck()\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " \n", + " # forward the GPT model\n", + " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", + " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", + " x = self.drop(token_embeddings + position_embeddings)\n", + " return x\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'idx': NeuralType(('B', 'T'), Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'embeddings': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l5rOP6lyOyRt" + }, + "source": [ + "### Refactoring the Encoder\n", + "\n", + "Next, let's refactor the GPT Encoder - which is implemented as a multi layer Transformer (Decoder) network.\n", + "\n", + "------\n", + "It can be noted that we refer to the GPT \"Encoder\" module - but it is constructed by using Transformer \"Decoder\" blocks.\n", + "\n", + "***When we discuss Neural Modules - we are discussing an abstract module with a certain input neural type and a certain output neural type.***\n", + "\n", + "For us, the GPT \"Encoder\" neural module will accept any implementation, whose\n", + "\n", + "- input neural type is `NeuralType(('B', 'T', 'C'), EmbeddedTextType())`\n", + "\n", + "- output type is `NeuralType(('B', 'T', 'C'), EncodedRepresentation())`\n", + "\n", + "-----\n", + "One concrete implementation of such a GPT \"Encoder\" neural module is a Deep Transformer \"Decoder\" network." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1QeQnQ_G2PwH" + }, + "source": [ + "class GPTTransformerEncoder(NeuralModule):\n", + " def __init__(self, n_embd: int, block_size: int, n_head: int, n_layer: int, attn_pdrop: float = 0.0, resid_pdrop: float = 0.0):\n", + " super().__init__()\n", + "\n", + " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) \n", + " for _ in range(n_layer)])\n", + " \n", + " @typecheck()\n", + " def forward(self, embed):\n", + " return self.blocks(embed)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'embed': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'encoding': NeuralType(('B', 'T', 'C'), CausalSelfAttentionType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NmCR3LK3QHum" + }, + "source": [ + "### Refactoring the Decoder\n", + "\n", + "Finally, let's refactor the Decoder - the small one-layer feed-forward network to decode the answer.\n", + "\n", + "-------\n", + "\n", + "Note an interesting detail - The `input_types` of the Decoder accepts the generic `EncoderRepresentation()`, where as the `neural_type` of the `GPTTransformerEncoder` has the `output_type` of `CausalSelfAttentionType`.\n", + "\n", + "This is semantically *not* a mismatch! As you can see above in the inheritance chart, we declare `EncodedRepresentation` -> `AttentionType` -> `SelfAttentionType` -> `CausalSelfAttentionType`. \n", + "\n", + "Such an inheritance hierarchy for the `element_type` allows future encoders (which also have a neural output type of at least `EncodedRepresentation`) to be swapped in place of the current GPT Causal Self Attention Encoder while keeping the rest of the NeMo model working just fine!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VCPUu0EWQIBX" + }, + "source": [ + "class GPTDecoder(NeuralModule):\n", + " def __init__(self, n_embd: int, vocab_size: int):\n", + " super().__init__()\n", + " self.ln_f = nn.LayerNorm(n_embd)\n", + " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", + "\n", + " @typecheck()\n", + " def forward(self, encoding):\n", + " x = self.ln_f(encoding)\n", + " logits = self.head(x)\n", + " return logits\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'encoding': NeuralType(('B', 'T', 'C'), EncodedRepresentation())\n", + " }\n", + " \n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", + " }\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nYLMjlW0Sdy1" + }, + "source": [ + "### Refactoring the NeMo GPT Model\n", + "\n", + "Now that we have 3 NeuralModules for the embedding, the encoder, and the decoder, let's refactor the NeMo model to take advantage of this refactor!\n", + "\n", + "This time, we inherit from `ModelPT` instead of the general `LightningModule`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZQlmtYU6iDwi" + }, + "source": [ + "class AbstractNeMoGPT(ModelPT):\n", + " def __init__(self, cfg: OmegaConf, trainer: ptl.Trainer = None):\n", + " super().__init__(cfg=cfg, trainer=trainer)\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.embedding = self.from_config_dict(self.cfg.embedding)\n", + " # deep transformer: just a sequence of transformer blocks\n", + " self.encoder = self.from_config_dict(self.cfg.encoder)\n", + " # decoder: at the end one more layernorm and decode the answers\n", + " self.decoder = self.from_config_dict(self.cfg.decoder)\n", + "\n", + " self.block_size = self.cfg.embedding.block_size\n", + " self.apply(self._init_weights)\n", + "\n", + " print(\"number of parameters: %e\" % self.num_weights)\n", + "\n", + " @typecheck()\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", + "\n", + " # forward the GPT model\n", + " # Remember: Only kwargs are allowed !\n", + " e = self.embedding(idx=idx)\n", + " x = self.encoder(embed=e)\n", + " logits = self.decoder(encoding=x)\n", + "\n", + " return logits\n", + "\n", + " def get_block_size(self):\n", + " return self.block_size\n", + "\n", + " def _init_weights(self, module):\n", + " \"\"\"\n", + " Vanilla model initialization:\n", + " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", + " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", + " \"\"\"\n", + " if isinstance(module, (nn.Linear, nn.Embedding)):\n", + " module.weight.data.normal_(mean=0.0, std=0.02)\n", + " if isinstance(module, nn.Linear) and module.bias is not None:\n", + " module.bias.data.zero_()\n", + " elif isinstance(module, nn.LayerNorm):\n", + " module.bias.data.zero_()\n", + " module.weight.data.fill_(1.0)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'idx': NeuralType(('B', 'T'), Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DFRmxWiSmdF3" + }, + "source": [ + "## Creating a config for a Model\n", + "\n", + "At first glance, not much changed compared to the PyTorch Lightning implementation above. Other than the constructor, which now accepts a config, nothing changed at all!\n", + "\n", + "NeMo operates on the concept of a NeMo Model being accompanied by a corresponding config dict (instantiated as an OmegaConf object). This enables us to prototype the model by utilizing Hydra rapidly. This includes various other benefits - such as hyperparameter optimization and serialization/deserialization of NeMo models.\n", + "\n", + "Let's look at how actually to construct such config objects!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uygo0BEYjKuj" + }, + "source": [ + "# model definition args (required)\n", + "# ================================\n", + "# vocab_size: int # size of the vocabulary (number of possible tokens)\n", + "# block_size: int # length of the model's context window in time\n", + "# n_layer: int # depth of the model; number of Transformer blocks in sequence\n", + "# n_embd: int # the \"width\" of the model, number of channels in each Transformer\n", + "# n_head: int # number of heads in each multi-head attention inside each Transformer block \n", + "\n", + "# model definition args (optional)\n", + "# ================================\n", + "# embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", + "# resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", + "# attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s4sdqRAFop-n" + }, + "source": [ + "------\n", + "As we look at the required parameters above, we need a way to tell OmegaConf that these values are currently not set, but the user should set them before we use them.\n", + "\n", + "OmegaConf supports such behavior using the `MISSING` value. A similar effect can be achieved in YAML configs by using `???` as a placeholder." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XqLSZq7Soo2j" + }, + "source": [ + "from omegaconf import MISSING" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JTH-1vu8TO7o" + }, + "source": [ + "# Let's create a utility for building the class path\n", + "def get_class_path(cls):\n", + " return f'{cls.__module__}.{cls.__name__}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6xToaWAJUmtX" + }, + "source": [ + "### Structure of a Model config\n", + "\n", + "Let's first create a config for the common components of the model level config -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZCvLdOlMVLy_" + }, + "source": [ + "common_config = OmegaConf.create({\n", + " 'vocab_size': MISSING,\n", + " 'block_size': MISSING,\n", + " 'n_layer': MISSING,\n", + " 'n_embd': MISSING,\n", + " 'n_head': MISSING,\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j8hvdKa4VmCV" + }, + "source": [ + "-----\n", + "The model config right now is still being built - it needs to contain a lot more details!\n", + "\n", + "A complete Model Config should have the sub-configs of all of its top-level modules as well. This means the configs of the `embedding`, `encoder`, and the `decoder`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v-2_QOZyVgrE" + }, + "source": [ + "### Structure of sub-module config\n", + "\n", + "For top-level models, we generally don't change the actual module very often, and instead, primarily change the hyperparameters of that model.\n", + "\n", + "So we will make use of `Hydra`'s Class instantiation method - which can easily be accessed via the class method `ModelPT.from_config_dict()`.\n", + "\n", + "Let's take a few examples below -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ntsxQKH0pDac" + }, + "source": [ + "embedding_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTEmbedding),\n", + " 'vocab_size': '${model.vocab_size}',\n", + " 'n_embd': '${model.n_embd}',\n", + " 'block_size': '${model.block_size}',\n", + " 'embd_pdrop': 0.1\n", + "})\n", + "\n", + "encoder_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTTransformerEncoder),\n", + " 'n_embd': '${model.n_embd}',\n", + " 'block_size': '${model.block_size}',\n", + " 'n_head': '${model.n_head}',\n", + " 'n_layer': '${model.n_layer}',\n", + " 'attn_pdrop': 0.1,\n", + " 'resid_pdrop': 0.1\n", + "})\n", + "\n", + "decoder_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTDecoder),\n", + " # n_embd: int, vocab_size: int\n", + " 'n_embd': '${model.n_embd}',\n", + " 'vocab_size': '${model.vocab_size}'\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtloTqkqWhpl" + }, + "source": [ + "##### What is `_target_`?\n", + "--------\n", + "\n", + "In the above config, we see a `_target_` in the config. `_target_` is usually a full classpath to the actual class in the python package/user local directory. It is required for Hydra to locate and instantiate the model from its path correctly.\n", + "\n", + "So why do we want to set a classpath?\n", + "\n", + "In general, when developing models, we don't often change the encoder or the decoder, but we do change the hyperparameters of the encoder and decoder.\n", + "\n", + "This notation helps us keep the Model level declaration of the forward step neat and precise. It also logically helps us demark which parts of the model can be easily replaced - in the future, we can easily replace the encoder with some other type of self-attention block or the decoder with an RNN or 1D-CNN neural module (as long as they have the same Neural Type definition as the current blocks).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ASDmcgE4XtQ4" + }, + "source": [ + "##### What is the `${}` syntax?\n", + "-------\n", + "\n", + "OmegaConf, and by extension, Hydra, supports Variable Interpolation. As you can see in the `__init__` of embedding, encoder, and decoder neural modules, they often share many parameters between each other.\n", + "\n", + "It would become tedious and error-prone to set each of these constructors' values separately in each of the embedding, encoder, and decoder configs.\n", + "\n", + "So instead, we define standard keys inside of the `model` level config and then interpolate these values inside of the respective configs!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zXvEcXGhZi5I" + }, + "source": [ + "### Attaching the model and module-level configs\n", + "\n", + "So now, we have a Model level and per-module level configs for the core components. Sub-module configs generally fall under the \"model\" namespace, but you have the flexibility to define the structure as you require.\n", + "\n", + "Let's attach them!\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c8hvNeB_aDgi" + }, + "source": [ + "model_config = OmegaConf.create({\n", + " 'model': common_config\n", + "})\n", + "\n", + "# Then let's attach the sub-module configs\n", + "model_config.model.embedding = embedding_config\n", + "model_config.model.encoder = encoder_config\n", + "model_config.model.decoder = decoder_config" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zIubuFcOpIB0" + }, + "source": [ + "-----\n", + "Let's print this config!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2SyKNgp9pG0N" + }, + "source": [ + "print(OmegaConf.to_yaml(model_config))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4PAA07EAauCn" + }, + "source": [ + "-----\n", + "Wait, why did OmegaConf not fill in the value of the variable interpolation for the configs yet?\n", + "\n", + "This is because OmegaConf takes a deferred approach to variable interpolation. First, we fill in temporary values of the required fields (those marked by `???`). Then, to force resolution ahead of time, we can use the following snippet - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0X4C76JyOAnN" + }, + "source": [ + "import copy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ugxA0TPtbHVZ" + }, + "source": [ + "temp_config = copy.deepcopy(model_config)\n", + "temp_config.model.vocab_size = 10\n", + "temp_config.model.block_size = 4\n", + "temp_config.model.n_layer = 1\n", + "temp_config.model.n_embd = 32\n", + "temp_config.model.n_head = 4\n", + "\n", + "temp_config = OmegaConf.create(OmegaConf.to_container(temp_config, resolve=True))\n", + "print(OmegaConf.to_yaml(temp_config))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V41RFIpEpiOu" + }, + "source": [ + "-----\n", + "Now that we have a config, let's try to create an object of the NeMo Model !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IIIVi2IfpsJ4" + }, + "source": [ + "# Let's work on a copy of the model config and update it before we send it into the Model.\n", + "cfg = copy.deepcopy(model_config)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OllBhswPqQXq" + }, + "source": [ + "# Let's set the values of the config (for some plausible small model)\n", + "cfg.model.vocab_size = 100\n", + "cfg.model.block_size = 128\n", + "cfg.model.n_layer = 1\n", + "cfg.model.n_embd = 32\n", + "cfg.model.n_head = 4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QJm2LnTqqcIM" + }, + "source": [ + "print(OmegaConf.to_yaml(cfg))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E7tpB8BcqeBO" + }, + "source": [ + "# Try to create a model with this config [ERROR CELL]\n", + "m = AbstractNeMoGPT(cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cXOLhpxdq4Ni" + }, + "source": [ + "-----\n", + "\n", + "You will note that we added the `Abstract` tag for a reason to this NeMo Model and that when we try to instantiate it - it raises an error that we need to implement specific methods.\n", + "\n", + "1) `setup_training_data` & `setup_validation_data` - All NeMo models should implement two data loaders - the training data loader and the validation data loader. Optionally, they can go one step further and also implement the `setup_test_data` method to add support for evaluating the Model on its own.\n", + "\n", + "Why do we enforce this? NeMo Models are meant to be a unified, cohesive object containing the details about the neural network underlying that Model and the data loaders to train, validate, and optionally test those models.\n", + "\n", + "In doing so, once the Model is created/deserialized, it would take just a few more steps to train the Model from scratch / fine-tune/evaluate the Model on any data that the user provides, as long as this user-provided dataset is in a format supported by the Dataset / DataLoader that is used by this Model!\n", + "\n", + "2) `list_available_models` - This is a utility method to provide a list of pre-trained NeMo models to the user from the cloud.\n", + "\n", + "Typically, NeMo models can be easily packaged into a tar file (which we call a .nemo file in the earlier primer notebook). These tar files contain the model config + the pre-trained checkpoint weights of the Model, and can easily be downloaded from some cloud service. \n", + "\n", + "For this notebook, we will not be implementing this method.\n", + "\n", + "--------\n", + "Finally, let's create a concrete implementation of the above NeMo Model!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vcwi1lO7t7Sm" + }, + "source": [ + "from nemo.core.classes.common import PretrainedModelInfo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ckCxyVLYqrz0" + }, + "source": [ + "class BasicNeMoGPT(AbstractNeMoGPT):\n", + "\n", + " @classmethod\n", + " def list_available_models(cls) -> PretrainedModelInfo:\n", + " return None\n", + "\n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self._train_dl = None\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " self._validation_dl = None\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " self._test_dl = None" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ofUoJ8DDvq_Y" + }, + "source": [ + "------\n", + "Now let's try to create an object of the `BasicNeMoGPT` model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G8iYQSC5vptU" + }, + "source": [ + "m = BasicNeMoGPT(cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "otvYW4TBxAju" + }, + "source": [ + "## Setting up train-val-test steps\n", + "\n", + "The above `BasicNeMoGPT` Model is a basic PyTorch Lightning Module, with some added functionality - \n", + "\n", + "1) Neural Type checks support - as defined in the Model as well as the internal modules.\n", + "\n", + "2) Save and restore of the Model (in the trivial case) to a tarfile.\n", + "\n", + "But as the Model is right now, it crucially does not support PyTorch Lightning's `Trainer`. As such, while this Model can be called manually, it cannot be easily trained or evaluated by using the PyTorch Lightning framework.\n", + "\n", + "------\n", + "\n", + "Let's begin adding support for this then -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QU3oQAVovxRg" + }, + "source": [ + "class BasicNeMoGPTWithSteps(BasicNeMoGPT):\n", + "\n", + " def step_(self, split, batch, batch_idx=None):\n", + " idx, targets = batch\n", + " logits = self(idx=idx)\n", + " loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n", + " key = 'loss' if split == 'train' else f\"{split}_loss\"\n", + " self.log(key, loss)\n", + " return {key: loss}\n", + "\n", + " def training_step(self, *args, **kwargs):\n", + " return self.step_('train', *args, **kwargs)\n", + "\n", + " def validation_step(self, *args, **kwargs):\n", + " return self.step_('val', *args, **kwargs)\n", + "\n", + " def test_step(self, *args, **kwargs):\n", + " return self.step_('test', *args, **kwargs)\n", + " \n", + " # This is useful for multiple validation data loader setup\n", + " def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):\n", + " val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()\n", + " return {'val_loss': val_loss_mean}\n", + "\n", + " # This is useful for multiple test data loader setup\n", + " def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):\n", + " test_loss_mean = torch.stack([x['test_loss'] for x in outputs]).mean()\n", + " return {'test_loss': test_loss_mean}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2Ki3kRxag511" + }, + "source": [ + "m = BasicNeMoGPTWithSteps(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f_7YziAw_Isu" + }, + "source": [ + "### Setup for Multi Validation and Multi Test data loaders\n", + "\n", + "As discussed in the NeMo Primer, NeMo has in-built support for multiple data loaders for validation and test steps. Therefore, as an example of how easy it is to add such support, we include the `multi_validation_epoch_end` and `multi_test_epoch_end` overrides.\n", + "\n", + "It is also practically essential to collate results from more than one distributed GPUs, and then aggregate results properly at the end of the epoch. NeMo strictly enforces the correct collation of results, even if you will work on only one device! Future-proofing is baked into the model design for this case!\n", + "\n", + "Therefore NeMo provides the above two generic methods to support aggregation and simultaneously support multiple datasets!\n", + "\n", + "**Please note, you can prepend your already existing `on_validation_epoch_end` and `on_test_epoch_end` implementations with the `multi_` in the name, and that alone is sufficient to enable multi-dataset and multi-GPU support!**\n", + "\n", + "------\n", + "**Note: To disable multi-dataset support, simply override `on_validation_epoch_end` and `on_test_epoch_end` instead of `multi_validation_epoch_end` and `multi_test_epoch_end`!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QpfSn-YUh7GK" + }, + "source": [ + "## Setting up the optimizer / scheduler\n", + "\n", + "We are relatively close to reaching feature parity with the MinGPT Model! But we are missing a crucial piece - the optimizer.\n", + "\n", + "All NeMo Model's come with a default implementation of `setup_optimization()`, which will parse the provided model config to obtain the `optim` and `sched` sub-configs, and automatically configure the optimizer and scheduler.\n", + "\n", + "If training GPT was as simple as plugging in an Adam optimizer over all the parameters with a cosine weight decay schedule, we could do that from the config alone.\n", + "\n", + "-------\n", + "\n", + "But GPT is not such a trivial model - more specifically, it requires weight decay to be applied to the weight matrices but not to the biases, the embedding matrix, or the LayerNorm layers.\n", + "\n", + "We can drop the support that Nemo provides for such special cases and instead utilize the PyTorch Lightning method `configure_optimizers` to perform the same task.\n", + "\n", + "-------\n", + "\n", + "Note, for NeMo Models; the `configure_optimizers` is implemented as a trivial call to `setup_optimization()` followed by returning the generated optimizer and scheduler! So we can override the `configure_optimizer` method and manage the optimizer creation manually!\n", + "\n", + "NeMo's goal is to provide usable defaults for the general case and simply back off to either PyTorch Lightning or PyTorch nn.Module itself in cases when the additional flexibility becomes necessary!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FgXkZQiVjnOv" + }, + "source": [ + "class BasicNeMoGPTWithOptim(BasicNeMoGPTWithSteps):\n", + "\n", + " def configure_optimizers(self):\n", + " \"\"\"\n", + " This long function is unfortunately doing something very simple and is being very defensive:\n", + " We are separating out all parameters of the model into two buckets: those that will experience\n", + " weight decay for regularization and those that won't (biases, and layernorm/embedding weights).\n", + " We are then returning the PyTorch optimizer object.\n", + " \"\"\"\n", + "\n", + " # separate out all parameters to those that will and won't experience weight decay\n", + " decay = set()\n", + " no_decay = set()\n", + " whitelist_weight_modules = (torch.nn.Linear, )\n", + " blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)\n", + " for mn, m in self.named_modules():\n", + " for pn, p in m.named_parameters():\n", + " fpn = '%s.%s' % (mn, pn) if mn else pn # full param name\n", + "\n", + " if pn.endswith('bias'):\n", + " # all biases will not be decayed\n", + " no_decay.add(fpn)\n", + " elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):\n", + " # weights of whitelist modules will be weight decayed\n", + " decay.add(fpn)\n", + " elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):\n", + " # weights of blacklist modules will NOT be weight decayed\n", + " no_decay.add(fpn)\n", + "\n", + " # special case the position embedding parameter in the root GPT module as not decayed\n", + " no_decay.add('embedding.pos_emb')\n", + "\n", + " # validate that we considered every parameter\n", + " param_dict = {pn: p for pn, p in self.named_parameters()}\n", + " inter_params = decay & no_decay\n", + " union_params = decay | no_decay\n", + " assert len(inter_params) == 0, \"parameters %s made it into both decay/no_decay sets!\" % (str(inter_params), )\n", + " assert len(param_dict.keys() - union_params) == 0, \"parameters %s were not separated into either decay/no_decay set!\" \\\n", + " % (str(param_dict.keys() - union_params), )\n", + "\n", + " # create the pytorch optimizer object\n", + " optim_groups = [\n", + " {\"params\": [param_dict[pn] for pn in sorted(list(decay))], \"weight_decay\": self.cfg.optim.weight_decay},\n", + " {\"params\": [param_dict[pn] for pn in sorted(list(no_decay))], \"weight_decay\": 0.0},\n", + " ]\n", + " optimizer = torch.optim.AdamW(optim_groups, lr=self.cfg.optim.lr, betas=self.cfg.optim.betas)\n", + " return optimizer\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kARDwthakEQk" + }, + "source": [ + "m = BasicNeMoGPTWithOptim(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iB1kwctv2cYv" + }, + "source": [ + "-----\n", + "Now let's setup the config for the optimizer !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5K7zh9Cn2s2u" + }, + "source": [ + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "optim_config = OmegaConf.create({\n", + " 'lr': 3e-4,\n", + " 'weight_decay': 0.1,\n", + " 'betas': [0.9, 0.95]\n", + "})\n", + "\n", + "cfg.model.optim = optim_config\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P31p8ABthsh0" + }, + "source": [ + "## Setting up the dataset / data loaders\n", + "\n", + "So we were able almost entirely to replicate the MinGPT implementation. \n", + "\n", + "Remember, NeMo models should contain all of the logic to load the Dataset and DataLoader for at least the train and validation step.\n", + "\n", + "We temporarily provided empty implementations to get around it till now, but let's fill that in now!\n", + "\n", + "-------\n", + "\n", + "**Note for datasets**: Below, we will show an example using a very small dataset called `tiny_shakespeare`, found at the original [char-rnn repository](https://github.com/karpathy/char-rnn), but practically you could use any text corpus. The one suggested in minGPT is available at http://mattmahoney.net/dc/textdata.html" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q8dlOcZPkxM1" + }, + "source": [ + "### Creating the Dataset\n", + "\n", + "NeMo has Neural Type checking support, even for Datasets! It's just a minor change of the import in most cases and one difference in how we handle `collate_fn`.\n", + "\n", + "We could paste the dataset info from minGPT, and you'd only need to make 2 changes!\n", + "\n", + "-----\n", + "In this example, we will be writing a thin subclass over the datasets provided by `nlp` from HuggingFace!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E-fswFkig9t4" + }, + "source": [ + "from nemo.core import Dataset\n", + "from torch.utils import data\n", + "from torch.utils.data.dataloader import DataLoader" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-Z8XuPeClGNm" + }, + "source": [ + "class TinyShakespeareDataset(Dataset):\n", + "\n", + " def __init__(self, data_path, block_size, crop=None, override_vocab=None):\n", + "\n", + " # load the data and crop it appropriately\n", + " with open(data_path, 'r') as f:\n", + " if crop is None:\n", + " data = f.read()\n", + " else:\n", + " f.seek(crop[0])\n", + " data = f.read(crop[1])\n", + "\n", + " # build a vocabulary from data or inherit it\n", + " vocab = sorted(list(set(data))) if override_vocab is None else override_vocab\n", + "\n", + " # Add UNK\n", + " special_tokens = ['', ''] # We use just and in the call, but can add others.\n", + " if not override_vocab:\n", + " vocab = [*special_tokens, *vocab] # Update train vocab with special tokens\n", + "\n", + " data_size, vocab_size = len(data), len(vocab)\n", + " print('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))\n", + " print('Num samples in dataset : %d' % (data_size // block_size))\n", + "\n", + " self.stoi = { ch:i for i,ch in enumerate(vocab) }\n", + " self.itos = { i:ch for i,ch in enumerate(vocab) }\n", + " self.block_size = block_size\n", + " self.vocab_size = vocab_size\n", + " self.data = data\n", + " self.vocab = vocab\n", + " self.special_tokens = special_tokens\n", + "\n", + " def __len__(self):\n", + " return len(self.data) // self.block_size\n", + "\n", + " def __getitem__(self, idx):\n", + " # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too\n", + " chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]\n", + " # map the string into a sequence of integers\n", + " ixes = [self.stoi[s] if s in self.stoi else self.stoi[''] for s in chunk ]\n", + " # if stars align (last idx and len(self.data) % self.block_size == 0), pad with \n", + " if len(ixes) < self.block_size + 1:\n", + " assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure\n", + " ixes.append(self.stoi[''])\n", + " dix = torch.tensor(ixes, dtype=torch.long)\n", + " return dix[:-1], dix[1:]\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'input': NeuralType(('B', 'T'), Index()),\n", + " 'target': NeuralType(('B', 'T'), LabelsType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7MEMR4TcmP5K" + }, + "source": [ + "------\n", + "We didn't have to change anything until here. How then is type-checking done? \n", + "\n", + "NeMo does type-checking inside of the collate function implementation itself! In this case, it is not necessary to override the `collate_fn` inside the Dataset, but if we did need to override it, **NeMo requires that the private method `_collate_fn` be overridden instead**.\n", + "\n", + "We can then use data loaders with minor modifications!\n", + "\n", + "**Also, there is no need to implement the `input_types` for Dataset, as they are the ones generating the input for the model!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZeKXAknenVch" + }, + "source": [ + "-----\n", + "Let's prepare the dataset that we are going to use - Tiny Shakespeare from the following codebase [char-rnn](https://github.com/karpathy/char-rnn)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VwsdXtVzo--t" + }, + "source": [ + "import os" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QvKcDCvIl9-A" + }, + "source": [ + "if not os.path.exists('tiny-shakespeare.txt'):\n", + " !wget https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ynCwqDu6vK8P" + }, + "source": [ + "!head -n 5 tiny-shakespeare.txt" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bfRL4t9_oS4C" + }, + "source": [ + "train_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(0, int(1e6)))\n", + "val_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1e6), int(50e3)), override_vocab=train_dataset.vocab)\n", + "test_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1.05e6), int(100e3)), override_vocab=train_dataset.vocab)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kIlCoZDksEDO" + }, + "source": [ + "### Setting up dataset/data loader support in the Model\n", + "\n", + "So we now know our data loader works. Let's integrate it as part of the Model itself!\n", + "\n", + "To do this, we use the three special attributes of the NeMo Model - `self._train_dl`, `self._validation_dl` and `self._test_dl`. Once you construct your DataLoader, place your data loader to these three variables. \n", + "\n", + "For multi-data loader support, the same applies! NeMo will automatically handle the management of multiple data loaders for you!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SVSfIk_-rMSg" + }, + "source": [ + "class NeMoGPT(BasicNeMoGPTWithOptim):\n", + "\n", + " def _setup_data_loader(self, cfg):\n", + " if self.vocab is None:\n", + " override_vocab = None\n", + " else:\n", + " override_vocab = self.vocab\n", + "\n", + " dataset = TinyShakespeareDataset(\n", + " data_path=cfg.data_path,\n", + " block_size=cfg.block_size,\n", + " crop=tuple(cfg.crop) if 'crop' in cfg else None,\n", + " override_vocab=override_vocab\n", + " )\n", + "\n", + " if self.vocab is None:\n", + " self.vocab = dataset.vocab\n", + "\n", + " return DataLoader(\n", + " dataset=dataset,\n", + " batch_size=cfg.batch_size,\n", + " shuffle=cfg.shuffle,\n", + " collate_fn=dataset.collate_fn, # <-- this is necessary for type checking\n", + " pin_memory=cfg.pin_memory if 'pin_memory' in cfg else False,\n", + " num_workers=cfg.num_workers if 'num_workers' in cfg else 0\n", + " )\n", + " \n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self.vocab = None\n", + " self._train_dl = self._setup_data_loader(train_data_config)\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " self._validation_dl = self._setup_data_loader(val_data_config)\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " self._test_dl = self._setup_data_loader(test_data_config)\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ait4nLtIxS96" + }, + "source": [ + "### Creating the dataset / dataloader config\n", + "\n", + "The final step to setup this model is to add the `train_ds`, `validation_ds` and `test_ds` configs inside the model config!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C6zcTqJixOOL" + }, + "source": [ + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "# Set the data path and update vocabular size\n", + "cfg.model.data_path = 'tiny-shakespeare.txt'\n", + "cfg.model.vocab_size = train_dataset.vocab_size\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zlvThf7BysyT" + }, + "source": [ + "train_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [0, int(1e6)],\n", + " 'batch_size': 64,\n", + " 'shuffle': True,\n", + "})\n", + "\n", + "validation_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [int(1e6), int(50e3)],\n", + " 'batch_size': 4,\n", + " 'shuffle': False,\n", + "})\n", + "\n", + "test_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [int(1.05e6), int(100e3)],\n", + " 'batch_size': 4,\n", + " 'shuffle': False,\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QVVzR6WKyMT5" + }, + "source": [ + "# Attach to the model config\n", + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "cfg.model.train_ds = train_ds\n", + "cfg.model.validation_ds = validation_ds\n", + "cfg.model.test_ds = test_ds\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nd_9_mxS0ET-" + }, + "source": [ + "# Let's see the config now !\n", + "print(OmegaConf.to_yaml(cfg))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dlwSQENU0JxA" + }, + "source": [ + "# Let's try creating a model now !\n", + "model = NeMoGPT(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q_Mp4bhH0tR1" + }, + "source": [ + "-----\n", + "All the data loaders load properly ! Yay!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CZHDqCyo6uWd" + }, + "source": [ + "# Evaluate the model - end to end!\n", + "\n", + "Now that the data loaders have been set up, all that's left is to train and test the model! We have most of the components required by this model - the train, val and test data loaders, the optimizer, and the type-checked forward step to perform the train-validation-test steps! \n", + "\n", + "But training a GPT model from scratch is not the goal of this primer, so instead, let's do a sanity check by merely testing the model for a few steps using random initial weights.\n", + "\n", + "The above will ensure that - \n", + "\n", + "1) Our data loaders work as intended\n", + "\n", + "2) The type checking system assures us that our Neural Modules are performing their forward step correctly.\n", + "\n", + "3) The loss is calculated, and therefore the model runs end to end, ultimately supporting PyTorch Lightning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "johk6Z0e0WEm" + }, + "source": [ + "if torch.cuda.is_available():\n", + " accelerator = 'gpu'\n", + "else:\n", + " accelerator = 'cpu'\n", + "\n", + "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches=1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oqeeofEr1S8e" + }, + "source": [ + "trainer.test(model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pqJy7esrA-Ha" + }, + "source": [ + "# Saving and restoring models\n", + "\n", + "NeMo internally keeps track of the model configuration, as well as the model checkpoints and parameters.\n", + "\n", + "As long as your NeMo follows the above general guidelines, you can call the `save_to` and `restore_from` methods to save and restore your models!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DksG_-7G1Vbe" + }, + "source": [ + "model.save_to('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JhjoFdCnBWVh" + }, + "source": [ + "!ls -d -- *.nemo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "567txSF0BYXN" + }, + "source": [ + "temp_model = NeMoGPT.restore_from('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YvnfG0kxBfTt" + }, + "source": [ + "# [ERROR CELL]\n", + "temp_model.setup_test_data(temp_model.cfg.test_ds)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N0ckN44YB-1K" + }, + "source": [ + "-----\n", + "\n", + "Hmm, it seems it wasn't so easy in this case. Non-trivial models have non-trivial issues!\n", + "\n", + "Remember, our NeMoGPT model sets its self.vocab inside the `setup_train_data` step. But that depends on the vocabulary generated by the train set... which is **not** restored during model restoration (unless you call `setup_train_data` explicitly!).\n", + "\n", + "We can quickly resolve this issue by constructing an external data file to enable save and restore support, and NeMo supports that too! We will use the `register_artifact` API in NeMo to support external files being attached to the .nemo checkpoint." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_Atyoc4NBjEV" + }, + "source": [ + "class NeMoGPTv2(NeMoGPT):\n", + " \n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self.vocab = None\n", + " self._train_dl = self._setup_data_loader(train_data_config)\n", + "\n", + " # Save the vocab into a text file for now\n", + " with open('vocab.txt', 'w') as f:\n", + " for token in self.vocab:\n", + " f.write(f\"{token}\")\n", + " \n", + " # This is going to register the file into .nemo!\n", + " # When you later use .save_to(), it will copy this file into the tar file.\n", + " self.register_artifact('vocab_file', 'vocab.txt')\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " # This is going to try to find the same file, and if it fails, \n", + " # it will use the copy in .nemo\n", + " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", + " \n", + " with open(vocab_file, 'r') as f:\n", + " vocab = []\n", + " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", + " self.vocab = vocab\n", + "\n", + " self._validation_dl = self._setup_data_loader(val_data_config)\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " # This is going to try to find the same file, and if it fails, \n", + " # it will use the copy in .nemo\n", + " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", + "\n", + " with open(vocab_file, 'r') as f:\n", + " vocab = []\n", + " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", + " self.vocab = vocab\n", + "\n", + " self._test_dl = self._setup_data_loader(test_data_config)\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mn09jsRZDusN" + }, + "source": [ + "# Let's try creating a model now !\n", + "model = NeMoGPTv2(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQPIPySDD1K0" + }, + "source": [ + "# Now let's try to save and restore !\n", + "model.save_to('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0YwCJ4xaJ3bU" + }, + "source": [ + "temp_model = NeMoGPTv2.restore_from('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tcxwDIIWKKCQ" + }, + "source": [ + "temp_model.setup_multiple_test_data(temp_model.cfg.test_ds)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j3Olm6ZTKRbO" + }, + "source": [ + "if torch.cuda.is_available():\n", + " accelerator = 'gpu'\n", + "else:\n", + " accelerator = 'cpu'\n", + "\n", + "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches =1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_QE2SngCKV2p" + }, + "source": [ + "trainer.test(model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o2HpKzwKJ_MW" + }, + "source": [ + "------\n", + "There we go ! Now our models can be serialized and de-serialized without any issue, even with an external vocab file !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZjCV5u3_OO7a" + }, + "source": [], + "execution_count": null, + "outputs": [] + } + ] } diff --git a/tutorials/02_NeMo_Adapters.ipynb b/tutorials/02_NeMo_Adapters.ipynb index 289426f3bc2b..f6b848fcca9d 100644 --- a/tutorials/02_NeMo_Adapters.ipynb +++ b/tutorials/02_NeMo_Adapters.ipynb @@ -25,7 +25,7 @@ "!pip install text-unidecode\n", "\n", "# ## Install NeMo\n", - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "## Grab the config we'll use in this example\n", diff --git a/tutorials/AudioTranslationSample.ipynb b/tutorials/AudioTranslationSample.ipynb index 0c34baacc953..27c223d58e5b 100644 --- a/tutorials/AudioTranslationSample.ipynb +++ b/tutorials/AudioTranslationSample.ipynb @@ -38,7 +38,7 @@ }, "outputs": [], "source": [ - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n" ] }, diff --git a/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb b/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb index ae4f43867c8d..eda5631f1cb5 100644 --- a/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb +++ b/tutorials/Publish_NeMo_Model_On_Hugging_Face_Hub.ipynb @@ -41,7 +41,7 @@ "!pip install text-unidecode\n", "\n", "### Install NeMo\n", - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]" ] }, diff --git a/tutorials/VoiceSwapSample.ipynb b/tutorials/VoiceSwapSample.ipynb index addf19f3b236..3c3b48d9590c 100644 --- a/tutorials/VoiceSwapSample.ipynb +++ b/tutorials/VoiceSwapSample.ipynb @@ -39,7 +39,7 @@ }, "outputs": [], "source": [ - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n" ] }, diff --git a/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb b/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb index 94e2caa17a58..8cf668a2a083 100644 --- a/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb +++ b/tutorials/asr/ASR_CTC_Language_Finetuning.ipynb @@ -39,7 +39,7 @@ "!pip install matplotlib>=3.3.2\n", "\n", "## Install NeMo\n", - "BRANCH = 'main'\n", + "BRANCH = 'r1.23.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "\n", "\"\"\"\n", diff --git a/tutorials/asr/ASR_for_telephony_speech.ipynb b/tutorials/asr/ASR_for_telephony_speech.ipynb index 6133fdc9a8b9..fa572945a0ff 100644 --- a/tutorials/asr/ASR_for_telephony_speech.ipynb +++ b/tutorials/asr/ASR_for_telephony_speech.ipynb @@ -1,344 +1,346 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "lJz6FDU1lRzc" - }, - "outputs": [], - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "!pip install matplotlib>=3.3.2\n", - "\n", - "## Install NeMo\n", - "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "## Grab the config we'll use in this example\n", - "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", - "\n", - "\"\"\"\n", - "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", - "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", - "that you want to use the \"Run All Cells\" (or similar) option.\n", - "\"\"\"\n", - "# exit()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v1Jk9etFlRzf" - }, - "source": [ - "# Telephony speech (8 kHz)\n", - "This notebook covers general recommendations for using NeMo models with 8 kHz speech. All the pretrained models currently available through NeMo are trained with audio at 16 kHz. This means that if the original audio was sampled at a different rate, it's sampling rate was converted to 16 kHz through upsampling or downsampling. One of the common applications for ASR is to recognize telephony speech which typically consists of speech sampled at 8 kHz.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Mixed sample rate\n", - "Most of the pretrained English models distributed with NeMo are trained with mixed sample rate data, i.e. the training data typically consists of data sampled at both 8 kHz and 16 kHz. As an example pretrained Citrinet model \"stt_en_citrinet_1024\" was trained with the following datasets. \n", - "* Librispeech 960 hours of English speech\n", - "* Fisher Corpus\n", - "* Switchboard-1 Dataset\n", - "* WSJ-0 and WSJ-1\n", - "* National Speech Corpus - 1\n", - "* Mozilla Common Voice\n", - "\n", - "Among these, Fisher and Switchboard datasets are conversational telephone speech datasets with audio sampled at 8 kHz while the other datasets were originally sampled at least 16 kHz. Before training, all audio files from Fisher and Switchboard datasets were upsampled to 16 kHz. Because of this mixed sample rate training, our models can be used to recognize both narrowband (8kHz) and wideband speech (16kHz)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Inference with NeMo\n", - "NeMo ASR currently supports inference of audio in .wav format. Internally, the audio file is resampled to 16 kHz before inference is called on the model, so there is no difference running inference on 8 kHz audio compared to say 16 kHz or any other sampling rate audio with NeMo. Let's look at an example for running inference on 8 kHz audio. " - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# This is where the an4/ directory will be placed.\n", - "# Change this if you don't want the data to be extracted in the current directory.\n", - "data_dir = '.'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import glob\n", - "import os\n", - "import subprocess\n", - "import tarfile\n", - "import wget\n", - "\n", - "# Download the dataset. This will take a few moments...\n", - "print(\"******\")\n", - "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", - " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", - " an4_path = wget.download(an4_url, data_dir)\n", - " print(f\"Dataset downloaded at: {an4_path}\")\n", - "else:\n", - " print(\"Tarfile already exists.\")\n", - " an4_path = data_dir + '/an4_sphere.tar.gz'\n", - "\n", - "if not os.path.exists(data_dir + '/an4/'):\n", - " # Untar and convert .sph to .wav (using sox)\n", - " tar = tarfile.open(an4_path)\n", - " tar.extractall(path=data_dir)\n", - "\n", - " print(\"Converting .sph to .wav...\")\n", - " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", - " for sph_path in sph_list:\n", - " wav_path = sph_path[:-4] + '.wav'\n", - " cmd = [\"sox\", sph_path, wav_path]\n", - " subprocess.run(cmd)\n", - "print(\"Finished conversion.\\n******\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Audio in an4 dataset is sampled at 22 kHz. Let's first downsample an audio file to 16 kHz." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import librosa\n", - "import IPython.display as ipd\n", - "import librosa.display\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Load and listen to the audio file\n", - "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", - "audio, sample_rate = librosa.load(example_file)\n", - "print(sample_rate)\n", - "audio_16kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=16000)\n", - "\n", - "import numpy as np\n", - "\n", - "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", - "spec = np.abs(librosa.stft(audio_16kHz))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=16000)\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');\n", - "plt.ylim([0, 8000])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's downsample the audio to 8 kHz" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "audio_8kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=8000)\n", - "spec = np.abs(librosa.stft(audio_8kHz))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=8000)\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');\n", - "plt.ylim([0, 8000])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import soundfile as sf\n", - "sf.write(data_dir + '/audio_16kHz.wav', audio_16kHz, 16000)\n", - "sample, sr = librosa.core.load(data_dir + '/audio_16kHz.wav')\n", - "ipd.Audio(sample, rate=sr)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sf.write(data_dir + '/audio_8kHz.wav', audio_8kHz, 8000)\n", - "sample, sr = librosa.core.load(data_dir + '/audio_8kHz.wav')\n", - "ipd.Audio(sample, rate=sr)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "Let's look at inference results using one of the pretrained models on the original, 16 kHz and 8 kHz versions of the example file we chose above." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from nemo.collections.asr.models import ASRModel\n", - "import torch\n", - "if torch.cuda.is_available():\n", - " device = torch.device(f'cuda:0')\n", - "asr_model = ASRModel.from_pretrained(model_name='stt_en_citrinet_1024', map_location=device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As discussed above, there are no changes required for inference based on the sampling rate of audio and as we see below the pretrained Citrinet model gives accurate transcription even for audio downsampled to 8 Khz." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[example_file]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_16kHz.wav']))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_8kHz.wav']))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Training / fine-tuning with 8 kHz data\n", - "For training a model with new 8 kHz data, one could take two approaches. The first approach, **which is recommended**, is to finetune a pretrained 16 kHz model by upsampling all the data to 16 kHz. Note that upsampling offline before training is not necessary but recommended as online upsampling during training is very time consuming and may slow down training significantly. The second approach is to train an 8 kHz model from scratch. **Note**: For the second approach, in our experiments we saw that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy.\n", - "\n", - "To upsample your 8 kHz data to 16 kHz command line tools like sox or ffmpeg are very useful. Here is the command to upsample and audio file using sox:\n", - "```shell\n", - "sox input_8k.wav -r 16000 -o output_16k.wav\n", - "```\n", - "Now to finetune a pre-trained model with this upsampled data, you can just restore the model weights from the pre-trained model and call trainer with the upsampled data. As an example, here is how one would fine-tune a Citrinet model:\n", - "```python\n", - "python examples/asr/script_to_bpe.py \\\n", - " --config-path=\"examples/asr/conf/citrinet\" \\\n", - " --config-name=\"citrinet_512.yaml\" \\\n", - " model.train_ds.manifest_filepath=\"\" \\\n", - " model.validation_ds.manifest_filepath=\"\" \\\n", - " trainer.devices=-1 \\\n", - " trainer.accelerator='gpu' \\\n", - " trainer.max_epochs=50 \\\n", - " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", - "```\n", - "\n", - "To train an 8 kHz model, just change the sample rate in the config to 8000 as follows:\n", - "\n", - "```python\n", - "python examples/asr/script_to_bpe.py \\\n", - " --config-path=\"examples/asr/conf/citrinet\" \\\n", - " --config-name=\"citrinet_512.yaml\" \\\n", - " model.sample_rate=8000 \\\n", - " model.train_ds.manifest_filepath=\"\" \\\n", - " model.validation_ds.manifest_filepath=\"\" \\\n", - " trainer.devices=-1 \\\n", - " trainer.accelerator='gpu' \\\n", - " trainer.max_epochs=50 \\\n", - " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", - "```" - ] - } - ], + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lJz6FDU1lRzc" + }, + "outputs": [], + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", + "\n", + "\n", + "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "!pip install matplotlib>=3.3.2\n", + "\n", + "## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "## Grab the config we'll use in this example\n", + "!mkdir configs\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "\n", + "\"\"\"\n", + "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", + "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", + "that you want to use the \"Run All Cells\" (or similar) option.\n", + "\"\"\"\n", + "# exit()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1Jk9etFlRzf" + }, + "source": [ + "# Telephony speech (8 kHz)\n", + "This notebook covers general recommendations for using NeMo models with 8 kHz speech. All the pretrained models currently available through NeMo are trained with audio at 16 kHz. This means that if the original audio was sampled at a different rate, it's sampling rate was converted to 16 kHz through upsampling or downsampling. One of the common applications for ASR is to recognize telephony speech which typically consists of speech sampled at 8 kHz.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mixed sample rate\n", + "Most of the pretrained English models distributed with NeMo are trained with mixed sample rate data, i.e. the training data typically consists of data sampled at both 8 kHz and 16 kHz. As an example pretrained Citrinet model \"stt_en_citrinet_1024\" was trained with the following datasets. \n", + "* Librispeech 960 hours of English speech\n", + "* Fisher Corpus\n", + "* Switchboard-1 Dataset\n", + "* WSJ-0 and WSJ-1\n", + "* National Speech Corpus - 1\n", + "* Mozilla Common Voice\n", + "\n", + "Among these, Fisher and Switchboard datasets are conversational telephone speech datasets with audio sampled at 8 kHz while the other datasets were originally sampled at least 16 kHz. Before training, all audio files from Fisher and Switchboard datasets were upsampled to 16 kHz. Because of this mixed sample rate training, our models can be used to recognize both narrowband (8kHz) and wideband speech (16kHz)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Inference with NeMo\n", + "NeMo ASR currently supports inference of audio in .wav format. Internally, the audio file is resampled to 16 kHz before inference is called on the model, so there is no difference running inference on 8 kHz audio compared to say 16 kHz or any other sampling rate audio with NeMo. Let's look at an example for running inference on 8 kHz audio. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# This is where the an4/ directory will be placed.\n", + "# Change this if you don't want the data to be extracted in the current directory.\n", + "data_dir = '.'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "import os\n", + "import subprocess\n", + "import tarfile\n", + "import wget\n", + "\n", + "# Download the dataset. This will take a few moments...\n", + "print(\"******\")\n", + "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", + " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", + " an4_path = wget.download(an4_url, data_dir)\n", + " print(f\"Dataset downloaded at: {an4_path}\")\n", + "else:\n", + " print(\"Tarfile already exists.\")\n", + " an4_path = data_dir + '/an4_sphere.tar.gz'\n", + "\n", + "if not os.path.exists(data_dir + '/an4/'):\n", + " # Untar and convert .sph to .wav (using sox)\n", + " tar = tarfile.open(an4_path)\n", + " tar.extractall(path=data_dir)\n", + "\n", + " print(\"Converting .sph to .wav...\")\n", + " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", + " for sph_path in sph_list:\n", + " wav_path = sph_path[:-4] + '.wav'\n", + " cmd = [\"sox\", sph_path, wav_path]\n", + " subprocess.run(cmd)\n", + "print(\"Finished conversion.\\n******\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Audio in an4 dataset is sampled at 22 kHz. Let's first downsample an audio file to 16 kHz." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import librosa\n", + "import IPython.display as ipd\n", + "import librosa.display\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Load and listen to the audio file\n", + "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", + "audio, sample_rate = librosa.load(example_file)\n", + "print(sample_rate)\n", + "audio_16kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=16000)\n", + "\n", + "import numpy as np\n", + "\n", + "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", + "spec = np.abs(librosa.stft(audio_16kHz))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=16000)\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');\n", + "plt.ylim([0, 8000])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's downsample the audio to 8 kHz" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "audio_8kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=8000)\n", + "spec = np.abs(librosa.stft(audio_8kHz))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=8000)\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');\n", + "plt.ylim([0, 8000])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import soundfile as sf\n", + "sf.write(data_dir + '/audio_16kHz.wav', audio_16kHz, 16000)\n", + "sample, sr = librosa.core.load(data_dir + '/audio_16kHz.wav')\n", + "ipd.Audio(sample, rate=sr)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sf.write(data_dir + '/audio_8kHz.wav', audio_8kHz, 8000)\n", + "sample, sr = librosa.core.load(data_dir + '/audio_8kHz.wav')\n", + "ipd.Audio(sample, rate=sr)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Let's look at inference results using one of the pretrained models on the original, 16 kHz and 8 kHz versions of the example file we chose above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo.collections.asr.models import ASRModel\n", + "import torch\n", + "if torch.cuda.is_available():\n", + " device = torch.device(f'cuda:0')\n", + "asr_model = ASRModel.from_pretrained(model_name='stt_en_citrinet_1024', map_location=device)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As discussed above, there are no changes required for inference based on the sampling rate of audio and as we see below the pretrained Citrinet model gives accurate transcription even for audio downsampled to 8 Khz." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[example_file]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_16kHz.wav']))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_8kHz.wav']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training / fine-tuning with 8 kHz data\n", + "For training a model with new 8 kHz data, one could take two approaches. The first approach, **which is recommended**, is to finetune a pretrained 16 kHz model by upsampling all the data to 16 kHz. Note that upsampling offline before training is not necessary but recommended as online upsampling during training is very time consuming and may slow down training significantly. The second approach is to train an 8 kHz model from scratch. **Note**: For the second approach, in our experiments we saw that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy.\n", + "\n", + "To upsample your 8 kHz data to 16 kHz command line tools like sox or ffmpeg are very useful. Here is the command to upsample and audio file using sox:\n", + "```shell\n", + "sox input_8k.wav -r 16000 -o output_16k.wav\n", + "```\n", + "Now to finetune a pre-trained model with this upsampled data, you can just restore the model weights from the pre-trained model and call trainer with the upsampled data. As an example, here is how one would fine-tune a Citrinet model:\n", + "```python\n", + "python examples/asr/script_to_bpe.py \\\n", + " --config-path=\"examples/asr/conf/citrinet\" \\\n", + " --config-name=\"citrinet_512.yaml\" \\\n", + " model.train_ds.manifest_filepath=\"\" \\\n", + " model.validation_ds.manifest_filepath=\"\" \\\n", + " trainer.devices=-1 \\\n", + " trainer.accelerator='gpu' \\\n", + " trainer.max_epochs=50 \\\n", + " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", + "```\n", + "\n", + "To train an 8 kHz model, just change the sample rate in the config to 8000 as follows:\n", + "\n", + "```python\n", + "python examples/asr/script_to_bpe.py \\\n", + " --config-path=\"examples/asr/conf/citrinet\" \\\n", + " --config-name=\"citrinet_512.yaml\" \\\n", + " model.sample_rate=8000 \\\n", + " model.train_ds.manifest_filepath=\"\" \\\n", + " model.validation_ds.manifest_filepath=\"\" \\\n", + " trainer.devices=-1 \\\n", + " trainer.accelerator='gpu' \\\n", + " trainer.max_epochs=50 \\\n", + " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", + "```" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "ASR_with_NeMo.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + }, + "pycharm": { + "stem_cell": { + "cell_type": "raw", "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "ASR_with_NeMo.ipynb", - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "pycharm": { - "stem_cell": { - "cell_type": "raw", - "metadata": { - "collapsed": false - }, - "source": [] - } - } + "collapsed": false }, - "nbformat": 4, - "nbformat_minor": 4 + "source": [] + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 } diff --git a/tutorials/asr/ASR_with_NeMo.ipynb b/tutorials/asr/ASR_with_NeMo.ipynb index 1a5fd19b3e54..3d5c3d1a500d 100644 --- a/tutorials/asr/ASR_with_NeMo.ipynb +++ b/tutorials/asr/ASR_with_NeMo.ipynb @@ -1,1176 +1,1178 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "accelerator": "GPU", - "colab": { - "name": "ASR_with_NeMo.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.7" - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "lJz6FDU1lRzc" - }, - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "!pip install matplotlib>=3.3.2\n", - "\n", - "## Install NeMo\n", - "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "\"\"\"\n", - "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", - "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", - "that you want to use the \"Run All Cells\" (or similar) option.\n", - "\"\"\"\n", - "# exit()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v1Jk9etFlRzf" - }, - "source": [ - "# Introduction to End-To-End Automatic Speech Recognition\n", - "\n", - "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", - "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", - "\n", - "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YLln3U-IlRzg" - }, - "source": [ - "## Conceptual Overview: What is ASR?\n", - "\n", - "ASR, or **Automatic Speech Recognition**, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?\n", - "\n", - "Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a **language model** that encapsulates the most likely orderings of words that are generated (e.g. an n-gram model), to a **pronunciation model** for each word in that ordering (e.g. a pronunciation table), to an **acoustic model** that translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).\n", - "\n", - "Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model `Pr(audio|transcript)*Pr(transcript)`, and take the argmax of this over possible transcripts.\n", - "\n", - "Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the whole prediction.\n", - "\n", - "Thus, we can see the appeal of **end-to-end ASR architectures**: discriminative models that simply take an audio input and give a textual output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, we could integrate a language model that would improve our predictions, as well.\n", - "\n", - "And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0S5iZPMSlRzg" - }, - "source": [ - "### End-To-End ASR\n", - "\n", - "With an end-to-end model, we want to directly learn `Pr(transcript|audio)` in order to predict the transcripts from the original audio. Since we are dealing with sequential information--audio data over time that corresponds to a sequence of letters--RNNs are the obvious choice. But now we have a pressing problem to deal with: since our input sequence (number of audio timesteps) is not the same length as our desired output (transcript length), how do we match each time step from the audio data to the correct output characters?\n", - "\n", - "Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as a phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (e.g. the prediction output `LLLAAAAPPTOOOPPPP` would become `LAPTOP`). It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do with words like \"book\" that contain consecutive repeated letters? Simply squashing repeated letters together would not work in that case!\n", - "\n", - "![Alignment example](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/tutorials/asr/images/alignment_example.png)\n", - "\n", - "Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the transcript--no extra data or labeling required. Let's briefly go over two popular approaches that allow us to do this, Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention.\n", - "\n", - "#### Connectionist Temporal Classification (CTC)\n", - "\n", - "In normal speech recognition prediction output, we would expect to have characters such as the letters from A through Z, numbers 0 through 9, spaces (\"\\_\"), and so on. CTC introduces a new intermediate output token called the **blank token** (\"-\") that is useful for getting around the alignment issue.\n", - "\n", - "With CTC, we still predict one token per time segment of speech, but we use the blank token to figure out where we can and can't collapse the predictions. The appearance of a blank token helps separate repeating letters that should not be collapsed. For instance, with an audio snippet segmented into `T=11` time steps, we could get predictions that look like `BOO-OOO--KK`, which would then collapse to `\"BO-O-K\"`, and then we would remove the blank tokens to get our final output, `BOOK`.\n", - "\n", - "Now, we can predict one output token per time step, then collapse and clean to get sensible output without any fear of ambiguity from repeating letters! A simple way of getting predictions like this would be to apply a bidirectional RNN to the audio input, apply softmax over each time step's output, and then take the token with the highest probability. The method of always taking the best token at each time step is called **greedy decoding, or max decoding**.\n", - "\n", - "To calculate our loss for backprop, we would like to know the log probability of the model producing the correct transcript, `log(Pr(transcript|audio))`. We can get the log probability of a single intermediate output sequence (e.g. `BOO-OOO--KK`) by summing over the log probabilities we get from each token's softmax value, but note that the resulting sum is different from the log probability of the transcript itself (`BOOK`). This is because there are multiple possible output sequences of the same length that can be collapsed to get the same transcript (e.g. `BBO--OO-KKK` also results in `BOOK`), and so we need to **marginalize over every valid sequence of length `T` that collapses to the transcript**.\n", - "\n", - "Therefore, to get our transcript's log probability given our audio input, we must sum the log probabilities of every sequence of length `T` that collapses to the transcript (e.g. `log(Pr(output: \"BOOK\"|audio)) = log(Pr(BOO-OOO--KK|audio)) + log(Pr(BBO--OO-KKK|audio)) + ...`). In practice, we can use a dynamic programming approach to calculate this, accumulating our log probabilities over different \"paths\" through the softmax outputs at each time step.\n", - "\n", - "If you would like a more in-depth explanation of how CTC works, or how we can improve our results by using a modified beam search algorithm, feel free to check out the Further Reading section at the end of this notebook for more resources.\n", - "\n", - "#### Sequence-to-Sequence with Attention\n", - "\n", - "One problem with CTC is that predictions at different time steps are conditionally independent, which is an issue because the words in a continuous utterance tend to be related to each other in some sensible way. With this conditional independence assumption, we can't learn a language model that can represent such dependencies, though we can add a language model on top of the CTC output to mitigate this to some degree.\n", - "\n", - "A popular alternative is to use a sequence-to-sequence model with attention. A typical seq2seq model for ASR consists of some sort of **bidirectional RNN encoder** that consumes the audio sequence timestep-by-timestep, and where the outputs are then passed to an **attention-based decoder**. Each prediction from the decoder is based on attending to some parts of the entire encoded input, as well as the previously outputted tokens.\n", - "\n", - "The outputs of the decoder can be anything from word pieces to phonemes to letters, and since predictions are not directly tied to time steps of the input, we can just continue producing tokens one-by-one until an end token is given (or we reach a specified max output length). This way, we do not need to deal with audio alignment, and our predicted transcript is just the sequence of outputs given by our decoder.\n", - "\n", - "Now that we have an idea of what some popular end-to-end ASR models look like, let's take a look at the audio data we'll be working with for our example." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "38aYTCTIlRzh" - }, - "source": [ - "## Taking a Look at Our Data (AN4)\n", - "\n", - "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n", - "\n", - "Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the \"Downloads\" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "gAhsmi6HlRzh" - }, - "source": [ - "import os\n", - "# This is where the an4/ directory will be placed.\n", - "# Change this if you don't want the data to be extracted in the current directory.\n", - "data_dir = '.'\n", - "\n", - "if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Yb4fuUvWlRzk", - "scrolled": true - }, - "source": [ - "import glob\n", - "import os\n", - "import subprocess\n", - "import tarfile\n", - "import wget\n", - "\n", - "# Download the dataset. This will take a few moments...\n", - "print(\"******\")\n", - "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", - " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", - " an4_path = wget.download(an4_url, data_dir)\n", - " print(f\"Dataset downloaded at: {an4_path}\")\n", - "else:\n", - " print(\"Tarfile already exists.\")\n", - " an4_path = data_dir + '/an4_sphere.tar.gz'\n", - "\n", - "if not os.path.exists(data_dir + '/an4/'):\n", - " # Untar and convert .sph to .wav (using sox)\n", - " tar = tarfile.open(an4_path)\n", - " tar.extractall(path=data_dir)\n", - "\n", - " print(\"Converting .sph to .wav...\")\n", - " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", - " for sph_path in sph_list:\n", - " wav_path = sph_path[:-4] + '.wav'\n", - " cmd = [\"sox\", sph_path, wav_path]\n", - " subprocess.run(cmd)\n", - "print(\"Finished conversion.\\n******\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m_LFeM0elRzm" - }, - "source": [ - "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\n", - "\n", - "Now we can load and take a look at the data. As an example, file `cen2-mgah-b.wav` is a 2.6 second-long audio recording of a man saying the letters \"G L E N N\" one-by-one. To confirm this, we can listen to the file:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_M_bSs3MjQlz" - }, - "source": [ - "import librosa\n", - "import IPython.display as ipd\n", - "\n", - "# Load and listen to the audio file\n", - "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", - "audio, sample_rate = librosa.load(example_file)\n", - "\n", - "ipd.Audio(example_file, rate=sample_rate)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qZyElgPVjQl5" - }, - "source": [ - "In an ASR task, if this WAV file was our input, then \"G L E N N\" would be our desired output.\n", - "\n", - "Let's plot the waveform, which is simply a line plot of the sequence of values that we read from the file. This is a format of viewing audio that you are likely to be familiar with seeing in many audio editors and visualizers:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "MqIAKkqelRzm" - }, - "source": [ - "%matplotlib inline\n", - "import librosa.display\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Plot our example audio file's waveform\n", - "plt.rcParams['figure.figsize'] = (15,7)\n", - "plt.title('Waveform of Audio Example')\n", - "plt.ylabel('Amplitude')\n", - "\n", - "_ = librosa.display.waveshow(audio, color='blue')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gg6RR_yolRzo" - }, - "source": [ - "We can see the activity in the waveform that corresponds to each letter in the audio, as our speaker here enunciates quite clearly!\n", - "You can kind of tell that each spoken letter has a different \"shape,\" and it's interesting to note that last two blobs look relatively similar, which is expected because they are both the letter \"N.\"\n", - "\n", - "### Spectrograms and Mel Spectrograms\n", - "\n", - "However, since audio information is more useful in the context of frequencies of sound over time, we can get a better representation than this raw sequence of 57,330 values.\n", - "We can apply a [Fourier Transform](https://en.wikipedia.org/wiki/Fourier_transform) on our audio signal to get something more useful: a **spectrogram**, which is a representation of the energy levels (i.e. amplitude, or \"loudness\") of each frequency (i.e. pitch) of the signal over the duration of the file.\n", - "A spectrogram (which can be viewed as a heat map) is a good way of seeing how the *strengths of various frequencies in the audio vary over time*, and is obtained by breaking up the signal into smaller, usually overlapping chunks and performing a Short-Time Fourier Transform (STFT) on each.\n", - "\n", - "Let's examine what the spectrogram of our sample looks like." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "oCFneEs1lRzp" - }, - "source": [ - "import numpy as np\n", - "\n", - "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", - "spec = np.abs(librosa.stft(audio))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time')\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9OPc4tcalRzs" - }, - "source": [ - "Again, we are able to see each letter being pronounced, and that the last two blobs that correspond to the \"N\"s are pretty similar-looking. But how do we interpret these shapes and colors? Just as in the waveform plot before, we see time passing on the x-axis (all 2.6s of audio). But now, the y-axis represents different frequencies (on a log scale), and *the color on the plot shows the strength of a frequency at a particular point in time*.\n", - "\n", - "We're still not done yet, as we can make one more potentially useful tweak: using the **Mel Spectrogram** instead of the normal spectrogram. This is simply a change in the frequency scale that we use from linear (or logarithmic) to the mel scale, which is \"a perceptual scale of pitches judged by listeners to be equal in distance from one another\" (from [Wikipedia](https://en.wikipedia.org/wiki/Mel_scale)).\n", - "\n", - "In other words, it's a transformation of the frequencies to be more aligned to what humans perceive; a change of +1000Hz from 2000Hz->3000Hz sounds like a larger difference to us than 9000Hz->10000Hz does, so the mel scale normalizes this such that equal distances sound like equal differences to the human ear. Intuitively, we use the mel spectrogram because in this case we are processing and transcribing human speech, such that transforming the scale to better match what we hear is a useful procedure." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7yQXVn-TlRzt" - }, - "source": [ - "# Plot the mel spectrogram of our sample\n", - "mel_spec = librosa.feature.melspectrogram(y=audio, sr=sample_rate)\n", - "mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)\n", - "\n", - "librosa.display.specshow(\n", - " mel_spec_db, x_axis='time', y_axis='mel')\n", - "plt.colorbar()\n", - "plt.title('Mel Spectrogram');" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RSCyVizDlRz1" - }, - "source": [ - "## Convolutional ASR Models\n", - "\n", - "Let's take a look at the model that we will be building, and how we specify its parameters.\n", - "\n", - "### The Jasper Model\n", - "\n", - "We will be training a small [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288) from scratch (e.g. initialized randomly). \n", - "In brief, Jasper architectures consist of a repeated block structure that utilizes 1D convolutions.\n", - "In a Jasper_KxR model, `R` sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated `K` times.\n", - "We also have a one extra block at the beginning and a few more at the end that are invariant of `K` and `R`, and we use CTC loss.\n", - "\n", - "### The QuartzNet Model\n", - "\n", - "The QuartzNet is better variant of Jasper with a key difference that it uses time-channel separable 1D convolutions. This allows it to dramatically reduce number of weights while keeping similar accuracy.\n", - "\n", - "A Jasper/QuartzNet models look like this (QuartzNet model is pictured):\n", - "\n", - "![QuartzNet with CTC](https://developer.nvidia.com/blog/wp-content/uploads/2020/05/quartznet-model-architecture-1-625x742.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gEpNci7slRzw" - }, - "source": [ - "# Using NeMo for Automatic Speech Recognition\n", - "\n", - "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n", - "\n", - "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", - "\n", - "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "4_W0lhaQlRzx" - }, - "source": [ - "# NeMo's \"core\" package\n", - "import nemo\n", - "# NeMo's ASR collection - this collections contains complete ASR models and\n", - "# building blocks (modules) for ASR\n", - "import nemo.collections.asr as nemo_asr" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_W8EbYktZE3" - }, - "source": [ - "## Using an Out-of-the-Box Model\n", - "\n", - "NeMo's ASR collection comes with many building blocks and even complete models that we can use for training and evaluation. Moreover, several models come with pre-trained weights. Let's instantiate a complete QuartzNet15x5 model." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "KFZZpYult96G" - }, - "source": [ - "# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you\n", - "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KucxoFJhum0i" - }, - "source": [ - "Next, we'll simply add paths to files we want to transcribe into the list and pass it to our model. Note that it will work for relatively short (<25 seconds) files. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3QCpR_93u1hp" - }, - "source": [ - "files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav')]\n", - "for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):\n", - " print(f\"Audio in {fname} was recognized as: {transcription}\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ppUm_kuavm_f" - }, - "source": [ - "That was easy! But there are plenty of scenarios where you would want to fine-tune the model on your own data or even train from scratch. For example, this out-of-the box model will obviously not work for Spanish and would likely perform poorly for telephone audio. So if you have collected your own data, you certainly should attempt to fine-tune or train on it!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ABUDaC5Js7AW" - }, - "source": [ - "## Training from Scratch\n", - "\n", - "To train from scratch, you need to prepare your training data in the right format and specify your models architecture." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RdNyw1b_zgtm" - }, - "source": [ - "### Creating Data Manifests\n", - "\n", - "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\n", - "\n", - "Here's an example of what one line in a NeMo-compatible manifest might look like:\n", - "```\n", - "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", - "```\n", - "\n", - "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\n", - "```\n", - "...\n", - " P I T T S B U R G H (cen5-fash-b)\n", - " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\n", - "...\n", - "```" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "lVB1sG1GlRzz" - }, - "source": [ - "# --- Building Manifest Files --- #\n", - "import json\n", - "\n", - "# Function to build a manifest\n", - "def build_manifest(transcripts_path, manifest_path, wav_path):\n", - " with open(transcripts_path, 'r') as fin:\n", - " with open(manifest_path, 'w') as fout:\n", - " for line in fin:\n", - " # Lines look like this:\n", - " # transcript (fileID)\n", - " transcript = line[: line.find('(')-1].lower()\n", - " transcript = transcript.replace('', '').replace('', '')\n", - " transcript = transcript.strip()\n", - "\n", - " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\n", - " audio_path = os.path.join(\n", - " data_dir, wav_path,\n", - " file_id[file_id.find('-')+1 : file_id.rfind('-')],\n", - " file_id + '.wav')\n", - "\n", - " duration = librosa.core.get_duration(filename=audio_path)\n", - "\n", - " # Write the metadata to the manifest\n", - " metadata = {\n", - " \"audio_filepath\": audio_path,\n", - " \"duration\": duration,\n", - " \"text\": transcript\n", - " }\n", - " json.dump(metadata, fout)\n", - " fout.write('\\n')\n", - " \n", - "# Building Manifests\n", - "print(\"******\")\n", - "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\n", - "train_manifest = data_dir + '/an4/train_manifest.json'\n", - "if not os.path.isfile(train_manifest):\n", - " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\n", - " print(\"Training manifest created.\")\n", - "\n", - "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\n", - "test_manifest = data_dir + '/an4/test_manifest.json'\n", - "if not os.path.isfile(test_manifest):\n", - " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n", - " print(\"Test manifest created.\")\n", - "print(\"***Done***\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "W2fShQzRzo-M" - }, - "source": [ - "### Specifying Our Model with a YAML Config File\n", - "\n", - "For this tutorial, we'll build a *Jasper_4x1 model*, with `K=4` blocks of single (`R=1`) sub-blocks and a *greedy CTC decoder*, using the configuration found in `./configs/config.yaml`.\n", - "\n", - "If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, with a field called `jasper` that contains a list with multiple entries. Each of the members in this list specifies one block in our model, and looks something like this:\n", - "```\n", - "- filters: 128\n", - " repeat: 1\n", - " kernel: [11]\n", - " stride: [2]\n", - " dilation: [1]\n", - " dropout: 0.2\n", - " residual: false\n", - " separable: true\n", - " se: true\n", - " se_context_size: -1\n", - "```\n", - "The first member of the list corresponds to the first block in the Jasper architecture diagram, which appears regardless of `K` and `R`.\n", - "Next, we have four entries that correspond to the `K=4` blocks, and each has `repeat: 1` since we are using `R=1`.\n", - "These are followed by two more entries for the blocks that appear at the end of our Jasper model before the CTC loss.\n", - "\n", - "There are also some entries at the top of the file that specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n", - "\n", - "Using a YAML config such as this is helpful for getting a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "PXVKBniMlRz5" - }, - "source": [ - "# --- Config Information ---#\n", - "try:\n", - " from ruamel.yaml import YAML\n", - "except ModuleNotFoundError:\n", - " from ruamel_yaml import YAML\n", - "config_path = './configs/config.yaml'\n", - "\n", - "if not os.path.exists(config_path):\n", - " # Grab the config we'll use in this example\n", - " BRANCH = 'main'\n", - " !mkdir configs\n", - " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", - "\n", - "yaml = YAML(typ='safe')\n", - "with open(config_path) as f:\n", - " params = yaml.load(f)\n", - "print(params)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wUmq3p2Aw_5N" - }, - "source": [ - "### Training with PyTorch Lightning\n", - "\n", - "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\n", - "\n", - "However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "GUfR6tAK0k2u" - }, - "source": [ - "import pytorch_lightning as pl\n", - "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IEn2RyvgxxvO" - }, - "source": [ - "Next, we instantiate and ASR model based on our ``config.yaml`` file from the previous section.\n", - "Note that this is a stage during which we also tell the model where our training and validation manifests are." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Cbf0fsMK09lk" - }, - "source": [ - "from omegaconf import DictConfig\n", - "params['model']['train_ds']['manifest_filepath'] = train_manifest\n", - "params['model']['validation_ds']['manifest_filepath'] = test_manifest\n", - "first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hWtzwL5qXTYq" - }, - "source": [ - "With that, we can start training with just one line!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "inRJsnrz1psq" - }, - "source": [ - "# Start training!!!\n", - "trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jpYXX-GslR0E" - }, - "source": [ - "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\n", - "\n", - "If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to()`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from()`.\n", - "\n", - "### After Training: Monitoring Progress and Changing Hyperparameters\n", - "We can now start Tensorboard to see how training went. Recall that WER stands for Word Error Rate and so the lower it is, the better." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "n_0y3stSXDX_" - }, - "source": [ - "try:\n", - " from google import colab\n", - " COLAB_ENV = True\n", - "except (ImportError, ModuleNotFoundError):\n", - " COLAB_ENV = False\n", - "\n", - "# Load the TensorBoard notebook extension\n", - "if COLAB_ENV:\n", - " %load_ext tensorboard\n", - " %tensorboard --logdir lightning_logs/\n", - "else:\n", - " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Z0h-BME7U8yb" - }, - "source": [ - "We could improve this model by playing with hyperparameters. We can look at the current hyperparameters with the following:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7kdQbpohXnEd" - }, - "source": [ - "print(params['model']['optim'])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sGZzRCvIW8kE" - }, - "source": [ - "Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `.setup_optimization()` with the new optimization parameters." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "AbigFKUtYgvn" - }, - "source": [ - "import copy\n", - "new_opt = copy.deepcopy(params['model']['optim'])\n", - "new_opt['lr'] = 0.001\n", - "first_asr_model.setup_optimization(optim_config=DictConfig(new_opt))\n", - "# And then you can invoke trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D5Kwg8Cz-aaO" - }, - "source": [ - "## Inference\n", - "\n", - "Let's have a quick look at how one could run inference with NeMo's ASR model.\n", - "\n", - "First, ``EncDecCTCModel`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3FT0klSV268p" - }, - "source": [ - "paths2audio_files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen7-fmjd-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fkai/cen8-fkai-b.wav')]\n", - "print(first_asr_model.transcribe(paths2audio_files=paths2audio_files,\n", - " batch_size=4))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6FiCfLX0D7py" - }, - "source": [ - "Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7mP4r1Gx_Ilt" - }, - "source": [ - "# Bigger batch-size = bigger throughput\n", - "params['model']['validation_ds']['batch_size'] = 16\n", - "\n", - "# Setup the test data loader and make sure the model is on GPU\n", - "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\n", - "first_asr_model.cuda()\n", - "first_asr_model.eval()\n", - "\n", - "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\n", - "# WER is computed as numerator/denominator.\n", - "# We'll gather all the test batches' numerators and denominators.\n", - "wer_nums = []\n", - "wer_denoms = []\n", - "\n", - "# Loop over all test batches.\n", - "# Iterating over the model's `test_dataloader` will give us:\n", - "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\n", - "# See the AudioToCharDataset for more details.\n", - "for test_batch in first_asr_model.test_dataloader():\n", - " test_batch = [x.cuda() for x in test_batch]\n", - " targets = test_batch[2]\n", - " targets_lengths = test_batch[3] \n", - " log_probs, encoded_len, greedy_predictions = first_asr_model(\n", - " input_signal=test_batch[0], input_signal_length=test_batch[1]\n", - " )\n", - " # Notice the model has a helper object to compute WER\n", - " first_asr_model.wer.update(predictions=greedy_predictions, predictions_lengths=None, targets=targets, targets_lengths=targets_lengths)\n", - " _, wer_num, wer_denom = first_asr_model.wer.compute()\n", - " first_asr_model.wer.reset()\n", - " wer_nums.append(wer_num.detach().cpu().numpy())\n", - " wer_denoms.append(wer_denom.detach().cpu().numpy())\n", - "\n", - " # Release tensors from GPU memory\n", - " del test_batch, log_probs, targets, targets_lengths, encoded_len, greedy_predictions\n", - "\n", - "# We need to sum all numerators and denominators first. Then divide.\n", - "print(f\"WER = {sum(wer_nums)/sum(wer_denoms)}\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0kM9kBNOCptf" - }, - "source": [ - "This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number. Check out the next section on how to improve it further." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RBcJtg5ulR0H" - }, - "source": [ - "## Model Improvements\n", - "\n", - "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\n", - "\n", - "### Data Augmentation\n", - "\n", - "There exist several ASR data augmentation methods that can increase the size of our training set.\n", - "\n", - "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding in a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\n", - "\n", - "Our toy model does not do spectrogram augmentation. But the real one we got from cloud does:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9glGogaPlR0H" - }, - "source": [ - "print(quartznet._cfg['spec_augment'])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LdwdcA_a640R" - }, - "source": [ - "If you want to enable SpecAugment in your model, make sure your .yaml config file contains 'model/spec_augment' section which looks like the one above." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2f142kIQc1Z2" - }, - "source": [ - "### Transfer learning\n", - "\n", - "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to make it perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\n", - "\n", - "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or on accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\n", - "\n", - "Transfer learning with NeMo is simple. Let's demonstrate how the model we got from the cloud could be fine-tuned on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change model's vocabulary, just to demonstrate how it's done." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "hl320dsydWX0" - }, - "source": [ - "# Check what kind of vocabulary/alphabet the model has right now\n", - "print(quartznet.decoder.vocabulary)\n", - "\n", - "# Let's add \"!\" symbol there. Note that you can (and should!) change the vocabulary\n", - "# entirely when fine-tuning using a different language.\n", - "quartznet.change_vocabulary(\n", - " new_vocabulary=[\n", - " ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\n", - " 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', \"'\", \"!\"\n", - " ]\n", - ")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M7lvmiMSd3Aw" - }, - "source": [ - "After this, our decoder has completely changed, but our encoder (which is where most of the weights are) remained intact. Let's fine tune-this model for 2 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_PZJIso-eDl-" - }, - "source": [ - "# Use the smaller learning rate we set before\n", - "quartznet.setup_optimization(optim_config=DictConfig(new_opt))\n", - "\n", - "# Point to the data we'll use for fine-tuning as the training set\n", - "quartznet.setup_training_data(train_data_config=params['model']['train_ds'])\n", - "\n", - "# Point to the new validation data for fine-tuning\n", - "quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])\n", - "\n", - "# And now we can create a PyTorch Lightning trainer and call `fit` again.\n", - "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=2)\n", - "trainer.fit(quartznet)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VURa1NavlR0U" - }, - "source": [ - "### Fast Training\n", - "\n", - "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\n", - "\n", - "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\n", - "\n", - "```python\n", - "# Mixed precision:\n", - "trainer = pl.Trainer(amp_level='O1', precision=16)\n", - "\n", - "# Trainer with a distributed backend:\n", - "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='ddp')\n", - "\n", - "# Of course, you can combine these flags as well.\n", - "```\n", - "\n", - "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d1ym8QT3jQnj" - }, - "source": [ - "### Deployment\n", - "\n", - "Note: It is recommended to run the deployment code from the NVIDIA PyTorch container.\n", - "\n", - "Let's get back to our pre-trained model and see how easy it can be exported to an ONNX file\n", - "in order to run it in an inference engine like TensorRT or ONNXRuntime.\n", - "\n", - "If you are running in an environment outside of the NVIDIA PyTorch container (like Google Colab for example) then you will have to build the onnxruntime and onnxruntime-gpu. The cell below gives an example of how to build those runtimes but the example may have to be adapted depending on your environment." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "I4WRcmakjQnj" - }, - "source": [ - "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n", - "#!mkdir -p ort\n", - "#%cd ort\n", - "#!git clean -xfd\n", - "#!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n", - "#!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/#x86_64-linux-gnu --build_wheel\n", - "#!pip uninstall -y onnxruntime\n", - "#!pip uninstall -y onnxruntime-gpu\n", - "#!pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n", - "#%cd .." - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F9yO1BEbjQnm" - }, - "source": [ - "Then run" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "HZnyWxPyjQnm" - }, - "source": [ - "import json\n", - "import os\n", - "import tempfile\n", - "import onnxruntime\n", - "import torch\n", - "\n", - "import numpy as np\n", - "import nemo.collections.asr as nemo_asr\n", - "from nemo.collections.asr.data.audio_to_text import AudioToCharDataset\n", - "from nemo.collections.asr.metrics.wer import WER\n", - "\n", - "def to_numpy(tensor):\n", - " return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()\n", - "\n", - "def setup_transcribe_dataloader(cfg, vocabulary):\n", - " config = {\n", - " 'manifest_filepath': os.path.join(cfg['temp_dir'], 'manifest.json'),\n", - " 'sample_rate': 16000,\n", - " 'labels': vocabulary,\n", - " 'batch_size': min(cfg['batch_size'], len(cfg['paths2audio_files'])),\n", - " 'trim_silence': True,\n", - " 'shuffle': False,\n", - " }\n", - " dataset = AudioToCharDataset(\n", - " manifest_filepath=config['manifest_filepath'],\n", - " labels=config['labels'],\n", - " sample_rate=config['sample_rate'],\n", - " int_values=config.get('int_values', False),\n", - " augmentor=None,\n", - " max_duration=config.get('max_duration', None),\n", - " min_duration=config.get('min_duration', None),\n", - " max_utts=config.get('max_utts', 0),\n", - " blank_index=config.get('blank_index', -1),\n", - " unk_index=config.get('unk_index', -1),\n", - " normalize=config.get('normalize_transcripts', False),\n", - " trim=config.get('trim_silence', True),\n", - " parser=config.get('parser', 'en'),\n", - " )\n", - " return torch.utils.data.DataLoader(\n", - " dataset=dataset,\n", - " batch_size=config['batch_size'],\n", - " collate_fn=dataset.collate_fn,\n", - " drop_last=config.get('drop_last', False),\n", - " shuffle=False,\n", - " num_workers=config.get('num_workers', 0),\n", - " pin_memory=config.get('pin_memory', False),\n", - " )\n", - "\n", - "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")\n", - "\n", - "quartznet.export('qn.onnx')\n", - "\n", - "ort_session = onnxruntime.InferenceSession('qn.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])\n", - "\n", - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " with open(os.path.join(tmpdir, 'manifest.json'), 'w') as fp:\n", - " for audio_file in files:\n", - " entry = {'audio_filepath': audio_file, 'duration': 100000, 'text': 'nothing'}\n", - " fp.write(json.dumps(entry) + '\\n')\n", - "\n", - " config = {'paths2audio_files': files, 'batch_size': 4, 'temp_dir': tmpdir}\n", - " temporary_datalayer = setup_transcribe_dataloader(config, quartznet.decoder.vocabulary)\n", - " for test_batch in temporary_datalayer:\n", - " processed_signal, processed_signal_len = quartznet.preprocessor(\n", - " input_signal=test_batch[0].to(quartznet.device), length=test_batch[1].to(quartznet.device)\n", - " )\n", - " ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(processed_signal),}\n", - " ologits = ort_session.run(None, ort_inputs)\n", - " alogits = np.asarray(ologits)\n", - " logits = torch.from_numpy(alogits[0])\n", - " greedy_predictions = logits.argmax(dim=-1, keepdim=False)\n", - " wer = WER(decoding=quartznet.decoding, use_cer=False)\n", - " hypotheses, _ = wer.decoding.ctc_decoder_predictions_tensor(greedy_predictions)\n", - " print(hypotheses)\n", - " break\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wteGqroafWg1" - }, - "source": [ - "## Under the Hood\n", - "\n", - "NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n", - "\n", - "In particular, ``nemo_asr.model.EncDecCTCModel`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n", - "```python\n", - "def forward(self, input_signal, input_signal_length):\n", - " processed_signal, processed_signal_len = self.preprocessor(\n", - " input_signal=input_signal, length=input_signal_length,\n", - " )\n", - " # Spec augment is not applied during evaluation/testing\n", - " if self.spec_augmentation is not None and self.training:\n", - " processed_signal = self.spec_augmentation(input_spec=processed_signal)\n", - " encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n", - " log_probs = self.decoder(encoder_output=encoded)\n", - " greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n", - " return log_probs, encoded_len, greedy_predictions\n", - "```\n", - "Here:\n", - "\n", - "* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n", - "* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n", - "* ``self.encoder`` - is a convolutional Jasper/QuartzNet-like encoder of type ``nemo_asr.modules.ConvASREncoder``\n", - "* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n", - "\n", - "Also, ``EncDecCTCModel`` uses the audio dataset class ``nemo_asr.data.AudioToCharDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n", - "\n", - "You can use these and other neural modules (or create new ones yourself!) to construct new ASR models." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "smzlvbhelR0U" - }, - "source": [ - "# Further Reading/Watching:\n", - "\n", - "That's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n", - "- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n", - "- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n", - "- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n", - "- [Listen Attend and Spell Paper (seq2seq ASR model)](https://arxiv.org/abs/1508.01211)\n", - "- [Explanation of the mel spectrogram in more depth](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0)\n", - "- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n", - "- [QuartzNet paper](https://arxiv.org/abs/1910.10261)\n", - "- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n", - "- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n", - "- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n", - "- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "V3ERGX86lR0V" - }, - "source": [], - "execution_count": null, - "outputs": [] - } - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "accelerator": "GPU", + "colab": { + "name": "ASR_with_NeMo.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.7" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "lJz6FDU1lRzc" + }, + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", + "\n", + "\n", + "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "!pip install matplotlib>=3.3.2\n", + "\n", + "## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "\"\"\"\n", + "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", + "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", + "that you want to use the \"Run All Cells\" (or similar) option.\n", + "\"\"\"\n", + "# exit()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1Jk9etFlRzf" + }, + "source": [ + "# Introduction to End-To-End Automatic Speech Recognition\n", + "\n", + "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", + "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", + "\n", + "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YLln3U-IlRzg" + }, + "source": [ + "## Conceptual Overview: What is ASR?\n", + "\n", + "ASR, or **Automatic Speech Recognition**, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?\n", + "\n", + "Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a **language model** that encapsulates the most likely orderings of words that are generated (e.g. an n-gram model), to a **pronunciation model** for each word in that ordering (e.g. a pronunciation table), to an **acoustic model** that translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).\n", + "\n", + "Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model `Pr(audio|transcript)*Pr(transcript)`, and take the argmax of this over possible transcripts.\n", + "\n", + "Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the whole prediction.\n", + "\n", + "Thus, we can see the appeal of **end-to-end ASR architectures**: discriminative models that simply take an audio input and give a textual output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, we could integrate a language model that would improve our predictions, as well.\n", + "\n", + "And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0S5iZPMSlRzg" + }, + "source": [ + "### End-To-End ASR\n", + "\n", + "With an end-to-end model, we want to directly learn `Pr(transcript|audio)` in order to predict the transcripts from the original audio. Since we are dealing with sequential information--audio data over time that corresponds to a sequence of letters--RNNs are the obvious choice. But now we have a pressing problem to deal with: since our input sequence (number of audio timesteps) is not the same length as our desired output (transcript length), how do we match each time step from the audio data to the correct output characters?\n", + "\n", + "Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as a phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (e.g. the prediction output `LLLAAAAPPTOOOPPPP` would become `LAPTOP`). It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do with words like \"book\" that contain consecutive repeated letters? Simply squashing repeated letters together would not work in that case!\n", + "\n", + "![Alignment example](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/tutorials/asr/images/alignment_example.png)\n", + "\n", + "Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the transcript--no extra data or labeling required. Let's briefly go over two popular approaches that allow us to do this, Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention.\n", + "\n", + "#### Connectionist Temporal Classification (CTC)\n", + "\n", + "In normal speech recognition prediction output, we would expect to have characters such as the letters from A through Z, numbers 0 through 9, spaces (\"\\_\"), and so on. CTC introduces a new intermediate output token called the **blank token** (\"-\") that is useful for getting around the alignment issue.\n", + "\n", + "With CTC, we still predict one token per time segment of speech, but we use the blank token to figure out where we can and can't collapse the predictions. The appearance of a blank token helps separate repeating letters that should not be collapsed. For instance, with an audio snippet segmented into `T=11` time steps, we could get predictions that look like `BOO-OOO--KK`, which would then collapse to `\"BO-O-K\"`, and then we would remove the blank tokens to get our final output, `BOOK`.\n", + "\n", + "Now, we can predict one output token per time step, then collapse and clean to get sensible output without any fear of ambiguity from repeating letters! A simple way of getting predictions like this would be to apply a bidirectional RNN to the audio input, apply softmax over each time step's output, and then take the token with the highest probability. The method of always taking the best token at each time step is called **greedy decoding, or max decoding**.\n", + "\n", + "To calculate our loss for backprop, we would like to know the log probability of the model producing the correct transcript, `log(Pr(transcript|audio))`. We can get the log probability of a single intermediate output sequence (e.g. `BOO-OOO--KK`) by summing over the log probabilities we get from each token's softmax value, but note that the resulting sum is different from the log probability of the transcript itself (`BOOK`). This is because there are multiple possible output sequences of the same length that can be collapsed to get the same transcript (e.g. `BBO--OO-KKK` also results in `BOOK`), and so we need to **marginalize over every valid sequence of length `T` that collapses to the transcript**.\n", + "\n", + "Therefore, to get our transcript's log probability given our audio input, we must sum the log probabilities of every sequence of length `T` that collapses to the transcript (e.g. `log(Pr(output: \"BOOK\"|audio)) = log(Pr(BOO-OOO--KK|audio)) + log(Pr(BBO--OO-KKK|audio)) + ...`). In practice, we can use a dynamic programming approach to calculate this, accumulating our log probabilities over different \"paths\" through the softmax outputs at each time step.\n", + "\n", + "If you would like a more in-depth explanation of how CTC works, or how we can improve our results by using a modified beam search algorithm, feel free to check out the Further Reading section at the end of this notebook for more resources.\n", + "\n", + "#### Sequence-to-Sequence with Attention\n", + "\n", + "One problem with CTC is that predictions at different time steps are conditionally independent, which is an issue because the words in a continuous utterance tend to be related to each other in some sensible way. With this conditional independence assumption, we can't learn a language model that can represent such dependencies, though we can add a language model on top of the CTC output to mitigate this to some degree.\n", + "\n", + "A popular alternative is to use a sequence-to-sequence model with attention. A typical seq2seq model for ASR consists of some sort of **bidirectional RNN encoder** that consumes the audio sequence timestep-by-timestep, and where the outputs are then passed to an **attention-based decoder**. Each prediction from the decoder is based on attending to some parts of the entire encoded input, as well as the previously outputted tokens.\n", + "\n", + "The outputs of the decoder can be anything from word pieces to phonemes to letters, and since predictions are not directly tied to time steps of the input, we can just continue producing tokens one-by-one until an end token is given (or we reach a specified max output length). This way, we do not need to deal with audio alignment, and our predicted transcript is just the sequence of outputs given by our decoder.\n", + "\n", + "Now that we have an idea of what some popular end-to-end ASR models look like, let's take a look at the audio data we'll be working with for our example." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "38aYTCTIlRzh" + }, + "source": [ + "## Taking a Look at Our Data (AN4)\n", + "\n", + "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n", + "\n", + "Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the \"Downloads\" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gAhsmi6HlRzh" + }, + "source": [ + "import os\n", + "# This is where the an4/ directory will be placed.\n", + "# Change this if you don't want the data to be extracted in the current directory.\n", + "data_dir = '.'\n", + "\n", + "if not os.path.exists(data_dir):\n", + " os.makedirs(data_dir)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yb4fuUvWlRzk", + "scrolled": true + }, + "source": [ + "import glob\n", + "import os\n", + "import subprocess\n", + "import tarfile\n", + "import wget\n", + "\n", + "# Download the dataset. This will take a few moments...\n", + "print(\"******\")\n", + "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", + " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", + " an4_path = wget.download(an4_url, data_dir)\n", + " print(f\"Dataset downloaded at: {an4_path}\")\n", + "else:\n", + " print(\"Tarfile already exists.\")\n", + " an4_path = data_dir + '/an4_sphere.tar.gz'\n", + "\n", + "if not os.path.exists(data_dir + '/an4/'):\n", + " # Untar and convert .sph to .wav (using sox)\n", + " tar = tarfile.open(an4_path)\n", + " tar.extractall(path=data_dir)\n", + "\n", + " print(\"Converting .sph to .wav...\")\n", + " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", + " for sph_path in sph_list:\n", + " wav_path = sph_path[:-4] + '.wav'\n", + " cmd = [\"sox\", sph_path, wav_path]\n", + " subprocess.run(cmd)\n", + "print(\"Finished conversion.\\n******\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m_LFeM0elRzm" + }, + "source": [ + "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\n", + "\n", + "Now we can load and take a look at the data. As an example, file `cen2-mgah-b.wav` is a 2.6 second-long audio recording of a man saying the letters \"G L E N N\" one-by-one. To confirm this, we can listen to the file:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_M_bSs3MjQlz" + }, + "source": [ + "import librosa\n", + "import IPython.display as ipd\n", + "\n", + "# Load and listen to the audio file\n", + "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", + "audio, sample_rate = librosa.load(example_file)\n", + "\n", + "ipd.Audio(example_file, rate=sample_rate)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZyElgPVjQl5" + }, + "source": [ + "In an ASR task, if this WAV file was our input, then \"G L E N N\" would be our desired output.\n", + "\n", + "Let's plot the waveform, which is simply a line plot of the sequence of values that we read from the file. This is a format of viewing audio that you are likely to be familiar with seeing in many audio editors and visualizers:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MqIAKkqelRzm" + }, + "source": [ + "%matplotlib inline\n", + "import librosa.display\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Plot our example audio file's waveform\n", + "plt.rcParams['figure.figsize'] = (15,7)\n", + "plt.title('Waveform of Audio Example')\n", + "plt.ylabel('Amplitude')\n", + "\n", + "_ = librosa.display.waveshow(audio, color='blue')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gg6RR_yolRzo" + }, + "source": [ + "We can see the activity in the waveform that corresponds to each letter in the audio, as our speaker here enunciates quite clearly!\n", + "You can kind of tell that each spoken letter has a different \"shape,\" and it's interesting to note that last two blobs look relatively similar, which is expected because they are both the letter \"N.\"\n", + "\n", + "### Spectrograms and Mel Spectrograms\n", + "\n", + "However, since audio information is more useful in the context of frequencies of sound over time, we can get a better representation than this raw sequence of 57,330 values.\n", + "We can apply a [Fourier Transform](https://en.wikipedia.org/wiki/Fourier_transform) on our audio signal to get something more useful: a **spectrogram**, which is a representation of the energy levels (i.e. amplitude, or \"loudness\") of each frequency (i.e. pitch) of the signal over the duration of the file.\n", + "A spectrogram (which can be viewed as a heat map) is a good way of seeing how the *strengths of various frequencies in the audio vary over time*, and is obtained by breaking up the signal into smaller, usually overlapping chunks and performing a Short-Time Fourier Transform (STFT) on each.\n", + "\n", + "Let's examine what the spectrogram of our sample looks like." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oCFneEs1lRzp" + }, + "source": [ + "import numpy as np\n", + "\n", + "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", + "spec = np.abs(librosa.stft(audio))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time')\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9OPc4tcalRzs" + }, + "source": [ + "Again, we are able to see each letter being pronounced, and that the last two blobs that correspond to the \"N\"s are pretty similar-looking. But how do we interpret these shapes and colors? Just as in the waveform plot before, we see time passing on the x-axis (all 2.6s of audio). But now, the y-axis represents different frequencies (on a log scale), and *the color on the plot shows the strength of a frequency at a particular point in time*.\n", + "\n", + "We're still not done yet, as we can make one more potentially useful tweak: using the **Mel Spectrogram** instead of the normal spectrogram. This is simply a change in the frequency scale that we use from linear (or logarithmic) to the mel scale, which is \"a perceptual scale of pitches judged by listeners to be equal in distance from one another\" (from [Wikipedia](https://en.wikipedia.org/wiki/Mel_scale)).\n", + "\n", + "In other words, it's a transformation of the frequencies to be more aligned to what humans perceive; a change of +1000Hz from 2000Hz->3000Hz sounds like a larger difference to us than 9000Hz->10000Hz does, so the mel scale normalizes this such that equal distances sound like equal differences to the human ear. Intuitively, we use the mel spectrogram because in this case we are processing and transcribing human speech, such that transforming the scale to better match what we hear is a useful procedure." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7yQXVn-TlRzt" + }, + "source": [ + "# Plot the mel spectrogram of our sample\n", + "mel_spec = librosa.feature.melspectrogram(y=audio, sr=sample_rate)\n", + "mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)\n", + "\n", + "librosa.display.specshow(\n", + " mel_spec_db, x_axis='time', y_axis='mel')\n", + "plt.colorbar()\n", + "plt.title('Mel Spectrogram');" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSCyVizDlRz1" + }, + "source": [ + "## Convolutional ASR Models\n", + "\n", + "Let's take a look at the model that we will be building, and how we specify its parameters.\n", + "\n", + "### The Jasper Model\n", + "\n", + "We will be training a small [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288) from scratch (e.g. initialized randomly). \n", + "In brief, Jasper architectures consist of a repeated block structure that utilizes 1D convolutions.\n", + "In a Jasper_KxR model, `R` sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated `K` times.\n", + "We also have a one extra block at the beginning and a few more at the end that are invariant of `K` and `R`, and we use CTC loss.\n", + "\n", + "### The QuartzNet Model\n", + "\n", + "The QuartzNet is better variant of Jasper with a key difference that it uses time-channel separable 1D convolutions. This allows it to dramatically reduce number of weights while keeping similar accuracy.\n", + "\n", + "A Jasper/QuartzNet models look like this (QuartzNet model is pictured):\n", + "\n", + "![QuartzNet with CTC](https://developer.nvidia.com/blog/wp-content/uploads/2020/05/quartznet-model-architecture-1-625x742.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gEpNci7slRzw" + }, + "source": [ + "# Using NeMo for Automatic Speech Recognition\n", + "\n", + "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n", + "\n", + "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", + "\n", + "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_W0lhaQlRzx" + }, + "source": [ + "# NeMo's \"core\" package\n", + "import nemo\n", + "# NeMo's ASR collection - this collections contains complete ASR models and\n", + "# building blocks (modules) for ASR\n", + "import nemo.collections.asr as nemo_asr" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_W8EbYktZE3" + }, + "source": [ + "## Using an Out-of-the-Box Model\n", + "\n", + "NeMo's ASR collection comes with many building blocks and even complete models that we can use for training and evaluation. Moreover, several models come with pre-trained weights. Let's instantiate a complete QuartzNet15x5 model." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KFZZpYult96G" + }, + "source": [ + "# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you\n", + "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KucxoFJhum0i" + }, + "source": [ + "Next, we'll simply add paths to files we want to transcribe into the list and pass it to our model. Note that it will work for relatively short (<25 seconds) files. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3QCpR_93u1hp" + }, + "source": [ + "files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav')]\n", + "for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):\n", + " print(f\"Audio in {fname} was recognized as: {transcription}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ppUm_kuavm_f" + }, + "source": [ + "That was easy! But there are plenty of scenarios where you would want to fine-tune the model on your own data or even train from scratch. For example, this out-of-the box model will obviously not work for Spanish and would likely perform poorly for telephone audio. So if you have collected your own data, you certainly should attempt to fine-tune or train on it!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ABUDaC5Js7AW" + }, + "source": [ + "## Training from Scratch\n", + "\n", + "To train from scratch, you need to prepare your training data in the right format and specify your models architecture." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RdNyw1b_zgtm" + }, + "source": [ + "### Creating Data Manifests\n", + "\n", + "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\n", + "\n", + "Here's an example of what one line in a NeMo-compatible manifest might look like:\n", + "```\n", + "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", + "```\n", + "\n", + "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\n", + "```\n", + "...\n", + " P I T T S B U R G H (cen5-fash-b)\n", + " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\n", + "...\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lVB1sG1GlRzz" + }, + "source": [ + "# --- Building Manifest Files --- #\n", + "import json\n", + "\n", + "# Function to build a manifest\n", + "def build_manifest(transcripts_path, manifest_path, wav_path):\n", + " with open(transcripts_path, 'r') as fin:\n", + " with open(manifest_path, 'w') as fout:\n", + " for line in fin:\n", + " # Lines look like this:\n", + " # transcript (fileID)\n", + " transcript = line[: line.find('(')-1].lower()\n", + " transcript = transcript.replace('', '').replace('', '')\n", + " transcript = transcript.strip()\n", + "\n", + " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\n", + " audio_path = os.path.join(\n", + " data_dir, wav_path,\n", + " file_id[file_id.find('-')+1 : file_id.rfind('-')],\n", + " file_id + '.wav')\n", + "\n", + " duration = librosa.core.get_duration(filename=audio_path)\n", + "\n", + " # Write the metadata to the manifest\n", + " metadata = {\n", + " \"audio_filepath\": audio_path,\n", + " \"duration\": duration,\n", + " \"text\": transcript\n", + " }\n", + " json.dump(metadata, fout)\n", + " fout.write('\\n')\n", + " \n", + "# Building Manifests\n", + "print(\"******\")\n", + "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\n", + "train_manifest = data_dir + '/an4/train_manifest.json'\n", + "if not os.path.isfile(train_manifest):\n", + " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\n", + " print(\"Training manifest created.\")\n", + "\n", + "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\n", + "test_manifest = data_dir + '/an4/test_manifest.json'\n", + "if not os.path.isfile(test_manifest):\n", + " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n", + " print(\"Test manifest created.\")\n", + "print(\"***Done***\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W2fShQzRzo-M" + }, + "source": [ + "### Specifying Our Model with a YAML Config File\n", + "\n", + "For this tutorial, we'll build a *Jasper_4x1 model*, with `K=4` blocks of single (`R=1`) sub-blocks and a *greedy CTC decoder*, using the configuration found in `./configs/config.yaml`.\n", + "\n", + "If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, with a field called `jasper` that contains a list with multiple entries. Each of the members in this list specifies one block in our model, and looks something like this:\n", + "```\n", + "- filters: 128\n", + " repeat: 1\n", + " kernel: [11]\n", + " stride: [2]\n", + " dilation: [1]\n", + " dropout: 0.2\n", + " residual: false\n", + " separable: true\n", + " se: true\n", + " se_context_size: -1\n", + "```\n", + "The first member of the list corresponds to the first block in the Jasper architecture diagram, which appears regardless of `K` and `R`.\n", + "Next, we have four entries that correspond to the `K=4` blocks, and each has `repeat: 1` since we are using `R=1`.\n", + "These are followed by two more entries for the blocks that appear at the end of our Jasper model before the CTC loss.\n", + "\n", + "There are also some entries at the top of the file that specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n", + "\n", + "Using a YAML config such as this is helpful for getting a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PXVKBniMlRz5" + }, + "source": [ + "# --- Config Information ---#\n", + "try:\n", + " from ruamel.yaml import YAML\n", + "except ModuleNotFoundError:\n", + " from ruamel_yaml import YAML\n", + "config_path = './configs/config.yaml'\n", + "\n", + "if not os.path.exists(config_path):\n", + " # Grab the config we'll use in this example\n", + " BRANCH = 'r1.23.0'\n", + " !mkdir configs\n", + " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "\n", + "yaml = YAML(typ='safe')\n", + "with open(config_path) as f:\n", + " params = yaml.load(f)\n", + "print(params)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wUmq3p2Aw_5N" + }, + "source": [ + "### Training with PyTorch Lightning\n", + "\n", + "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\n", + "\n", + "However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GUfR6tAK0k2u" + }, + "source": [ + "import pytorch_lightning as pl\n", + "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IEn2RyvgxxvO" + }, + "source": [ + "Next, we instantiate and ASR model based on our ``config.yaml`` file from the previous section.\n", + "Note that this is a stage during which we also tell the model where our training and validation manifests are." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cbf0fsMK09lk" + }, + "source": [ + "from omegaconf import DictConfig\n", + "params['model']['train_ds']['manifest_filepath'] = train_manifest\n", + "params['model']['validation_ds']['manifest_filepath'] = test_manifest\n", + "first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hWtzwL5qXTYq" + }, + "source": [ + "With that, we can start training with just one line!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "inRJsnrz1psq" + }, + "source": [ + "# Start training!!!\n", + "trainer.fit(first_asr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpYXX-GslR0E" + }, + "source": [ + "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\n", + "\n", + "If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to()`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from()`.\n", + "\n", + "### After Training: Monitoring Progress and Changing Hyperparameters\n", + "We can now start Tensorboard to see how training went. Recall that WER stands for Word Error Rate and so the lower it is, the better." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n_0y3stSXDX_" + }, + "source": [ + "try:\n", + " from google import colab\n", + " COLAB_ENV = True\n", + "except (ImportError, ModuleNotFoundError):\n", + " COLAB_ENV = False\n", + "\n", + "# Load the TensorBoard notebook extension\n", + "if COLAB_ENV:\n", + " %load_ext tensorboard\n", + " %tensorboard --logdir lightning_logs/\n", + "else:\n", + " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z0h-BME7U8yb" + }, + "source": [ + "We could improve this model by playing with hyperparameters. We can look at the current hyperparameters with the following:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7kdQbpohXnEd" + }, + "source": [ + "print(params['model']['optim'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sGZzRCvIW8kE" + }, + "source": [ + "Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `.setup_optimization()` with the new optimization parameters." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AbigFKUtYgvn" + }, + "source": [ + "import copy\n", + "new_opt = copy.deepcopy(params['model']['optim'])\n", + "new_opt['lr'] = 0.001\n", + "first_asr_model.setup_optimization(optim_config=DictConfig(new_opt))\n", + "# And then you can invoke trainer.fit(first_asr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5Kwg8Cz-aaO" + }, + "source": [ + "## Inference\n", + "\n", + "Let's have a quick look at how one could run inference with NeMo's ASR model.\n", + "\n", + "First, ``EncDecCTCModel`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3FT0klSV268p" + }, + "source": [ + "paths2audio_files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen7-fmjd-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fkai/cen8-fkai-b.wav')]\n", + "print(first_asr_model.transcribe(paths2audio_files=paths2audio_files,\n", + " batch_size=4))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6FiCfLX0D7py" + }, + "source": [ + "Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7mP4r1Gx_Ilt" + }, + "source": [ + "# Bigger batch-size = bigger throughput\n", + "params['model']['validation_ds']['batch_size'] = 16\n", + "\n", + "# Setup the test data loader and make sure the model is on GPU\n", + "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\n", + "first_asr_model.cuda()\n", + "first_asr_model.eval()\n", + "\n", + "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\n", + "# WER is computed as numerator/denominator.\n", + "# We'll gather all the test batches' numerators and denominators.\n", + "wer_nums = []\n", + "wer_denoms = []\n", + "\n", + "# Loop over all test batches.\n", + "# Iterating over the model's `test_dataloader` will give us:\n", + "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\n", + "# See the AudioToCharDataset for more details.\n", + "for test_batch in first_asr_model.test_dataloader():\n", + " test_batch = [x.cuda() for x in test_batch]\n", + " targets = test_batch[2]\n", + " targets_lengths = test_batch[3] \n", + " log_probs, encoded_len, greedy_predictions = first_asr_model(\n", + " input_signal=test_batch[0], input_signal_length=test_batch[1]\n", + " )\n", + " # Notice the model has a helper object to compute WER\n", + " first_asr_model.wer.update(predictions=greedy_predictions, predictions_lengths=None, targets=targets, targets_lengths=targets_lengths)\n", + " _, wer_num, wer_denom = first_asr_model.wer.compute()\n", + " first_asr_model.wer.reset()\n", + " wer_nums.append(wer_num.detach().cpu().numpy())\n", + " wer_denoms.append(wer_denom.detach().cpu().numpy())\n", + "\n", + " # Release tensors from GPU memory\n", + " del test_batch, log_probs, targets, targets_lengths, encoded_len, greedy_predictions\n", + "\n", + "# We need to sum all numerators and denominators first. Then divide.\n", + "print(f\"WER = {sum(wer_nums)/sum(wer_denoms)}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0kM9kBNOCptf" + }, + "source": [ + "This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number. Check out the next section on how to improve it further." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RBcJtg5ulR0H" + }, + "source": [ + "## Model Improvements\n", + "\n", + "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\n", + "\n", + "### Data Augmentation\n", + "\n", + "There exist several ASR data augmentation methods that can increase the size of our training set.\n", + "\n", + "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding in a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\n", + "\n", + "Our toy model does not do spectrogram augmentation. But the real one we got from cloud does:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9glGogaPlR0H" + }, + "source": [ + "print(quartznet._cfg['spec_augment'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LdwdcA_a640R" + }, + "source": [ + "If you want to enable SpecAugment in your model, make sure your .yaml config file contains 'model/spec_augment' section which looks like the one above." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2f142kIQc1Z2" + }, + "source": [ + "### Transfer learning\n", + "\n", + "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to make it perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\n", + "\n", + "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or on accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\n", + "\n", + "Transfer learning with NeMo is simple. Let's demonstrate how the model we got from the cloud could be fine-tuned on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change model's vocabulary, just to demonstrate how it's done." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hl320dsydWX0" + }, + "source": [ + "# Check what kind of vocabulary/alphabet the model has right now\n", + "print(quartznet.decoder.vocabulary)\n", + "\n", + "# Let's add \"!\" symbol there. Note that you can (and should!) change the vocabulary\n", + "# entirely when fine-tuning using a different language.\n", + "quartznet.change_vocabulary(\n", + " new_vocabulary=[\n", + " ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\n", + " 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', \"'\", \"!\"\n", + " ]\n", + ")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M7lvmiMSd3Aw" + }, + "source": [ + "After this, our decoder has completely changed, but our encoder (which is where most of the weights are) remained intact. Let's fine tune-this model for 2 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_PZJIso-eDl-" + }, + "source": [ + "# Use the smaller learning rate we set before\n", + "quartznet.setup_optimization(optim_config=DictConfig(new_opt))\n", + "\n", + "# Point to the data we'll use for fine-tuning as the training set\n", + "quartznet.setup_training_data(train_data_config=params['model']['train_ds'])\n", + "\n", + "# Point to the new validation data for fine-tuning\n", + "quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])\n", + "\n", + "# And now we can create a PyTorch Lightning trainer and call `fit` again.\n", + "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=2)\n", + "trainer.fit(quartznet)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VURa1NavlR0U" + }, + "source": [ + "### Fast Training\n", + "\n", + "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\n", + "\n", + "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\n", + "\n", + "```python\n", + "# Mixed precision:\n", + "trainer = pl.Trainer(amp_level='O1', precision=16)\n", + "\n", + "# Trainer with a distributed backend:\n", + "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='ddp')\n", + "\n", + "# Of course, you can combine these flags as well.\n", + "```\n", + "\n", + "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d1ym8QT3jQnj" + }, + "source": [ + "### Deployment\n", + "\n", + "Note: It is recommended to run the deployment code from the NVIDIA PyTorch container.\n", + "\n", + "Let's get back to our pre-trained model and see how easy it can be exported to an ONNX file\n", + "in order to run it in an inference engine like TensorRT or ONNXRuntime.\n", + "\n", + "If you are running in an environment outside of the NVIDIA PyTorch container (like Google Colab for example) then you will have to build the onnxruntime and onnxruntime-gpu. The cell below gives an example of how to build those runtimes but the example may have to be adapted depending on your environment." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I4WRcmakjQnj" + }, + "source": [ + "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n", + "#!mkdir -p ort\n", + "#%cd ort\n", + "#!git clean -xfd\n", + "#!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n", + "#!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/#x86_64-linux-gnu --build_wheel\n", + "#!pip uninstall -y onnxruntime\n", + "#!pip uninstall -y onnxruntime-gpu\n", + "#!pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n", + "#%cd .." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F9yO1BEbjQnm" + }, + "source": [ + "Then run" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HZnyWxPyjQnm" + }, + "source": [ + "import json\n", + "import os\n", + "import tempfile\n", + "import onnxruntime\n", + "import torch\n", + "\n", + "import numpy as np\n", + "import nemo.collections.asr as nemo_asr\n", + "from nemo.collections.asr.data.audio_to_text import AudioToCharDataset\n", + "from nemo.collections.asr.metrics.wer import WER\n", + "\n", + "def to_numpy(tensor):\n", + " return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()\n", + "\n", + "def setup_transcribe_dataloader(cfg, vocabulary):\n", + " config = {\n", + " 'manifest_filepath': os.path.join(cfg['temp_dir'], 'manifest.json'),\n", + " 'sample_rate': 16000,\n", + " 'labels': vocabulary,\n", + " 'batch_size': min(cfg['batch_size'], len(cfg['paths2audio_files'])),\n", + " 'trim_silence': True,\n", + " 'shuffle': False,\n", + " }\n", + " dataset = AudioToCharDataset(\n", + " manifest_filepath=config['manifest_filepath'],\n", + " labels=config['labels'],\n", + " sample_rate=config['sample_rate'],\n", + " int_values=config.get('int_values', False),\n", + " augmentor=None,\n", + " max_duration=config.get('max_duration', None),\n", + " min_duration=config.get('min_duration', None),\n", + " max_utts=config.get('max_utts', 0),\n", + " blank_index=config.get('blank_index', -1),\n", + " unk_index=config.get('unk_index', -1),\n", + " normalize=config.get('normalize_transcripts', False),\n", + " trim=config.get('trim_silence', True),\n", + " parser=config.get('parser', 'en'),\n", + " )\n", + " return torch.utils.data.DataLoader(\n", + " dataset=dataset,\n", + " batch_size=config['batch_size'],\n", + " collate_fn=dataset.collate_fn,\n", + " drop_last=config.get('drop_last', False),\n", + " shuffle=False,\n", + " num_workers=config.get('num_workers', 0),\n", + " pin_memory=config.get('pin_memory', False),\n", + " )\n", + "\n", + "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")\n", + "\n", + "quartznet.export('qn.onnx')\n", + "\n", + "ort_session = onnxruntime.InferenceSession('qn.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])\n", + "\n", + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " with open(os.path.join(tmpdir, 'manifest.json'), 'w') as fp:\n", + " for audio_file in files:\n", + " entry = {'audio_filepath': audio_file, 'duration': 100000, 'text': 'nothing'}\n", + " fp.write(json.dumps(entry) + '\\n')\n", + "\n", + " config = {'paths2audio_files': files, 'batch_size': 4, 'temp_dir': tmpdir}\n", + " temporary_datalayer = setup_transcribe_dataloader(config, quartznet.decoder.vocabulary)\n", + " for test_batch in temporary_datalayer:\n", + " processed_signal, processed_signal_len = quartznet.preprocessor(\n", + " input_signal=test_batch[0].to(quartznet.device), length=test_batch[1].to(quartznet.device)\n", + " )\n", + " ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(processed_signal),}\n", + " ologits = ort_session.run(None, ort_inputs)\n", + " alogits = np.asarray(ologits)\n", + " logits = torch.from_numpy(alogits[0])\n", + " greedy_predictions = logits.argmax(dim=-1, keepdim=False)\n", + " wer = WER(decoding=quartznet.decoding, use_cer=False)\n", + " hypotheses, _ = wer.decoding.ctc_decoder_predictions_tensor(greedy_predictions)\n", + " print(hypotheses)\n", + " break\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wteGqroafWg1" + }, + "source": [ + "## Under the Hood\n", + "\n", + "NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n", + "\n", + "In particular, ``nemo_asr.model.EncDecCTCModel`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n", + "```python\n", + "def forward(self, input_signal, input_signal_length):\n", + " processed_signal, processed_signal_len = self.preprocessor(\n", + " input_signal=input_signal, length=input_signal_length,\n", + " )\n", + " # Spec augment is not applied during evaluation/testing\n", + " if self.spec_augmentation is not None and self.training:\n", + " processed_signal = self.spec_augmentation(input_spec=processed_signal)\n", + " encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n", + " log_probs = self.decoder(encoder_output=encoded)\n", + " greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n", + " return log_probs, encoded_len, greedy_predictions\n", + "```\n", + "Here:\n", + "\n", + "* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n", + "* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n", + "* ``self.encoder`` - is a convolutional Jasper/QuartzNet-like encoder of type ``nemo_asr.modules.ConvASREncoder``\n", + "* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n", + "\n", + "Also, ``EncDecCTCModel`` uses the audio dataset class ``nemo_asr.data.AudioToCharDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n", + "\n", + "You can use these and other neural modules (or create new ones yourself!) to construct new ASR models." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "smzlvbhelR0U" + }, + "source": [ + "# Further Reading/Watching:\n", + "\n", + "That's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n", + "- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n", + "- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n", + "- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n", + "- [Listen Attend and Spell Paper (seq2seq ASR model)](https://arxiv.org/abs/1508.01211)\n", + "- [Explanation of the mel spectrogram in more depth](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0)\n", + "- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n", + "- [QuartzNet paper](https://arxiv.org/abs/1910.10261)\n", + "- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n", + "- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n", + "- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n", + "- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V3ERGX86lR0V" + }, + "source": [], + "execution_count": null, + "outputs": [] + } + ] } From ab333ec88ea24da6fdcfa62fa70b7e678b0598f2 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Tue, 30 Jan 2024 16:47:34 -0800 Subject: [PATCH 7/8] revert accidental changes Signed-off-by: Chen Cui --- tutorials/01_NeMo_Models.ipynb | 4 +++- tutorials/asr/ASR_for_telephony_speech.ipynb | 4 +--- tutorials/asr/ASR_with_NeMo.ipynb | 4 +--- tutorials/asr/Streaming_ASR.ipynb | 4 +--- 4 files changed, 6 insertions(+), 10 deletions(-) diff --git a/tutorials/01_NeMo_Models.ipynb b/tutorials/01_NeMo_Models.ipynb index 130518b39498..0c4e3c6aa683 100644 --- a/tutorials/01_NeMo_Models.ipynb +++ b/tutorials/01_NeMo_Models.ipynb @@ -2644,7 +2644,9 @@ "metadata": { "id": "ZjCV5u3_OO7a" }, - "source": [], + "source": [ + "" + ], "execution_count": null, "outputs": [] } diff --git a/tutorials/asr/ASR_for_telephony_speech.ipynb b/tutorials/asr/ASR_for_telephony_speech.ipynb index fa572945a0ff..5d10e50950dd 100644 --- a/tutorials/asr/ASR_for_telephony_speech.ipynb +++ b/tutorials/asr/ASR_for_telephony_speech.ipynb @@ -17,9 +17,7 @@ "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n", - "\n", - "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", "\"\"\"\n", "# If you're using Google Colab and not running locally, run this cell.\n", "\n", diff --git a/tutorials/asr/ASR_with_NeMo.ipynb b/tutorials/asr/ASR_with_NeMo.ipynb index 3d5c3d1a500d..479a89ed8c2d 100644 --- a/tutorials/asr/ASR_with_NeMo.ipynb +++ b/tutorials/asr/ASR_with_NeMo.ipynb @@ -43,9 +43,7 @@ "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n", - "\n", - "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", "\"\"\"\n", "# If you're using Google Colab and not running locally, run this cell.\n", "\n", diff --git a/tutorials/asr/Streaming_ASR.ipynb b/tutorials/asr/Streaming_ASR.ipynb index c65ce057033a..628908a523cf 100644 --- a/tutorials/asr/Streaming_ASR.ipynb +++ b/tutorials/asr/Streaming_ASR.ipynb @@ -17,9 +17,7 @@ "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n", - "\n", - "NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", "\"\"\"\n", "# If you're using Google Colab and not running locally, run this cell.\n", "\n", From 97e12ff52511d846cef5b92570e9d022c5493af1 Mon Sep 17 00:00:00 2001 From: Chen Cui Date: Tue, 30 Jan 2024 16:50:56 -0800 Subject: [PATCH 8/8] revert accidental changes Signed-off-by: Chen Cui --- tutorials/01_NeMo_Models.ipynb | 5304 +++++++++--------- tutorials/asr/ASR_for_telephony_speech.ipynb | 680 +-- tutorials/asr/ASR_with_NeMo.ipynb | 2348 ++++---- 3 files changed, 4166 insertions(+), 4166 deletions(-) diff --git a/tutorials/01_NeMo_Models.ipynb b/tutorials/01_NeMo_Models.ipynb index 0c4e3c6aa683..d9ed6f22181d 100644 --- a/tutorials/01_NeMo_Models.ipynb +++ b/tutorials/01_NeMo_Models.ipynb @@ -1,2654 +1,2654 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "01_NeMo_Models.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "ASnx4b5jXsil" - }, - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "\n", - "# ## Install NeMo\n", - "BRANCH = 'r1.23.0'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "## Install TorchAudio\n", - "!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html\n", - "\n", - "## Grab the config we'll use in this example\n", - "!mkdir configs" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "a0eAURFKXdFT" - }, - "source": [ - "# minGPT License\n", - "\n", - "*This notebook port's the [minGPT codebase](https://github.com/karpathy/minGPT) into equivalent NeMo code. The license for minGPT has therefore been attached here.*\n", - "\n", - "```\n", - "The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy\n", - "\n", - "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", - "\n", - "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", - "\n", - "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2b7Z064UZFH9" - }, - "source": [ - "# torch-rnn License\n", - "*This notebook utilizes the `tiny-shakespeare` dataset from the [torch-rnn](https://github.com/jcjohnson/torch-rnn) codebase. The license for torch-rnn has therefore been attached here.*\n", - "\n", - "```\n", - "The MIT License (MIT)\n", - "\n", - "Copyright (c) 2016 Justin Johnson\n", - "\n", - "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", - "\n", - "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", - "\n", - "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eKzK-Z7obCED" - }, - "source": [ - "-------\n", - "\n", - "***Note: This notebook will intentionally introduce some errors to show the power of Neural Types or model development concepts, inside the cells marked with `[ERROR CELL]`. The explanation of and resolution of such errors can be found in the subsequent cells.***\n", - "\n", - "-----" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "81qdv0mPee-j" - }, - "source": [ - "# The NeMo Model\n", - "\n", - "NeMo comes with several state-of-the-art pre-trained Conversational AI models for users to quickly be able to start training and fine-tuning on their own datasets. \n", - "\n", - "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", - "\n", - "In this tutorial we will learn how to develop a non-trivial NeMo model from scratch. This helps us to understand the underlying components and how they interact with the overall PyTorch ecosystem.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nKNftwxzllth" - }, - "source": [ - "-------\n", - "At the heart of NeMo lies the concept of the \"Model\". For NeMo developers, a \"Model\" is the neural network(s) as well as all the infrastructure supporting those network(s), wrapped into a singular, cohesive unit. As such, most NeMo models are constructed to contain the following out of the box (note: some NeMo models support additional functionality specific to the domain/use case!) - \n", - "\n", - " - Neural Network architecture - all of the modules that are required for the model.\n", - "\n", - " - Dataset + Data Loaders - all of the components that prepare the data for consumption during training or evaluation.\n", - "\n", - " - Preprocessing + Postprocessing - any of the components that process the datasets so the modules can easily consume them.\n", - "\n", - " - Optimizer + Schedulers - basic defaults that work out of the box and allow further experimentation with ease.\n", - "\n", - " - Any other supporting infrastructure - tokenizers, language model configuration, data augmentation, etc." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5VOoAQT1mipO" - }, - "source": [ - "# Constructing a NeMo Model\n", - "\n", - "NeMo \"Models\" are comprised of a few key components, so let's tackle them one by one. We will attempt to go in the order that's stated above.\n", - "\n", - "To make this slightly challenging, let's port a model from the NLP domain this time. Transformers are all the rage, with BERT and his friends from Sesame Street forming the core infrastructure for many NLP tasks. \n", - "\n", - "An excellent (yet simple) implementation of one such model - GPT - can be found in the `minGPT` repository - https://github.com/karpathy/minGPT. While the script is short, it explains and succinctly explores all of the core components we expect in a NeMo model, so it's a prime candidate for NeMo! Sidenote: NeMo supports GPT in its NLP collection, and as such, this notebook aims to be an in-depth development walkthrough for such models.\n", - "\n", - "In the following notebook, we will attempt to port minGPT to NeMo, and along the way, discuss some core concepts of NeMo itself." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fOlQKsaRot1l" - }, - "source": [ - "# Constructing the Neural Network Architecture\n", - "\n", - "First, on the list - the neural network that forms the backbone of the NeMo Model.\n", - "\n", - "So how do we create such a model? Using PyTorch! As you'll see below, NeMo components are compatible with all of PyTorch, so you can augment your workflow without ever losing the flexibility of PyTorch itself!\n", - "\n", - "Let's start with a couple of imports - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "piLOgwOPX1FS" - }, - "source": [ - "import torch\n", - "import nemo\n", - "from nemo.core import NeuralModule\n", - "from nemo.core import typecheck" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yySYjHgAqVvT" - }, - "source": [ - "## Neural Module\n", - "Wait, what's `NeuralModule`? Where is the wonderful `torch.nn.Module`? \n", - "\n", - "`NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.\n", - "\n", - "In addition to being a `torch.nn.Module`, thereby being entirely compatible with the PyTorch ecosystem, it has the following capabilities - \n", - "\n", - "1) `Typing` - It adds support for `Neural Type Checking` to the model. `Typing` is optional but quite useful, as we will discuss below!\n", - "\n", - "2) `Serialization` - Remember the `OmegaConf` config dict and YAML config files? Well, all `NeuralModules` inherently supports serialization/deserialization from such config dictionaries!\n", - "\n", - "3) `FileIO` - This is another entirely optional file serialization system. Does your `NeuralModule` require some way to preserve data that can't be saved into a PyTorch checkpoint? Write your serialization and deserialization logic in two handy methods! **Note**: When you create the final NeMo Model, this will be implemented for you! Automatic serialization and deserialization support of NeMo models!\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bseLiNoqqQrE" - }, - "source": [ - "class MyEmptyModule(NeuralModule):\n", - "\n", - " def forward(self):\n", - " print(\"Neural Module ~ hello world!\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "j4Q36L5urdOQ" - }, - "source": [ - "x = MyEmptyModule()\n", - "x()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lHXAcn5Ot_1I" - }, - "source": [ - "## Neural Types\n", - "\n", - "Neural Types? You might be wondering what that term refers to.\n", - "\n", - "Almost all NeMo components inherit the class `Typing`. `Typing` is a simple class that adds two properties to the class that inherits it - `input_types` and `output_types`. A NeuralType, by its shortest definition, is simply a semantic tensor. It contains information regarding the semantic shape the tensor should hold, as well as the semantic information of what that tensor represents. That's it.\n", - "\n", - "So what semantic information does such a typed tensor contain? Let's take an example below.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ezOJERbVwG34" - }, - "source": [ - "------\n", - "Across the Deep Learning domain, we often encounter cases where tensor shapes may match, but the semantics don't match at all. For example take a look at the following rank 3 tensors - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZvC57bbxwXxN" - }, - "source": [ - "# Case 1:\n", - "embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", - "x = torch.randint(high=10, size=(1, 5))\n", - "print(\"x :\", x)\n", - "print(\"embedding(x) :\", embedding(x).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "sMaqhMBgxe2C" - }, - "source": [ - "# Case 2\n", - "lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", - "x = torch.randn(1, 5, 1)\n", - "print(\"x :\", x)\n", - "print(\"lstm(x) :\", lstm(x)[0].shape) # Let's take all timestep outputs of the LSTM" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9IQHjki-yezX" - }, - "source": [ - "-------\n", - "As you can see, the output of Case 1 is an embedding of shape [1, 5, 30], and the output of Case 2 is an LSTM output (state `h` over all time steps), also of the same shape [1, 5, 30].\n", - "\n", - "Do they have the same shape? **Yes**.
If we do a Case 1 .shape == Case 2 .shape, will we get True as an output? **Yes**.
\n", - "Do they represent the same concept? **No**.
\n", - "\n", - "\n", - "The ability to recognize that the two tensors do not represent the same semantic information is precisely why we utilize Neural Types. It contains the information of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the two outputs of those tensors, it would raise an error saying semantically they were different things (more technically, it would say that they are `INCOMPATIBLE` with each other)!\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ucP0hNI7vWrU" - }, - "source": [ - "--------\n", - "\n", - "You may have read of concepts such as [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html). While conceptually similar, Neural Types attached by NeMo are not as tightly bound to the PyTorch ecosystem - practically any object of a class can be attached with a neural type!\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Uvf5oLt9zxSS" - }, - "source": [ - "## Neural Types - Usage\n", - "\n", - "Neural Types sound interesting, so how do we go about adding them? Let's take a few cases below. \n", - "\n", - "Neural Types are one of the core foundations of NeMo - you will find them in a vast majority of Neural Modules, and every NeMo Model will have its Neural Types defined. While they are entirely optional and not intrusive, NeMo takes great care to support it so that there is no semantic incompatibility between components being used by users." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eTizOBUg0qIB" - }, - "source": [ - "Let's start with a basic example of a type checked module." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "yp0FG8NJt1Jd" - }, - "source": [ - "from nemo.core.neural_types import NeuralType\n", - "from nemo.core.neural_types import *" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "3tsgs8Fp0-WV" - }, - "source": [ - "class EmbeddingModule(NeuralModule):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", - "\n", - " @typecheck()\n", - " def forward(self, x):\n", - " return self.embedding(x)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'x': NeuralType(axes=('B', 'T'), elements_type=Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EmbeddedTextType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sY9GYEoD3Yy0" - }, - "source": [ - "To show the benefit of Neural Types, we are going to replicate the above cases inside NeuralModules.\n", - "\n", - "Let's discuss how we added type checking support to the above class.\n", - "\n", - "1) `forward` has a decorator `@typecheck()` on it.\n", - "\n", - "2) `input_types` and `output_types` properties are defined.\n", - "\n", - "That's it!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "on268fAX4LLU" - }, - "source": [ - "-------\n", - "\n", - "Let's expand on each of the above steps.\n", - "\n", - "- `@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the two default properties of `input_types` and `output_types`, which by default returns None.\n", - "\n", - "The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can \"opt-in\" to type checking by overriding the two properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.\n", - "\n", - "So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator. Inside NeMo, we will show some advanced use cases (which are quite crucial to particular domains such as TTS)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o9i1KugG5om7" - }, - "source": [ - "------\n", - "\n", - "As we see above, `@typecheck()` enforces the types. How then, do we provide this type of information to NeMo? \n", - "\n", - "By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`.\n", - "\n", - "In the above case, we define a `NeuralType` as two components - \n", - "\n", - "- `axes`: This is the semantic information of the carried by the axes themselves. The most common axes information is from single character notation.\n", - "\n", - "> `B` = Batch
\n", - "> `C` / `D` - Channel / Dimension (treated the same)
\n", - "> `T` - Time
\n", - "> `H` / `W` - Height / Width
\n", - "\n", - "- `elements_type`: This is the semantic information of \"what the tensor represents\". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us to build a hierarchy of custom semantic types that can be used by NeMo!\n", - "\n", - "Here, we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "boxxMniv27vi" - }, - "source": [ - "embedding_module = EmbeddingModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BgfDuBm27wiV" - }, - "source": [ - "Now let's construct the equivalent of the Case 2 above, but as a `NeuralModule`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "SZZOOoCJ2-iV" - }, - "source": [ - "class LSTMModule(NeuralModule):\n", - " def __init__(self):\n", - " super().__init__()\n", - " self.lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", - "\n", - " @typecheck()\n", - " def forward(self, x):\n", - " return self.lstm(x)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7iIWIunz8IQq" - }, - "source": [ - "------\n", - "Here, we define the LSTM module from the Case 2 above.\n", - "\n", - "We changed the input to be a rank three tensor, now representing a \"SpectrogramType\". We intentionally keep it generic - it can be a `MelSpectrogramType` or a `MFCCSpectrogramType` as its input!\n", - "\n", - "The output of an LSTM is now an `EncodedRepresentation`. Practically, this can be the output of a CNN layer, a Transformer block, or in this case, an LSTM layer. We can, of course, specialize by subclassing EncodedRepresentation and then using that!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "6LlOJf0C8GN4" - }, - "source": [ - "lstm_module = LSTMModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hj0wonSz8_0c" - }, - "source": [ - "------\n", - "Now for the test !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "giLJlub78-Ja" - }, - "source": [ - "# Case 1 [ERROR CELL]\n", - "x1 = torch.randint(high=10, size=(1, 5))\n", - "print(\"x :\", x1)\n", - "print(\"embedding(x) :\", embedding_module(x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "K-fhclja9WLr" - }, - "source": [ - "-----\n", - "You might be wondering why we get a `TypeError` right off the bat. This `TypeError` is raised by design.\n", - "\n", - "Positional arguments can cause significant issues during model development, mostly when the model/module design is not finalized. To reduce the potential for mistakes caused by wrong positional arguments and enforce the name of arguments provided to the function, `Typing` requires you to **call all of your type-checked functions by kwargs only**." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "2KUj_p6M9L-f" - }, - "source": [ - "# Case 1\n", - "print(\"x :\", x1)\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dirhWWvMRusx" - }, - "source": [ - "Now let's try the same for the `LSTMModule` in Case 2" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FMu3B0-9-CqE" - }, - "source": [ - "# Case 2 [ERROR CELL]\n", - "x2 = torch.randn(1, 5, 1) # Input = [B=1, T=5, C=1]\n", - "print(\"x :\", x2)\n", - "print(\"lstm(x) :\", lstm_module(x=x2)[0].shape) # Let's take all timestep outputs of the LSTM" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-OTLdR_4-isV" - }, - "source": [ - "-----\n", - "Now we get a type error stating that the number of output arguments provided does not match what is expected.\n", - "\n", - "What exactly is going on here? Well, inside our `LSTMModule` class, we declare the output types to be a single NeuralType - an `EncodedRepresentation` of shape [B, T, C].\n", - "\n", - "But the output of an LSTM layer is a tuple of \n", - "1) the encoded representation of shape [B, T, C]\n", - "2) another tuple containing two state values - the hidden state `h` and the cell state `c`, each of shape [num_layers * num_directions, B, C]!\n", - "\n", - "So the neural type system raises an error saying that the number of output arguments does not match what is expected.\n", - "\n", - "**NOTE**: The axis kind information of the two states will be represented by `D` to represent a general \"Dimension\" - since `num_layers` and `num_directions` are collapsed under a single axis. For NeMo, Axis types of `C` and `D` are equivalent and can be interchanged, so we will use `C` here to represent the hidden dimension of the LSTM and `D` to represent the merged axis `num_layers * num_directions`.\n", - "\n", - "Let's fix the above." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "q2u-keAM-d-B" - }, - "source": [ - "class CorrectLSTMModule(LSTMModule): # Let's inherit the wrong class to make it easy to override\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),\n", - " 'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())],\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "a99NX0O8KMvW" - }, - "source": [ - "You should note that for the `h_c` neural type, we wrap it in a list - `[]`. NeMo, by default, assumes that each `NeuralType` corresponds to a single returned value. However, in the case of LSTMs, they produce a tuple of two state tensors.\n", - "\n", - "So we inform NeMo that this particular `NeuralType` is a single-dimensional list of items - and that each element of this list shares the same `NeuralType` and has the same shape.\n", - "\n", - "NeMo then ensures that the `h_c` is always a list of tensors. It will not check *how many* items are in the list, but will ensure that the returned value *must be a list containing zero or more items* - and that each of these items share the same `NeuralType`. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "GyPZH-fz_dG4" - }, - "source": [ - "lstm_module = CorrectLSTMModule()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "9whH50PE_Xyx" - }, - "source": [ - "# Case 2\n", - "x2 = torch.randn(1, 5, 1)\n", - "y2, (h, c) = lstm_module(x=x2)\n", - "print(\"x :\", x2)\n", - "print(\"lstm(x) :\", y2.shape) # The output of the LSTM RNN\n", - "print(\"hidden state (h) :\", h.shape) # The first hidden state of the LSTM RNN\n", - "print(\"hidden state (c) :\", c.shape) # The second hidden state of the LSTM RNN" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cRueNvNY_jI3" - }, - "source": [ - "------\n", - "Great! So now, the type checking system is happy.\n", - "\n", - "If you looked closely, the outputs were ordinary Torch Tensors (this is good news; we don't want to be incompatible with torch Tensors after all!). So, where exactly is the type of information stored?\n", - "\n", - "When the `output_types` is overridden, and valid torch tensors are returned as a result, these tensors are attached with the attribute `neural_type`. Let's inspect this -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "bGQ9XbWU_ffa" - }, - "source": [ - "emb_out = embedding_module(x=x1)\n", - "lstm_out = lstm_module(x=x2)[0]\n", - "\n", - "assert hasattr(emb_out, 'neural_type')\n", - "assert hasattr(lstm_out, 'neural_type')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "kEpBruSOScPJ" - }, - "source": [ - "print(\"Embedding tensor :\", emb_out.neural_type)\n", - "print(\"LSTM tensor :\", lstm_out.neural_type)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BWTsqiAHAony" - }, - "source": [ - "-------\n", - "So we see that these tensors now have this attribute called `neural_type` and are the same shape.\n", - "\n", - "This exercise's entire goal was to assert that the two outputs are semantically **not** the same object, even if they are the same shape. \n", - "\n", - "Let's test this!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "8AU9FMtdATIm" - }, - "source": [ - "emb_out.neural_type.compare(lstm_out.neural_type)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "2cqnqAGIBCjA" - }, - "source": [ - "emb_out.neural_type == lstm_out.neural_type" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HmH6B0mHDJqb" - }, - "source": [ - "## Neural Types - Limitations\n", - "\n", - "You might have noticed one interesting fact - our inputs were just `torch.Tensor` to both typed function calls, and they had no `neural_type` assigned to them.\n", - "\n", - "So why did the type check system not raise any error? \n", - "\n", - "This is to maintain compatibility - type checking is meant to work on a chain of function calls - and each of these functions should themselves be wrapped with the `@typecheck()` decorator. This is also done because we don't want to overtax the forward call with dozens of checks, and therefore we only type modules that perform some higher-order logical computation. \n", - "\n", - "------\n", - "\n", - "As an example, it is mostly unnecessary (but still possible) to type the input and output of every residual block of a ResNet model. However, it is practically important to type the encoder (no matter how many layers is inside it) and the decoder (the classification head) separately so that when one does fine-tuning, there is no semantic mismatch of the tensors input to the encoder and bound to the decoder." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6m28zSEKEjt_" - }, - "source": [ - "-------\n", - "For this case, since it would be impractical to extend a class to attach a type to the input tensor, we can take a shortcut and directly attach the neural type to the input!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "AGbKB4gJEzcU" - }, - "source": [ - "embedding_module = EmbeddingModule()\n", - "x1 = torch.randint(high=10, size=(1, 5))\n", - "\n", - "# Attach correct neural type\n", - "x1.neural_type = NeuralType(('B', 'T'), Index())\n", - "\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "F0j-evylFM5j" - }, - "source": [ - "# Attach wrong neural type [ERROR CELL]\n", - "x1.neural_type = NeuralType(('B', 'T'), LabelsType())\n", - "\n", - "print(\"embedding(x) :\", embedding_module(x=x1).shape)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "StMPyg6oCC9B" - }, - "source": [ - "## Let's create the minGPT components\n", - "\n", - "Now that we have a somewhat firm grasp of neural type checking, let's begin porting the minGPT example code. Once again, most of the code will be a direct port from the [minGPT repository](https://github.com/karpathy/minGPT).\n", - "\n", - "Here, you will notice one thing. By just changing class imports, one `@typecheck()` on forward, and adding `input_types` and `output_types` (which are also entirely optional!), we are almost entirely done with the PyTorch Lightning port!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "raFkuSRaBAE0" - }, - "source": [ - "import math\n", - "from typing import List, Set, Dict, Tuple, Optional\n", - "\n", - "import torch\n", - "import torch.nn as nn\n", - "from torch.nn import functional as F" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yakGOXrzF1XW" - }, - "source": [ - "## Creating Element Types\n", - "\n", - "Till now, we have used the Neural Types provided by the NeMo core. But we need not be restricted to the pre-defined element types !\n", - "\n", - "Users have total flexibility in defining any hierarchy of element types as they please!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ybhLLVyUF0mo" - }, - "source": [ - "class AttentionType(EncodedRepresentation):\n", - " \"\"\"Basic Attention Element Type\"\"\"\n", - "\n", - "class SelfAttentionType(AttentionType):\n", - " \"\"\"Self Attention Element Type\"\"\"\n", - "\n", - "class CausalSelfAttentionType(SelfAttentionType):\n", - " \"\"\"Causal Self Attention Element Type\"\"\"" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mONJRMdbZNSE" - }, - "source": [ - "## Creating the modules\n", - "\n", - "Neural Modules are generally top-level modules but can be used at any level of the module hierarchy.\n", - "\n", - "For demonstration, we will treat an encoder comprising a block of Causal Self Attention modules as a typed Neural Module. Of course, we can also treat each Causal Self Attention layer itself as a neural module if we require it, but top-level modules are generally preferred." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "w4oXpAL_CoDp" - }, - "source": [ - "class CausalSelfAttention(nn.Module):\n", - " \"\"\"\n", - " A vanilla multi-head masked self-attention layer with a projection at the end.\n", - " It is possible to use torch.nn.MultiheadAttention here but I am including an\n", - " explicit implementation here to show that there is nothing too scary here.\n", - " \"\"\"\n", - "\n", - " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", - " super().__init__()\n", - " assert n_embd % n_head == 0\n", - " self.n_head = n_head\n", - " # key, query, value projections for all heads\n", - " self.key = nn.Linear(n_embd, n_embd)\n", - " self.query = nn.Linear(n_embd, n_embd)\n", - " self.value = nn.Linear(n_embd, n_embd)\n", - " # regularization\n", - " self.attn_drop = nn.Dropout(attn_pdrop)\n", - " self.resid_drop = nn.Dropout(resid_pdrop)\n", - " # output projection\n", - " self.proj = nn.Linear(n_embd, n_embd)\n", - " # causal mask to ensure that attention is only applied to the left in the input sequence\n", - " self.register_buffer(\"mask\", torch.tril(torch.ones(block_size, block_size))\n", - " .view(1, 1, block_size, block_size))\n", - " def forward(self, x, layer_past=None):\n", - " B, T, C = x.size()\n", - "\n", - " # calculate query, key, values for all heads in batch and move head forward to be the batch dim\n", - " k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - " q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - " v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", - "\n", - " # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)\n", - " att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))\n", - " att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))\n", - " att = F.softmax(att, dim=-1)\n", - " att = self.attn_drop(att)\n", - " y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)\n", - " y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side\n", - "\n", - " # output projection\n", - " y = self.resid_drop(self.proj(y))\n", - " return y\n", - " \n", - "\n", - "class Block(nn.Module):\n", - " \"\"\" an unassuming Transformer block \"\"\"\n", - "\n", - " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", - " super().__init__()\n", - " self.ln1 = nn.LayerNorm(n_embd)\n", - " self.ln2 = nn.LayerNorm(n_embd)\n", - " self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)\n", - " self.mlp = nn.Sequential(\n", - " nn.Linear(n_embd, 4 * n_embd),\n", - " nn.GELU(),\n", - " nn.Linear(4 * n_embd, n_embd),\n", - " nn.Dropout(resid_pdrop),\n", - " )\n", - "\n", - " def forward(self, x):\n", - " x = x + self.attn(self.ln1(x))\n", - " x = x + self.mlp(self.ln2(x))\n", - " return x" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mv0dyrLifkw0" - }, - "source": [ - "## Building the NeMo Model\n", - "\n", - "Since a NeMo Model is comprised of various parts, we are going to iterate on the model step by step inside this notebook. As such, we will have multiple intermediate NeMo \"Models\", which will be partial implementations, and they will inherit each other iteratively.\n", - "\n", - "In a complete implementation of a NeMo Model (as found in the NeMo collections), all of these components will generally be found in a single class.\n", - "\n", - "Let's start by inheriting `ModelPT` - the core class of a PyTorch NeMo Model, which inherits the PyTorch Lightning Module." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TxeG-qMrRgNU" - }, - "source": [ - "-------\n", - "**Remember**:\n", - "\n", - " - The NeMo equivalent of `torch.nn.Module` is the `NeuralModule.\n", - " - The NeMo equivalent of the `LightningModule` is `ModelPT`.\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0TsfmCYthMux" - }, - "source": [ - "import pytorch_lightning as ptl\n", - "from nemo.core import ModelPT\n", - "from omegaconf import OmegaConf" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_ib2rSz2hjaP" - }, - "source": [ - "------\n", - "Next, let's construct the bare minimum implementation of the NeMo Model - just the constructor, the initializer of weights, and the forward method.\n", - "\n", - "Initially, we will follow the steps followed by the minGPT implementation, and progressively refactor for NeMo " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "98x9-Fh-HVwj" - }, - "source": [ - "class PTLGPT(ptl.LightningModule):\n", - " def __init__(self,\n", - " # model definition args\n", - " vocab_size: int, # size of the vocabulary (number of possible tokens)\n", - " block_size: int, # length of the model's context window in time\n", - " n_layer: int, # depth of the model; number of Transformer blocks in sequence\n", - " n_embd: int, # the \"width\" of the model, number of channels in each Transformer\n", - " n_head: int, # number of heads in each multi-head attention inside each Transformer block\n", - " # model optimization args\n", - " learning_rate: float = 3e-4, # the base learning rate of the model\n", - " weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops\n", - " betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer\n", - " embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", - " resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", - " attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix\n", - " ):\n", - " super().__init__()\n", - "\n", - " # save these for optimizer init later\n", - " self.learning_rate = learning_rate\n", - " self.weight_decay = weight_decay\n", - " self.betas = betas\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", - " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", - " self.drop = nn.Dropout(embd_pdrop)\n", - " # deep transformer: just a sequence of transformer blocks\n", - " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])\n", - " # decoder: at the end one more layernorm and decode the answers\n", - " self.ln_f = nn.LayerNorm(n_embd)\n", - " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", - "\n", - " self.block_size = block_size\n", - " self.apply(self._init_weights)\n", - "\n", - " print(\"number of parameters: %e\" % sum(p.numel() for p in self.parameters()))\n", - "\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", - "\n", - " # forward the GPT model\n", - " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", - " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", - " x = self.drop(token_embeddings + position_embeddings)\n", - " x = self.blocks(x)\n", - " x = self.ln_f(x)\n", - " logits = self.head(x)\n", - "\n", - " return logits\n", - "\n", - " def get_block_size(self):\n", - " return self.block_size\n", - "\n", - " def _init_weights(self, module):\n", - " \"\"\"\n", - " Vanilla model initialization:\n", - " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", - " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", - " \"\"\"\n", - " if isinstance(module, (nn.Linear, nn.Embedding)):\n", - " module.weight.data.normal_(mean=0.0, std=0.02)\n", - " if isinstance(module, nn.Linear) and module.bias is not None:\n", - " module.bias.data.zero_()\n", - " elif isinstance(module, nn.LayerNorm):\n", - " module.bias.data.zero_()\n", - " module.weight.data.fill_(1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2bMf5SO7wmor" - }, - "source": [ - "------\n", - "Let's create a PyTorch Lightning Model above, just to make sure it works !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "rrXIBzg4wutC" - }, - "source": [ - "m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZCcgn1bajPW8" - }, - "source": [ - "------\n", - "Now, let's convert the above easily into a NeMo Model.\n", - "\n", - "A NeMo Model constructor generally accepts only two things - \n", - "\n", - "1) `cfg`: An OmegaConf DictConfig object that defines precisely the components required by the model to define its neural network architecture, data loader setup, optimizer setup, and any additional components needed for the model itself.\n", - "\n", - "2) `trainer`: An optional Trainer from PyTorch Lightning if the NeMo model will be used for training. It can be set after construction (if required) using the `set_trainer` method. For this notebook, we will not be constructing the config for the Trainer object." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WQMTCB3kz0UA" - }, - "source": [ - "## Refactoring Neural Modules\n", - "\n", - "As we discussed above, Neural Modules are generally higher-level components of the Model and can potentially be replaced by equivalent Neural Modules.\n", - "\n", - "As we see above, the embedding modules, deep transformer decoder network, and final decoder layer have all been combined inside the PyTorch Lightning implementation constructor.\n", - "\n", - "------\n", - "\n", - "However, the final decoder module could have been an RNN instead of a simple Linear layer, or it could have been a 1D-CNN instead.\n", - "\n", - "Likewise, the deep transformer decoder could potentially have a different implementation of Self Attention modules.\n", - "\n", - "These changes cannot be easily implemented any more inside the above implementation. However, if we refactor these components into their respective NeuralModules, then we can easily replace them with equivalent modules we construct in the future!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EJj5sSkX0xHi" - }, - "source": [ - "### Refactoring the Embedding module\n", - "\n", - "Let's first refactor out the embedding module from the above implementation" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "uYwMyjqK05RL" - }, - "source": [ - "class GPTEmbedding(NeuralModule):\n", - " def __init__(self, vocab_size: int, n_embd: int, block_size: int, embd_pdrop: float = 0.0):\n", - " super().__init__()\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", - " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", - " self.drop = nn.Dropout(embd_pdrop)\n", - "\n", - " @typecheck()\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " \n", - " # forward the GPT model\n", - " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", - " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", - " x = self.drop(token_embeddings + position_embeddings)\n", - " return x\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'idx': NeuralType(('B', 'T'), Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'embeddings': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "l5rOP6lyOyRt" - }, - "source": [ - "### Refactoring the Encoder\n", - "\n", - "Next, let's refactor the GPT Encoder - which is implemented as a multi layer Transformer (Decoder) network.\n", - "\n", - "------\n", - "It can be noted that we refer to the GPT \"Encoder\" module - but it is constructed by using Transformer \"Decoder\" blocks.\n", - "\n", - "***When we discuss Neural Modules - we are discussing an abstract module with a certain input neural type and a certain output neural type.***\n", - "\n", - "For us, the GPT \"Encoder\" neural module will accept any implementation, whose\n", - "\n", - "- input neural type is `NeuralType(('B', 'T', 'C'), EmbeddedTextType())`\n", - "\n", - "- output type is `NeuralType(('B', 'T', 'C'), EncodedRepresentation())`\n", - "\n", - "-----\n", - "One concrete implementation of such a GPT \"Encoder\" neural module is a Deep Transformer \"Decoder\" network." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "1QeQnQ_G2PwH" - }, - "source": [ - "class GPTTransformerEncoder(NeuralModule):\n", - " def __init__(self, n_embd: int, block_size: int, n_head: int, n_layer: int, attn_pdrop: float = 0.0, resid_pdrop: float = 0.0):\n", - " super().__init__()\n", - "\n", - " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) \n", - " for _ in range(n_layer)])\n", - " \n", - " @typecheck()\n", - " def forward(self, embed):\n", - " return self.blocks(embed)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'embed': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'encoding': NeuralType(('B', 'T', 'C'), CausalSelfAttentionType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NmCR3LK3QHum" - }, - "source": [ - "### Refactoring the Decoder\n", - "\n", - "Finally, let's refactor the Decoder - the small one-layer feed-forward network to decode the answer.\n", - "\n", - "-------\n", - "\n", - "Note an interesting detail - The `input_types` of the Decoder accepts the generic `EncoderRepresentation()`, where as the `neural_type` of the `GPTTransformerEncoder` has the `output_type` of `CausalSelfAttentionType`.\n", - "\n", - "This is semantically *not* a mismatch! As you can see above in the inheritance chart, we declare `EncodedRepresentation` -> `AttentionType` -> `SelfAttentionType` -> `CausalSelfAttentionType`. \n", - "\n", - "Such an inheritance hierarchy for the `element_type` allows future encoders (which also have a neural output type of at least `EncodedRepresentation`) to be swapped in place of the current GPT Causal Self Attention Encoder while keeping the rest of the NeMo model working just fine!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VCPUu0EWQIBX" - }, - "source": [ - "class GPTDecoder(NeuralModule):\n", - " def __init__(self, n_embd: int, vocab_size: int):\n", - " super().__init__()\n", - " self.ln_f = nn.LayerNorm(n_embd)\n", - " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", - "\n", - " @typecheck()\n", - " def forward(self, encoding):\n", - " x = self.ln_f(encoding)\n", - " logits = self.head(x)\n", - " return logits\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'encoding': NeuralType(('B', 'T', 'C'), EncodedRepresentation())\n", - " }\n", - " \n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", - " }\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nYLMjlW0Sdy1" - }, - "source": [ - "### Refactoring the NeMo GPT Model\n", - "\n", - "Now that we have 3 NeuralModules for the embedding, the encoder, and the decoder, let's refactor the NeMo model to take advantage of this refactor!\n", - "\n", - "This time, we inherit from `ModelPT` instead of the general `LightningModule`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZQlmtYU6iDwi" - }, - "source": [ - "class AbstractNeMoGPT(ModelPT):\n", - " def __init__(self, cfg: OmegaConf, trainer: ptl.Trainer = None):\n", - " super().__init__(cfg=cfg, trainer=trainer)\n", - "\n", - " # input embedding stem: drop(content + position)\n", - " self.embedding = self.from_config_dict(self.cfg.embedding)\n", - " # deep transformer: just a sequence of transformer blocks\n", - " self.encoder = self.from_config_dict(self.cfg.encoder)\n", - " # decoder: at the end one more layernorm and decode the answers\n", - " self.decoder = self.from_config_dict(self.cfg.decoder)\n", - "\n", - " self.block_size = self.cfg.embedding.block_size\n", - " self.apply(self._init_weights)\n", - "\n", - " print(\"number of parameters: %e\" % self.num_weights)\n", - "\n", - " @typecheck()\n", - " def forward(self, idx):\n", - " b, t = idx.size()\n", - " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", - "\n", - " # forward the GPT model\n", - " # Remember: Only kwargs are allowed !\n", - " e = self.embedding(idx=idx)\n", - " x = self.encoder(embed=e)\n", - " logits = self.decoder(encoding=x)\n", - "\n", - " return logits\n", - "\n", - " def get_block_size(self):\n", - " return self.block_size\n", - "\n", - " def _init_weights(self, module):\n", - " \"\"\"\n", - " Vanilla model initialization:\n", - " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", - " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", - " \"\"\"\n", - " if isinstance(module, (nn.Linear, nn.Embedding)):\n", - " module.weight.data.normal_(mean=0.0, std=0.02)\n", - " if isinstance(module, nn.Linear) and module.bias is not None:\n", - " module.bias.data.zero_()\n", - " elif isinstance(module, nn.LayerNorm):\n", - " module.bias.data.zero_()\n", - " module.weight.data.fill_(1.0)\n", - "\n", - " @property\n", - " def input_types(self):\n", - " return {\n", - " 'idx': NeuralType(('B', 'T'), Index())\n", - " }\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DFRmxWiSmdF3" - }, - "source": [ - "## Creating a config for a Model\n", - "\n", - "At first glance, not much changed compared to the PyTorch Lightning implementation above. Other than the constructor, which now accepts a config, nothing changed at all!\n", - "\n", - "NeMo operates on the concept of a NeMo Model being accompanied by a corresponding config dict (instantiated as an OmegaConf object). This enables us to prototype the model by utilizing Hydra rapidly. This includes various other benefits - such as hyperparameter optimization and serialization/deserialization of NeMo models.\n", - "\n", - "Let's look at how actually to construct such config objects!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "uygo0BEYjKuj" - }, - "source": [ - "# model definition args (required)\n", - "# ================================\n", - "# vocab_size: int # size of the vocabulary (number of possible tokens)\n", - "# block_size: int # length of the model's context window in time\n", - "# n_layer: int # depth of the model; number of Transformer blocks in sequence\n", - "# n_embd: int # the \"width\" of the model, number of channels in each Transformer\n", - "# n_head: int # number of heads in each multi-head attention inside each Transformer block \n", - "\n", - "# model definition args (optional)\n", - "# ================================\n", - "# embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", - "# resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", - "# attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "s4sdqRAFop-n" - }, - "source": [ - "------\n", - "As we look at the required parameters above, we need a way to tell OmegaConf that these values are currently not set, but the user should set them before we use them.\n", - "\n", - "OmegaConf supports such behavior using the `MISSING` value. A similar effect can be achieved in YAML configs by using `???` as a placeholder." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "XqLSZq7Soo2j" - }, - "source": [ - "from omegaconf import MISSING" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "JTH-1vu8TO7o" - }, - "source": [ - "# Let's create a utility for building the class path\n", - "def get_class_path(cls):\n", - " return f'{cls.__module__}.{cls.__name__}'" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6xToaWAJUmtX" - }, - "source": [ - "### Structure of a Model config\n", - "\n", - "Let's first create a config for the common components of the model level config -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZCvLdOlMVLy_" - }, - "source": [ - "common_config = OmegaConf.create({\n", - " 'vocab_size': MISSING,\n", - " 'block_size': MISSING,\n", - " 'n_layer': MISSING,\n", - " 'n_embd': MISSING,\n", - " 'n_head': MISSING,\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "j8hvdKa4VmCV" - }, - "source": [ - "-----\n", - "The model config right now is still being built - it needs to contain a lot more details!\n", - "\n", - "A complete Model Config should have the sub-configs of all of its top-level modules as well. This means the configs of the `embedding`, `encoder`, and the `decoder`.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v-2_QOZyVgrE" - }, - "source": [ - "### Structure of sub-module config\n", - "\n", - "For top-level models, we generally don't change the actual module very often, and instead, primarily change the hyperparameters of that model.\n", - "\n", - "So we will make use of `Hydra`'s Class instantiation method - which can easily be accessed via the class method `ModelPT.from_config_dict()`.\n", - "\n", - "Let's take a few examples below -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ntsxQKH0pDac" - }, - "source": [ - "embedding_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTEmbedding),\n", - " 'vocab_size': '${model.vocab_size}',\n", - " 'n_embd': '${model.n_embd}',\n", - " 'block_size': '${model.block_size}',\n", - " 'embd_pdrop': 0.1\n", - "})\n", - "\n", - "encoder_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTTransformerEncoder),\n", - " 'n_embd': '${model.n_embd}',\n", - " 'block_size': '${model.block_size}',\n", - " 'n_head': '${model.n_head}',\n", - " 'n_layer': '${model.n_layer}',\n", - " 'attn_pdrop': 0.1,\n", - " 'resid_pdrop': 0.1\n", - "})\n", - "\n", - "decoder_config = OmegaConf.create({\n", - " '_target_': get_class_path(GPTDecoder),\n", - " # n_embd: int, vocab_size: int\n", - " 'n_embd': '${model.n_embd}',\n", - " 'vocab_size': '${model.vocab_size}'\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qtloTqkqWhpl" - }, - "source": [ - "##### What is `_target_`?\n", - "--------\n", - "\n", - "In the above config, we see a `_target_` in the config. `_target_` is usually a full classpath to the actual class in the python package/user local directory. It is required for Hydra to locate and instantiate the model from its path correctly.\n", - "\n", - "So why do we want to set a classpath?\n", - "\n", - "In general, when developing models, we don't often change the encoder or the decoder, but we do change the hyperparameters of the encoder and decoder.\n", - "\n", - "This notation helps us keep the Model level declaration of the forward step neat and precise. It also logically helps us demark which parts of the model can be easily replaced - in the future, we can easily replace the encoder with some other type of self-attention block or the decoder with an RNN or 1D-CNN neural module (as long as they have the same Neural Type definition as the current blocks).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ASDmcgE4XtQ4" - }, - "source": [ - "##### What is the `${}` syntax?\n", - "-------\n", - "\n", - "OmegaConf, and by extension, Hydra, supports Variable Interpolation. As you can see in the `__init__` of embedding, encoder, and decoder neural modules, they often share many parameters between each other.\n", - "\n", - "It would become tedious and error-prone to set each of these constructors' values separately in each of the embedding, encoder, and decoder configs.\n", - "\n", - "So instead, we define standard keys inside of the `model` level config and then interpolate these values inside of the respective configs!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zXvEcXGhZi5I" - }, - "source": [ - "### Attaching the model and module-level configs\n", - "\n", - "So now, we have a Model level and per-module level configs for the core components. Sub-module configs generally fall under the \"model\" namespace, but you have the flexibility to define the structure as you require.\n", - "\n", - "Let's attach them!\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "c8hvNeB_aDgi" - }, - "source": [ - "model_config = OmegaConf.create({\n", - " 'model': common_config\n", - "})\n", - "\n", - "# Then let's attach the sub-module configs\n", - "model_config.model.embedding = embedding_config\n", - "model_config.model.encoder = encoder_config\n", - "model_config.model.decoder = decoder_config" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "zIubuFcOpIB0" - }, - "source": [ - "-----\n", - "Let's print this config!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "2SyKNgp9pG0N" - }, - "source": [ - "print(OmegaConf.to_yaml(model_config))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4PAA07EAauCn" - }, - "source": [ - "-----\n", - "Wait, why did OmegaConf not fill in the value of the variable interpolation for the configs yet?\n", - "\n", - "This is because OmegaConf takes a deferred approach to variable interpolation. First, we fill in temporary values of the required fields (those marked by `???`). Then, to force resolution ahead of time, we can use the following snippet - " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "0X4C76JyOAnN" - }, - "source": [ - "import copy" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ugxA0TPtbHVZ" - }, - "source": [ - "temp_config = copy.deepcopy(model_config)\n", - "temp_config.model.vocab_size = 10\n", - "temp_config.model.block_size = 4\n", - "temp_config.model.n_layer = 1\n", - "temp_config.model.n_embd = 32\n", - "temp_config.model.n_head = 4\n", - "\n", - "temp_config = OmegaConf.create(OmegaConf.to_container(temp_config, resolve=True))\n", - "print(OmegaConf.to_yaml(temp_config))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V41RFIpEpiOu" - }, - "source": [ - "-----\n", - "Now that we have a config, let's try to create an object of the NeMo Model !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "IIIVi2IfpsJ4" - }, - "source": [ - "# Let's work on a copy of the model config and update it before we send it into the Model.\n", - "cfg = copy.deepcopy(model_config)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "OllBhswPqQXq" - }, - "source": [ - "# Let's set the values of the config (for some plausible small model)\n", - "cfg.model.vocab_size = 100\n", - "cfg.model.block_size = 128\n", - "cfg.model.n_layer = 1\n", - "cfg.model.n_embd = 32\n", - "cfg.model.n_head = 4" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QJm2LnTqqcIM" - }, - "source": [ - "print(OmegaConf.to_yaml(cfg))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "E7tpB8BcqeBO" - }, - "source": [ - "# Try to create a model with this config [ERROR CELL]\n", - "m = AbstractNeMoGPT(cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "cXOLhpxdq4Ni" - }, - "source": [ - "-----\n", - "\n", - "You will note that we added the `Abstract` tag for a reason to this NeMo Model and that when we try to instantiate it - it raises an error that we need to implement specific methods.\n", - "\n", - "1) `setup_training_data` & `setup_validation_data` - All NeMo models should implement two data loaders - the training data loader and the validation data loader. Optionally, they can go one step further and also implement the `setup_test_data` method to add support for evaluating the Model on its own.\n", - "\n", - "Why do we enforce this? NeMo Models are meant to be a unified, cohesive object containing the details about the neural network underlying that Model and the data loaders to train, validate, and optionally test those models.\n", - "\n", - "In doing so, once the Model is created/deserialized, it would take just a few more steps to train the Model from scratch / fine-tune/evaluate the Model on any data that the user provides, as long as this user-provided dataset is in a format supported by the Dataset / DataLoader that is used by this Model!\n", - "\n", - "2) `list_available_models` - This is a utility method to provide a list of pre-trained NeMo models to the user from the cloud.\n", - "\n", - "Typically, NeMo models can be easily packaged into a tar file (which we call a .nemo file in the earlier primer notebook). These tar files contain the model config + the pre-trained checkpoint weights of the Model, and can easily be downloaded from some cloud service. \n", - "\n", - "For this notebook, we will not be implementing this method.\n", - "\n", - "--------\n", - "Finally, let's create a concrete implementation of the above NeMo Model!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Vcwi1lO7t7Sm" - }, - "source": [ - "from nemo.core.classes.common import PretrainedModelInfo" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ckCxyVLYqrz0" - }, - "source": [ - "class BasicNeMoGPT(AbstractNeMoGPT):\n", - "\n", - " @classmethod\n", - " def list_available_models(cls) -> PretrainedModelInfo:\n", - " return None\n", - "\n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self._train_dl = None\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " self._validation_dl = None\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " self._test_dl = None" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ofUoJ8DDvq_Y" - }, - "source": [ - "------\n", - "Now let's try to create an object of the `BasicNeMoGPT` model" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "G8iYQSC5vptU" - }, - "source": [ - "m = BasicNeMoGPT(cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "otvYW4TBxAju" - }, - "source": [ - "## Setting up train-val-test steps\n", - "\n", - "The above `BasicNeMoGPT` Model is a basic PyTorch Lightning Module, with some added functionality - \n", - "\n", - "1) Neural Type checks support - as defined in the Model as well as the internal modules.\n", - "\n", - "2) Save and restore of the Model (in the trivial case) to a tarfile.\n", - "\n", - "But as the Model is right now, it crucially does not support PyTorch Lightning's `Trainer`. As such, while this Model can be called manually, it cannot be easily trained or evaluated by using the PyTorch Lightning framework.\n", - "\n", - "------\n", - "\n", - "Let's begin adding support for this then -" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "QU3oQAVovxRg" - }, - "source": [ - "class BasicNeMoGPTWithSteps(BasicNeMoGPT):\n", - "\n", - " def step_(self, split, batch, batch_idx=None):\n", - " idx, targets = batch\n", - " logits = self(idx=idx)\n", - " loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n", - " key = 'loss' if split == 'train' else f\"{split}_loss\"\n", - " self.log(key, loss)\n", - " return {key: loss}\n", - "\n", - " def training_step(self, *args, **kwargs):\n", - " return self.step_('train', *args, **kwargs)\n", - "\n", - " def validation_step(self, *args, **kwargs):\n", - " return self.step_('val', *args, **kwargs)\n", - "\n", - " def test_step(self, *args, **kwargs):\n", - " return self.step_('test', *args, **kwargs)\n", - " \n", - " # This is useful for multiple validation data loader setup\n", - " def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):\n", - " val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()\n", - " return {'val_loss': val_loss_mean}\n", - "\n", - " # This is useful for multiple test data loader setup\n", - " def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):\n", - " test_loss_mean = torch.stack([x['test_loss'] for x in outputs]).mean()\n", - " return {'test_loss': test_loss_mean}" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "2Ki3kRxag511" - }, - "source": [ - "m = BasicNeMoGPTWithSteps(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "f_7YziAw_Isu" - }, - "source": [ - "### Setup for Multi Validation and Multi Test data loaders\n", - "\n", - "As discussed in the NeMo Primer, NeMo has in-built support for multiple data loaders for validation and test steps. Therefore, as an example of how easy it is to add such support, we include the `multi_validation_epoch_end` and `multi_test_epoch_end` overrides.\n", - "\n", - "It is also practically essential to collate results from more than one distributed GPUs, and then aggregate results properly at the end of the epoch. NeMo strictly enforces the correct collation of results, even if you will work on only one device! Future-proofing is baked into the model design for this case!\n", - "\n", - "Therefore NeMo provides the above two generic methods to support aggregation and simultaneously support multiple datasets!\n", - "\n", - "**Please note, you can prepend your already existing `on_validation_epoch_end` and `on_test_epoch_end` implementations with the `multi_` in the name, and that alone is sufficient to enable multi-dataset and multi-GPU support!**\n", - "\n", - "------\n", - "**Note: To disable multi-dataset support, simply override `on_validation_epoch_end` and `on_test_epoch_end` instead of `multi_validation_epoch_end` and `multi_test_epoch_end`!**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QpfSn-YUh7GK" - }, - "source": [ - "## Setting up the optimizer / scheduler\n", - "\n", - "We are relatively close to reaching feature parity with the MinGPT Model! But we are missing a crucial piece - the optimizer.\n", - "\n", - "All NeMo Model's come with a default implementation of `setup_optimization()`, which will parse the provided model config to obtain the `optim` and `sched` sub-configs, and automatically configure the optimizer and scheduler.\n", - "\n", - "If training GPT was as simple as plugging in an Adam optimizer over all the parameters with a cosine weight decay schedule, we could do that from the config alone.\n", - "\n", - "-------\n", - "\n", - "But GPT is not such a trivial model - more specifically, it requires weight decay to be applied to the weight matrices but not to the biases, the embedding matrix, or the LayerNorm layers.\n", - "\n", - "We can drop the support that Nemo provides for such special cases and instead utilize the PyTorch Lightning method `configure_optimizers` to perform the same task.\n", - "\n", - "-------\n", - "\n", - "Note, for NeMo Models; the `configure_optimizers` is implemented as a trivial call to `setup_optimization()` followed by returning the generated optimizer and scheduler! So we can override the `configure_optimizer` method and manage the optimizer creation manually!\n", - "\n", - "NeMo's goal is to provide usable defaults for the general case and simply back off to either PyTorch Lightning or PyTorch nn.Module itself in cases when the additional flexibility becomes necessary!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "FgXkZQiVjnOv" - }, - "source": [ - "class BasicNeMoGPTWithOptim(BasicNeMoGPTWithSteps):\n", - "\n", - " def configure_optimizers(self):\n", - " \"\"\"\n", - " This long function is unfortunately doing something very simple and is being very defensive:\n", - " We are separating out all parameters of the model into two buckets: those that will experience\n", - " weight decay for regularization and those that won't (biases, and layernorm/embedding weights).\n", - " We are then returning the PyTorch optimizer object.\n", - " \"\"\"\n", - "\n", - " # separate out all parameters to those that will and won't experience weight decay\n", - " decay = set()\n", - " no_decay = set()\n", - " whitelist_weight_modules = (torch.nn.Linear, )\n", - " blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)\n", - " for mn, m in self.named_modules():\n", - " for pn, p in m.named_parameters():\n", - " fpn = '%s.%s' % (mn, pn) if mn else pn # full param name\n", - "\n", - " if pn.endswith('bias'):\n", - " # all biases will not be decayed\n", - " no_decay.add(fpn)\n", - " elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):\n", - " # weights of whitelist modules will be weight decayed\n", - " decay.add(fpn)\n", - " elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):\n", - " # weights of blacklist modules will NOT be weight decayed\n", - " no_decay.add(fpn)\n", - "\n", - " # special case the position embedding parameter in the root GPT module as not decayed\n", - " no_decay.add('embedding.pos_emb')\n", - "\n", - " # validate that we considered every parameter\n", - " param_dict = {pn: p for pn, p in self.named_parameters()}\n", - " inter_params = decay & no_decay\n", - " union_params = decay | no_decay\n", - " assert len(inter_params) == 0, \"parameters %s made it into both decay/no_decay sets!\" % (str(inter_params), )\n", - " assert len(param_dict.keys() - union_params) == 0, \"parameters %s were not separated into either decay/no_decay set!\" \\\n", - " % (str(param_dict.keys() - union_params), )\n", - "\n", - " # create the pytorch optimizer object\n", - " optim_groups = [\n", - " {\"params\": [param_dict[pn] for pn in sorted(list(decay))], \"weight_decay\": self.cfg.optim.weight_decay},\n", - " {\"params\": [param_dict[pn] for pn in sorted(list(no_decay))], \"weight_decay\": 0.0},\n", - " ]\n", - " optimizer = torch.optim.AdamW(optim_groups, lr=self.cfg.optim.lr, betas=self.cfg.optim.betas)\n", - " return optimizer\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "kARDwthakEQk" - }, - "source": [ - "m = BasicNeMoGPTWithOptim(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iB1kwctv2cYv" - }, - "source": [ - "-----\n", - "Now let's setup the config for the optimizer !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "5K7zh9Cn2s2u" - }, - "source": [ - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "optim_config = OmegaConf.create({\n", - " 'lr': 3e-4,\n", - " 'weight_decay': 0.1,\n", - " 'betas': [0.9, 0.95]\n", - "})\n", - "\n", - "cfg.model.optim = optim_config\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "P31p8ABthsh0" - }, - "source": [ - "## Setting up the dataset / data loaders\n", - "\n", - "So we were able almost entirely to replicate the MinGPT implementation. \n", - "\n", - "Remember, NeMo models should contain all of the logic to load the Dataset and DataLoader for at least the train and validation step.\n", - "\n", - "We temporarily provided empty implementations to get around it till now, but let's fill that in now!\n", - "\n", - "-------\n", - "\n", - "**Note for datasets**: Below, we will show an example using a very small dataset called `tiny_shakespeare`, found at the original [char-rnn repository](https://github.com/karpathy/char-rnn), but practically you could use any text corpus. The one suggested in minGPT is available at http://mattmahoney.net/dc/textdata.html" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "q8dlOcZPkxM1" - }, - "source": [ - "### Creating the Dataset\n", - "\n", - "NeMo has Neural Type checking support, even for Datasets! It's just a minor change of the import in most cases and one difference in how we handle `collate_fn`.\n", - "\n", - "We could paste the dataset info from minGPT, and you'd only need to make 2 changes!\n", - "\n", - "-----\n", - "In this example, we will be writing a thin subclass over the datasets provided by `nlp` from HuggingFace!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "E-fswFkig9t4" - }, - "source": [ - "from nemo.core import Dataset\n", - "from torch.utils import data\n", - "from torch.utils.data.dataloader import DataLoader" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "-Z8XuPeClGNm" - }, - "source": [ - "class TinyShakespeareDataset(Dataset):\n", - "\n", - " def __init__(self, data_path, block_size, crop=None, override_vocab=None):\n", - "\n", - " # load the data and crop it appropriately\n", - " with open(data_path, 'r') as f:\n", - " if crop is None:\n", - " data = f.read()\n", - " else:\n", - " f.seek(crop[0])\n", - " data = f.read(crop[1])\n", - "\n", - " # build a vocabulary from data or inherit it\n", - " vocab = sorted(list(set(data))) if override_vocab is None else override_vocab\n", - "\n", - " # Add UNK\n", - " special_tokens = ['', ''] # We use just and in the call, but can add others.\n", - " if not override_vocab:\n", - " vocab = [*special_tokens, *vocab] # Update train vocab with special tokens\n", - "\n", - " data_size, vocab_size = len(data), len(vocab)\n", - " print('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))\n", - " print('Num samples in dataset : %d' % (data_size // block_size))\n", - "\n", - " self.stoi = { ch:i for i,ch in enumerate(vocab) }\n", - " self.itos = { i:ch for i,ch in enumerate(vocab) }\n", - " self.block_size = block_size\n", - " self.vocab_size = vocab_size\n", - " self.data = data\n", - " self.vocab = vocab\n", - " self.special_tokens = special_tokens\n", - "\n", - " def __len__(self):\n", - " return len(self.data) // self.block_size\n", - "\n", - " def __getitem__(self, idx):\n", - " # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too\n", - " chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]\n", - " # map the string into a sequence of integers\n", - " ixes = [self.stoi[s] if s in self.stoi else self.stoi[''] for s in chunk ]\n", - " # if stars align (last idx and len(self.data) % self.block_size == 0), pad with \n", - " if len(ixes) < self.block_size + 1:\n", - " assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure\n", - " ixes.append(self.stoi[''])\n", - " dix = torch.tensor(ixes, dtype=torch.long)\n", - " return dix[:-1], dix[1:]\n", - "\n", - " @property\n", - " def output_types(self):\n", - " return {\n", - " 'input': NeuralType(('B', 'T'), Index()),\n", - " 'target': NeuralType(('B', 'T'), LabelsType())\n", - " }" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7MEMR4TcmP5K" - }, - "source": [ - "------\n", - "We didn't have to change anything until here. How then is type-checking done? \n", - "\n", - "NeMo does type-checking inside of the collate function implementation itself! In this case, it is not necessary to override the `collate_fn` inside the Dataset, but if we did need to override it, **NeMo requires that the private method `_collate_fn` be overridden instead**.\n", - "\n", - "We can then use data loaders with minor modifications!\n", - "\n", - "**Also, there is no need to implement the `input_types` for Dataset, as they are the ones generating the input for the model!**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZeKXAknenVch" - }, - "source": [ - "-----\n", - "Let's prepare the dataset that we are going to use - Tiny Shakespeare from the following codebase [char-rnn](https://github.com/karpathy/char-rnn)." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VwsdXtVzo--t" - }, - "source": [ - "import os" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QvKcDCvIl9-A" - }, - "source": [ - "if not os.path.exists('tiny-shakespeare.txt'):\n", - " !wget https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ynCwqDu6vK8P" - }, - "source": [ - "!head -n 5 tiny-shakespeare.txt" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "bfRL4t9_oS4C" - }, - "source": [ - "train_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(0, int(1e6)))\n", - "val_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1e6), int(50e3)), override_vocab=train_dataset.vocab)\n", - "test_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1.05e6), int(100e3)), override_vocab=train_dataset.vocab)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kIlCoZDksEDO" - }, - "source": [ - "### Setting up dataset/data loader support in the Model\n", - "\n", - "So we now know our data loader works. Let's integrate it as part of the Model itself!\n", - "\n", - "To do this, we use the three special attributes of the NeMo Model - `self._train_dl`, `self._validation_dl` and `self._test_dl`. Once you construct your DataLoader, place your data loader to these three variables. \n", - "\n", - "For multi-data loader support, the same applies! NeMo will automatically handle the management of multiple data loaders for you!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "SVSfIk_-rMSg" - }, - "source": [ - "class NeMoGPT(BasicNeMoGPTWithOptim):\n", - "\n", - " def _setup_data_loader(self, cfg):\n", - " if self.vocab is None:\n", - " override_vocab = None\n", - " else:\n", - " override_vocab = self.vocab\n", - "\n", - " dataset = TinyShakespeareDataset(\n", - " data_path=cfg.data_path,\n", - " block_size=cfg.block_size,\n", - " crop=tuple(cfg.crop) if 'crop' in cfg else None,\n", - " override_vocab=override_vocab\n", - " )\n", - "\n", - " if self.vocab is None:\n", - " self.vocab = dataset.vocab\n", - "\n", - " return DataLoader(\n", - " dataset=dataset,\n", - " batch_size=cfg.batch_size,\n", - " shuffle=cfg.shuffle,\n", - " collate_fn=dataset.collate_fn, # <-- this is necessary for type checking\n", - " pin_memory=cfg.pin_memory if 'pin_memory' in cfg else False,\n", - " num_workers=cfg.num_workers if 'num_workers' in cfg else 0\n", - " )\n", - " \n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self.vocab = None\n", - " self._train_dl = self._setup_data_loader(train_data_config)\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " self._validation_dl = self._setup_data_loader(val_data_config)\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " self._test_dl = self._setup_data_loader(test_data_config)\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ait4nLtIxS96" - }, - "source": [ - "### Creating the dataset / dataloader config\n", - "\n", - "The final step to setup this model is to add the `train_ds`, `validation_ds` and `test_ds` configs inside the model config!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "C6zcTqJixOOL" - }, - "source": [ - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "# Set the data path and update vocabular size\n", - "cfg.model.data_path = 'tiny-shakespeare.txt'\n", - "cfg.model.vocab_size = train_dataset.vocab_size\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "zlvThf7BysyT" - }, - "source": [ - "train_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [0, int(1e6)],\n", - " 'batch_size': 64,\n", - " 'shuffle': True,\n", - "})\n", - "\n", - "validation_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [int(1e6), int(50e3)],\n", - " 'batch_size': 4,\n", - " 'shuffle': False,\n", - "})\n", - "\n", - "test_ds = OmegaConf.create({\n", - " 'data_path': '${model.data_path}',\n", - " 'block_size': '${model.block_size}',\n", - " 'crop': [int(1.05e6), int(100e3)],\n", - " 'batch_size': 4,\n", - " 'shuffle': False,\n", - "})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QVVzR6WKyMT5" - }, - "source": [ - "# Attach to the model config\n", - "OmegaConf.set_struct(cfg.model, False)\n", - "\n", - "cfg.model.train_ds = train_ds\n", - "cfg.model.validation_ds = validation_ds\n", - "cfg.model.test_ds = test_ds\n", - "\n", - "OmegaConf.set_struct(cfg.model, True)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "nd_9_mxS0ET-" - }, - "source": [ - "# Let's see the config now !\n", - "print(OmegaConf.to_yaml(cfg))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "dlwSQENU0JxA" - }, - "source": [ - "# Let's try creating a model now !\n", - "model = NeMoGPT(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Q_Mp4bhH0tR1" - }, - "source": [ - "-----\n", - "All the data loaders load properly ! Yay!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "CZHDqCyo6uWd" - }, - "source": [ - "# Evaluate the model - end to end!\n", - "\n", - "Now that the data loaders have been set up, all that's left is to train and test the model! We have most of the components required by this model - the train, val and test data loaders, the optimizer, and the type-checked forward step to perform the train-validation-test steps! \n", - "\n", - "But training a GPT model from scratch is not the goal of this primer, so instead, let's do a sanity check by merely testing the model for a few steps using random initial weights.\n", - "\n", - "The above will ensure that - \n", - "\n", - "1) Our data loaders work as intended\n", - "\n", - "2) The type checking system assures us that our Neural Modules are performing their forward step correctly.\n", - "\n", - "3) The loss is calculated, and therefore the model runs end to end, ultimately supporting PyTorch Lightning." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "johk6Z0e0WEm" - }, - "source": [ - "if torch.cuda.is_available():\n", - " accelerator = 'gpu'\n", - "else:\n", - " accelerator = 'cpu'\n", - "\n", - "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches=1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "oqeeofEr1S8e" - }, - "source": [ - "trainer.test(model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pqJy7esrA-Ha" - }, - "source": [ - "# Saving and restoring models\n", - "\n", - "NeMo internally keeps track of the model configuration, as well as the model checkpoints and parameters.\n", - "\n", - "As long as your NeMo follows the above general guidelines, you can call the `save_to` and `restore_from` methods to save and restore your models!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "DksG_-7G1Vbe" - }, - "source": [ - "model.save_to('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "JhjoFdCnBWVh" - }, - "source": [ - "!ls -d -- *.nemo" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "567txSF0BYXN" - }, - "source": [ - "temp_model = NeMoGPT.restore_from('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "YvnfG0kxBfTt" - }, - "source": [ - "# [ERROR CELL]\n", - "temp_model.setup_test_data(temp_model.cfg.test_ds)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "N0ckN44YB-1K" - }, - "source": [ - "-----\n", - "\n", - "Hmm, it seems it wasn't so easy in this case. Non-trivial models have non-trivial issues!\n", - "\n", - "Remember, our NeMoGPT model sets its self.vocab inside the `setup_train_data` step. But that depends on the vocabulary generated by the train set... which is **not** restored during model restoration (unless you call `setup_train_data` explicitly!).\n", - "\n", - "We can quickly resolve this issue by constructing an external data file to enable save and restore support, and NeMo supports that too! We will use the `register_artifact` API in NeMo to support external files being attached to the .nemo checkpoint." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_Atyoc4NBjEV" - }, - "source": [ - "class NeMoGPTv2(NeMoGPT):\n", - " \n", - " def setup_training_data(self, train_data_config: OmegaConf):\n", - " self.vocab = None\n", - " self._train_dl = self._setup_data_loader(train_data_config)\n", - "\n", - " # Save the vocab into a text file for now\n", - " with open('vocab.txt', 'w') as f:\n", - " for token in self.vocab:\n", - " f.write(f\"{token}\")\n", - " \n", - " # This is going to register the file into .nemo!\n", - " # When you later use .save_to(), it will copy this file into the tar file.\n", - " self.register_artifact('vocab_file', 'vocab.txt')\n", - " \n", - " def setup_validation_data(self, val_data_config: OmegaConf):\n", - " # This is going to try to find the same file, and if it fails, \n", - " # it will use the copy in .nemo\n", - " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", - " \n", - " with open(vocab_file, 'r') as f:\n", - " vocab = []\n", - " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", - " self.vocab = vocab\n", - "\n", - " self._validation_dl = self._setup_data_loader(val_data_config)\n", - " \n", - " def setup_test_data(self, test_data_config: OmegaConf):\n", - " # This is going to try to find the same file, and if it fails, \n", - " # it will use the copy in .nemo\n", - " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", - "\n", - " with open(vocab_file, 'r') as f:\n", - " vocab = []\n", - " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", - " self.vocab = vocab\n", - "\n", - " self._test_dl = self._setup_data_loader(test_data_config)\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "mn09jsRZDusN" - }, - "source": [ - "# Let's try creating a model now !\n", - "model = NeMoGPTv2(cfg=cfg.model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "sQPIPySDD1K0" - }, - "source": [ - "# Now let's try to save and restore !\n", - "model.save_to('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "0YwCJ4xaJ3bU" - }, - "source": [ - "temp_model = NeMoGPTv2.restore_from('gpt_model.nemo')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "tcxwDIIWKKCQ" - }, - "source": [ - "temp_model.setup_multiple_test_data(temp_model.cfg.test_ds)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "j3Olm6ZTKRbO" - }, - "source": [ - "if torch.cuda.is_available():\n", - " accelerator = 'gpu'\n", - "else:\n", - " accelerator = 'cpu'\n", - "\n", - "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches =1.0)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "_QE2SngCKV2p" - }, - "source": [ - "trainer.test(model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "o2HpKzwKJ_MW" - }, - "source": [ - "------\n", - "There we go ! Now our models can be serialized and de-serialized without any issue, even with an external vocab file !" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZjCV5u3_OO7a" - }, - "source": [ - "" - ], - "execution_count": null, - "outputs": [] - } - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "01_NeMo_Models.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "ASnx4b5jXsil" + }, + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "\n", + "# ## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "## Install TorchAudio\n", + "!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html\n", + "\n", + "## Grab the config we'll use in this example\n", + "!mkdir configs" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a0eAURFKXdFT" + }, + "source": [ + "# minGPT License\n", + "\n", + "*This notebook port's the [minGPT codebase](https://github.com/karpathy/minGPT) into equivalent NeMo code. The license for minGPT has therefore been attached here.*\n", + "\n", + "```\n", + "The MIT License (MIT) Copyright (c) 2020 Andrej Karpathy\n", + "\n", + "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", + "\n", + "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", + "\n", + "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2b7Z064UZFH9" + }, + "source": [ + "# torch-rnn License\n", + "*This notebook utilizes the `tiny-shakespeare` dataset from the [torch-rnn](https://github.com/jcjohnson/torch-rnn) codebase. The license for torch-rnn has therefore been attached here.*\n", + "\n", + "```\n", + "The MIT License (MIT)\n", + "\n", + "Copyright (c) 2016 Justin Johnson\n", + "\n", + "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", + "\n", + "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", + "\n", + "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eKzK-Z7obCED" + }, + "source": [ + "-------\n", + "\n", + "***Note: This notebook will intentionally introduce some errors to show the power of Neural Types or model development concepts, inside the cells marked with `[ERROR CELL]`. The explanation of and resolution of such errors can be found in the subsequent cells.***\n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "81qdv0mPee-j" + }, + "source": [ + "# The NeMo Model\n", + "\n", + "NeMo comes with several state-of-the-art pre-trained Conversational AI models for users to quickly be able to start training and fine-tuning on their own datasets. \n", + "\n", + "In the previous [NeMo Primer](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb) notebook, we learned how to download pretrained checkpoints with NeMo and we also discussed the fundamental concepts of the NeMo Model. The previous tutorial showed us how to use, modify, save, and restore NeMo Models.\n", + "\n", + "In this tutorial we will learn how to develop a non-trivial NeMo model from scratch. This helps us to understand the underlying components and how they interact with the overall PyTorch ecosystem.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nKNftwxzllth" + }, + "source": [ + "-------\n", + "At the heart of NeMo lies the concept of the \"Model\". For NeMo developers, a \"Model\" is the neural network(s) as well as all the infrastructure supporting those network(s), wrapped into a singular, cohesive unit. As such, most NeMo models are constructed to contain the following out of the box (note: some NeMo models support additional functionality specific to the domain/use case!) - \n", + "\n", + " - Neural Network architecture - all of the modules that are required for the model.\n", + "\n", + " - Dataset + Data Loaders - all of the components that prepare the data for consumption during training or evaluation.\n", + "\n", + " - Preprocessing + Postprocessing - any of the components that process the datasets so the modules can easily consume them.\n", + "\n", + " - Optimizer + Schedulers - basic defaults that work out of the box and allow further experimentation with ease.\n", + "\n", + " - Any other supporting infrastructure - tokenizers, language model configuration, data augmentation, etc." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5VOoAQT1mipO" + }, + "source": [ + "# Constructing a NeMo Model\n", + "\n", + "NeMo \"Models\" are comprised of a few key components, so let's tackle them one by one. We will attempt to go in the order that's stated above.\n", + "\n", + "To make this slightly challenging, let's port a model from the NLP domain this time. Transformers are all the rage, with BERT and his friends from Sesame Street forming the core infrastructure for many NLP tasks. \n", + "\n", + "An excellent (yet simple) implementation of one such model - GPT - can be found in the `minGPT` repository - https://github.com/karpathy/minGPT. While the script is short, it explains and succinctly explores all of the core components we expect in a NeMo model, so it's a prime candidate for NeMo! Sidenote: NeMo supports GPT in its NLP collection, and as such, this notebook aims to be an in-depth development walkthrough for such models.\n", + "\n", + "In the following notebook, we will attempt to port minGPT to NeMo, and along the way, discuss some core concepts of NeMo itself." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fOlQKsaRot1l" + }, + "source": [ + "# Constructing the Neural Network Architecture\n", + "\n", + "First, on the list - the neural network that forms the backbone of the NeMo Model.\n", + "\n", + "So how do we create such a model? Using PyTorch! As you'll see below, NeMo components are compatible with all of PyTorch, so you can augment your workflow without ever losing the flexibility of PyTorch itself!\n", + "\n", + "Let's start with a couple of imports - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "piLOgwOPX1FS" + }, + "source": [ + "import torch\n", + "import nemo\n", + "from nemo.core import NeuralModule\n", + "from nemo.core import typecheck" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yySYjHgAqVvT" + }, + "source": [ + "## Neural Module\n", + "Wait, what's `NeuralModule`? Where is the wonderful `torch.nn.Module`? \n", + "\n", + "`NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.\n", + "\n", + "In addition to being a `torch.nn.Module`, thereby being entirely compatible with the PyTorch ecosystem, it has the following capabilities - \n", + "\n", + "1) `Typing` - It adds support for `Neural Type Checking` to the model. `Typing` is optional but quite useful, as we will discuss below!\n", + "\n", + "2) `Serialization` - Remember the `OmegaConf` config dict and YAML config files? Well, all `NeuralModules` inherently supports serialization/deserialization from such config dictionaries!\n", + "\n", + "3) `FileIO` - This is another entirely optional file serialization system. Does your `NeuralModule` require some way to preserve data that can't be saved into a PyTorch checkpoint? Write your serialization and deserialization logic in two handy methods! **Note**: When you create the final NeMo Model, this will be implemented for you! Automatic serialization and deserialization support of NeMo models!\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bseLiNoqqQrE" + }, + "source": [ + "class MyEmptyModule(NeuralModule):\n", + "\n", + " def forward(self):\n", + " print(\"Neural Module ~ hello world!\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j4Q36L5urdOQ" + }, + "source": [ + "x = MyEmptyModule()\n", + "x()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lHXAcn5Ot_1I" + }, + "source": [ + "## Neural Types\n", + "\n", + "Neural Types? You might be wondering what that term refers to.\n", + "\n", + "Almost all NeMo components inherit the class `Typing`. `Typing` is a simple class that adds two properties to the class that inherits it - `input_types` and `output_types`. A NeuralType, by its shortest definition, is simply a semantic tensor. It contains information regarding the semantic shape the tensor should hold, as well as the semantic information of what that tensor represents. That's it.\n", + "\n", + "So what semantic information does such a typed tensor contain? Let's take an example below.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ezOJERbVwG34" + }, + "source": [ + "------\n", + "Across the Deep Learning domain, we often encounter cases where tensor shapes may match, but the semantics don't match at all. For example take a look at the following rank 3 tensors - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZvC57bbxwXxN" + }, + "source": [ + "# Case 1:\n", + "embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", + "x = torch.randint(high=10, size=(1, 5))\n", + "print(\"x :\", x)\n", + "print(\"embedding(x) :\", embedding(x).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sMaqhMBgxe2C" + }, + "source": [ + "# Case 2\n", + "lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", + "x = torch.randn(1, 5, 1)\n", + "print(\"x :\", x)\n", + "print(\"lstm(x) :\", lstm(x)[0].shape) # Let's take all timestep outputs of the LSTM" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9IQHjki-yezX" + }, + "source": [ + "-------\n", + "As you can see, the output of Case 1 is an embedding of shape [1, 5, 30], and the output of Case 2 is an LSTM output (state `h` over all time steps), also of the same shape [1, 5, 30].\n", + "\n", + "Do they have the same shape? **Yes**.
If we do a Case 1 .shape == Case 2 .shape, will we get True as an output? **Yes**.
\n", + "Do they represent the same concept? **No**.
\n", + "\n", + "\n", + "The ability to recognize that the two tensors do not represent the same semantic information is precisely why we utilize Neural Types. It contains the information of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the two outputs of those tensors, it would raise an error saying semantically they were different things (more technically, it would say that they are `INCOMPATIBLE` with each other)!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ucP0hNI7vWrU" + }, + "source": [ + "--------\n", + "\n", + "You may have read of concepts such as [Named Tensors](https://pytorch.org/docs/stable/named_tensor.html). While conceptually similar, Neural Types attached by NeMo are not as tightly bound to the PyTorch ecosystem - practically any object of a class can be attached with a neural type!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Uvf5oLt9zxSS" + }, + "source": [ + "## Neural Types - Usage\n", + "\n", + "Neural Types sound interesting, so how do we go about adding them? Let's take a few cases below. \n", + "\n", + "Neural Types are one of the core foundations of NeMo - you will find them in a vast majority of Neural Modules, and every NeMo Model will have its Neural Types defined. While they are entirely optional and not intrusive, NeMo takes great care to support it so that there is no semantic incompatibility between components being used by users." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eTizOBUg0qIB" + }, + "source": [ + "Let's start with a basic example of a type checked module." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "yp0FG8NJt1Jd" + }, + "source": [ + "from nemo.core.neural_types import NeuralType\n", + "from nemo.core.neural_types import *" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3tsgs8Fp0-WV" + }, + "source": [ + "class EmbeddingModule(NeuralModule):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)\n", + "\n", + " @typecheck()\n", + " def forward(self, x):\n", + " return self.embedding(x)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'x': NeuralType(axes=('B', 'T'), elements_type=Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EmbeddedTextType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sY9GYEoD3Yy0" + }, + "source": [ + "To show the benefit of Neural Types, we are going to replicate the above cases inside NeuralModules.\n", + "\n", + "Let's discuss how we added type checking support to the above class.\n", + "\n", + "1) `forward` has a decorator `@typecheck()` on it.\n", + "\n", + "2) `input_types` and `output_types` properties are defined.\n", + "\n", + "That's it!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "on268fAX4LLU" + }, + "source": [ + "-------\n", + "\n", + "Let's expand on each of the above steps.\n", + "\n", + "- `@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the two default properties of `input_types` and `output_types`, which by default returns None.\n", + "\n", + "The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can \"opt-in\" to type checking by overriding the two properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.\n", + "\n", + "So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator. Inside NeMo, we will show some advanced use cases (which are quite crucial to particular domains such as TTS)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o9i1KugG5om7" + }, + "source": [ + "------\n", + "\n", + "As we see above, `@typecheck()` enforces the types. How then, do we provide this type of information to NeMo? \n", + "\n", + "By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`.\n", + "\n", + "In the above case, we define a `NeuralType` as two components - \n", + "\n", + "- `axes`: This is the semantic information of the carried by the axes themselves. The most common axes information is from single character notation.\n", + "\n", + "> `B` = Batch
\n", + "> `C` / `D` - Channel / Dimension (treated the same)
\n", + "> `T` - Time
\n", + "> `H` / `W` - Height / Width
\n", + "\n", + "- `elements_type`: This is the semantic information of \"what the tensor represents\". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us to build a hierarchy of custom semantic types that can be used by NeMo!\n", + "\n", + "Here, we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "boxxMniv27vi" + }, + "source": [ + "embedding_module = EmbeddingModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BgfDuBm27wiV" + }, + "source": [ + "Now let's construct the equivalent of the Case 2 above, but as a `NeuralModule`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SZZOOoCJ2-iV" + }, + "source": [ + "class LSTMModule(NeuralModule):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.lstm = torch.nn.LSTM(1, 30, batch_first=True)\n", + "\n", + " @typecheck()\n", + " def forward(self, x):\n", + " return self.lstm(x)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7iIWIunz8IQq" + }, + "source": [ + "------\n", + "Here, we define the LSTM module from the Case 2 above.\n", + "\n", + "We changed the input to be a rank three tensor, now representing a \"SpectrogramType\". We intentionally keep it generic - it can be a `MelSpectrogramType` or a `MFCCSpectrogramType` as its input!\n", + "\n", + "The output of an LSTM is now an `EncodedRepresentation`. Practically, this can be the output of a CNN layer, a Transformer block, or in this case, an LSTM layer. We can, of course, specialize by subclassing EncodedRepresentation and then using that!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6LlOJf0C8GN4" + }, + "source": [ + "lstm_module = LSTMModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hj0wonSz8_0c" + }, + "source": [ + "------\n", + "Now for the test !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "giLJlub78-Ja" + }, + "source": [ + "# Case 1 [ERROR CELL]\n", + "x1 = torch.randint(high=10, size=(1, 5))\n", + "print(\"x :\", x1)\n", + "print(\"embedding(x) :\", embedding_module(x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K-fhclja9WLr" + }, + "source": [ + "-----\n", + "You might be wondering why we get a `TypeError` right off the bat. This `TypeError` is raised by design.\n", + "\n", + "Positional arguments can cause significant issues during model development, mostly when the model/module design is not finalized. To reduce the potential for mistakes caused by wrong positional arguments and enforce the name of arguments provided to the function, `Typing` requires you to **call all of your type-checked functions by kwargs only**." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2KUj_p6M9L-f" + }, + "source": [ + "# Case 1\n", + "print(\"x :\", x1)\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dirhWWvMRusx" + }, + "source": [ + "Now let's try the same for the `LSTMModule` in Case 2" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FMu3B0-9-CqE" + }, + "source": [ + "# Case 2 [ERROR CELL]\n", + "x2 = torch.randn(1, 5, 1) # Input = [B=1, T=5, C=1]\n", + "print(\"x :\", x2)\n", + "print(\"lstm(x) :\", lstm_module(x=x2)[0].shape) # Let's take all timestep outputs of the LSTM" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-OTLdR_4-isV" + }, + "source": [ + "-----\n", + "Now we get a type error stating that the number of output arguments provided does not match what is expected.\n", + "\n", + "What exactly is going on here? Well, inside our `LSTMModule` class, we declare the output types to be a single NeuralType - an `EncodedRepresentation` of shape [B, T, C].\n", + "\n", + "But the output of an LSTM layer is a tuple of \n", + "1) the encoded representation of shape [B, T, C]\n", + "2) another tuple containing two state values - the hidden state `h` and the cell state `c`, each of shape [num_layers * num_directions, B, C]!\n", + "\n", + "So the neural type system raises an error saying that the number of output arguments does not match what is expected.\n", + "\n", + "**NOTE**: The axis kind information of the two states will be represented by `D` to represent a general \"Dimension\" - since `num_layers` and `num_directions` are collapsed under a single axis. For NeMo, Axis types of `C` and `D` are equivalent and can be interchanged, so we will use `C` here to represent the hidden dimension of the LSTM and `D` to represent the merged axis `num_layers * num_directions`.\n", + "\n", + "Let's fix the above." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "q2u-keAM-d-B" + }, + "source": [ + "class CorrectLSTMModule(LSTMModule): # Let's inherit the wrong class to make it easy to override\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),\n", + " 'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())],\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "a99NX0O8KMvW" + }, + "source": [ + "You should note that for the `h_c` neural type, we wrap it in a list - `[]`. NeMo, by default, assumes that each `NeuralType` corresponds to a single returned value. However, in the case of LSTMs, they produce a tuple of two state tensors.\n", + "\n", + "So we inform NeMo that this particular `NeuralType` is a single-dimensional list of items - and that each element of this list shares the same `NeuralType` and has the same shape.\n", + "\n", + "NeMo then ensures that the `h_c` is always a list of tensors. It will not check *how many* items are in the list, but will ensure that the returned value *must be a list containing zero or more items* - and that each of these items share the same `NeuralType`. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GyPZH-fz_dG4" + }, + "source": [ + "lstm_module = CorrectLSTMModule()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "9whH50PE_Xyx" + }, + "source": [ + "# Case 2\n", + "x2 = torch.randn(1, 5, 1)\n", + "y2, (h, c) = lstm_module(x=x2)\n", + "print(\"x :\", x2)\n", + "print(\"lstm(x) :\", y2.shape) # The output of the LSTM RNN\n", + "print(\"hidden state (h) :\", h.shape) # The first hidden state of the LSTM RNN\n", + "print(\"hidden state (c) :\", c.shape) # The second hidden state of the LSTM RNN" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cRueNvNY_jI3" + }, + "source": [ + "------\n", + "Great! So now, the type checking system is happy.\n", + "\n", + "If you looked closely, the outputs were ordinary Torch Tensors (this is good news; we don't want to be incompatible with torch Tensors after all!). So, where exactly is the type of information stored?\n", + "\n", + "When the `output_types` is overridden, and valid torch tensors are returned as a result, these tensors are attached with the attribute `neural_type`. Let's inspect this -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bGQ9XbWU_ffa" + }, + "source": [ + "emb_out = embedding_module(x=x1)\n", + "lstm_out = lstm_module(x=x2)[0]\n", + "\n", + "assert hasattr(emb_out, 'neural_type')\n", + "assert hasattr(lstm_out, 'neural_type')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kEpBruSOScPJ" + }, + "source": [ + "print(\"Embedding tensor :\", emb_out.neural_type)\n", + "print(\"LSTM tensor :\", lstm_out.neural_type)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BWTsqiAHAony" + }, + "source": [ + "-------\n", + "So we see that these tensors now have this attribute called `neural_type` and are the same shape.\n", + "\n", + "This exercise's entire goal was to assert that the two outputs are semantically **not** the same object, even if they are the same shape. \n", + "\n", + "Let's test this!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "8AU9FMtdATIm" + }, + "source": [ + "emb_out.neural_type.compare(lstm_out.neural_type)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2cqnqAGIBCjA" + }, + "source": [ + "emb_out.neural_type == lstm_out.neural_type" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HmH6B0mHDJqb" + }, + "source": [ + "## Neural Types - Limitations\n", + "\n", + "You might have noticed one interesting fact - our inputs were just `torch.Tensor` to both typed function calls, and they had no `neural_type` assigned to them.\n", + "\n", + "So why did the type check system not raise any error? \n", + "\n", + "This is to maintain compatibility - type checking is meant to work on a chain of function calls - and each of these functions should themselves be wrapped with the `@typecheck()` decorator. This is also done because we don't want to overtax the forward call with dozens of checks, and therefore we only type modules that perform some higher-order logical computation. \n", + "\n", + "------\n", + "\n", + "As an example, it is mostly unnecessary (but still possible) to type the input and output of every residual block of a ResNet model. However, it is practically important to type the encoder (no matter how many layers is inside it) and the decoder (the classification head) separately so that when one does fine-tuning, there is no semantic mismatch of the tensors input to the encoder and bound to the decoder." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6m28zSEKEjt_" + }, + "source": [ + "-------\n", + "For this case, since it would be impractical to extend a class to attach a type to the input tensor, we can take a shortcut and directly attach the neural type to the input!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AGbKB4gJEzcU" + }, + "source": [ + "embedding_module = EmbeddingModule()\n", + "x1 = torch.randint(high=10, size=(1, 5))\n", + "\n", + "# Attach correct neural type\n", + "x1.neural_type = NeuralType(('B', 'T'), Index())\n", + "\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "F0j-evylFM5j" + }, + "source": [ + "# Attach wrong neural type [ERROR CELL]\n", + "x1.neural_type = NeuralType(('B', 'T'), LabelsType())\n", + "\n", + "print(\"embedding(x) :\", embedding_module(x=x1).shape)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "StMPyg6oCC9B" + }, + "source": [ + "## Let's create the minGPT components\n", + "\n", + "Now that we have a somewhat firm grasp of neural type checking, let's begin porting the minGPT example code. Once again, most of the code will be a direct port from the [minGPT repository](https://github.com/karpathy/minGPT).\n", + "\n", + "Here, you will notice one thing. By just changing class imports, one `@typecheck()` on forward, and adding `input_types` and `output_types` (which are also entirely optional!), we are almost entirely done with the PyTorch Lightning port!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "raFkuSRaBAE0" + }, + "source": [ + "import math\n", + "from typing import List, Set, Dict, Tuple, Optional\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "from torch.nn import functional as F" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yakGOXrzF1XW" + }, + "source": [ + "## Creating Element Types\n", + "\n", + "Till now, we have used the Neural Types provided by the NeMo core. But we need not be restricted to the pre-defined element types !\n", + "\n", + "Users have total flexibility in defining any hierarchy of element types as they please!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ybhLLVyUF0mo" + }, + "source": [ + "class AttentionType(EncodedRepresentation):\n", + " \"\"\"Basic Attention Element Type\"\"\"\n", + "\n", + "class SelfAttentionType(AttentionType):\n", + " \"\"\"Self Attention Element Type\"\"\"\n", + "\n", + "class CausalSelfAttentionType(SelfAttentionType):\n", + " \"\"\"Causal Self Attention Element Type\"\"\"" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mONJRMdbZNSE" + }, + "source": [ + "## Creating the modules\n", + "\n", + "Neural Modules are generally top-level modules but can be used at any level of the module hierarchy.\n", + "\n", + "For demonstration, we will treat an encoder comprising a block of Causal Self Attention modules as a typed Neural Module. Of course, we can also treat each Causal Self Attention layer itself as a neural module if we require it, but top-level modules are generally preferred." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "w4oXpAL_CoDp" + }, + "source": [ + "class CausalSelfAttention(nn.Module):\n", + " \"\"\"\n", + " A vanilla multi-head masked self-attention layer with a projection at the end.\n", + " It is possible to use torch.nn.MultiheadAttention here but I am including an\n", + " explicit implementation here to show that there is nothing too scary here.\n", + " \"\"\"\n", + "\n", + " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", + " super().__init__()\n", + " assert n_embd % n_head == 0\n", + " self.n_head = n_head\n", + " # key, query, value projections for all heads\n", + " self.key = nn.Linear(n_embd, n_embd)\n", + " self.query = nn.Linear(n_embd, n_embd)\n", + " self.value = nn.Linear(n_embd, n_embd)\n", + " # regularization\n", + " self.attn_drop = nn.Dropout(attn_pdrop)\n", + " self.resid_drop = nn.Dropout(resid_pdrop)\n", + " # output projection\n", + " self.proj = nn.Linear(n_embd, n_embd)\n", + " # causal mask to ensure that attention is only applied to the left in the input sequence\n", + " self.register_buffer(\"mask\", torch.tril(torch.ones(block_size, block_size))\n", + " .view(1, 1, block_size, block_size))\n", + " def forward(self, x, layer_past=None):\n", + " B, T, C = x.size()\n", + "\n", + " # calculate query, key, values for all heads in batch and move head forward to be the batch dim\n", + " k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + " q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + " v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)\n", + "\n", + " # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)\n", + " att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))\n", + " att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))\n", + " att = F.softmax(att, dim=-1)\n", + " att = self.attn_drop(att)\n", + " y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)\n", + " y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side\n", + "\n", + " # output projection\n", + " y = self.resid_drop(self.proj(y))\n", + " return y\n", + " \n", + "\n", + "class Block(nn.Module):\n", + " \"\"\" an unassuming Transformer block \"\"\"\n", + "\n", + " def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):\n", + " super().__init__()\n", + " self.ln1 = nn.LayerNorm(n_embd)\n", + " self.ln2 = nn.LayerNorm(n_embd)\n", + " self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)\n", + " self.mlp = nn.Sequential(\n", + " nn.Linear(n_embd, 4 * n_embd),\n", + " nn.GELU(),\n", + " nn.Linear(4 * n_embd, n_embd),\n", + " nn.Dropout(resid_pdrop),\n", + " )\n", + "\n", + " def forward(self, x):\n", + " x = x + self.attn(self.ln1(x))\n", + " x = x + self.mlp(self.ln2(x))\n", + " return x" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mv0dyrLifkw0" + }, + "source": [ + "## Building the NeMo Model\n", + "\n", + "Since a NeMo Model is comprised of various parts, we are going to iterate on the model step by step inside this notebook. As such, we will have multiple intermediate NeMo \"Models\", which will be partial implementations, and they will inherit each other iteratively.\n", + "\n", + "In a complete implementation of a NeMo Model (as found in the NeMo collections), all of these components will generally be found in a single class.\n", + "\n", + "Let's start by inheriting `ModelPT` - the core class of a PyTorch NeMo Model, which inherits the PyTorch Lightning Module." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TxeG-qMrRgNU" + }, + "source": [ + "-------\n", + "**Remember**:\n", + "\n", + " - The NeMo equivalent of `torch.nn.Module` is the `NeuralModule.\n", + " - The NeMo equivalent of the `LightningModule` is `ModelPT`.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0TsfmCYthMux" + }, + "source": [ + "import pytorch_lightning as ptl\n", + "from nemo.core import ModelPT\n", + "from omegaconf import OmegaConf" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_ib2rSz2hjaP" + }, + "source": [ + "------\n", + "Next, let's construct the bare minimum implementation of the NeMo Model - just the constructor, the initializer of weights, and the forward method.\n", + "\n", + "Initially, we will follow the steps followed by the minGPT implementation, and progressively refactor for NeMo " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "98x9-Fh-HVwj" + }, + "source": [ + "class PTLGPT(ptl.LightningModule):\n", + " def __init__(self,\n", + " # model definition args\n", + " vocab_size: int, # size of the vocabulary (number of possible tokens)\n", + " block_size: int, # length of the model's context window in time\n", + " n_layer: int, # depth of the model; number of Transformer blocks in sequence\n", + " n_embd: int, # the \"width\" of the model, number of channels in each Transformer\n", + " n_head: int, # number of heads in each multi-head attention inside each Transformer block\n", + " # model optimization args\n", + " learning_rate: float = 3e-4, # the base learning rate of the model\n", + " weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops\n", + " betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer\n", + " embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", + " resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", + " attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix\n", + " ):\n", + " super().__init__()\n", + "\n", + " # save these for optimizer init later\n", + " self.learning_rate = learning_rate\n", + " self.weight_decay = weight_decay\n", + " self.betas = betas\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", + " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", + " self.drop = nn.Dropout(embd_pdrop)\n", + " # deep transformer: just a sequence of transformer blocks\n", + " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])\n", + " # decoder: at the end one more layernorm and decode the answers\n", + " self.ln_f = nn.LayerNorm(n_embd)\n", + " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", + "\n", + " self.block_size = block_size\n", + " self.apply(self._init_weights)\n", + "\n", + " print(\"number of parameters: %e\" % sum(p.numel() for p in self.parameters()))\n", + "\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", + "\n", + " # forward the GPT model\n", + " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", + " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", + " x = self.drop(token_embeddings + position_embeddings)\n", + " x = self.blocks(x)\n", + " x = self.ln_f(x)\n", + " logits = self.head(x)\n", + "\n", + " return logits\n", + "\n", + " def get_block_size(self):\n", + " return self.block_size\n", + "\n", + " def _init_weights(self, module):\n", + " \"\"\"\n", + " Vanilla model initialization:\n", + " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", + " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", + " \"\"\"\n", + " if isinstance(module, (nn.Linear, nn.Embedding)):\n", + " module.weight.data.normal_(mean=0.0, std=0.02)\n", + " if isinstance(module, nn.Linear) and module.bias is not None:\n", + " module.bias.data.zero_()\n", + " elif isinstance(module, nn.LayerNorm):\n", + " module.bias.data.zero_()\n", + " module.weight.data.fill_(1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2bMf5SO7wmor" + }, + "source": [ + "------\n", + "Let's create a PyTorch Lightning Model above, just to make sure it works !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "rrXIBzg4wutC" + }, + "source": [ + "m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZCcgn1bajPW8" + }, + "source": [ + "------\n", + "Now, let's convert the above easily into a NeMo Model.\n", + "\n", + "A NeMo Model constructor generally accepts only two things - \n", + "\n", + "1) `cfg`: An OmegaConf DictConfig object that defines precisely the components required by the model to define its neural network architecture, data loader setup, optimizer setup, and any additional components needed for the model itself.\n", + "\n", + "2) `trainer`: An optional Trainer from PyTorch Lightning if the NeMo model will be used for training. It can be set after construction (if required) using the `set_trainer` method. For this notebook, we will not be constructing the config for the Trainer object." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WQMTCB3kz0UA" + }, + "source": [ + "## Refactoring Neural Modules\n", + "\n", + "As we discussed above, Neural Modules are generally higher-level components of the Model and can potentially be replaced by equivalent Neural Modules.\n", + "\n", + "As we see above, the embedding modules, deep transformer decoder network, and final decoder layer have all been combined inside the PyTorch Lightning implementation constructor.\n", + "\n", + "------\n", + "\n", + "However, the final decoder module could have been an RNN instead of a simple Linear layer, or it could have been a 1D-CNN instead.\n", + "\n", + "Likewise, the deep transformer decoder could potentially have a different implementation of Self Attention modules.\n", + "\n", + "These changes cannot be easily implemented any more inside the above implementation. However, if we refactor these components into their respective NeuralModules, then we can easily replace them with equivalent modules we construct in the future!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EJj5sSkX0xHi" + }, + "source": [ + "### Refactoring the Embedding module\n", + "\n", + "Let's first refactor out the embedding module from the above implementation" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uYwMyjqK05RL" + }, + "source": [ + "class GPTEmbedding(NeuralModule):\n", + " def __init__(self, vocab_size: int, n_embd: int, block_size: int, embd_pdrop: float = 0.0):\n", + " super().__init__()\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.tok_emb = nn.Embedding(vocab_size, n_embd)\n", + " self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))\n", + " self.drop = nn.Dropout(embd_pdrop)\n", + "\n", + " @typecheck()\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " \n", + " # forward the GPT model\n", + " token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n", + " position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n", + " x = self.drop(token_embeddings + position_embeddings)\n", + " return x\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'idx': NeuralType(('B', 'T'), Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'embeddings': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l5rOP6lyOyRt" + }, + "source": [ + "### Refactoring the Encoder\n", + "\n", + "Next, let's refactor the GPT Encoder - which is implemented as a multi layer Transformer (Decoder) network.\n", + "\n", + "------\n", + "It can be noted that we refer to the GPT \"Encoder\" module - but it is constructed by using Transformer \"Decoder\" blocks.\n", + "\n", + "***When we discuss Neural Modules - we are discussing an abstract module with a certain input neural type and a certain output neural type.***\n", + "\n", + "For us, the GPT \"Encoder\" neural module will accept any implementation, whose\n", + "\n", + "- input neural type is `NeuralType(('B', 'T', 'C'), EmbeddedTextType())`\n", + "\n", + "- output type is `NeuralType(('B', 'T', 'C'), EncodedRepresentation())`\n", + "\n", + "-----\n", + "One concrete implementation of such a GPT \"Encoder\" neural module is a Deep Transformer \"Decoder\" network." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "1QeQnQ_G2PwH" + }, + "source": [ + "class GPTTransformerEncoder(NeuralModule):\n", + " def __init__(self, n_embd: int, block_size: int, n_head: int, n_layer: int, attn_pdrop: float = 0.0, resid_pdrop: float = 0.0):\n", + " super().__init__()\n", + "\n", + " self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) \n", + " for _ in range(n_layer)])\n", + " \n", + " @typecheck()\n", + " def forward(self, embed):\n", + " return self.blocks(embed)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'embed': NeuralType(('B', 'T', 'C'), EmbeddedTextType())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'encoding': NeuralType(('B', 'T', 'C'), CausalSelfAttentionType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NmCR3LK3QHum" + }, + "source": [ + "### Refactoring the Decoder\n", + "\n", + "Finally, let's refactor the Decoder - the small one-layer feed-forward network to decode the answer.\n", + "\n", + "-------\n", + "\n", + "Note an interesting detail - The `input_types` of the Decoder accepts the generic `EncoderRepresentation()`, where as the `neural_type` of the `GPTTransformerEncoder` has the `output_type` of `CausalSelfAttentionType`.\n", + "\n", + "This is semantically *not* a mismatch! As you can see above in the inheritance chart, we declare `EncodedRepresentation` -> `AttentionType` -> `SelfAttentionType` -> `CausalSelfAttentionType`. \n", + "\n", + "Such an inheritance hierarchy for the `element_type` allows future encoders (which also have a neural output type of at least `EncodedRepresentation`) to be swapped in place of the current GPT Causal Self Attention Encoder while keeping the rest of the NeMo model working just fine!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VCPUu0EWQIBX" + }, + "source": [ + "class GPTDecoder(NeuralModule):\n", + " def __init__(self, n_embd: int, vocab_size: int):\n", + " super().__init__()\n", + " self.ln_f = nn.LayerNorm(n_embd)\n", + " self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f\n", + "\n", + " @typecheck()\n", + " def forward(self, encoding):\n", + " x = self.ln_f(encoding)\n", + " logits = self.head(x)\n", + " return logits\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'encoding': NeuralType(('B', 'T', 'C'), EncodedRepresentation())\n", + " }\n", + " \n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", + " }\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nYLMjlW0Sdy1" + }, + "source": [ + "### Refactoring the NeMo GPT Model\n", + "\n", + "Now that we have 3 NeuralModules for the embedding, the encoder, and the decoder, let's refactor the NeMo model to take advantage of this refactor!\n", + "\n", + "This time, we inherit from `ModelPT` instead of the general `LightningModule`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZQlmtYU6iDwi" + }, + "source": [ + "class AbstractNeMoGPT(ModelPT):\n", + " def __init__(self, cfg: OmegaConf, trainer: ptl.Trainer = None):\n", + " super().__init__(cfg=cfg, trainer=trainer)\n", + "\n", + " # input embedding stem: drop(content + position)\n", + " self.embedding = self.from_config_dict(self.cfg.embedding)\n", + " # deep transformer: just a sequence of transformer blocks\n", + " self.encoder = self.from_config_dict(self.cfg.encoder)\n", + " # decoder: at the end one more layernorm and decode the answers\n", + " self.decoder = self.from_config_dict(self.cfg.decoder)\n", + "\n", + " self.block_size = self.cfg.embedding.block_size\n", + " self.apply(self._init_weights)\n", + "\n", + " print(\"number of parameters: %e\" % self.num_weights)\n", + "\n", + " @typecheck()\n", + " def forward(self, idx):\n", + " b, t = idx.size()\n", + " assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n", + "\n", + " # forward the GPT model\n", + " # Remember: Only kwargs are allowed !\n", + " e = self.embedding(idx=idx)\n", + " x = self.encoder(embed=e)\n", + " logits = self.decoder(encoding=x)\n", + "\n", + " return logits\n", + "\n", + " def get_block_size(self):\n", + " return self.block_size\n", + "\n", + " def _init_weights(self, module):\n", + " \"\"\"\n", + " Vanilla model initialization:\n", + " - all MatMul weights \\in N(0, 0.02) and biases to zero\n", + " - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0\n", + " \"\"\"\n", + " if isinstance(module, (nn.Linear, nn.Embedding)):\n", + " module.weight.data.normal_(mean=0.0, std=0.02)\n", + " if isinstance(module, nn.Linear) and module.bias is not None:\n", + " module.bias.data.zero_()\n", + " elif isinstance(module, nn.LayerNorm):\n", + " module.bias.data.zero_()\n", + " module.weight.data.fill_(1.0)\n", + "\n", + " @property\n", + " def input_types(self):\n", + " return {\n", + " 'idx': NeuralType(('B', 'T'), Index())\n", + " }\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'logits': NeuralType(('B', 'T', 'C'), LogitsType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DFRmxWiSmdF3" + }, + "source": [ + "## Creating a config for a Model\n", + "\n", + "At first glance, not much changed compared to the PyTorch Lightning implementation above. Other than the constructor, which now accepts a config, nothing changed at all!\n", + "\n", + "NeMo operates on the concept of a NeMo Model being accompanied by a corresponding config dict (instantiated as an OmegaConf object). This enables us to prototype the model by utilizing Hydra rapidly. This includes various other benefits - such as hyperparameter optimization and serialization/deserialization of NeMo models.\n", + "\n", + "Let's look at how actually to construct such config objects!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "uygo0BEYjKuj" + }, + "source": [ + "# model definition args (required)\n", + "# ================================\n", + "# vocab_size: int # size of the vocabulary (number of possible tokens)\n", + "# block_size: int # length of the model's context window in time\n", + "# n_layer: int # depth of the model; number of Transformer blocks in sequence\n", + "# n_embd: int # the \"width\" of the model, number of channels in each Transformer\n", + "# n_head: int # number of heads in each multi-head attention inside each Transformer block \n", + "\n", + "# model definition args (optional)\n", + "# ================================\n", + "# embd_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on input embeddings\n", + "# resid_pdrop: float = 0.1, # \\in [0,1]: amount of dropout in each residual connection\n", + "# attn_pdrop: float = 0.1, # \\in [0,1]: amount of dropout on the attention matrix" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s4sdqRAFop-n" + }, + "source": [ + "------\n", + "As we look at the required parameters above, we need a way to tell OmegaConf that these values are currently not set, but the user should set them before we use them.\n", + "\n", + "OmegaConf supports such behavior using the `MISSING` value. A similar effect can be achieved in YAML configs by using `???` as a placeholder." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XqLSZq7Soo2j" + }, + "source": [ + "from omegaconf import MISSING" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JTH-1vu8TO7o" + }, + "source": [ + "# Let's create a utility for building the class path\n", + "def get_class_path(cls):\n", + " return f'{cls.__module__}.{cls.__name__}'" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6xToaWAJUmtX" + }, + "source": [ + "### Structure of a Model config\n", + "\n", + "Let's first create a config for the common components of the model level config -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZCvLdOlMVLy_" + }, + "source": [ + "common_config = OmegaConf.create({\n", + " 'vocab_size': MISSING,\n", + " 'block_size': MISSING,\n", + " 'n_layer': MISSING,\n", + " 'n_embd': MISSING,\n", + " 'n_head': MISSING,\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j8hvdKa4VmCV" + }, + "source": [ + "-----\n", + "The model config right now is still being built - it needs to contain a lot more details!\n", + "\n", + "A complete Model Config should have the sub-configs of all of its top-level modules as well. This means the configs of the `embedding`, `encoder`, and the `decoder`.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v-2_QOZyVgrE" + }, + "source": [ + "### Structure of sub-module config\n", + "\n", + "For top-level models, we generally don't change the actual module very often, and instead, primarily change the hyperparameters of that model.\n", + "\n", + "So we will make use of `Hydra`'s Class instantiation method - which can easily be accessed via the class method `ModelPT.from_config_dict()`.\n", + "\n", + "Let's take a few examples below -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ntsxQKH0pDac" + }, + "source": [ + "embedding_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTEmbedding),\n", + " 'vocab_size': '${model.vocab_size}',\n", + " 'n_embd': '${model.n_embd}',\n", + " 'block_size': '${model.block_size}',\n", + " 'embd_pdrop': 0.1\n", + "})\n", + "\n", + "encoder_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTTransformerEncoder),\n", + " 'n_embd': '${model.n_embd}',\n", + " 'block_size': '${model.block_size}',\n", + " 'n_head': '${model.n_head}',\n", + " 'n_layer': '${model.n_layer}',\n", + " 'attn_pdrop': 0.1,\n", + " 'resid_pdrop': 0.1\n", + "})\n", + "\n", + "decoder_config = OmegaConf.create({\n", + " '_target_': get_class_path(GPTDecoder),\n", + " # n_embd: int, vocab_size: int\n", + " 'n_embd': '${model.n_embd}',\n", + " 'vocab_size': '${model.vocab_size}'\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtloTqkqWhpl" + }, + "source": [ + "##### What is `_target_`?\n", + "--------\n", + "\n", + "In the above config, we see a `_target_` in the config. `_target_` is usually a full classpath to the actual class in the python package/user local directory. It is required for Hydra to locate and instantiate the model from its path correctly.\n", + "\n", + "So why do we want to set a classpath?\n", + "\n", + "In general, when developing models, we don't often change the encoder or the decoder, but we do change the hyperparameters of the encoder and decoder.\n", + "\n", + "This notation helps us keep the Model level declaration of the forward step neat and precise. It also logically helps us demark which parts of the model can be easily replaced - in the future, we can easily replace the encoder with some other type of self-attention block or the decoder with an RNN or 1D-CNN neural module (as long as they have the same Neural Type definition as the current blocks).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ASDmcgE4XtQ4" + }, + "source": [ + "##### What is the `${}` syntax?\n", + "-------\n", + "\n", + "OmegaConf, and by extension, Hydra, supports Variable Interpolation. As you can see in the `__init__` of embedding, encoder, and decoder neural modules, they often share many parameters between each other.\n", + "\n", + "It would become tedious and error-prone to set each of these constructors' values separately in each of the embedding, encoder, and decoder configs.\n", + "\n", + "So instead, we define standard keys inside of the `model` level config and then interpolate these values inside of the respective configs!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zXvEcXGhZi5I" + }, + "source": [ + "### Attaching the model and module-level configs\n", + "\n", + "So now, we have a Model level and per-module level configs for the core components. Sub-module configs generally fall under the \"model\" namespace, but you have the flexibility to define the structure as you require.\n", + "\n", + "Let's attach them!\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "c8hvNeB_aDgi" + }, + "source": [ + "model_config = OmegaConf.create({\n", + " 'model': common_config\n", + "})\n", + "\n", + "# Then let's attach the sub-module configs\n", + "model_config.model.embedding = embedding_config\n", + "model_config.model.encoder = encoder_config\n", + "model_config.model.decoder = decoder_config" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zIubuFcOpIB0" + }, + "source": [ + "-----\n", + "Let's print this config!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2SyKNgp9pG0N" + }, + "source": [ + "print(OmegaConf.to_yaml(model_config))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4PAA07EAauCn" + }, + "source": [ + "-----\n", + "Wait, why did OmegaConf not fill in the value of the variable interpolation for the configs yet?\n", + "\n", + "This is because OmegaConf takes a deferred approach to variable interpolation. First, we fill in temporary values of the required fields (those marked by `???`). Then, to force resolution ahead of time, we can use the following snippet - " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "0X4C76JyOAnN" + }, + "source": [ + "import copy" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ugxA0TPtbHVZ" + }, + "source": [ + "temp_config = copy.deepcopy(model_config)\n", + "temp_config.model.vocab_size = 10\n", + "temp_config.model.block_size = 4\n", + "temp_config.model.n_layer = 1\n", + "temp_config.model.n_embd = 32\n", + "temp_config.model.n_head = 4\n", + "\n", + "temp_config = OmegaConf.create(OmegaConf.to_container(temp_config, resolve=True))\n", + "print(OmegaConf.to_yaml(temp_config))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V41RFIpEpiOu" + }, + "source": [ + "-----\n", + "Now that we have a config, let's try to create an object of the NeMo Model !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "IIIVi2IfpsJ4" + }, + "source": [ + "# Let's work on a copy of the model config and update it before we send it into the Model.\n", + "cfg = copy.deepcopy(model_config)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "OllBhswPqQXq" + }, + "source": [ + "# Let's set the values of the config (for some plausible small model)\n", + "cfg.model.vocab_size = 100\n", + "cfg.model.block_size = 128\n", + "cfg.model.n_layer = 1\n", + "cfg.model.n_embd = 32\n", + "cfg.model.n_head = 4" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QJm2LnTqqcIM" + }, + "source": [ + "print(OmegaConf.to_yaml(cfg))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "E7tpB8BcqeBO" + }, + "source": [ + "# Try to create a model with this config [ERROR CELL]\n", + "m = AbstractNeMoGPT(cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cXOLhpxdq4Ni" + }, + "source": [ + "-----\n", + "\n", + "You will note that we added the `Abstract` tag for a reason to this NeMo Model and that when we try to instantiate it - it raises an error that we need to implement specific methods.\n", + "\n", + "1) `setup_training_data` & `setup_validation_data` - All NeMo models should implement two data loaders - the training data loader and the validation data loader. Optionally, they can go one step further and also implement the `setup_test_data` method to add support for evaluating the Model on its own.\n", + "\n", + "Why do we enforce this? NeMo Models are meant to be a unified, cohesive object containing the details about the neural network underlying that Model and the data loaders to train, validate, and optionally test those models.\n", + "\n", + "In doing so, once the Model is created/deserialized, it would take just a few more steps to train the Model from scratch / fine-tune/evaluate the Model on any data that the user provides, as long as this user-provided dataset is in a format supported by the Dataset / DataLoader that is used by this Model!\n", + "\n", + "2) `list_available_models` - This is a utility method to provide a list of pre-trained NeMo models to the user from the cloud.\n", + "\n", + "Typically, NeMo models can be easily packaged into a tar file (which we call a .nemo file in the earlier primer notebook). These tar files contain the model config + the pre-trained checkpoint weights of the Model, and can easily be downloaded from some cloud service. \n", + "\n", + "For this notebook, we will not be implementing this method.\n", + "\n", + "--------\n", + "Finally, let's create a concrete implementation of the above NeMo Model!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Vcwi1lO7t7Sm" + }, + "source": [ + "from nemo.core.classes.common import PretrainedModelInfo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ckCxyVLYqrz0" + }, + "source": [ + "class BasicNeMoGPT(AbstractNeMoGPT):\n", + "\n", + " @classmethod\n", + " def list_available_models(cls) -> PretrainedModelInfo:\n", + " return None\n", + "\n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self._train_dl = None\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " self._validation_dl = None\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " self._test_dl = None" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ofUoJ8DDvq_Y" + }, + "source": [ + "------\n", + "Now let's try to create an object of the `BasicNeMoGPT` model" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "G8iYQSC5vptU" + }, + "source": [ + "m = BasicNeMoGPT(cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "otvYW4TBxAju" + }, + "source": [ + "## Setting up train-val-test steps\n", + "\n", + "The above `BasicNeMoGPT` Model is a basic PyTorch Lightning Module, with some added functionality - \n", + "\n", + "1) Neural Type checks support - as defined in the Model as well as the internal modules.\n", + "\n", + "2) Save and restore of the Model (in the trivial case) to a tarfile.\n", + "\n", + "But as the Model is right now, it crucially does not support PyTorch Lightning's `Trainer`. As such, while this Model can be called manually, it cannot be easily trained or evaluated by using the PyTorch Lightning framework.\n", + "\n", + "------\n", + "\n", + "Let's begin adding support for this then -" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "QU3oQAVovxRg" + }, + "source": [ + "class BasicNeMoGPTWithSteps(BasicNeMoGPT):\n", + "\n", + " def step_(self, split, batch, batch_idx=None):\n", + " idx, targets = batch\n", + " logits = self(idx=idx)\n", + " loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n", + " key = 'loss' if split == 'train' else f\"{split}_loss\"\n", + " self.log(key, loss)\n", + " return {key: loss}\n", + "\n", + " def training_step(self, *args, **kwargs):\n", + " return self.step_('train', *args, **kwargs)\n", + "\n", + " def validation_step(self, *args, **kwargs):\n", + " return self.step_('val', *args, **kwargs)\n", + "\n", + " def test_step(self, *args, **kwargs):\n", + " return self.step_('test', *args, **kwargs)\n", + " \n", + " # This is useful for multiple validation data loader setup\n", + " def multi_validation_epoch_end(self, outputs, dataloader_idx: int = 0):\n", + " val_loss_mean = torch.stack([x['val_loss'] for x in outputs]).mean()\n", + " return {'val_loss': val_loss_mean}\n", + "\n", + " # This is useful for multiple test data loader setup\n", + " def multi_test_epoch_end(self, outputs, dataloader_idx: int = 0):\n", + " test_loss_mean = torch.stack([x['test_loss'] for x in outputs]).mean()\n", + " return {'test_loss': test_loss_mean}" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "2Ki3kRxag511" + }, + "source": [ + "m = BasicNeMoGPTWithSteps(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f_7YziAw_Isu" + }, + "source": [ + "### Setup for Multi Validation and Multi Test data loaders\n", + "\n", + "As discussed in the NeMo Primer, NeMo has in-built support for multiple data loaders for validation and test steps. Therefore, as an example of how easy it is to add such support, we include the `multi_validation_epoch_end` and `multi_test_epoch_end` overrides.\n", + "\n", + "It is also practically essential to collate results from more than one distributed GPUs, and then aggregate results properly at the end of the epoch. NeMo strictly enforces the correct collation of results, even if you will work on only one device! Future-proofing is baked into the model design for this case!\n", + "\n", + "Therefore NeMo provides the above two generic methods to support aggregation and simultaneously support multiple datasets!\n", + "\n", + "**Please note, you can prepend your already existing `on_validation_epoch_end` and `on_test_epoch_end` implementations with the `multi_` in the name, and that alone is sufficient to enable multi-dataset and multi-GPU support!**\n", + "\n", + "------\n", + "**Note: To disable multi-dataset support, simply override `on_validation_epoch_end` and `on_test_epoch_end` instead of `multi_validation_epoch_end` and `multi_test_epoch_end`!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QpfSn-YUh7GK" + }, + "source": [ + "## Setting up the optimizer / scheduler\n", + "\n", + "We are relatively close to reaching feature parity with the MinGPT Model! But we are missing a crucial piece - the optimizer.\n", + "\n", + "All NeMo Model's come with a default implementation of `setup_optimization()`, which will parse the provided model config to obtain the `optim` and `sched` sub-configs, and automatically configure the optimizer and scheduler.\n", + "\n", + "If training GPT was as simple as plugging in an Adam optimizer over all the parameters with a cosine weight decay schedule, we could do that from the config alone.\n", + "\n", + "-------\n", + "\n", + "But GPT is not such a trivial model - more specifically, it requires weight decay to be applied to the weight matrices but not to the biases, the embedding matrix, or the LayerNorm layers.\n", + "\n", + "We can drop the support that Nemo provides for such special cases and instead utilize the PyTorch Lightning method `configure_optimizers` to perform the same task.\n", + "\n", + "-------\n", + "\n", + "Note, for NeMo Models; the `configure_optimizers` is implemented as a trivial call to `setup_optimization()` followed by returning the generated optimizer and scheduler! So we can override the `configure_optimizer` method and manage the optimizer creation manually!\n", + "\n", + "NeMo's goal is to provide usable defaults for the general case and simply back off to either PyTorch Lightning or PyTorch nn.Module itself in cases when the additional flexibility becomes necessary!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "FgXkZQiVjnOv" + }, + "source": [ + "class BasicNeMoGPTWithOptim(BasicNeMoGPTWithSteps):\n", + "\n", + " def configure_optimizers(self):\n", + " \"\"\"\n", + " This long function is unfortunately doing something very simple and is being very defensive:\n", + " We are separating out all parameters of the model into two buckets: those that will experience\n", + " weight decay for regularization and those that won't (biases, and layernorm/embedding weights).\n", + " We are then returning the PyTorch optimizer object.\n", + " \"\"\"\n", + "\n", + " # separate out all parameters to those that will and won't experience weight decay\n", + " decay = set()\n", + " no_decay = set()\n", + " whitelist_weight_modules = (torch.nn.Linear, )\n", + " blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)\n", + " for mn, m in self.named_modules():\n", + " for pn, p in m.named_parameters():\n", + " fpn = '%s.%s' % (mn, pn) if mn else pn # full param name\n", + "\n", + " if pn.endswith('bias'):\n", + " # all biases will not be decayed\n", + " no_decay.add(fpn)\n", + " elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):\n", + " # weights of whitelist modules will be weight decayed\n", + " decay.add(fpn)\n", + " elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):\n", + " # weights of blacklist modules will NOT be weight decayed\n", + " no_decay.add(fpn)\n", + "\n", + " # special case the position embedding parameter in the root GPT module as not decayed\n", + " no_decay.add('embedding.pos_emb')\n", + "\n", + " # validate that we considered every parameter\n", + " param_dict = {pn: p for pn, p in self.named_parameters()}\n", + " inter_params = decay & no_decay\n", + " union_params = decay | no_decay\n", + " assert len(inter_params) == 0, \"parameters %s made it into both decay/no_decay sets!\" % (str(inter_params), )\n", + " assert len(param_dict.keys() - union_params) == 0, \"parameters %s were not separated into either decay/no_decay set!\" \\\n", + " % (str(param_dict.keys() - union_params), )\n", + "\n", + " # create the pytorch optimizer object\n", + " optim_groups = [\n", + " {\"params\": [param_dict[pn] for pn in sorted(list(decay))], \"weight_decay\": self.cfg.optim.weight_decay},\n", + " {\"params\": [param_dict[pn] for pn in sorted(list(no_decay))], \"weight_decay\": 0.0},\n", + " ]\n", + " optimizer = torch.optim.AdamW(optim_groups, lr=self.cfg.optim.lr, betas=self.cfg.optim.betas)\n", + " return optimizer\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "kARDwthakEQk" + }, + "source": [ + "m = BasicNeMoGPTWithOptim(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iB1kwctv2cYv" + }, + "source": [ + "-----\n", + "Now let's setup the config for the optimizer !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "5K7zh9Cn2s2u" + }, + "source": [ + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "optim_config = OmegaConf.create({\n", + " 'lr': 3e-4,\n", + " 'weight_decay': 0.1,\n", + " 'betas': [0.9, 0.95]\n", + "})\n", + "\n", + "cfg.model.optim = optim_config\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P31p8ABthsh0" + }, + "source": [ + "## Setting up the dataset / data loaders\n", + "\n", + "So we were able almost entirely to replicate the MinGPT implementation. \n", + "\n", + "Remember, NeMo models should contain all of the logic to load the Dataset and DataLoader for at least the train and validation step.\n", + "\n", + "We temporarily provided empty implementations to get around it till now, but let's fill that in now!\n", + "\n", + "-------\n", + "\n", + "**Note for datasets**: Below, we will show an example using a very small dataset called `tiny_shakespeare`, found at the original [char-rnn repository](https://github.com/karpathy/char-rnn), but practically you could use any text corpus. The one suggested in minGPT is available at http://mattmahoney.net/dc/textdata.html" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q8dlOcZPkxM1" + }, + "source": [ + "### Creating the Dataset\n", + "\n", + "NeMo has Neural Type checking support, even for Datasets! It's just a minor change of the import in most cases and one difference in how we handle `collate_fn`.\n", + "\n", + "We could paste the dataset info from minGPT, and you'd only need to make 2 changes!\n", + "\n", + "-----\n", + "In this example, we will be writing a thin subclass over the datasets provided by `nlp` from HuggingFace!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E-fswFkig9t4" + }, + "source": [ + "from nemo.core import Dataset\n", + "from torch.utils import data\n", + "from torch.utils.data.dataloader import DataLoader" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-Z8XuPeClGNm" + }, + "source": [ + "class TinyShakespeareDataset(Dataset):\n", + "\n", + " def __init__(self, data_path, block_size, crop=None, override_vocab=None):\n", + "\n", + " # load the data and crop it appropriately\n", + " with open(data_path, 'r') as f:\n", + " if crop is None:\n", + " data = f.read()\n", + " else:\n", + " f.seek(crop[0])\n", + " data = f.read(crop[1])\n", + "\n", + " # build a vocabulary from data or inherit it\n", + " vocab = sorted(list(set(data))) if override_vocab is None else override_vocab\n", + "\n", + " # Add UNK\n", + " special_tokens = ['', ''] # We use just and in the call, but can add others.\n", + " if not override_vocab:\n", + " vocab = [*special_tokens, *vocab] # Update train vocab with special tokens\n", + "\n", + " data_size, vocab_size = len(data), len(vocab)\n", + " print('data of crop %s has %d characters, vocab of size %d.' % (str(crop), data_size, vocab_size))\n", + " print('Num samples in dataset : %d' % (data_size // block_size))\n", + "\n", + " self.stoi = { ch:i for i,ch in enumerate(vocab) }\n", + " self.itos = { i:ch for i,ch in enumerate(vocab) }\n", + " self.block_size = block_size\n", + " self.vocab_size = vocab_size\n", + " self.data = data\n", + " self.vocab = vocab\n", + " self.special_tokens = special_tokens\n", + "\n", + " def __len__(self):\n", + " return len(self.data) // self.block_size\n", + "\n", + " def __getitem__(self, idx):\n", + " # attempt to fetch a chunk of (block_size + 1) items, but (block_size) will work too\n", + " chunk = self.data[idx*self.block_size : min(len(self.data), (idx+1)*self.block_size + 1)]\n", + " # map the string into a sequence of integers\n", + " ixes = [self.stoi[s] if s in self.stoi else self.stoi[''] for s in chunk ]\n", + " # if stars align (last idx and len(self.data) % self.block_size == 0), pad with \n", + " if len(ixes) < self.block_size + 1:\n", + " assert len(ixes) == self.block_size # i believe this is the only way this could happen, make sure\n", + " ixes.append(self.stoi[''])\n", + " dix = torch.tensor(ixes, dtype=torch.long)\n", + " return dix[:-1], dix[1:]\n", + "\n", + " @property\n", + " def output_types(self):\n", + " return {\n", + " 'input': NeuralType(('B', 'T'), Index()),\n", + " 'target': NeuralType(('B', 'T'), LabelsType())\n", + " }" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7MEMR4TcmP5K" + }, + "source": [ + "------\n", + "We didn't have to change anything until here. How then is type-checking done? \n", + "\n", + "NeMo does type-checking inside of the collate function implementation itself! In this case, it is not necessary to override the `collate_fn` inside the Dataset, but if we did need to override it, **NeMo requires that the private method `_collate_fn` be overridden instead**.\n", + "\n", + "We can then use data loaders with minor modifications!\n", + "\n", + "**Also, there is no need to implement the `input_types` for Dataset, as they are the ones generating the input for the model!**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZeKXAknenVch" + }, + "source": [ + "-----\n", + "Let's prepare the dataset that we are going to use - Tiny Shakespeare from the following codebase [char-rnn](https://github.com/karpathy/char-rnn)." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VwsdXtVzo--t" + }, + "source": [ + "import os" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QvKcDCvIl9-A" + }, + "source": [ + "if not os.path.exists('tiny-shakespeare.txt'):\n", + " !wget https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ynCwqDu6vK8P" + }, + "source": [ + "!head -n 5 tiny-shakespeare.txt" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "bfRL4t9_oS4C" + }, + "source": [ + "train_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(0, int(1e6)))\n", + "val_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1e6), int(50e3)), override_vocab=train_dataset.vocab)\n", + "test_dataset = TinyShakespeareDataset('tiny-shakespeare.txt', cfg.model.block_size, crop=(int(1.05e6), int(100e3)), override_vocab=train_dataset.vocab)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kIlCoZDksEDO" + }, + "source": [ + "### Setting up dataset/data loader support in the Model\n", + "\n", + "So we now know our data loader works. Let's integrate it as part of the Model itself!\n", + "\n", + "To do this, we use the three special attributes of the NeMo Model - `self._train_dl`, `self._validation_dl` and `self._test_dl`. Once you construct your DataLoader, place your data loader to these three variables. \n", + "\n", + "For multi-data loader support, the same applies! NeMo will automatically handle the management of multiple data loaders for you!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SVSfIk_-rMSg" + }, + "source": [ + "class NeMoGPT(BasicNeMoGPTWithOptim):\n", + "\n", + " def _setup_data_loader(self, cfg):\n", + " if self.vocab is None:\n", + " override_vocab = None\n", + " else:\n", + " override_vocab = self.vocab\n", + "\n", + " dataset = TinyShakespeareDataset(\n", + " data_path=cfg.data_path,\n", + " block_size=cfg.block_size,\n", + " crop=tuple(cfg.crop) if 'crop' in cfg else None,\n", + " override_vocab=override_vocab\n", + " )\n", + "\n", + " if self.vocab is None:\n", + " self.vocab = dataset.vocab\n", + "\n", + " return DataLoader(\n", + " dataset=dataset,\n", + " batch_size=cfg.batch_size,\n", + " shuffle=cfg.shuffle,\n", + " collate_fn=dataset.collate_fn, # <-- this is necessary for type checking\n", + " pin_memory=cfg.pin_memory if 'pin_memory' in cfg else False,\n", + " num_workers=cfg.num_workers if 'num_workers' in cfg else 0\n", + " )\n", + " \n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self.vocab = None\n", + " self._train_dl = self._setup_data_loader(train_data_config)\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " self._validation_dl = self._setup_data_loader(val_data_config)\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " self._test_dl = self._setup_data_loader(test_data_config)\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ait4nLtIxS96" + }, + "source": [ + "### Creating the dataset / dataloader config\n", + "\n", + "The final step to setup this model is to add the `train_ds`, `validation_ds` and `test_ds` configs inside the model config!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "C6zcTqJixOOL" + }, + "source": [ + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "# Set the data path and update vocabular size\n", + "cfg.model.data_path = 'tiny-shakespeare.txt'\n", + "cfg.model.vocab_size = train_dataset.vocab_size\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "zlvThf7BysyT" + }, + "source": [ + "train_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [0, int(1e6)],\n", + " 'batch_size': 64,\n", + " 'shuffle': True,\n", + "})\n", + "\n", + "validation_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [int(1e6), int(50e3)],\n", + " 'batch_size': 4,\n", + " 'shuffle': False,\n", + "})\n", + "\n", + "test_ds = OmegaConf.create({\n", + " 'data_path': '${model.data_path}',\n", + " 'block_size': '${model.block_size}',\n", + " 'crop': [int(1.05e6), int(100e3)],\n", + " 'batch_size': 4,\n", + " 'shuffle': False,\n", + "})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QVVzR6WKyMT5" + }, + "source": [ + "# Attach to the model config\n", + "OmegaConf.set_struct(cfg.model, False)\n", + "\n", + "cfg.model.train_ds = train_ds\n", + "cfg.model.validation_ds = validation_ds\n", + "cfg.model.test_ds = test_ds\n", + "\n", + "OmegaConf.set_struct(cfg.model, True)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "nd_9_mxS0ET-" + }, + "source": [ + "# Let's see the config now !\n", + "print(OmegaConf.to_yaml(cfg))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "dlwSQENU0JxA" + }, + "source": [ + "# Let's try creating a model now !\n", + "model = NeMoGPT(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q_Mp4bhH0tR1" + }, + "source": [ + "-----\n", + "All the data loaders load properly ! Yay!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CZHDqCyo6uWd" + }, + "source": [ + "# Evaluate the model - end to end!\n", + "\n", + "Now that the data loaders have been set up, all that's left is to train and test the model! We have most of the components required by this model - the train, val and test data loaders, the optimizer, and the type-checked forward step to perform the train-validation-test steps! \n", + "\n", + "But training a GPT model from scratch is not the goal of this primer, so instead, let's do a sanity check by merely testing the model for a few steps using random initial weights.\n", + "\n", + "The above will ensure that - \n", + "\n", + "1) Our data loaders work as intended\n", + "\n", + "2) The type checking system assures us that our Neural Modules are performing their forward step correctly.\n", + "\n", + "3) The loss is calculated, and therefore the model runs end to end, ultimately supporting PyTorch Lightning." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "johk6Z0e0WEm" + }, + "source": [ + "if torch.cuda.is_available():\n", + " accelerator = 'gpu'\n", + "else:\n", + " accelerator = 'cpu'\n", + "\n", + "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches=1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "oqeeofEr1S8e" + }, + "source": [ + "trainer.test(model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pqJy7esrA-Ha" + }, + "source": [ + "# Saving and restoring models\n", + "\n", + "NeMo internally keeps track of the model configuration, as well as the model checkpoints and parameters.\n", + "\n", + "As long as your NeMo follows the above general guidelines, you can call the `save_to` and `restore_from` methods to save and restore your models!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "DksG_-7G1Vbe" + }, + "source": [ + "model.save_to('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "JhjoFdCnBWVh" + }, + "source": [ + "!ls -d -- *.nemo" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "567txSF0BYXN" + }, + "source": [ + "temp_model = NeMoGPT.restore_from('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "YvnfG0kxBfTt" + }, + "source": [ + "# [ERROR CELL]\n", + "temp_model.setup_test_data(temp_model.cfg.test_ds)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N0ckN44YB-1K" + }, + "source": [ + "-----\n", + "\n", + "Hmm, it seems it wasn't so easy in this case. Non-trivial models have non-trivial issues!\n", + "\n", + "Remember, our NeMoGPT model sets its self.vocab inside the `setup_train_data` step. But that depends on the vocabulary generated by the train set... which is **not** restored during model restoration (unless you call `setup_train_data` explicitly!).\n", + "\n", + "We can quickly resolve this issue by constructing an external data file to enable save and restore support, and NeMo supports that too! We will use the `register_artifact` API in NeMo to support external files being attached to the .nemo checkpoint." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_Atyoc4NBjEV" + }, + "source": [ + "class NeMoGPTv2(NeMoGPT):\n", + " \n", + " def setup_training_data(self, train_data_config: OmegaConf):\n", + " self.vocab = None\n", + " self._train_dl = self._setup_data_loader(train_data_config)\n", + "\n", + " # Save the vocab into a text file for now\n", + " with open('vocab.txt', 'w') as f:\n", + " for token in self.vocab:\n", + " f.write(f\"{token}\")\n", + " \n", + " # This is going to register the file into .nemo!\n", + " # When you later use .save_to(), it will copy this file into the tar file.\n", + " self.register_artifact('vocab_file', 'vocab.txt')\n", + " \n", + " def setup_validation_data(self, val_data_config: OmegaConf):\n", + " # This is going to try to find the same file, and if it fails, \n", + " # it will use the copy in .nemo\n", + " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", + " \n", + " with open(vocab_file, 'r') as f:\n", + " vocab = []\n", + " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", + " self.vocab = vocab\n", + "\n", + " self._validation_dl = self._setup_data_loader(val_data_config)\n", + " \n", + " def setup_test_data(self, test_data_config: OmegaConf):\n", + " # This is going to try to find the same file, and if it fails, \n", + " # it will use the copy in .nemo\n", + " vocab_file = self.register_artifact('vocab_file', 'vocab.txt')\n", + "\n", + " with open(vocab_file, 'r') as f:\n", + " vocab = []\n", + " vocab = f.read().split('')[:-1] # the -1 here is for the dangling token in the file\n", + " self.vocab = vocab\n", + "\n", + " self._test_dl = self._setup_data_loader(test_data_config)\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "mn09jsRZDusN" + }, + "source": [ + "# Let's try creating a model now !\n", + "model = NeMoGPTv2(cfg=cfg.model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "sQPIPySDD1K0" + }, + "source": [ + "# Now let's try to save and restore !\n", + "model.save_to('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "0YwCJ4xaJ3bU" + }, + "source": [ + "temp_model = NeMoGPTv2.restore_from('gpt_model.nemo')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tcxwDIIWKKCQ" + }, + "source": [ + "temp_model.setup_multiple_test_data(temp_model.cfg.test_ds)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "j3Olm6ZTKRbO" + }, + "source": [ + "if torch.cuda.is_available():\n", + " accelerator = 'gpu'\n", + "else:\n", + " accelerator = 'cpu'\n", + "\n", + "trainer = ptl.Trainer(devices=1, accelerator=accelerator, limit_test_batches =1.0)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "_QE2SngCKV2p" + }, + "source": [ + "trainer.test(model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o2HpKzwKJ_MW" + }, + "source": [ + "------\n", + "There we go ! Now our models can be serialized and de-serialized without any issue, even with an external vocab file !" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZjCV5u3_OO7a" + }, + "source": [ + "" + ], + "execution_count": null, + "outputs": [] + } + ] } diff --git a/tutorials/asr/ASR_for_telephony_speech.ipynb b/tutorials/asr/ASR_for_telephony_speech.ipynb index 5d10e50950dd..47f752b57b48 100644 --- a/tutorials/asr/ASR_for_telephony_speech.ipynb +++ b/tutorials/asr/ASR_for_telephony_speech.ipynb @@ -1,344 +1,344 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "lJz6FDU1lRzc" - }, - "outputs": [], - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "!pip install matplotlib>=3.3.2\n", - "\n", - "## Install NeMo\n", - "BRANCH = 'r1.23.0'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "## Grab the config we'll use in this example\n", - "!mkdir configs\n", - "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", - "\n", - "\"\"\"\n", - "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", - "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", - "that you want to use the \"Run All Cells\" (or similar) option.\n", - "\"\"\"\n", - "# exit()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v1Jk9etFlRzf" - }, - "source": [ - "# Telephony speech (8 kHz)\n", - "This notebook covers general recommendations for using NeMo models with 8 kHz speech. All the pretrained models currently available through NeMo are trained with audio at 16 kHz. This means that if the original audio was sampled at a different rate, it's sampling rate was converted to 16 kHz through upsampling or downsampling. One of the common applications for ASR is to recognize telephony speech which typically consists of speech sampled at 8 kHz.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Mixed sample rate\n", - "Most of the pretrained English models distributed with NeMo are trained with mixed sample rate data, i.e. the training data typically consists of data sampled at both 8 kHz and 16 kHz. As an example pretrained Citrinet model \"stt_en_citrinet_1024\" was trained with the following datasets. \n", - "* Librispeech 960 hours of English speech\n", - "* Fisher Corpus\n", - "* Switchboard-1 Dataset\n", - "* WSJ-0 and WSJ-1\n", - "* National Speech Corpus - 1\n", - "* Mozilla Common Voice\n", - "\n", - "Among these, Fisher and Switchboard datasets are conversational telephone speech datasets with audio sampled at 8 kHz while the other datasets were originally sampled at least 16 kHz. Before training, all audio files from Fisher and Switchboard datasets were upsampled to 16 kHz. Because of this mixed sample rate training, our models can be used to recognize both narrowband (8kHz) and wideband speech (16kHz)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Inference with NeMo\n", - "NeMo ASR currently supports inference of audio in .wav format. Internally, the audio file is resampled to 16 kHz before inference is called on the model, so there is no difference running inference on 8 kHz audio compared to say 16 kHz or any other sampling rate audio with NeMo. Let's look at an example for running inference on 8 kHz audio. " - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "# This is where the an4/ directory will be placed.\n", - "# Change this if you don't want the data to be extracted in the current directory.\n", - "data_dir = '.'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import glob\n", - "import os\n", - "import subprocess\n", - "import tarfile\n", - "import wget\n", - "\n", - "# Download the dataset. This will take a few moments...\n", - "print(\"******\")\n", - "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", - " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", - " an4_path = wget.download(an4_url, data_dir)\n", - " print(f\"Dataset downloaded at: {an4_path}\")\n", - "else:\n", - " print(\"Tarfile already exists.\")\n", - " an4_path = data_dir + '/an4_sphere.tar.gz'\n", - "\n", - "if not os.path.exists(data_dir + '/an4/'):\n", - " # Untar and convert .sph to .wav (using sox)\n", - " tar = tarfile.open(an4_path)\n", - " tar.extractall(path=data_dir)\n", - "\n", - " print(\"Converting .sph to .wav...\")\n", - " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", - " for sph_path in sph_list:\n", - " wav_path = sph_path[:-4] + '.wav'\n", - " cmd = [\"sox\", sph_path, wav_path]\n", - " subprocess.run(cmd)\n", - "print(\"Finished conversion.\\n******\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Audio in an4 dataset is sampled at 22 kHz. Let's first downsample an audio file to 16 kHz." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import librosa\n", - "import IPython.display as ipd\n", - "import librosa.display\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Load and listen to the audio file\n", - "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", - "audio, sample_rate = librosa.load(example_file)\n", - "print(sample_rate)\n", - "audio_16kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=16000)\n", - "\n", - "import numpy as np\n", - "\n", - "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", - "spec = np.abs(librosa.stft(audio_16kHz))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=16000)\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');\n", - "plt.ylim([0, 8000])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's downsample the audio to 8 kHz" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "audio_8kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=8000)\n", - "spec = np.abs(librosa.stft(audio_8kHz))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=8000)\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');\n", - "plt.ylim([0, 8000])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import soundfile as sf\n", - "sf.write(data_dir + '/audio_16kHz.wav', audio_16kHz, 16000)\n", - "sample, sr = librosa.core.load(data_dir + '/audio_16kHz.wav')\n", - "ipd.Audio(sample, rate=sr)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sf.write(data_dir + '/audio_8kHz.wav', audio_8kHz, 8000)\n", - "sample, sr = librosa.core.load(data_dir + '/audio_8kHz.wav')\n", - "ipd.Audio(sample, rate=sr)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "Let's look at inference results using one of the pretrained models on the original, 16 kHz and 8 kHz versions of the example file we chose above." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from nemo.collections.asr.models import ASRModel\n", - "import torch\n", - "if torch.cuda.is_available():\n", - " device = torch.device(f'cuda:0')\n", - "asr_model = ASRModel.from_pretrained(model_name='stt_en_citrinet_1024', map_location=device)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As discussed above, there are no changes required for inference based on the sampling rate of audio and as we see below the pretrained Citrinet model gives accurate transcription even for audio downsampled to 8 Khz." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[example_file]))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_16kHz.wav']))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_8kHz.wav']))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Training / fine-tuning with 8 kHz data\n", - "For training a model with new 8 kHz data, one could take two approaches. The first approach, **which is recommended**, is to finetune a pretrained 16 kHz model by upsampling all the data to 16 kHz. Note that upsampling offline before training is not necessary but recommended as online upsampling during training is very time consuming and may slow down training significantly. The second approach is to train an 8 kHz model from scratch. **Note**: For the second approach, in our experiments we saw that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy.\n", - "\n", - "To upsample your 8 kHz data to 16 kHz command line tools like sox or ffmpeg are very useful. Here is the command to upsample and audio file using sox:\n", - "```shell\n", - "sox input_8k.wav -r 16000 -o output_16k.wav\n", - "```\n", - "Now to finetune a pre-trained model with this upsampled data, you can just restore the model weights from the pre-trained model and call trainer with the upsampled data. As an example, here is how one would fine-tune a Citrinet model:\n", - "```python\n", - "python examples/asr/script_to_bpe.py \\\n", - " --config-path=\"examples/asr/conf/citrinet\" \\\n", - " --config-name=\"citrinet_512.yaml\" \\\n", - " model.train_ds.manifest_filepath=\"\" \\\n", - " model.validation_ds.manifest_filepath=\"\" \\\n", - " trainer.devices=-1 \\\n", - " trainer.accelerator='gpu' \\\n", - " trainer.max_epochs=50 \\\n", - " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", - "```\n", - "\n", - "To train an 8 kHz model, just change the sample rate in the config to 8000 as follows:\n", - "\n", - "```python\n", - "python examples/asr/script_to_bpe.py \\\n", - " --config-path=\"examples/asr/conf/citrinet\" \\\n", - " --config-name=\"citrinet_512.yaml\" \\\n", - " model.sample_rate=8000 \\\n", - " model.train_ds.manifest_filepath=\"\" \\\n", - " model.validation_ds.manifest_filepath=\"\" \\\n", - " trainer.devices=-1 \\\n", - " trainer.accelerator='gpu' \\\n", - " trainer.max_epochs=50 \\\n", - " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", - "```" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "ASR_with_NeMo.ipynb", - "provenance": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.5" - }, - "pycharm": { - "stem_cell": { - "cell_type": "raw", + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lJz6FDU1lRzc" + }, + "outputs": [], + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", + "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "!pip install matplotlib>=3.3.2\n", + "\n", + "## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "## Grab the config we'll use in this example\n", + "!mkdir configs\n", + "!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "\n", + "\"\"\"\n", + "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", + "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", + "that you want to use the \"Run All Cells\" (or similar) option.\n", + "\"\"\"\n", + "# exit()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1Jk9etFlRzf" + }, + "source": [ + "# Telephony speech (8 kHz)\n", + "This notebook covers general recommendations for using NeMo models with 8 kHz speech. All the pretrained models currently available through NeMo are trained with audio at 16 kHz. This means that if the original audio was sampled at a different rate, it's sampling rate was converted to 16 kHz through upsampling or downsampling. One of the common applications for ASR is to recognize telephony speech which typically consists of speech sampled at 8 kHz.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mixed sample rate\n", + "Most of the pretrained English models distributed with NeMo are trained with mixed sample rate data, i.e. the training data typically consists of data sampled at both 8 kHz and 16 kHz. As an example pretrained Citrinet model \"stt_en_citrinet_1024\" was trained with the following datasets. \n", + "* Librispeech 960 hours of English speech\n", + "* Fisher Corpus\n", + "* Switchboard-1 Dataset\n", + "* WSJ-0 and WSJ-1\n", + "* National Speech Corpus - 1\n", + "* Mozilla Common Voice\n", + "\n", + "Among these, Fisher and Switchboard datasets are conversational telephone speech datasets with audio sampled at 8 kHz while the other datasets were originally sampled at least 16 kHz. Before training, all audio files from Fisher and Switchboard datasets were upsampled to 16 kHz. Because of this mixed sample rate training, our models can be used to recognize both narrowband (8kHz) and wideband speech (16kHz)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Inference with NeMo\n", + "NeMo ASR currently supports inference of audio in .wav format. Internally, the audio file is resampled to 16 kHz before inference is called on the model, so there is no difference running inference on 8 kHz audio compared to say 16 kHz or any other sampling rate audio with NeMo. Let's look at an example for running inference on 8 kHz audio. " + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# This is where the an4/ directory will be placed.\n", + "# Change this if you don't want the data to be extracted in the current directory.\n", + "data_dir = '.'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "import os\n", + "import subprocess\n", + "import tarfile\n", + "import wget\n", + "\n", + "# Download the dataset. This will take a few moments...\n", + "print(\"******\")\n", + "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", + " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", + " an4_path = wget.download(an4_url, data_dir)\n", + " print(f\"Dataset downloaded at: {an4_path}\")\n", + "else:\n", + " print(\"Tarfile already exists.\")\n", + " an4_path = data_dir + '/an4_sphere.tar.gz'\n", + "\n", + "if not os.path.exists(data_dir + '/an4/'):\n", + " # Untar and convert .sph to .wav (using sox)\n", + " tar = tarfile.open(an4_path)\n", + " tar.extractall(path=data_dir)\n", + "\n", + " print(\"Converting .sph to .wav...\")\n", + " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", + " for sph_path in sph_list:\n", + " wav_path = sph_path[:-4] + '.wav'\n", + " cmd = [\"sox\", sph_path, wav_path]\n", + " subprocess.run(cmd)\n", + "print(\"Finished conversion.\\n******\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Audio in an4 dataset is sampled at 22 kHz. Let's first downsample an audio file to 16 kHz." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import librosa\n", + "import IPython.display as ipd\n", + "import librosa.display\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Load and listen to the audio file\n", + "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", + "audio, sample_rate = librosa.load(example_file)\n", + "print(sample_rate)\n", + "audio_16kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=16000)\n", + "\n", + "import numpy as np\n", + "\n", + "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", + "spec = np.abs(librosa.stft(audio_16kHz))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=16000)\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');\n", + "plt.ylim([0, 8000])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's downsample the audio to 8 kHz" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "audio_8kHz = librosa.core.resample(audio, orig_sr=sample_rate, target_sr=8000)\n", + "spec = np.abs(librosa.stft(audio_8kHz))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time', sr=8000)\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');\n", + "plt.ylim([0, 8000])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import soundfile as sf\n", + "sf.write(data_dir + '/audio_16kHz.wav', audio_16kHz, 16000)\n", + "sample, sr = librosa.core.load(data_dir + '/audio_16kHz.wav')\n", + "ipd.Audio(sample, rate=sr)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sf.write(data_dir + '/audio_8kHz.wav', audio_8kHz, 8000)\n", + "sample, sr = librosa.core.load(data_dir + '/audio_8kHz.wav')\n", + "ipd.Audio(sample, rate=sr)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Let's look at inference results using one of the pretrained models on the original, 16 kHz and 8 kHz versions of the example file we chose above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo.collections.asr.models import ASRModel\n", + "import torch\n", + "if torch.cuda.is_available():\n", + " device = torch.device(f'cuda:0')\n", + "asr_model = ASRModel.from_pretrained(model_name='stt_en_citrinet_1024', map_location=device)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As discussed above, there are no changes required for inference based on the sampling rate of audio and as we see below the pretrained Citrinet model gives accurate transcription even for audio downsampled to 8 Khz." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[example_file]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_16kHz.wav']))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(asr_model.transcribe(paths2audio_files=[data_dir + '/audio_8kHz.wav']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Training / fine-tuning with 8 kHz data\n", + "For training a model with new 8 kHz data, one could take two approaches. The first approach, **which is recommended**, is to finetune a pretrained 16 kHz model by upsampling all the data to 16 kHz. Note that upsampling offline before training is not necessary but recommended as online upsampling during training is very time consuming and may slow down training significantly. The second approach is to train an 8 kHz model from scratch. **Note**: For the second approach, in our experiments we saw that loading the weights of a 16 kHz model as initialization helps the model to converge faster with better accuracy.\n", + "\n", + "To upsample your 8 kHz data to 16 kHz command line tools like sox or ffmpeg are very useful. Here is the command to upsample and audio file using sox:\n", + "```shell\n", + "sox input_8k.wav -r 16000 -o output_16k.wav\n", + "```\n", + "Now to finetune a pre-trained model with this upsampled data, you can just restore the model weights from the pre-trained model and call trainer with the upsampled data. As an example, here is how one would fine-tune a Citrinet model:\n", + "```python\n", + "python examples/asr/script_to_bpe.py \\\n", + " --config-path=\"examples/asr/conf/citrinet\" \\\n", + " --config-name=\"citrinet_512.yaml\" \\\n", + " model.train_ds.manifest_filepath=\"\" \\\n", + " model.validation_ds.manifest_filepath=\"\" \\\n", + " trainer.devices=-1 \\\n", + " trainer.accelerator='gpu' \\\n", + " trainer.max_epochs=50 \\\n", + " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", + "```\n", + "\n", + "To train an 8 kHz model, just change the sample rate in the config to 8000 as follows:\n", + "\n", + "```python\n", + "python examples/asr/script_to_bpe.py \\\n", + " --config-path=\"examples/asr/conf/citrinet\" \\\n", + " --config-name=\"citrinet_512.yaml\" \\\n", + " model.sample_rate=8000 \\\n", + " model.train_ds.manifest_filepath=\"\" \\\n", + " model.validation_ds.manifest_filepath=\"\" \\\n", + " trainer.devices=-1 \\\n", + " trainer.accelerator='gpu' \\\n", + " trainer.max_epochs=50 \\\n", + " +init_from_pretrained_model=\"stt_en_citrinet_512\"\n", + "```" + ] + } + ], "metadata": { - "collapsed": false + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "ASR_with_NeMo.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + }, + "pycharm": { + "stem_cell": { + "cell_type": "raw", + "metadata": { + "collapsed": false + }, + "source": [] + } + } }, - "source": [] - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 + "nbformat": 4, + "nbformat_minor": 4 } diff --git a/tutorials/asr/ASR_with_NeMo.ipynb b/tutorials/asr/ASR_with_NeMo.ipynb index 479a89ed8c2d..850f19171948 100644 --- a/tutorials/asr/ASR_with_NeMo.ipynb +++ b/tutorials/asr/ASR_with_NeMo.ipynb @@ -1,1176 +1,1176 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "accelerator": "GPU", - "colab": { - "name": "ASR_with_NeMo.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.7" - } - }, - "cells": [ - { - "cell_type": "code", - "metadata": { - "id": "lJz6FDU1lRzc" - }, - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell.\n", - "\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install sox libsndfile1 ffmpeg\n", - "!pip install text-unidecode\n", - "!pip install matplotlib>=3.3.2\n", - "\n", - "## Install NeMo\n", - "BRANCH = 'r1.23.0'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "\"\"\"\n", - "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", - "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", - "that you want to use the \"Run All Cells\" (or similar) option.\n", - "\"\"\"\n", - "# exit()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v1Jk9etFlRzf" - }, - "source": [ - "# Introduction to End-To-End Automatic Speech Recognition\n", - "\n", - "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", - "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", - "\n", - "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YLln3U-IlRzg" - }, - "source": [ - "## Conceptual Overview: What is ASR?\n", - "\n", - "ASR, or **Automatic Speech Recognition**, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?\n", - "\n", - "Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a **language model** that encapsulates the most likely orderings of words that are generated (e.g. an n-gram model), to a **pronunciation model** for each word in that ordering (e.g. a pronunciation table), to an **acoustic model** that translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).\n", - "\n", - "Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model `Pr(audio|transcript)*Pr(transcript)`, and take the argmax of this over possible transcripts.\n", - "\n", - "Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the whole prediction.\n", - "\n", - "Thus, we can see the appeal of **end-to-end ASR architectures**: discriminative models that simply take an audio input and give a textual output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, we could integrate a language model that would improve our predictions, as well.\n", - "\n", - "And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0S5iZPMSlRzg" - }, - "source": [ - "### End-To-End ASR\n", - "\n", - "With an end-to-end model, we want to directly learn `Pr(transcript|audio)` in order to predict the transcripts from the original audio. Since we are dealing with sequential information--audio data over time that corresponds to a sequence of letters--RNNs are the obvious choice. But now we have a pressing problem to deal with: since our input sequence (number of audio timesteps) is not the same length as our desired output (transcript length), how do we match each time step from the audio data to the correct output characters?\n", - "\n", - "Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as a phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (e.g. the prediction output `LLLAAAAPPTOOOPPPP` would become `LAPTOP`). It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do with words like \"book\" that contain consecutive repeated letters? Simply squashing repeated letters together would not work in that case!\n", - "\n", - "![Alignment example](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/tutorials/asr/images/alignment_example.png)\n", - "\n", - "Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the transcript--no extra data or labeling required. Let's briefly go over two popular approaches that allow us to do this, Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention.\n", - "\n", - "#### Connectionist Temporal Classification (CTC)\n", - "\n", - "In normal speech recognition prediction output, we would expect to have characters such as the letters from A through Z, numbers 0 through 9, spaces (\"\\_\"), and so on. CTC introduces a new intermediate output token called the **blank token** (\"-\") that is useful for getting around the alignment issue.\n", - "\n", - "With CTC, we still predict one token per time segment of speech, but we use the blank token to figure out where we can and can't collapse the predictions. The appearance of a blank token helps separate repeating letters that should not be collapsed. For instance, with an audio snippet segmented into `T=11` time steps, we could get predictions that look like `BOO-OOO--KK`, which would then collapse to `\"BO-O-K\"`, and then we would remove the blank tokens to get our final output, `BOOK`.\n", - "\n", - "Now, we can predict one output token per time step, then collapse and clean to get sensible output without any fear of ambiguity from repeating letters! A simple way of getting predictions like this would be to apply a bidirectional RNN to the audio input, apply softmax over each time step's output, and then take the token with the highest probability. The method of always taking the best token at each time step is called **greedy decoding, or max decoding**.\n", - "\n", - "To calculate our loss for backprop, we would like to know the log probability of the model producing the correct transcript, `log(Pr(transcript|audio))`. We can get the log probability of a single intermediate output sequence (e.g. `BOO-OOO--KK`) by summing over the log probabilities we get from each token's softmax value, but note that the resulting sum is different from the log probability of the transcript itself (`BOOK`). This is because there are multiple possible output sequences of the same length that can be collapsed to get the same transcript (e.g. `BBO--OO-KKK` also results in `BOOK`), and so we need to **marginalize over every valid sequence of length `T` that collapses to the transcript**.\n", - "\n", - "Therefore, to get our transcript's log probability given our audio input, we must sum the log probabilities of every sequence of length `T` that collapses to the transcript (e.g. `log(Pr(output: \"BOOK\"|audio)) = log(Pr(BOO-OOO--KK|audio)) + log(Pr(BBO--OO-KKK|audio)) + ...`). In practice, we can use a dynamic programming approach to calculate this, accumulating our log probabilities over different \"paths\" through the softmax outputs at each time step.\n", - "\n", - "If you would like a more in-depth explanation of how CTC works, or how we can improve our results by using a modified beam search algorithm, feel free to check out the Further Reading section at the end of this notebook for more resources.\n", - "\n", - "#### Sequence-to-Sequence with Attention\n", - "\n", - "One problem with CTC is that predictions at different time steps are conditionally independent, which is an issue because the words in a continuous utterance tend to be related to each other in some sensible way. With this conditional independence assumption, we can't learn a language model that can represent such dependencies, though we can add a language model on top of the CTC output to mitigate this to some degree.\n", - "\n", - "A popular alternative is to use a sequence-to-sequence model with attention. A typical seq2seq model for ASR consists of some sort of **bidirectional RNN encoder** that consumes the audio sequence timestep-by-timestep, and where the outputs are then passed to an **attention-based decoder**. Each prediction from the decoder is based on attending to some parts of the entire encoded input, as well as the previously outputted tokens.\n", - "\n", - "The outputs of the decoder can be anything from word pieces to phonemes to letters, and since predictions are not directly tied to time steps of the input, we can just continue producing tokens one-by-one until an end token is given (or we reach a specified max output length). This way, we do not need to deal with audio alignment, and our predicted transcript is just the sequence of outputs given by our decoder.\n", - "\n", - "Now that we have an idea of what some popular end-to-end ASR models look like, let's take a look at the audio data we'll be working with for our example." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "38aYTCTIlRzh" - }, - "source": [ - "## Taking a Look at Our Data (AN4)\n", - "\n", - "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n", - "\n", - "Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the \"Downloads\" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "gAhsmi6HlRzh" - }, - "source": [ - "import os\n", - "# This is where the an4/ directory will be placed.\n", - "# Change this if you don't want the data to be extracted in the current directory.\n", - "data_dir = '.'\n", - "\n", - "if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "Yb4fuUvWlRzk", - "scrolled": true - }, - "source": [ - "import glob\n", - "import os\n", - "import subprocess\n", - "import tarfile\n", - "import wget\n", - "\n", - "# Download the dataset. This will take a few moments...\n", - "print(\"******\")\n", - "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", - " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", - " an4_path = wget.download(an4_url, data_dir)\n", - " print(f\"Dataset downloaded at: {an4_path}\")\n", - "else:\n", - " print(\"Tarfile already exists.\")\n", - " an4_path = data_dir + '/an4_sphere.tar.gz'\n", - "\n", - "if not os.path.exists(data_dir + '/an4/'):\n", - " # Untar and convert .sph to .wav (using sox)\n", - " tar = tarfile.open(an4_path)\n", - " tar.extractall(path=data_dir)\n", - "\n", - " print(\"Converting .sph to .wav...\")\n", - " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", - " for sph_path in sph_list:\n", - " wav_path = sph_path[:-4] + '.wav'\n", - " cmd = [\"sox\", sph_path, wav_path]\n", - " subprocess.run(cmd)\n", - "print(\"Finished conversion.\\n******\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "m_LFeM0elRzm" - }, - "source": [ - "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\n", - "\n", - "Now we can load and take a look at the data. As an example, file `cen2-mgah-b.wav` is a 2.6 second-long audio recording of a man saying the letters \"G L E N N\" one-by-one. To confirm this, we can listen to the file:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_M_bSs3MjQlz" - }, - "source": [ - "import librosa\n", - "import IPython.display as ipd\n", - "\n", - "# Load and listen to the audio file\n", - "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", - "audio, sample_rate = librosa.load(example_file)\n", - "\n", - "ipd.Audio(example_file, rate=sample_rate)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qZyElgPVjQl5" - }, - "source": [ - "In an ASR task, if this WAV file was our input, then \"G L E N N\" would be our desired output.\n", - "\n", - "Let's plot the waveform, which is simply a line plot of the sequence of values that we read from the file. This is a format of viewing audio that you are likely to be familiar with seeing in many audio editors and visualizers:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "MqIAKkqelRzm" - }, - "source": [ - "%matplotlib inline\n", - "import librosa.display\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Plot our example audio file's waveform\n", - "plt.rcParams['figure.figsize'] = (15,7)\n", - "plt.title('Waveform of Audio Example')\n", - "plt.ylabel('Amplitude')\n", - "\n", - "_ = librosa.display.waveshow(audio, color='blue')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gg6RR_yolRzo" - }, - "source": [ - "We can see the activity in the waveform that corresponds to each letter in the audio, as our speaker here enunciates quite clearly!\n", - "You can kind of tell that each spoken letter has a different \"shape,\" and it's interesting to note that last two blobs look relatively similar, which is expected because they are both the letter \"N.\"\n", - "\n", - "### Spectrograms and Mel Spectrograms\n", - "\n", - "However, since audio information is more useful in the context of frequencies of sound over time, we can get a better representation than this raw sequence of 57,330 values.\n", - "We can apply a [Fourier Transform](https://en.wikipedia.org/wiki/Fourier_transform) on our audio signal to get something more useful: a **spectrogram**, which is a representation of the energy levels (i.e. amplitude, or \"loudness\") of each frequency (i.e. pitch) of the signal over the duration of the file.\n", - "A spectrogram (which can be viewed as a heat map) is a good way of seeing how the *strengths of various frequencies in the audio vary over time*, and is obtained by breaking up the signal into smaller, usually overlapping chunks and performing a Short-Time Fourier Transform (STFT) on each.\n", - "\n", - "Let's examine what the spectrogram of our sample looks like." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "oCFneEs1lRzp" - }, - "source": [ - "import numpy as np\n", - "\n", - "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", - "spec = np.abs(librosa.stft(audio))\n", - "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", - "\n", - "# Use log scale to view frequencies\n", - "librosa.display.specshow(spec_db, y_axis='log', x_axis='time')\n", - "plt.colorbar()\n", - "plt.title('Audio Spectrogram');" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9OPc4tcalRzs" - }, - "source": [ - "Again, we are able to see each letter being pronounced, and that the last two blobs that correspond to the \"N\"s are pretty similar-looking. But how do we interpret these shapes and colors? Just as in the waveform plot before, we see time passing on the x-axis (all 2.6s of audio). But now, the y-axis represents different frequencies (on a log scale), and *the color on the plot shows the strength of a frequency at a particular point in time*.\n", - "\n", - "We're still not done yet, as we can make one more potentially useful tweak: using the **Mel Spectrogram** instead of the normal spectrogram. This is simply a change in the frequency scale that we use from linear (or logarithmic) to the mel scale, which is \"a perceptual scale of pitches judged by listeners to be equal in distance from one another\" (from [Wikipedia](https://en.wikipedia.org/wiki/Mel_scale)).\n", - "\n", - "In other words, it's a transformation of the frequencies to be more aligned to what humans perceive; a change of +1000Hz from 2000Hz->3000Hz sounds like a larger difference to us than 9000Hz->10000Hz does, so the mel scale normalizes this such that equal distances sound like equal differences to the human ear. Intuitively, we use the mel spectrogram because in this case we are processing and transcribing human speech, such that transforming the scale to better match what we hear is a useful procedure." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7yQXVn-TlRzt" - }, - "source": [ - "# Plot the mel spectrogram of our sample\n", - "mel_spec = librosa.feature.melspectrogram(y=audio, sr=sample_rate)\n", - "mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)\n", - "\n", - "librosa.display.specshow(\n", - " mel_spec_db, x_axis='time', y_axis='mel')\n", - "plt.colorbar()\n", - "plt.title('Mel Spectrogram');" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RSCyVizDlRz1" - }, - "source": [ - "## Convolutional ASR Models\n", - "\n", - "Let's take a look at the model that we will be building, and how we specify its parameters.\n", - "\n", - "### The Jasper Model\n", - "\n", - "We will be training a small [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288) from scratch (e.g. initialized randomly). \n", - "In brief, Jasper architectures consist of a repeated block structure that utilizes 1D convolutions.\n", - "In a Jasper_KxR model, `R` sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated `K` times.\n", - "We also have a one extra block at the beginning and a few more at the end that are invariant of `K` and `R`, and we use CTC loss.\n", - "\n", - "### The QuartzNet Model\n", - "\n", - "The QuartzNet is better variant of Jasper with a key difference that it uses time-channel separable 1D convolutions. This allows it to dramatically reduce number of weights while keeping similar accuracy.\n", - "\n", - "A Jasper/QuartzNet models look like this (QuartzNet model is pictured):\n", - "\n", - "![QuartzNet with CTC](https://developer.nvidia.com/blog/wp-content/uploads/2020/05/quartznet-model-architecture-1-625x742.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gEpNci7slRzw" - }, - "source": [ - "# Using NeMo for Automatic Speech Recognition\n", - "\n", - "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n", - "\n", - "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", - "\n", - "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "4_W0lhaQlRzx" - }, - "source": [ - "# NeMo's \"core\" package\n", - "import nemo\n", - "# NeMo's ASR collection - this collections contains complete ASR models and\n", - "# building blocks (modules) for ASR\n", - "import nemo.collections.asr as nemo_asr" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "v_W8EbYktZE3" - }, - "source": [ - "## Using an Out-of-the-Box Model\n", - "\n", - "NeMo's ASR collection comes with many building blocks and even complete models that we can use for training and evaluation. Moreover, several models come with pre-trained weights. Let's instantiate a complete QuartzNet15x5 model." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "KFZZpYult96G" - }, - "source": [ - "# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you\n", - "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KucxoFJhum0i" - }, - "source": [ - "Next, we'll simply add paths to files we want to transcribe into the list and pass it to our model. Note that it will work for relatively short (<25 seconds) files. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3QCpR_93u1hp" - }, - "source": [ - "files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav')]\n", - "for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):\n", - " print(f\"Audio in {fname} was recognized as: {transcription}\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ppUm_kuavm_f" - }, - "source": [ - "That was easy! But there are plenty of scenarios where you would want to fine-tune the model on your own data or even train from scratch. For example, this out-of-the box model will obviously not work for Spanish and would likely perform poorly for telephone audio. So if you have collected your own data, you certainly should attempt to fine-tune or train on it!" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ABUDaC5Js7AW" - }, - "source": [ - "## Training from Scratch\n", - "\n", - "To train from scratch, you need to prepare your training data in the right format and specify your models architecture." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RdNyw1b_zgtm" - }, - "source": [ - "### Creating Data Manifests\n", - "\n", - "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\n", - "\n", - "Here's an example of what one line in a NeMo-compatible manifest might look like:\n", - "```\n", - "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", - "```\n", - "\n", - "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\n", - "```\n", - "...\n", - " P I T T S B U R G H (cen5-fash-b)\n", - " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\n", - "...\n", - "```" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "lVB1sG1GlRzz" - }, - "source": [ - "# --- Building Manifest Files --- #\n", - "import json\n", - "\n", - "# Function to build a manifest\n", - "def build_manifest(transcripts_path, manifest_path, wav_path):\n", - " with open(transcripts_path, 'r') as fin:\n", - " with open(manifest_path, 'w') as fout:\n", - " for line in fin:\n", - " # Lines look like this:\n", - " # transcript (fileID)\n", - " transcript = line[: line.find('(')-1].lower()\n", - " transcript = transcript.replace('', '').replace('', '')\n", - " transcript = transcript.strip()\n", - "\n", - " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\n", - " audio_path = os.path.join(\n", - " data_dir, wav_path,\n", - " file_id[file_id.find('-')+1 : file_id.rfind('-')],\n", - " file_id + '.wav')\n", - "\n", - " duration = librosa.core.get_duration(filename=audio_path)\n", - "\n", - " # Write the metadata to the manifest\n", - " metadata = {\n", - " \"audio_filepath\": audio_path,\n", - " \"duration\": duration,\n", - " \"text\": transcript\n", - " }\n", - " json.dump(metadata, fout)\n", - " fout.write('\\n')\n", - " \n", - "# Building Manifests\n", - "print(\"******\")\n", - "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\n", - "train_manifest = data_dir + '/an4/train_manifest.json'\n", - "if not os.path.isfile(train_manifest):\n", - " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\n", - " print(\"Training manifest created.\")\n", - "\n", - "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\n", - "test_manifest = data_dir + '/an4/test_manifest.json'\n", - "if not os.path.isfile(test_manifest):\n", - " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n", - " print(\"Test manifest created.\")\n", - "print(\"***Done***\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "W2fShQzRzo-M" - }, - "source": [ - "### Specifying Our Model with a YAML Config File\n", - "\n", - "For this tutorial, we'll build a *Jasper_4x1 model*, with `K=4` blocks of single (`R=1`) sub-blocks and a *greedy CTC decoder*, using the configuration found in `./configs/config.yaml`.\n", - "\n", - "If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, with a field called `jasper` that contains a list with multiple entries. Each of the members in this list specifies one block in our model, and looks something like this:\n", - "```\n", - "- filters: 128\n", - " repeat: 1\n", - " kernel: [11]\n", - " stride: [2]\n", - " dilation: [1]\n", - " dropout: 0.2\n", - " residual: false\n", - " separable: true\n", - " se: true\n", - " se_context_size: -1\n", - "```\n", - "The first member of the list corresponds to the first block in the Jasper architecture diagram, which appears regardless of `K` and `R`.\n", - "Next, we have four entries that correspond to the `K=4` blocks, and each has `repeat: 1` since we are using `R=1`.\n", - "These are followed by two more entries for the blocks that appear at the end of our Jasper model before the CTC loss.\n", - "\n", - "There are also some entries at the top of the file that specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n", - "\n", - "Using a YAML config such as this is helpful for getting a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "PXVKBniMlRz5" - }, - "source": [ - "# --- Config Information ---#\n", - "try:\n", - " from ruamel.yaml import YAML\n", - "except ModuleNotFoundError:\n", - " from ruamel_yaml import YAML\n", - "config_path = './configs/config.yaml'\n", - "\n", - "if not os.path.exists(config_path):\n", - " # Grab the config we'll use in this example\n", - " BRANCH = 'r1.23.0'\n", - " !mkdir configs\n", - " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", - "\n", - "yaml = YAML(typ='safe')\n", - "with open(config_path) as f:\n", - " params = yaml.load(f)\n", - "print(params)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wUmq3p2Aw_5N" - }, - "source": [ - "### Training with PyTorch Lightning\n", - "\n", - "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\n", - "\n", - "However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "GUfR6tAK0k2u" - }, - "source": [ - "import pytorch_lightning as pl\n", - "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IEn2RyvgxxvO" - }, - "source": [ - "Next, we instantiate and ASR model based on our ``config.yaml`` file from the previous section.\n", - "Note that this is a stage during which we also tell the model where our training and validation manifests are." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Cbf0fsMK09lk" - }, - "source": [ - "from omegaconf import DictConfig\n", - "params['model']['train_ds']['manifest_filepath'] = train_manifest\n", - "params['model']['validation_ds']['manifest_filepath'] = test_manifest\n", - "first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hWtzwL5qXTYq" - }, - "source": [ - "With that, we can start training with just one line!" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "inRJsnrz1psq" - }, - "source": [ - "# Start training!!!\n", - "trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jpYXX-GslR0E" - }, - "source": [ - "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\n", - "\n", - "If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to()`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from()`.\n", - "\n", - "### After Training: Monitoring Progress and Changing Hyperparameters\n", - "We can now start Tensorboard to see how training went. Recall that WER stands for Word Error Rate and so the lower it is, the better." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "n_0y3stSXDX_" - }, - "source": [ - "try:\n", - " from google import colab\n", - " COLAB_ENV = True\n", - "except (ImportError, ModuleNotFoundError):\n", - " COLAB_ENV = False\n", - "\n", - "# Load the TensorBoard notebook extension\n", - "if COLAB_ENV:\n", - " %load_ext tensorboard\n", - " %tensorboard --logdir lightning_logs/\n", - "else:\n", - " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Z0h-BME7U8yb" - }, - "source": [ - "We could improve this model by playing with hyperparameters. We can look at the current hyperparameters with the following:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7kdQbpohXnEd" - }, - "source": [ - "print(params['model']['optim'])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sGZzRCvIW8kE" - }, - "source": [ - "Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `.setup_optimization()` with the new optimization parameters." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "AbigFKUtYgvn" - }, - "source": [ - "import copy\n", - "new_opt = copy.deepcopy(params['model']['optim'])\n", - "new_opt['lr'] = 0.001\n", - "first_asr_model.setup_optimization(optim_config=DictConfig(new_opt))\n", - "# And then you can invoke trainer.fit(first_asr_model)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D5Kwg8Cz-aaO" - }, - "source": [ - "## Inference\n", - "\n", - "Let's have a quick look at how one could run inference with NeMo's ASR model.\n", - "\n", - "First, ``EncDecCTCModel`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3FT0klSV268p" - }, - "source": [ - "paths2audio_files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen7-fmjd-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav'),\n", - " os.path.join(data_dir, 'an4/wav/an4_clstk/fkai/cen8-fkai-b.wav')]\n", - "print(first_asr_model.transcribe(paths2audio_files=paths2audio_files,\n", - " batch_size=4))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6FiCfLX0D7py" - }, - "source": [ - "Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7mP4r1Gx_Ilt" - }, - "source": [ - "# Bigger batch-size = bigger throughput\n", - "params['model']['validation_ds']['batch_size'] = 16\n", - "\n", - "# Setup the test data loader and make sure the model is on GPU\n", - "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\n", - "first_asr_model.cuda()\n", - "first_asr_model.eval()\n", - "\n", - "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\n", - "# WER is computed as numerator/denominator.\n", - "# We'll gather all the test batches' numerators and denominators.\n", - "wer_nums = []\n", - "wer_denoms = []\n", - "\n", - "# Loop over all test batches.\n", - "# Iterating over the model's `test_dataloader` will give us:\n", - "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\n", - "# See the AudioToCharDataset for more details.\n", - "for test_batch in first_asr_model.test_dataloader():\n", - " test_batch = [x.cuda() for x in test_batch]\n", - " targets = test_batch[2]\n", - " targets_lengths = test_batch[3] \n", - " log_probs, encoded_len, greedy_predictions = first_asr_model(\n", - " input_signal=test_batch[0], input_signal_length=test_batch[1]\n", - " )\n", - " # Notice the model has a helper object to compute WER\n", - " first_asr_model.wer.update(predictions=greedy_predictions, predictions_lengths=None, targets=targets, targets_lengths=targets_lengths)\n", - " _, wer_num, wer_denom = first_asr_model.wer.compute()\n", - " first_asr_model.wer.reset()\n", - " wer_nums.append(wer_num.detach().cpu().numpy())\n", - " wer_denoms.append(wer_denom.detach().cpu().numpy())\n", - "\n", - " # Release tensors from GPU memory\n", - " del test_batch, log_probs, targets, targets_lengths, encoded_len, greedy_predictions\n", - "\n", - "# We need to sum all numerators and denominators first. Then divide.\n", - "print(f\"WER = {sum(wer_nums)/sum(wer_denoms)}\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0kM9kBNOCptf" - }, - "source": [ - "This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number. Check out the next section on how to improve it further." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RBcJtg5ulR0H" - }, - "source": [ - "## Model Improvements\n", - "\n", - "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\n", - "\n", - "### Data Augmentation\n", - "\n", - "There exist several ASR data augmentation methods that can increase the size of our training set.\n", - "\n", - "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding in a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\n", - "\n", - "Our toy model does not do spectrogram augmentation. But the real one we got from cloud does:" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9glGogaPlR0H" - }, - "source": [ - "print(quartznet._cfg['spec_augment'])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LdwdcA_a640R" - }, - "source": [ - "If you want to enable SpecAugment in your model, make sure your .yaml config file contains 'model/spec_augment' section which looks like the one above." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2f142kIQc1Z2" - }, - "source": [ - "### Transfer learning\n", - "\n", - "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to make it perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\n", - "\n", - "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or on accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\n", - "\n", - "Transfer learning with NeMo is simple. Let's demonstrate how the model we got from the cloud could be fine-tuned on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change model's vocabulary, just to demonstrate how it's done." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "hl320dsydWX0" - }, - "source": [ - "# Check what kind of vocabulary/alphabet the model has right now\n", - "print(quartznet.decoder.vocabulary)\n", - "\n", - "# Let's add \"!\" symbol there. Note that you can (and should!) change the vocabulary\n", - "# entirely when fine-tuning using a different language.\n", - "quartznet.change_vocabulary(\n", - " new_vocabulary=[\n", - " ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\n", - " 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', \"'\", \"!\"\n", - " ]\n", - ")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "M7lvmiMSd3Aw" - }, - "source": [ - "After this, our decoder has completely changed, but our encoder (which is where most of the weights are) remained intact. Let's fine tune-this model for 2 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "_PZJIso-eDl-" - }, - "source": [ - "# Use the smaller learning rate we set before\n", - "quartznet.setup_optimization(optim_config=DictConfig(new_opt))\n", - "\n", - "# Point to the data we'll use for fine-tuning as the training set\n", - "quartznet.setup_training_data(train_data_config=params['model']['train_ds'])\n", - "\n", - "# Point to the new validation data for fine-tuning\n", - "quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])\n", - "\n", - "# And now we can create a PyTorch Lightning trainer and call `fit` again.\n", - "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=2)\n", - "trainer.fit(quartznet)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VURa1NavlR0U" - }, - "source": [ - "### Fast Training\n", - "\n", - "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\n", - "\n", - "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\n", - "\n", - "```python\n", - "# Mixed precision:\n", - "trainer = pl.Trainer(amp_level='O1', precision=16)\n", - "\n", - "# Trainer with a distributed backend:\n", - "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='ddp')\n", - "\n", - "# Of course, you can combine these flags as well.\n", - "```\n", - "\n", - "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d1ym8QT3jQnj" - }, - "source": [ - "### Deployment\n", - "\n", - "Note: It is recommended to run the deployment code from the NVIDIA PyTorch container.\n", - "\n", - "Let's get back to our pre-trained model and see how easy it can be exported to an ONNX file\n", - "in order to run it in an inference engine like TensorRT or ONNXRuntime.\n", - "\n", - "If you are running in an environment outside of the NVIDIA PyTorch container (like Google Colab for example) then you will have to build the onnxruntime and onnxruntime-gpu. The cell below gives an example of how to build those runtimes but the example may have to be adapted depending on your environment." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "I4WRcmakjQnj" - }, - "source": [ - "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n", - "#!mkdir -p ort\n", - "#%cd ort\n", - "#!git clean -xfd\n", - "#!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n", - "#!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/#x86_64-linux-gnu --build_wheel\n", - "#!pip uninstall -y onnxruntime\n", - "#!pip uninstall -y onnxruntime-gpu\n", - "#!pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n", - "#%cd .." - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F9yO1BEbjQnm" - }, - "source": [ - "Then run" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "HZnyWxPyjQnm" - }, - "source": [ - "import json\n", - "import os\n", - "import tempfile\n", - "import onnxruntime\n", - "import torch\n", - "\n", - "import numpy as np\n", - "import nemo.collections.asr as nemo_asr\n", - "from nemo.collections.asr.data.audio_to_text import AudioToCharDataset\n", - "from nemo.collections.asr.metrics.wer import WER\n", - "\n", - "def to_numpy(tensor):\n", - " return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()\n", - "\n", - "def setup_transcribe_dataloader(cfg, vocabulary):\n", - " config = {\n", - " 'manifest_filepath': os.path.join(cfg['temp_dir'], 'manifest.json'),\n", - " 'sample_rate': 16000,\n", - " 'labels': vocabulary,\n", - " 'batch_size': min(cfg['batch_size'], len(cfg['paths2audio_files'])),\n", - " 'trim_silence': True,\n", - " 'shuffle': False,\n", - " }\n", - " dataset = AudioToCharDataset(\n", - " manifest_filepath=config['manifest_filepath'],\n", - " labels=config['labels'],\n", - " sample_rate=config['sample_rate'],\n", - " int_values=config.get('int_values', False),\n", - " augmentor=None,\n", - " max_duration=config.get('max_duration', None),\n", - " min_duration=config.get('min_duration', None),\n", - " max_utts=config.get('max_utts', 0),\n", - " blank_index=config.get('blank_index', -1),\n", - " unk_index=config.get('unk_index', -1),\n", - " normalize=config.get('normalize_transcripts', False),\n", - " trim=config.get('trim_silence', True),\n", - " parser=config.get('parser', 'en'),\n", - " )\n", - " return torch.utils.data.DataLoader(\n", - " dataset=dataset,\n", - " batch_size=config['batch_size'],\n", - " collate_fn=dataset.collate_fn,\n", - " drop_last=config.get('drop_last', False),\n", - " shuffle=False,\n", - " num_workers=config.get('num_workers', 0),\n", - " pin_memory=config.get('pin_memory', False),\n", - " )\n", - "\n", - "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")\n", - "\n", - "quartznet.export('qn.onnx')\n", - "\n", - "ort_session = onnxruntime.InferenceSession('qn.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])\n", - "\n", - "with tempfile.TemporaryDirectory() as tmpdir:\n", - " with open(os.path.join(tmpdir, 'manifest.json'), 'w') as fp:\n", - " for audio_file in files:\n", - " entry = {'audio_filepath': audio_file, 'duration': 100000, 'text': 'nothing'}\n", - " fp.write(json.dumps(entry) + '\\n')\n", - "\n", - " config = {'paths2audio_files': files, 'batch_size': 4, 'temp_dir': tmpdir}\n", - " temporary_datalayer = setup_transcribe_dataloader(config, quartznet.decoder.vocabulary)\n", - " for test_batch in temporary_datalayer:\n", - " processed_signal, processed_signal_len = quartznet.preprocessor(\n", - " input_signal=test_batch[0].to(quartznet.device), length=test_batch[1].to(quartznet.device)\n", - " )\n", - " ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(processed_signal),}\n", - " ologits = ort_session.run(None, ort_inputs)\n", - " alogits = np.asarray(ologits)\n", - " logits = torch.from_numpy(alogits[0])\n", - " greedy_predictions = logits.argmax(dim=-1, keepdim=False)\n", - " wer = WER(decoding=quartznet.decoding, use_cer=False)\n", - " hypotheses, _ = wer.decoding.ctc_decoder_predictions_tensor(greedy_predictions)\n", - " print(hypotheses)\n", - " break\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wteGqroafWg1" - }, - "source": [ - "## Under the Hood\n", - "\n", - "NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n", - "\n", - "In particular, ``nemo_asr.model.EncDecCTCModel`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n", - "```python\n", - "def forward(self, input_signal, input_signal_length):\n", - " processed_signal, processed_signal_len = self.preprocessor(\n", - " input_signal=input_signal, length=input_signal_length,\n", - " )\n", - " # Spec augment is not applied during evaluation/testing\n", - " if self.spec_augmentation is not None and self.training:\n", - " processed_signal = self.spec_augmentation(input_spec=processed_signal)\n", - " encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n", - " log_probs = self.decoder(encoder_output=encoded)\n", - " greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n", - " return log_probs, encoded_len, greedy_predictions\n", - "```\n", - "Here:\n", - "\n", - "* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n", - "* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n", - "* ``self.encoder`` - is a convolutional Jasper/QuartzNet-like encoder of type ``nemo_asr.modules.ConvASREncoder``\n", - "* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n", - "\n", - "Also, ``EncDecCTCModel`` uses the audio dataset class ``nemo_asr.data.AudioToCharDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n", - "\n", - "You can use these and other neural modules (or create new ones yourself!) to construct new ASR models." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "smzlvbhelR0U" - }, - "source": [ - "# Further Reading/Watching:\n", - "\n", - "That's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n", - "- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n", - "- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n", - "- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n", - "- [Listen Attend and Spell Paper (seq2seq ASR model)](https://arxiv.org/abs/1508.01211)\n", - "- [Explanation of the mel spectrogram in more depth](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0)\n", - "- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n", - "- [QuartzNet paper](https://arxiv.org/abs/1910.10261)\n", - "- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n", - "- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n", - "- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n", - "- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "V3ERGX86lR0V" - }, - "source": [], - "execution_count": null, - "outputs": [] - } - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "accelerator": "GPU", + "colab": { + "name": "ASR_with_NeMo.ipynb", + "provenance": [], + "collapsed_sections": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.7" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "lJz6FDU1lRzc" + }, + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", + "\n\nNOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell.\n", + "\n", + "## Install dependencies\n", + "!pip install wget\n", + "!apt-get install sox libsndfile1 ffmpeg\n", + "!pip install text-unidecode\n", + "!pip install matplotlib>=3.3.2\n", + "\n", + "## Install NeMo\n", + "BRANCH = 'r1.23.0'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "\"\"\"\n", + "Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\n", + "Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\n", + "that you want to use the \"Run All Cells\" (or similar) option.\n", + "\"\"\"\n", + "# exit()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v1Jk9etFlRzf" + }, + "source": [ + "# Introduction to End-To-End Automatic Speech Recognition\n", + "\n", + "This notebook contains a basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", + "We will first introduce the basics of the main concepts behind speech recognition, then explore concrete examples of what the data looks like and walk through putting together a simple end-to-end ASR pipeline.\n", + "\n", + "We assume that you are familiar with general machine learning concepts and can follow Python code, and we'll be using the [AN4 dataset from CMU](http://www.speech.cs.cmu.edu/databases/an4/) (with processing using `sox`)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YLln3U-IlRzg" + }, + "source": [ + "## Conceptual Overview: What is ASR?\n", + "\n", + "ASR, or **Automatic Speech Recognition**, refers to the problem of getting a program to automatically transcribe spoken language (speech-to-text). Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. In other words, given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?\n", + "\n", + "Traditional speech recognition takes a generative approach, modeling the full pipeline of how speech sounds are produced in order to evaluate a speech sample. We would start from a **language model** that encapsulates the most likely orderings of words that are generated (e.g. an n-gram model), to a **pronunciation model** for each word in that ordering (e.g. a pronunciation table), to an **acoustic model** that translates those pronunciations to audio waveforms (e.g. a Gaussian Mixture Model).\n", + "\n", + "Then, if we receive some spoken input, our goal would be to find the most likely sequence of text that would result in the given audio according to our generative pipeline of models. Overall, with traditional speech recognition, we try to model `Pr(audio|transcript)*Pr(transcript)`, and take the argmax of this over possible transcripts.\n", + "\n", + "Over time, neural nets advanced to the point where each component of the traditional speech recognition model could be replaced by a neural model that had better performance and that had a greater potential for generalization. For example, we could replace an n-gram model with a neural language model, and replace a pronunciation table with a neural pronunciation model, and so on. However, each of these neural models need to be trained individually on different tasks, and errors in any model in the pipeline could throw off the whole prediction.\n", + "\n", + "Thus, we can see the appeal of **end-to-end ASR architectures**: discriminative models that simply take an audio input and give a textual output, and in which all components of the architecture are trained together towards the same goal. The model's encoder would be akin to an acoustic model for extracting speech features, which can then be directly piped to a decoder which outputs text. If desired, we could integrate a language model that would improve our predictions, as well.\n", + "\n", + "And the entire end-to-end ASR model can be trained at once--a much easier pipeline to handle! " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0S5iZPMSlRzg" + }, + "source": [ + "### End-To-End ASR\n", + "\n", + "With an end-to-end model, we want to directly learn `Pr(transcript|audio)` in order to predict the transcripts from the original audio. Since we are dealing with sequential information--audio data over time that corresponds to a sequence of letters--RNNs are the obvious choice. But now we have a pressing problem to deal with: since our input sequence (number of audio timesteps) is not the same length as our desired output (transcript length), how do we match each time step from the audio data to the correct output characters?\n", + "\n", + "Earlier speech recognition approaches relied on **temporally-aligned data**, in which each segment of time in an audio file was matched up to a corresponding speech sound such as a phoneme or word. However, if we would like to have the flexibility to predict letter-by-letter to prevent OOV (out of vocabulary) issues, then each time step in the data would have to be labeled with the letter sound that the speaker is making at that point in the audio file. With that information, it seems like we should simply be able to try to predict the correct letter for each time step and then collapse the repeated letters (e.g. the prediction output `LLLAAAAPPTOOOPPPP` would become `LAPTOP`). It turns out that this idea has some problems: not only does alignment make the dataset incredibly labor-intensive to label, but also, what do we do with words like \"book\" that contain consecutive repeated letters? Simply squashing repeated letters together would not work in that case!\n", + "\n", + "![Alignment example](https://raw.githubusercontent.com/NVIDIA/NeMo/stable/tutorials/asr/images/alignment_example.png)\n", + "\n", + "Modern end-to-end approaches get around this using methods that don't require manual alignment at all, so that the input-output pairs are really just the raw audio and the transcript--no extra data or labeling required. Let's briefly go over two popular approaches that allow us to do this, Connectionist Temporal Classification (CTC) and sequence-to-sequence models with attention.\n", + "\n", + "#### Connectionist Temporal Classification (CTC)\n", + "\n", + "In normal speech recognition prediction output, we would expect to have characters such as the letters from A through Z, numbers 0 through 9, spaces (\"\\_\"), and so on. CTC introduces a new intermediate output token called the **blank token** (\"-\") that is useful for getting around the alignment issue.\n", + "\n", + "With CTC, we still predict one token per time segment of speech, but we use the blank token to figure out where we can and can't collapse the predictions. The appearance of a blank token helps separate repeating letters that should not be collapsed. For instance, with an audio snippet segmented into `T=11` time steps, we could get predictions that look like `BOO-OOO--KK`, which would then collapse to `\"BO-O-K\"`, and then we would remove the blank tokens to get our final output, `BOOK`.\n", + "\n", + "Now, we can predict one output token per time step, then collapse and clean to get sensible output without any fear of ambiguity from repeating letters! A simple way of getting predictions like this would be to apply a bidirectional RNN to the audio input, apply softmax over each time step's output, and then take the token with the highest probability. The method of always taking the best token at each time step is called **greedy decoding, or max decoding**.\n", + "\n", + "To calculate our loss for backprop, we would like to know the log probability of the model producing the correct transcript, `log(Pr(transcript|audio))`. We can get the log probability of a single intermediate output sequence (e.g. `BOO-OOO--KK`) by summing over the log probabilities we get from each token's softmax value, but note that the resulting sum is different from the log probability of the transcript itself (`BOOK`). This is because there are multiple possible output sequences of the same length that can be collapsed to get the same transcript (e.g. `BBO--OO-KKK` also results in `BOOK`), and so we need to **marginalize over every valid sequence of length `T` that collapses to the transcript**.\n", + "\n", + "Therefore, to get our transcript's log probability given our audio input, we must sum the log probabilities of every sequence of length `T` that collapses to the transcript (e.g. `log(Pr(output: \"BOOK\"|audio)) = log(Pr(BOO-OOO--KK|audio)) + log(Pr(BBO--OO-KKK|audio)) + ...`). In practice, we can use a dynamic programming approach to calculate this, accumulating our log probabilities over different \"paths\" through the softmax outputs at each time step.\n", + "\n", + "If you would like a more in-depth explanation of how CTC works, or how we can improve our results by using a modified beam search algorithm, feel free to check out the Further Reading section at the end of this notebook for more resources.\n", + "\n", + "#### Sequence-to-Sequence with Attention\n", + "\n", + "One problem with CTC is that predictions at different time steps are conditionally independent, which is an issue because the words in a continuous utterance tend to be related to each other in some sensible way. With this conditional independence assumption, we can't learn a language model that can represent such dependencies, though we can add a language model on top of the CTC output to mitigate this to some degree.\n", + "\n", + "A popular alternative is to use a sequence-to-sequence model with attention. A typical seq2seq model for ASR consists of some sort of **bidirectional RNN encoder** that consumes the audio sequence timestep-by-timestep, and where the outputs are then passed to an **attention-based decoder**. Each prediction from the decoder is based on attending to some parts of the entire encoded input, as well as the previously outputted tokens.\n", + "\n", + "The outputs of the decoder can be anything from word pieces to phonemes to letters, and since predictions are not directly tied to time steps of the input, we can just continue producing tokens one-by-one until an end token is given (or we reach a specified max output length). This way, we do not need to deal with audio alignment, and our predicted transcript is just the sequence of outputs given by our decoder.\n", + "\n", + "Now that we have an idea of what some popular end-to-end ASR models look like, let's take a look at the audio data we'll be working with for our example." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "38aYTCTIlRzh" + }, + "source": [ + "## Taking a Look at Our Data (AN4)\n", + "\n", + "The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.\n", + "\n", + "Before we get started, let's download and prepare the dataset. The utterances are available as `.sph` files, so we will need to convert them to `.wav` for later processing. If you are not using Google Colab, please make sure you have [Sox](http://sox.sourceforge.net/) installed for this step--see the \"Downloads\" section of the linked Sox homepage. (If you are using Google Colab, Sox should have already been installed in the setup cell at the beginning.)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gAhsmi6HlRzh" + }, + "source": [ + "import os\n", + "# This is where the an4/ directory will be placed.\n", + "# Change this if you don't want the data to be extracted in the current directory.\n", + "data_dir = '.'\n", + "\n", + "if not os.path.exists(data_dir):\n", + " os.makedirs(data_dir)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "Yb4fuUvWlRzk", + "scrolled": true + }, + "source": [ + "import glob\n", + "import os\n", + "import subprocess\n", + "import tarfile\n", + "import wget\n", + "\n", + "# Download the dataset. This will take a few moments...\n", + "print(\"******\")\n", + "if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):\n", + " an4_url = 'https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz'\n", + " an4_path = wget.download(an4_url, data_dir)\n", + " print(f\"Dataset downloaded at: {an4_path}\")\n", + "else:\n", + " print(\"Tarfile already exists.\")\n", + " an4_path = data_dir + '/an4_sphere.tar.gz'\n", + "\n", + "if not os.path.exists(data_dir + '/an4/'):\n", + " # Untar and convert .sph to .wav (using sox)\n", + " tar = tarfile.open(an4_path)\n", + " tar.extractall(path=data_dir)\n", + "\n", + " print(\"Converting .sph to .wav...\")\n", + " sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)\n", + " for sph_path in sph_list:\n", + " wav_path = sph_path[:-4] + '.wav'\n", + " cmd = [\"sox\", sph_path, wav_path]\n", + " subprocess.run(cmd)\n", + "print(\"Finished conversion.\\n******\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m_LFeM0elRzm" + }, + "source": [ + "You should now have a folder called `an4` that contains `etc/an4_train.transcription`, `etc/an4_test.transcription`, audio files in `wav/an4_clstk` and `wav/an4test_clstk`, along with some other files we will not need.\n", + "\n", + "Now we can load and take a look at the data. As an example, file `cen2-mgah-b.wav` is a 2.6 second-long audio recording of a man saying the letters \"G L E N N\" one-by-one. To confirm this, we can listen to the file:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_M_bSs3MjQlz" + }, + "source": [ + "import librosa\n", + "import IPython.display as ipd\n", + "\n", + "# Load and listen to the audio file\n", + "example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'\n", + "audio, sample_rate = librosa.load(example_file)\n", + "\n", + "ipd.Audio(example_file, rate=sample_rate)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZyElgPVjQl5" + }, + "source": [ + "In an ASR task, if this WAV file was our input, then \"G L E N N\" would be our desired output.\n", + "\n", + "Let's plot the waveform, which is simply a line plot of the sequence of values that we read from the file. This is a format of viewing audio that you are likely to be familiar with seeing in many audio editors and visualizers:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "MqIAKkqelRzm" + }, + "source": [ + "%matplotlib inline\n", + "import librosa.display\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# Plot our example audio file's waveform\n", + "plt.rcParams['figure.figsize'] = (15,7)\n", + "plt.title('Waveform of Audio Example')\n", + "plt.ylabel('Amplitude')\n", + "\n", + "_ = librosa.display.waveshow(audio, color='blue')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gg6RR_yolRzo" + }, + "source": [ + "We can see the activity in the waveform that corresponds to each letter in the audio, as our speaker here enunciates quite clearly!\n", + "You can kind of tell that each spoken letter has a different \"shape,\" and it's interesting to note that last two blobs look relatively similar, which is expected because they are both the letter \"N.\"\n", + "\n", + "### Spectrograms and Mel Spectrograms\n", + "\n", + "However, since audio information is more useful in the context of frequencies of sound over time, we can get a better representation than this raw sequence of 57,330 values.\n", + "We can apply a [Fourier Transform](https://en.wikipedia.org/wiki/Fourier_transform) on our audio signal to get something more useful: a **spectrogram**, which is a representation of the energy levels (i.e. amplitude, or \"loudness\") of each frequency (i.e. pitch) of the signal over the duration of the file.\n", + "A spectrogram (which can be viewed as a heat map) is a good way of seeing how the *strengths of various frequencies in the audio vary over time*, and is obtained by breaking up the signal into smaller, usually overlapping chunks and performing a Short-Time Fourier Transform (STFT) on each.\n", + "\n", + "Let's examine what the spectrogram of our sample looks like." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "oCFneEs1lRzp" + }, + "source": [ + "import numpy as np\n", + "\n", + "# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)\n", + "spec = np.abs(librosa.stft(audio))\n", + "spec_db = librosa.amplitude_to_db(spec, ref=np.max) # Decibels\n", + "\n", + "# Use log scale to view frequencies\n", + "librosa.display.specshow(spec_db, y_axis='log', x_axis='time')\n", + "plt.colorbar()\n", + "plt.title('Audio Spectrogram');" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9OPc4tcalRzs" + }, + "source": [ + "Again, we are able to see each letter being pronounced, and that the last two blobs that correspond to the \"N\"s are pretty similar-looking. But how do we interpret these shapes and colors? Just as in the waveform plot before, we see time passing on the x-axis (all 2.6s of audio). But now, the y-axis represents different frequencies (on a log scale), and *the color on the plot shows the strength of a frequency at a particular point in time*.\n", + "\n", + "We're still not done yet, as we can make one more potentially useful tweak: using the **Mel Spectrogram** instead of the normal spectrogram. This is simply a change in the frequency scale that we use from linear (or logarithmic) to the mel scale, which is \"a perceptual scale of pitches judged by listeners to be equal in distance from one another\" (from [Wikipedia](https://en.wikipedia.org/wiki/Mel_scale)).\n", + "\n", + "In other words, it's a transformation of the frequencies to be more aligned to what humans perceive; a change of +1000Hz from 2000Hz->3000Hz sounds like a larger difference to us than 9000Hz->10000Hz does, so the mel scale normalizes this such that equal distances sound like equal differences to the human ear. Intuitively, we use the mel spectrogram because in this case we are processing and transcribing human speech, such that transforming the scale to better match what we hear is a useful procedure." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7yQXVn-TlRzt" + }, + "source": [ + "# Plot the mel spectrogram of our sample\n", + "mel_spec = librosa.feature.melspectrogram(y=audio, sr=sample_rate)\n", + "mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)\n", + "\n", + "librosa.display.specshow(\n", + " mel_spec_db, x_axis='time', y_axis='mel')\n", + "plt.colorbar()\n", + "plt.title('Mel Spectrogram');" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSCyVizDlRz1" + }, + "source": [ + "## Convolutional ASR Models\n", + "\n", + "Let's take a look at the model that we will be building, and how we specify its parameters.\n", + "\n", + "### The Jasper Model\n", + "\n", + "We will be training a small [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288) from scratch (e.g. initialized randomly). \n", + "In brief, Jasper architectures consist of a repeated block structure that utilizes 1D convolutions.\n", + "In a Jasper_KxR model, `R` sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated `K` times.\n", + "We also have a one extra block at the beginning and a few more at the end that are invariant of `K` and `R`, and we use CTC loss.\n", + "\n", + "### The QuartzNet Model\n", + "\n", + "The QuartzNet is better variant of Jasper with a key difference that it uses time-channel separable 1D convolutions. This allows it to dramatically reduce number of weights while keeping similar accuracy.\n", + "\n", + "A Jasper/QuartzNet models look like this (QuartzNet model is pictured):\n", + "\n", + "![QuartzNet with CTC](https://developer.nvidia.com/blog/wp-content/uploads/2020/05/quartznet-model-architecture-1-625x742.png)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gEpNci7slRzw" + }, + "source": [ + "# Using NeMo for Automatic Speech Recognition\n", + "\n", + "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n", + "\n", + "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n", + "\n", + "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "4_W0lhaQlRzx" + }, + "source": [ + "# NeMo's \"core\" package\n", + "import nemo\n", + "# NeMo's ASR collection - this collections contains complete ASR models and\n", + "# building blocks (modules) for ASR\n", + "import nemo.collections.asr as nemo_asr" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v_W8EbYktZE3" + }, + "source": [ + "## Using an Out-of-the-Box Model\n", + "\n", + "NeMo's ASR collection comes with many building blocks and even complete models that we can use for training and evaluation. Moreover, several models come with pre-trained weights. Let's instantiate a complete QuartzNet15x5 model." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KFZZpYult96G" + }, + "source": [ + "# This line will download pre-trained QuartzNet15x5 model from NVIDIA's NGC cloud and instantiate it for you\n", + "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KucxoFJhum0i" + }, + "source": [ + "Next, we'll simply add paths to files we want to transcribe into the list and pass it to our model. Note that it will work for relatively short (<25 seconds) files. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3QCpR_93u1hp" + }, + "source": [ + "files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav')]\n", + "for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):\n", + " print(f\"Audio in {fname} was recognized as: {transcription}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ppUm_kuavm_f" + }, + "source": [ + "That was easy! But there are plenty of scenarios where you would want to fine-tune the model on your own data or even train from scratch. For example, this out-of-the box model will obviously not work for Spanish and would likely perform poorly for telephone audio. So if you have collected your own data, you certainly should attempt to fine-tune or train on it!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ABUDaC5Js7AW" + }, + "source": [ + "## Training from Scratch\n", + "\n", + "To train from scratch, you need to prepare your training data in the right format and specify your models architecture." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RdNyw1b_zgtm" + }, + "source": [ + "### Creating Data Manifests\n", + "\n", + "The first thing we need to do now is to create manifests for our training and evaluation data, which will contain the metadata of our audio files. NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.\n", + "\n", + "Here's an example of what one line in a NeMo-compatible manifest might look like:\n", + "```\n", + "{\"audio_filepath\": \"path/to/audio.wav\", \"duration\": 3.45, \"text\": \"this is a nemo tutorial\"}\n", + "```\n", + "\n", + "We can build our training and evaluation manifests using `an4/etc/an4_train.transcription` and `an4/etc/an4_test.transcription`, which have lines containing transcripts and their corresponding audio file IDs:\n", + "```\n", + "...\n", + " P I T T S B U R G H (cen5-fash-b)\n", + " TWO SIX EIGHT FOUR FOUR ONE EIGHT (cen7-fash-b)\n", + "...\n", + "```" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "lVB1sG1GlRzz" + }, + "source": [ + "# --- Building Manifest Files --- #\n", + "import json\n", + "\n", + "# Function to build a manifest\n", + "def build_manifest(transcripts_path, manifest_path, wav_path):\n", + " with open(transcripts_path, 'r') as fin:\n", + " with open(manifest_path, 'w') as fout:\n", + " for line in fin:\n", + " # Lines look like this:\n", + " # transcript (fileID)\n", + " transcript = line[: line.find('(')-1].lower()\n", + " transcript = transcript.replace('', '').replace('', '')\n", + " transcript = transcript.strip()\n", + "\n", + " file_id = line[line.find('(')+1 : -2] # e.g. \"cen4-fash-b\"\n", + " audio_path = os.path.join(\n", + " data_dir, wav_path,\n", + " file_id[file_id.find('-')+1 : file_id.rfind('-')],\n", + " file_id + '.wav')\n", + "\n", + " duration = librosa.core.get_duration(filename=audio_path)\n", + "\n", + " # Write the metadata to the manifest\n", + " metadata = {\n", + " \"audio_filepath\": audio_path,\n", + " \"duration\": duration,\n", + " \"text\": transcript\n", + " }\n", + " json.dump(metadata, fout)\n", + " fout.write('\\n')\n", + " \n", + "# Building Manifests\n", + "print(\"******\")\n", + "train_transcripts = data_dir + '/an4/etc/an4_train.transcription'\n", + "train_manifest = data_dir + '/an4/train_manifest.json'\n", + "if not os.path.isfile(train_manifest):\n", + " build_manifest(train_transcripts, train_manifest, 'an4/wav/an4_clstk')\n", + " print(\"Training manifest created.\")\n", + "\n", + "test_transcripts = data_dir + '/an4/etc/an4_test.transcription'\n", + "test_manifest = data_dir + '/an4/test_manifest.json'\n", + "if not os.path.isfile(test_manifest):\n", + " build_manifest(test_transcripts, test_manifest, 'an4/wav/an4test_clstk')\n", + " print(\"Test manifest created.\")\n", + "print(\"***Done***\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W2fShQzRzo-M" + }, + "source": [ + "### Specifying Our Model with a YAML Config File\n", + "\n", + "For this tutorial, we'll build a *Jasper_4x1 model*, with `K=4` blocks of single (`R=1`) sub-blocks and a *greedy CTC decoder*, using the configuration found in `./configs/config.yaml`.\n", + "\n", + "If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, with a field called `jasper` that contains a list with multiple entries. Each of the members in this list specifies one block in our model, and looks something like this:\n", + "```\n", + "- filters: 128\n", + " repeat: 1\n", + " kernel: [11]\n", + " stride: [2]\n", + " dilation: [1]\n", + " dropout: 0.2\n", + " residual: false\n", + " separable: true\n", + " se: true\n", + " se_context_size: -1\n", + "```\n", + "The first member of the list corresponds to the first block in the Jasper architecture diagram, which appears regardless of `K` and `R`.\n", + "Next, we have four entries that correspond to the `K=4` blocks, and each has `repeat: 1` since we are using `R=1`.\n", + "These are followed by two more entries for the blocks that appear at the end of our Jasper model before the CTC loss.\n", + "\n", + "There are also some entries at the top of the file that specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.\n", + "\n", + "Using a YAML config such as this is helpful for getting a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "PXVKBniMlRz5" + }, + "source": [ + "# --- Config Information ---#\n", + "try:\n", + " from ruamel.yaml import YAML\n", + "except ModuleNotFoundError:\n", + " from ruamel_yaml import YAML\n", + "config_path = './configs/config.yaml'\n", + "\n", + "if not os.path.exists(config_path):\n", + " # Grab the config we'll use in this example\n", + " BRANCH = 'r1.23.0'\n", + " !mkdir configs\n", + " !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml\n", + "\n", + "yaml = YAML(typ='safe')\n", + "with open(config_path) as f:\n", + " params = yaml.load(f)\n", + "print(params)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wUmq3p2Aw_5N" + }, + "source": [ + "### Training with PyTorch Lightning\n", + "\n", + "NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.\n", + "\n", + "However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "GUfR6tAK0k2u" + }, + "source": [ + "import pytorch_lightning as pl\n", + "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IEn2RyvgxxvO" + }, + "source": [ + "Next, we instantiate and ASR model based on our ``config.yaml`` file from the previous section.\n", + "Note that this is a stage during which we also tell the model where our training and validation manifests are." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Cbf0fsMK09lk" + }, + "source": [ + "from omegaconf import DictConfig\n", + "params['model']['train_ds']['manifest_filepath'] = train_manifest\n", + "params['model']['validation_ds']['manifest_filepath'] = test_manifest\n", + "first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hWtzwL5qXTYq" + }, + "source": [ + "With that, we can start training with just one line!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "inRJsnrz1psq" + }, + "source": [ + "# Start training!!!\n", + "trainer.fit(first_asr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jpYXX-GslR0E" + }, + "source": [ + "There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.\n", + "\n", + "If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to()`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from()`.\n", + "\n", + "### After Training: Monitoring Progress and Changing Hyperparameters\n", + "We can now start Tensorboard to see how training went. Recall that WER stands for Word Error Rate and so the lower it is, the better." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "n_0y3stSXDX_" + }, + "source": [ + "try:\n", + " from google import colab\n", + " COLAB_ENV = True\n", + "except (ImportError, ModuleNotFoundError):\n", + " COLAB_ENV = False\n", + "\n", + "# Load the TensorBoard notebook extension\n", + "if COLAB_ENV:\n", + " %load_ext tensorboard\n", + " %tensorboard --logdir lightning_logs/\n", + "else:\n", + " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z0h-BME7U8yb" + }, + "source": [ + "We could improve this model by playing with hyperparameters. We can look at the current hyperparameters with the following:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7kdQbpohXnEd" + }, + "source": [ + "print(params['model']['optim'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sGZzRCvIW8kE" + }, + "source": [ + "Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `.setup_optimization()` with the new optimization parameters." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "AbigFKUtYgvn" + }, + "source": [ + "import copy\n", + "new_opt = copy.deepcopy(params['model']['optim'])\n", + "new_opt['lr'] = 0.001\n", + "first_asr_model.setup_optimization(optim_config=DictConfig(new_opt))\n", + "# And then you can invoke trainer.fit(first_asr_model)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5Kwg8Cz-aaO" + }, + "source": [ + "## Inference\n", + "\n", + "Let's have a quick look at how one could run inference with NeMo's ASR model.\n", + "\n", + "First, ``EncDecCTCModel`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3FT0klSV268p" + }, + "source": [ + "paths2audio_files = [os.path.join(data_dir, 'an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen7-fmjd-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fmjd/cen8-fmjd-b.wav'),\n", + " os.path.join(data_dir, 'an4/wav/an4_clstk/fkai/cen8-fkai-b.wav')]\n", + "print(first_asr_model.transcribe(paths2audio_files=paths2audio_files,\n", + " batch_size=4))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6FiCfLX0D7py" + }, + "source": [ + "Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7mP4r1Gx_Ilt" + }, + "source": [ + "# Bigger batch-size = bigger throughput\n", + "params['model']['validation_ds']['batch_size'] = 16\n", + "\n", + "# Setup the test data loader and make sure the model is on GPU\n", + "first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])\n", + "first_asr_model.cuda()\n", + "first_asr_model.eval()\n", + "\n", + "# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.\n", + "# WER is computed as numerator/denominator.\n", + "# We'll gather all the test batches' numerators and denominators.\n", + "wer_nums = []\n", + "wer_denoms = []\n", + "\n", + "# Loop over all test batches.\n", + "# Iterating over the model's `test_dataloader` will give us:\n", + "# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)\n", + "# See the AudioToCharDataset for more details.\n", + "for test_batch in first_asr_model.test_dataloader():\n", + " test_batch = [x.cuda() for x in test_batch]\n", + " targets = test_batch[2]\n", + " targets_lengths = test_batch[3] \n", + " log_probs, encoded_len, greedy_predictions = first_asr_model(\n", + " input_signal=test_batch[0], input_signal_length=test_batch[1]\n", + " )\n", + " # Notice the model has a helper object to compute WER\n", + " first_asr_model.wer.update(predictions=greedy_predictions, predictions_lengths=None, targets=targets, targets_lengths=targets_lengths)\n", + " _, wer_num, wer_denom = first_asr_model.wer.compute()\n", + " first_asr_model.wer.reset()\n", + " wer_nums.append(wer_num.detach().cpu().numpy())\n", + " wer_denoms.append(wer_denom.detach().cpu().numpy())\n", + "\n", + " # Release tensors from GPU memory\n", + " del test_batch, log_probs, targets, targets_lengths, encoded_len, greedy_predictions\n", + "\n", + "# We need to sum all numerators and denominators first. Then divide.\n", + "print(f\"WER = {sum(wer_nums)/sum(wer_denoms)}\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0kM9kBNOCptf" + }, + "source": [ + "This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number. Check out the next section on how to improve it further." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RBcJtg5ulR0H" + }, + "source": [ + "## Model Improvements\n", + "\n", + "You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.\n", + "\n", + "### Data Augmentation\n", + "\n", + "There exist several ASR data augmentation methods that can increase the size of our training set.\n", + "\n", + "For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments (\"frequency masking\") or time segments (\"time masking\") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding in a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)\n", + "\n", + "Our toy model does not do spectrogram augmentation. But the real one we got from cloud does:" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9glGogaPlR0H" + }, + "source": [ + "print(quartznet._cfg['spec_augment'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LdwdcA_a640R" + }, + "source": [ + "If you want to enable SpecAugment in your model, make sure your .yaml config file contains 'model/spec_augment' section which looks like the one above." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2f142kIQc1Z2" + }, + "source": [ + "### Transfer learning\n", + "\n", + "Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to make it perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.\n", + "\n", + "In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or on accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.\n", + "\n", + "Transfer learning with NeMo is simple. Let's demonstrate how the model we got from the cloud could be fine-tuned on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change model's vocabulary, just to demonstrate how it's done." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "hl320dsydWX0" + }, + "source": [ + "# Check what kind of vocabulary/alphabet the model has right now\n", + "print(quartznet.decoder.vocabulary)\n", + "\n", + "# Let's add \"!\" symbol there. Note that you can (and should!) change the vocabulary\n", + "# entirely when fine-tuning using a different language.\n", + "quartznet.change_vocabulary(\n", + " new_vocabulary=[\n", + " ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',\n", + " 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', \"'\", \"!\"\n", + " ]\n", + ")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M7lvmiMSd3Aw" + }, + "source": [ + "After this, our decoder has completely changed, but our encoder (which is where most of the weights are) remained intact. Let's fine tune-this model for 2 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the \"After Training\" section)`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "_PZJIso-eDl-" + }, + "source": [ + "# Use the smaller learning rate we set before\n", + "quartznet.setup_optimization(optim_config=DictConfig(new_opt))\n", + "\n", + "# Point to the data we'll use for fine-tuning as the training set\n", + "quartznet.setup_training_data(train_data_config=params['model']['train_ds'])\n", + "\n", + "# Point to the new validation data for fine-tuning\n", + "quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])\n", + "\n", + "# And now we can create a PyTorch Lightning trainer and call `fit` again.\n", + "trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=2)\n", + "trainer.fit(quartznet)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VURa1NavlR0U" + }, + "source": [ + "### Fast Training\n", + "\n", + "Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.\n", + "\n", + "You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:\n", + "\n", + "```python\n", + "# Mixed precision:\n", + "trainer = pl.Trainer(amp_level='O1', precision=16)\n", + "\n", + "# Trainer with a distributed backend:\n", + "trainer = pl.Trainer(devices=2, num_nodes=2, accelerator='gpu', strategy='ddp')\n", + "\n", + "# Of course, you can combine these flags as well.\n", + "```\n", + "\n", + "Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_ctc/speech_to_text_ctc.py) which can handle mixed precision and distributed training using command-line arguments." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d1ym8QT3jQnj" + }, + "source": [ + "### Deployment\n", + "\n", + "Note: It is recommended to run the deployment code from the NVIDIA PyTorch container.\n", + "\n", + "Let's get back to our pre-trained model and see how easy it can be exported to an ONNX file\n", + "in order to run it in an inference engine like TensorRT or ONNXRuntime.\n", + "\n", + "If you are running in an environment outside of the NVIDIA PyTorch container (like Google Colab for example) then you will have to build the onnxruntime and onnxruntime-gpu. The cell below gives an example of how to build those runtimes but the example may have to be adapted depending on your environment." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "I4WRcmakjQnj" + }, + "source": [ + "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n", + "#!mkdir -p ort\n", + "#%cd ort\n", + "#!git clean -xfd\n", + "#!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n", + "#!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/#x86_64-linux-gnu --build_wheel\n", + "#!pip uninstall -y onnxruntime\n", + "#!pip uninstall -y onnxruntime-gpu\n", + "#!pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n", + "#%cd .." + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F9yO1BEbjQnm" + }, + "source": [ + "Then run" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HZnyWxPyjQnm" + }, + "source": [ + "import json\n", + "import os\n", + "import tempfile\n", + "import onnxruntime\n", + "import torch\n", + "\n", + "import numpy as np\n", + "import nemo.collections.asr as nemo_asr\n", + "from nemo.collections.asr.data.audio_to_text import AudioToCharDataset\n", + "from nemo.collections.asr.metrics.wer import WER\n", + "\n", + "def to_numpy(tensor):\n", + " return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()\n", + "\n", + "def setup_transcribe_dataloader(cfg, vocabulary):\n", + " config = {\n", + " 'manifest_filepath': os.path.join(cfg['temp_dir'], 'manifest.json'),\n", + " 'sample_rate': 16000,\n", + " 'labels': vocabulary,\n", + " 'batch_size': min(cfg['batch_size'], len(cfg['paths2audio_files'])),\n", + " 'trim_silence': True,\n", + " 'shuffle': False,\n", + " }\n", + " dataset = AudioToCharDataset(\n", + " manifest_filepath=config['manifest_filepath'],\n", + " labels=config['labels'],\n", + " sample_rate=config['sample_rate'],\n", + " int_values=config.get('int_values', False),\n", + " augmentor=None,\n", + " max_duration=config.get('max_duration', None),\n", + " min_duration=config.get('min_duration', None),\n", + " max_utts=config.get('max_utts', 0),\n", + " blank_index=config.get('blank_index', -1),\n", + " unk_index=config.get('unk_index', -1),\n", + " normalize=config.get('normalize_transcripts', False),\n", + " trim=config.get('trim_silence', True),\n", + " parser=config.get('parser', 'en'),\n", + " )\n", + " return torch.utils.data.DataLoader(\n", + " dataset=dataset,\n", + " batch_size=config['batch_size'],\n", + " collate_fn=dataset.collate_fn,\n", + " drop_last=config.get('drop_last', False),\n", + " shuffle=False,\n", + " num_workers=config.get('num_workers', 0),\n", + " pin_memory=config.get('pin_memory', False),\n", + " )\n", + "\n", + "quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"QuartzNet15x5Base-En\")\n", + "\n", + "quartznet.export('qn.onnx')\n", + "\n", + "ort_session = onnxruntime.InferenceSession('qn.onnx', providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])\n", + "\n", + "with tempfile.TemporaryDirectory() as tmpdir:\n", + " with open(os.path.join(tmpdir, 'manifest.json'), 'w') as fp:\n", + " for audio_file in files:\n", + " entry = {'audio_filepath': audio_file, 'duration': 100000, 'text': 'nothing'}\n", + " fp.write(json.dumps(entry) + '\\n')\n", + "\n", + " config = {'paths2audio_files': files, 'batch_size': 4, 'temp_dir': tmpdir}\n", + " temporary_datalayer = setup_transcribe_dataloader(config, quartznet.decoder.vocabulary)\n", + " for test_batch in temporary_datalayer:\n", + " processed_signal, processed_signal_len = quartznet.preprocessor(\n", + " input_signal=test_batch[0].to(quartznet.device), length=test_batch[1].to(quartznet.device)\n", + " )\n", + " ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(processed_signal),}\n", + " ologits = ort_session.run(None, ort_inputs)\n", + " alogits = np.asarray(ologits)\n", + " logits = torch.from_numpy(alogits[0])\n", + " greedy_predictions = logits.argmax(dim=-1, keepdim=False)\n", + " wer = WER(decoding=quartznet.decoding, use_cer=False)\n", + " hypotheses, _ = wer.decoding.ctc_decoder_predictions_tensor(greedy_predictions)\n", + " print(hypotheses)\n", + " break\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wteGqroafWg1" + }, + "source": [ + "## Under the Hood\n", + "\n", + "NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.\n", + "\n", + "In particular, ``nemo_asr.model.EncDecCTCModel`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:\n", + "```python\n", + "def forward(self, input_signal, input_signal_length):\n", + " processed_signal, processed_signal_len = self.preprocessor(\n", + " input_signal=input_signal, length=input_signal_length,\n", + " )\n", + " # Spec augment is not applied during evaluation/testing\n", + " if self.spec_augmentation is not None and self.training:\n", + " processed_signal = self.spec_augmentation(input_spec=processed_signal)\n", + " encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)\n", + " log_probs = self.decoder(encoder_output=encoded)\n", + " greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)\n", + " return log_probs, encoded_len, greedy_predictions\n", + "```\n", + "Here:\n", + "\n", + "* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram\n", + "* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. \n", + "* ``self.encoder`` - is a convolutional Jasper/QuartzNet-like encoder of type ``nemo_asr.modules.ConvASREncoder``\n", + "* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).\n", + "\n", + "Also, ``EncDecCTCModel`` uses the audio dataset class ``nemo_asr.data.AudioToCharDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.\n", + "\n", + "You can use these and other neural modules (or create new ones yourself!) to construct new ASR models." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "smzlvbhelR0U" + }, + "source": [ + "# Further Reading/Watching:\n", + "\n", + "That's all for now! If you'd like to learn more about the topics covered in this tutorial, here are some resources that may interest you:\n", + "- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)\n", + "- [\"An Intuitive Explanation of Connectionist Temporal Classification\"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)\n", + "- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)\n", + "- [Listen Attend and Spell Paper (seq2seq ASR model)](https://arxiv.org/abs/1508.01211)\n", + "- [Explanation of the mel spectrogram in more depth](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0)\n", + "- [Jasper Paper](https://arxiv.org/abs/1904.03288)\n", + "- [QuartzNet paper](https://arxiv.org/abs/1910.10261)\n", + "- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)\n", + "- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)\n", + "- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)\n", + "- [Transfer Learning Blogpost](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/)" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "V3ERGX86lR0V" + }, + "source": [], + "execution_count": null, + "outputs": [] + } + ] }