Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt Tuning returns low-quality results #103

Open
weidotwisc opened this issue Mar 28, 2024 · 33 comments
Open

Prompt Tuning returns low-quality results #103

weidotwisc opened this issue Mar 28, 2024 · 33 comments
Assignees

Comments

@weidotwisc
Copy link

weidotwisc commented Mar 28, 2024

Describe the bug

Prompt Tuning model generates low-quality output

Platform

Please provide details about the environment you are using, including the following:

  • Interpreter version: python 3.11.0
  • Library version: transformers 4.37.2 peft 0.8.2

Sample Code

My launching script is

export MODEL_PATH=/dccstor/ai4code-ansible/shared/ckpt/granite-20b-code-all-yaml-2k/
export DATA_PATH=/dccstor/weiz/peft/train.json
export OUTPUT_PATH=pt_ckpts
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=.
python tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--peft_method "pt" \
--tokenizer_name_or_path $MODEL_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--per_device_eval_batch_size 1  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 3e-2  \
--weight_decay 0.  \
--warmup_ratio 0.0  \
--lr_scheduler_type "linear"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Response:"  \
--dataset_text_field "output" \
--use_flash_attn True  \
--torch_dtype "bfloat16" \
       --num_virtual_tokens 100 \
       --max_seq_length 512 \
--prompt_tuning_init RANDOM

To repeat, one needs to define MODEL_PATH to the yaml-2k path (which is shared via COS bucket) and DATA_PATH to the training data

Expected behavior

Using the first training data item as the input to the model should show something close to the ground truth, i.e.,

community.docker.docker_login:                                                                                                                         
    registry: "{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}"                                                            
    username: "{{ registry_username }}"                                                                                                                  
    password: "{{ registry_password }}"     

Observed behavior

When running through inference, I got

.,.,.,.,.,.,.,.,.,.,........,.,.,.,.,...,.,.....,..,...,......,:.: -.:.:.:.:.:.:.............,........,..........,.:..........,.........,::::::::::::::::
::::::......:..............,.............,..............,.,..:.......:.........2.2......................      

Additional context

@weidotwisc
Copy link
Author

train.json
The attached is the training data.

@weidotwisc
Copy link
Author

weidotwisc commented Apr 2, 2024

Hi, I think I found the problem:

I used the flash attention flag during training, so in the inference code, we would need the extra flag
_attn_implementation="flash_attention_2" when loading model.

Thanks!

@weidotwisc
Copy link
Author

I feel it probably worths noting this down in the inference document so that the consumer of this model needs to be aware of this particular flash attention flag as it requires changes to downstream inference pipeline code.

@weidotwisc weidotwisc reopened this Apr 3, 2024
@fabianlim
Copy link
Collaborator

@weidotwisc on second thoughts im a little puzzled. FA2 is a drop-in replacement for the attention layer, which only speeds up the compute, but should not affect the result. As such, regardless if you had trained in FA2, during inference it should not matter if you had set FA2 or not.

@weidotwisc
Copy link
Author

weidotwisc commented Apr 4, 2024

@fabianlim
From what I have observed, if we use SFT trainer code and specify
--use_flash_attn False, then the inference model loading code should be:
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16)

If we use SFT trainer code and specify --use_flash_attn True, then the inference model loading code should be:
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2") Otherwise the model will predict mostly garbage tokens as I have documented above.

@Ssukriti
Copy link
Collaborator

Ssukriti commented Apr 4, 2024

@weidotwisc could you also share you test set and evaluation script if you have it?

@olson-ibm
Copy link
Contributor

olson-ibm commented Apr 5, 2024

EDIT: I now see the train.json in the above comment. I missed it the first time around.

Trying to reproduce this issue using the platform testing environment DevStage.

This process is being documented here.

@weidotwisc I don't have access to the data set you used to tune, export DATA_PATH=/dccstor/weiz/peft/train.json. How can I get it? Your directory on CCC is not accessible with my CCC credentials. Can you temporarily copy it to one of the public /dccstor/ai4code-ansible directories? I'll delete it as soon as I get it.

@weidotwisc
Copy link
Author

Thanks @olson-ibm !
@Ssukriti -- The evaluation pipeline is a little tough to set up -- it requires some artifacts credentials to install some specific ansible related libraries and then some set up for post processing before calling into a pytest-based environment.
However, to repeat this flash attention related issue, you can pick a sample from the input dataset and let pt-tuned model to generate output, the results will be looking gibberish like

::::::......:..............,.............,..............,.,..:.......:.........2.2......................     

instead of any reasonable yaml-like output such as

community.docker.docker_login:                                                                                                                         
    registry: "{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}"                                                            
    username: "{{ registry_username }}"                                                                                                                  
    password: "{{ registry_password }}"     

@Ssukriti
Copy link
Collaborator

Ssukriti commented Apr 5, 2024

@olson-ibm thank you! Some of the things we need to try :

  1. Using same set of parameters Wei provided , try tuning and then try inference with few examples from train set itself. Report inference results and loss function drop - just few values to see if its decreasing.
  2. increase max_seq_length to 2048 , increase epochs to 20, --num_virtual_tokens to 1000 , per_device_train_batch_size=8 and repeat and report inference results
  3. if above parameters in 2. dont result in OOM , keep them standard and tweak learning rate 1e-5 and 0.05
    b. if it results in OOM, then reduce max_seq_length to 1024

@chinghuachen
Copy link

chinghuachen commented Apr 23, 2024

@Ssukriti could we please assign this to someone if it is being worked on by the platform team? Or should we tracking this elsewhere?

@anhuong
Copy link
Collaborator

anhuong commented Apr 25, 2024

@weidotwisc Neither I nor @olson-ibm were able to reproduce your low quality inference results although we are running inference different than you. You can see Joe's configuration and inference run in the issue.

Here are the configurations I used:

{
  "model_name_or_path": "granite-20b-code-all-yaml-2k-v1.1",
  "tokenizer_name_or_path": "granite-20b-code-all-yaml-2k-v1.1",
  "training_data_path": "train.json",
  "num_train_epochs": 5.0,
  "per_device_train_batch_size": 2,
  "gradient_accumulation_steps": 1,
  "learning_rate": 0.03,
  "response_template": "\n### Response:",
  "dataset_text_field": "output",
  "num_virtual_tokens": 100,
  "prompt_tuning_init": "RANDOM",
  "peft_method": "pt",
  "logging_steps": 1,
  "include_tokens_per_second": true,
  "max_seq_length": 512
}

This matches the same configurations you used above. I prompt tuned on 1 GPU with flash-attention with the accelerate_launch.py script which calls the sft_trainer.py script with the above configurations. This uses the same training data you linked to above.

Here is the loss for the tuning:

{'loss': 0.9866, 'grad_norm': 0.013051297515630722, 'learning_rate': 0.024, 'epoch': 1.0}
{'loss': 0.5044, 'grad_norm': 0.02981734648346901, 'learning_rate': 0.018, 'epoch': 2.0}
{'loss': 0.4916, 'grad_norm': 0.011600751429796219, 'learning_rate': 0.012, 'epoch': 3.0}
{'loss': 0.4751, 'grad_norm': 0.0, 'learning_rate': 0.006, 'epoch': 4.0}
{'loss': 0.4658, 'grad_norm': 0.010685372166335583, 'learning_rate': 0.0, 'epoch': 5.0}
{'train_runtime': 2525.3289, 'train_samples_per_second': 4.538, 'train_steps_per_second': 2.269, 'train_tokens_per_second': 2025.542, 'train_loss': 0.5846864923340696, 'epoch': 5.0}

I then ran inference using the run_inference.py script with changes I have in a PR to run with or without flash-attention.

Because the sequence length is so low, I got 1368 warning messages of This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the max_seq_length. I tried to increase the max_seq_length to 1024 and 2048 but ran into OOM failures.

An inference I ran with flash-attention:

{
        "formatted input": "### Input: \n- name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry\n\n### Response:",
        "predicted target": [
            "community.docker.docker_login:\n    registry: \"{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\"\n    username: \"{{ registry_username }}\"\n    password: \"{{ registry_password }}\"\n    email: \"{{ registry_email }}\"\n    reauthorize: true\n\n- name: Pull the image\n  community.docker.docker_image:\n    name: \"{{ image_name }}\"\n    tag: \"{{ image_tag }}\"\n    source: pull"
        ],
        "ref target": [
            "community.docker.docker_login:\n    password: \"{{ registry_pass }}\"\n    username: \"{{ registry_user }}\"\n    registry_url: \"{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\"\n    reauthorize: true"
        ]
    },

Same without flash-attention enabled at inference time:

{
        "formatted input": "### Input: \n- name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry\n\n### Response:",
        "predicted target": [
            "community.docker.docker_login:\n    registry: \"{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\"\n    username: \"{{ registry_username }}\"\n    password: \"{{ registry_password }}\"\n    email: \"{{ registry_email }}\"\n    reauthorize: true\n\n- name: Pull the image\n  community.docker.docker_image:\n    name: \"{{ image_name }}\"\n    tag: \"{{ image_tag }}\"\n    source: pull"
        ],
        "ref target": [
            "community.docker.docker_login:\n    password: \"{{ registry_pass }}\"\n    username: \"{{ registry_user }}\"\n    registry_url: \"{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\"\n    reauthorize: true"
        ]
    }

As you can see the outputs are the same and both are formatting in correct ansible.

Here is another one where the results are slightly different but both still look good:
With flash-attn:

{
        "formatted input": "### Input: \n- name: Create {{ docker_compose_dir }} directory\n  ansible.builtin.file:\n    path: \"{{ docker_compose_dir }}\"\n    state: directory\n\n- name: Create Redis socket directory\n  ansible.builtin.file:\n    path: \"{{ docker_compose_dir }}/redis_socket\"\n    state: directory\n    mode: 511\n\n- name: Create Docker Compose Configuration\n  ansible.builtin.template:\n    src: \"{{ item.file }}.j2\"\n    dest: \"{{ docker_compose_dir }}/{{ item.file }}\"\n    mode: \"{{ item.mode }}\"\n  loop:\n    - file: environment.sh\n      mode: \"0600\"\n    - file: credentials.py\n      mode: \"0600\"\n    - file: docker-compose.yml\n      mode: \"0600\"\n    - file: nginx.conf\n      mode: \"0600\"\n    - file: redis.conf\n      mode: \"0664\"\n  register: awx_compose_config\n\n- name: Render SECRET_KEY file\n  ansible.builtin.copy:\n    content: \"{{ secret_key }}\"\n    dest: \"{{ docker_compose_dir }}/SECRET_KEY\"\n    mode: 384\n  register: awx_secret_key\n\n- name: Remove AWX containers before migrating postgres so that the old postgres container does not get used\n  community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    state: absent\n  ignore_errors: true\n\n- name: Start the containers\n\n### Response:",
        "predicted target": [
            "community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    state: present\n\n- name: Wait for AWX to start\n  ansible.builtin.wait_for:\n    port: 8052\n    delay: 10\n\n- name: Migrate the database\n  ansible.builtin.shell: docker exec -i awx_task awx-manage migrate --noinput\n  register: migrate_result\n  until: migrate"
        ],
        "ref target": [
            "community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    restarted: \"{{ awx_compose_config is changed or awx_secret_key is changed }}\"\n  register: awx_compose_start"
        ]
    }

Without flash-attn:

{
        "formatted input": "### Input: \n- name: Create {{ docker_compose_dir }} directory\n  ansible.builtin.file:\n    path: \"{{ docker_compose_dir }}\"\n    state: directory\n\n- name: Create Redis socket directory\n  ansible.builtin.file:\n    path: \"{{ docker_compose_dir }}/redis_socket\"\n    state: directory\n    mode: 511\n\n- name: Create Docker Compose Configuration\n  ansible.builtin.template:\n    src: \"{{ item.file }}.j2\"\n    dest: \"{{ docker_compose_dir }}/{{ item.file }}\"\n    mode: \"{{ item.mode }}\"\n  loop:\n    - file: environment.sh\n      mode: \"0600\"\n    - file: credentials.py\n      mode: \"0600\"\n    - file: docker-compose.yml\n      mode: \"0600\"\n    - file: nginx.conf\n      mode: \"0600\"\n    - file: redis.conf\n      mode: \"0664\"\n  register: awx_compose_config\n\n- name: Render SECRET_KEY file\n  ansible.builtin.copy:\n    content: \"{{ secret_key }}\"\n    dest: \"{{ docker_compose_dir }}/SECRET_KEY\"\n    mode: 384\n  register: awx_secret_key\n\n- name: Remove AWX containers before migrating postgres so that the old postgres container does not get used\n  community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    state: absent\n  ignore_errors: true\n\n- name: Start the containers\n\n### Response:",
        "predicted target": [
            "community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    state: present\n\n- name: Wait for AWX to start\n  ansible.builtin.wait_for:\n    host: \"{{ awx_host }}\"\n    port: 8052\n    delay: 10\n\n- name: Migrate the database\n  ansible.builtin.shell: docker exec -i awx_task awx-manage migrate --noinput"
        ],
        "ref target": [
            "community.docker.docker_compose:\n    project_src: \"{{ docker_compose_dir }}\"\n    restarted: \"{{ awx_compose_config is changed or awx_secret_key is changed }}\"\n  register: awx_compose_start"
        ]
    }

@weidotwisc
Copy link
Author

weidotwisc commented Apr 25, 2024

@anhuong @olson-ibm Thanks Anh and Joe! To reproduce what I have observed, I am attaching two files here:
(1) an input json file (train_tiny.json)
(2) my inference script (ansible_pt_inference_shared.py)

To run the script, first change two lines in the ansible_pt_inference_shared.py script: line 6 to point to your checkpoint (SFT flash_attn flag enabled) path and line 7 to point to your train_tiny.json

Then run "python ansible_pt_inference_shared.py", in my case, I see

output_str: 
.,.,.,.,.,.,.,.,.,.,........,.,.,.,.,...,.,.....,..,...,......,:.: -.:.:.:.:.:.:.............,........,..........,.:..........,.........,::::::::::::::::::::::......:..............,.............,..............,.,..:.......:.........2.2......................

Thanks!

Wei
ansible_pt_inference_shared.txt

train_tiny.json

P.S.: for some reason I cannot attach .py file, so I renamed ansible_pt_inference_shared.py to ansible_pt_inference_shared.txt, you would need to rename it back.

@anhuong
Copy link
Collaborator

anhuong commented Apr 26, 2024

Also wanted to note some testing I did with different max sequence lengths:

  • Out of memory issues when using max_seq_length>512 and batch size>=8
  • We suspect the training data is getting truncated, which can remove the response template that is needed and removes training data samples which also decreases the quality of the model. With increased max_seq_length, results in more training data included and less warnings
    • max_seq_length=512, 1368 instances of warning This instance will be ignored in loss calculation. and Could not find response key, epoch=5 final loss=0.4658
    • max_seq_length=1024, 900 warnings, epoch=10 final loss=0.4198
    • max_seq_length=2048, 210 warnings, epoch=10 final loss=0.3774

@weidotwisc
Copy link
Author

Thanks, Anh! But the problem that I was reporting , in my opinion, has less to do with training. When I ran inference (see ansible_pt_inference_shared.py that I had shared previously) without the attn flag against a flash attn model returns gibberish results.I wonder could you give it a try to run the inference script that I had attached against one of your flash attention enabled trained model and see if you have seen the same problem ? Thanks!

@Ssukriti
Copy link
Collaborator

Official inference stack by platform is TGIS . We could not reproduce any such problem with TGIS . Hence there is no bug for platform to debug .

We give as ab example local inferencing using HF APIs in our tuning library , which is the official way from HF to infer models , we did not see any issue with that either !

So @weidotwisc i would suggest you to try the inference script in our repo and debug your script accordingly.
While Anh can take a look at your script in a timeboxed manner , it is really not part of platform stack and not a bug for us to dig into .

The issue reported was that quality is poor - we have demonstrated quality is not poor using platform stack

@weidotwisc
Copy link
Author

@Ssukriti @anhuong Could you show me what is the right process to use "TGIS" to inference on one testing example ? I can try follow and see if I can get the desired result ?

Thanks!
Wei

@weidotwisc
Copy link
Author

@anhuong You mentioned that "I then ran inference using the run_inference.py script with changes I have in a PR to run with or without flash-attention." --> what is right process to run it and where is the document to run it ?

Thanks!

Wei

@anhuong
Copy link
Collaborator

anhuong commented Apr 29, 2024

@weidotwisc Are you able to access this issue that walks through running the model on TGIS?

This is what I ran for inference:
To run without flash-attn:

# given file of input data
$ python scripts/run_inference.py  --model <path-to-model> --text_file <path-to-text-file> --max_new_tokens 100 

# single text input
$ python scripts/run_inference.py  --model <path-to-model> --text "### Input: - name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry ### Response:" --max_new_tokens 100 

I also ran the script with the training dataset you have attached and was not able to get the same garbled output you mentioned. Running against my prompt tuned model, this is the output I got:

[root@sft-trainer-test-anh-eval app]# python ansible_pt_inference_shared.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 9/9 [2:32:08<00:00, 1014.23s/it]
- name: Ensure pip
  community.general.easy_install:
    name: pip
    state: present

- name: Ensure docker for python
  ansible.builtin.pip:
    name: docker
    version: 2.0.0

- name: Fetch registry certificate
  ansible.builtin.fetch:
    src: "{{ docker_storage_dir }}/certs/docker.crt"
    dest: /tmp/docker_registry.crt
    flat: true
  run_once: true
  delegate_to: "{{ groups['docker_registry'][0] }}"

- name: Create registry certificate directory
  ansible.builtin.file:
    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}
    state: directory
    mode: 493

- name: Copy registry certificate file
  ansible.builtin.copy:
    src: /tmp/docker_registry.crt
    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt

- name: Enable service docker, restart to pickup registry certificate
  ansible.builtin.systemd:
    name: docker
    enabled: true
    state: restarted

- name: Login to the registry
input_len:  284
/usr/local/lib/python3.11/site-packages/peft/peft_model.py:1232: UserWarning: Position ids are not supported for parameter efficient tuning. Ignoring position ids.
  warnings.warn("Position ids are not supported for parameter efficient tuning. Ignoring position ids.")
len_outputs:  540
output_str: 

  community.docker.docker_login:
    registry: "{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}"
    username: "{{ registry_username }}"
    password: "{{ registry_password }}"
    email: "{{ registry_email }}"
    reauthorize: true
- name: Install packages
  ansible.builtin.apt:
    name: "{{ item }}"
    state: present
  with_items:
    - git
    - vim
    - curl
    - htop
    - tree
    - tmux
    - python-software-properties
    - software-properties-common
    - build-essential
    - libssl-dev
    - libffi-dev
    - python-dev
    - python-pip
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-dev
    - python-

You can see the input string printed out as well as the output string, which although has - python-dev repeated at the end, is providing ansible formatted results.

@weidotwisc
Copy link
Author

weidotwisc commented Apr 30, 2024

@anhuong Thanks a lot for the help! I wonder when you ran ansible_pt_inference_shared.py, did you run against a flash_attn enabled prompt tuned model ?

Thanks!
Wei

@anhuong
Copy link
Collaborator

anhuong commented Apr 30, 2024

Yes I ran against the same code model enabling flash_attn during tuning time and then did not use flash_attn via your script during inference time.

@weidotwisc
Copy link
Author

@anhuong I used the single text input example that you listed by

model_path=/dccstor/weiz/bikinie/peft/pt/sft_debugging/fms-hf-tuning/pt_ckpts/checkpoint-5730
python scripts/run_inference.py  --model $model_path --text "- name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry" --max_new_tokens 100 

Then I get the inference_result.json that looks like this:

[
    {
        "input": "- name: Ensure pip\\n  community.general.easy_install:\\n    name: pip\\n    state: present\\n\\n- name: Ensure docker for python\\n  ansible.builtin.pip:\\n    name: docker\\n    version: 2.0.0\\n\\n- name: Fetch registry certificate\\n  ansible.builtin.fetch:\\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\\n    dest: /tmp/docker_registry.crt\\n    flat: true\\n  run_once: true\\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\\n\\n- name: Create registry certificate directory\\n  ansible.builtin.file:\\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\\n    state: directory\\n    mode: 493\\n\\n- name: Copy registry certificate file\\n  ansible.builtin.copy:\\n    src: /tmp/docker_registry.crt\\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\\n\\n- name: Enable service docker, restart to pickup registry certificate\\n  ansible.builtin.systemd:\\n    name: docker\\n    enabled: true\\n    state: restarted\\n\\n- name: Login to the registry",
        "output": "- name: Ensure pip\\n  community.general.easy_install:\\n    name: pip\\n    state: present\\n\\n- name: Ensure docker for python\\n  ansible.builtin.pip:\\n    name: docker\\n    version: 2.0.0\\n\\n- name: Fetch registry certificate\\n  ansible.builtin.fetch:\\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\\n    dest: /tmp/docker_registry.crt\\n    flat: true\\n  run_once: true\\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\\n\\n- name: Create registry certificate directory\\n  ansible.builtin.file:\\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\\n    state: directory\\n    mode: 493\\n\\n- name: Copy registry certificate file\\n  ansible.builtin.copy:\\n    src: /tmp/docker_registry.crt\\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\\n\\n- name: Enable service docker, restart to pickup registry certificate\\n  ansible.builtin.systemd:\\n    name: docker\\n    enabled: true\\n    state: restarted\\n\\n- name: Login to the registry. -... -.2.2. -.:.:.:.:.:.:.............,.,.,...,..,.,.,.,.,.,.,....,.,..,::::::::::::::::.:.:........."
    }
]

If you scroll down to the end of the inference result, you can see my flash-attn-enabled PT-tuned model generates gibberish results.

I have two questions:
(1) What was your PT training launching script that enabled flash attention ?
(2) The model that I trained was on March 26, 2024 -- do you think there could be some changes between then and now in the PT training process that could have altered the results (i.e., how the model was checkpoint-ed in the end) ?

Thanks!

Wei

@anhuong
Copy link
Collaborator

anhuong commented Apr 30, 2024

(1) What was your PT training launching script that enabled flash attention ?

The inference script that uses flash attention is on a branch here: https://github.com/anhuong/fms-hf-tuning/blob/flash-attn-inference/scripts/run_inference.py and requires adding flag --use_flash_attn. This is currently in PR #132

I also updated my above comment as the input text used for inference should include the alpaca formatting with ### Input: and ### Response:, please try it with that.

(2) The model that I trained was on March 26, 2024 -- do you think there could be some changes between then and now in the PT training process that could have altered the results (i.e., how the model was checkpoint-ed in the end) ?

I'm unsure, this seems to be the most relevant hange but if the previous use_flash_attention_2 flag is not fully deprecated yet then the functionality should be the same.

@weidotwisc
Copy link
Author

@anhuong Thanks! I was asking "What was your PT training launching script that enabled flash attention ?" --> Training not inference script.

Thanks!

Wei

@anhuong
Copy link
Collaborator

anhuong commented Apr 30, 2024

I used the script in this repo sft_trainer.py but triggered with the build scripts

@weidotwisc
Copy link
Author

weidotwisc commented Apr 30, 2024

@anhuong Thanks!

This is the script that I used to launch the PT training job

export MODEL_PATH=/dccstor/ai4code-ansible/shared/ckpt/granite-20b-code-all-yaml-2k/
export DATA_PATH=/dccstor/weiz/peft/train.json
export OUTPUT_PATH=pt_ckpts
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=.
python tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--peft_method "pt" \
--tokenizer_name_or_path $MODEL_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--per_device_eval_batch_size 1  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 3e-2  \
--weight_decay 0.  \
--warmup_ratio 0.0  \
--lr_scheduler_type "linear"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Response:"  \
--dataset_text_field "output" \
--use_flash_attn True  \
--torch_dtype "bfloat16" \
       --num_virtual_tokens 100 \
       --max_seq_length 512 \
--prompt_tuning_init RANDOM

as mentioned at the top of this issue thread.

I followed the README at https://github.com/foundation-model-stack/fms-hf-tuning#prompt-tuning- and modified accordingly (in particular to have flash attn enabled by --use_flash_attn True).

I wonder how should I run your training script in a similar fashion ?

Thanks!
Wei

@anhuong
Copy link
Collaborator

anhuong commented Apr 30, 2024

As noted above, I ran with the same configurations you had set. This required creating the JSON config, setting env var SFT_TRAINER_CONFIG_JSON_PATH to the config and then running python accelerate_launch.py on a Pod with single-GPU.

By default when running this script, use_flash_attn is set to true. You can see the full set of params set:

INFO:root:Parameters used to launch training:     
ModelArguments(model_name_or_path='granite-20b-code', use_flash_attn=True, torch_dtype=torch.bfloat16),

DataArguments(training_data_path='wisdom-train.json', response_template='\n### Response:', dataset_text_field='output', validation_data_path=None), 

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=True,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.03,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/runs/Apr24_18-02-41_sft-trainer-test-anh-wisdom-match-params,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1,
logging_strategy=IntervalStrategy.EPOCH,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_seq_length=512,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=5.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=granite-20b-code-tuned,
overwrite_output_dir=False,
packing=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=2,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=granite-20b-code,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
), 

PromptTuningConfig(prompt_tuning_init='RANDOM', num_virtual_tokens=100, prompt_tuning_init_text='Classify if the tweet is a complaint or not:', tokenizer_name_or_path='granite-20b-code')

@weidotwisc
Copy link
Author

weidotwisc commented May 1, 2024

@anhuong

To reproduce the bug, this is what I did:

(1) checkout the latest fms-hf-tuning at commit c2f2f8c (Apr 29,2024)

(2) Run the training launching script, as documented in https://github.com/foundation-model-stack/fms-hf-tuning#prompt-tuning-, with the following modification (in particular, --use_flash_attn True), the same as documented at the top of this issue thread:

export MODEL_PATH=/dccstor/ai4code-ansible/shared/ckpt/granite-20b-code-all-yaml-2k/
export DATA_PATH=/dccstor/weiz/irene/fms-hf-tuning/examples/prompt_tuning_ans/containers_infra-ent-train-fms.json
export OUTPUT_PATH=pt_ckpts
export CUDA_VISIBLE_DEVICES=0
export PYTHONPATH=.
python tuning/sft_trainer.py  \
--model_name_or_path $MODEL_PATH  \
--training_data_path $DATA_PATH  \
--output_dir $OUTPUT_PATH  \
--peft_method "pt" \
--tokenizer_name_or_path $MODEL_PATH  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--per_device_eval_batch_size 1  \
--gradient_accumulation_steps 1  \
--evaluation_strategy "no"  \
--save_strategy "epoch"  \
--learning_rate 3e-2  \
--weight_decay 0.  \
--warmup_ratio 0.0  \
--lr_scheduler_type "linear"  \
--logging_steps 1  \
--include_tokens_per_second  \
--packing False  \
--response_template "\n### Response:"  \
--dataset_text_field "output" \
--use_flash_attn True  \
--torch_dtype "bfloat16" \
       --num_virtual_tokens 100 \
       --max_seq_length 512 \
--prompt_tuning_init RANDOM

(3) Run the single text example inference script as you proposed, pointing to the checkpoint from step (2)

model_path=/dccstor/weiz/bikinie/peft/pt/sft_debugging/fms-hf-tuning/pt_ckpts/checkpoint-5730
python scripts/run_inference.py  --model $model_path --text "- name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry" --max_new_tokens 100 

The inference output (aka after the input repeating part) of the model, stored in inference_result.json (attached below) looks like this:

.0:::: '0: - '0: - 0: - 1: '0.0.0.0.0: '0: '0: '0: '0: - '0: - '0: - '0: - '0:0:0:0:0:0 -0:8000:1.0.0.0.00:0.0.0.0:0:00

inference_result.json

Thanks!

Wei

@anhuong
Copy link
Collaborator

anhuong commented May 1, 2024

@weidotwisc instead of me tuning on an old commit, you should tune the model using the latest repo main branch. In addition, I noted to try inference with the ### Input: and ### Response: alpaca formatting to see if this affects the prediction result. Please make these changes

@weidotwisc
Copy link
Author

@anhuong

(1) It is the latest commit that I tested against -- please note the commit hash!

(2) I have updated my inference script to be:

python scripts/run_inference.py  --model $model_path  --text "### Input: - name: Ensure pip\n  community.general.easy_install:\n    name: pip\n    state: present\n\n- name: Ensure docker for python\n  ansible.builtin.pip:\n    name: docker\n    version: 2.0.0\n\n- name: Fetch registry certificate\n  ansible.builtin.fetch:\n    src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n    dest: /tmp/docker_registry.crt\n    flat: true\n  run_once: true\n  delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n  ansible.builtin.file:\n    path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n    state: directory\n    mode: 493\n\n- name: Copy registry certificate file\n  ansible.builtin.copy:\n    src: /tmp/docker_registry.crt\n    dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n  ansible.builtin.systemd:\n    name: docker\n    enabled: true\n    state: restarted\n\n- name: Login to the registry ### Response:" --max_new_tokens 100 

I still get garbage towards the end see attached json file.
inference_result.json

Thanks!
Wei

@anhuong
Copy link
Collaborator

anhuong commented May 2, 2024

Sorry for missing that it is the latest commit hash, thanks for tuning on latest commit. Sorry for being pedantic, but could you try with text "### Input: - name: Ensure pip\n community.general.easy_install:\n name: pip\n state: present\n\n- name: Ensure docker for python\n ansible.builtin.pip:\n name: docker\n version: 2.0.0\n\n- name: Fetch registry certificate\n ansible.builtin.fetch:\n src: \"{{ docker_storage_dir }}/certs/docker.crt\"\n dest: /tmp/docker_registry.crt\n flat: true\n run_once: true\n delegate_to: \"{{ groups['docker_registry'][0] }}\"\n\n- name: Create registry certificate directory\n ansible.builtin.file:\n path: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}\n state: directory\n mode: 493\n\n- name: Copy registry certificate file\n ansible.builtin.copy:\n src: /tmp/docker_registry.crt\n dest: /etc/docker/certs.d/{{ hostvars[groups['docker_registry'][0]].ansible_host }}:{{ registry_port }}/ca.crt\n\n- name: Enable service docker, restart to pickup registry certificate\n ansible.builtin.systemd:\n name: docker\n enabled: true\n state: restarted\n\n- name: Login to the registry\n\n### Response:" --max_new_tokens 100 I know this is a minor difference to include newlines \n\n### Response: but I want to be sure the format is correct.

Also it looks like we tuned the same way so unless the base models we are using differently, not sure on why inference results are different. Also if you are trying to use flash attention with the inference script, make sure you are using my fork. You should not need to pass in True --use_flash_attn True but only the flag --use_flash_attn.

Finally, potentially since we are running on different devices with potentially different dependencies this could also be causing problems. Can you please let me know what versions of these dependencies you have: transformers, accelerate, peft, trl (all can get with pip show) and finally cuda (can get with /usr/local/cuda/bin/nvcc --version)
I have versions:
peft: 0.10.0
transformers: 4.39.3
trl: 0.8.1
accelerate: 0.29.2
cuda:

Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@weidotwisc
Copy link
Author

weidotwisc commented May 6, 2024

@anhuong Thanks!

(1) The attached is the inference_result.json that uses the "\n\n" before "### Response:". It is still gibberish towards the end.

(2) I am not sure if I understand your comments about (i) your fork and (ii) pass the --use_flash_attn flag when using inference script. When I was using inference script, (i) I used scripts/run_inference.py in the main branch (ii) I didn't pass in any flag, I was simply doing python scripts/run_inference.py --model $model_path --text ... --max_new_tokens 100, where ... is the text field.

(3) My environments are the following

peft: 0.8.2
transformers: 4.37.2
trl: 0.7.10
accelerate: 0.27.0
cuda: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

Thanks!

Wei

@anhuong
Copy link
Collaborator

anhuong commented May 7, 2024

Thanks Wei, my change has been merged so you can now use the latest main branch and add flag --use_flash_attn to see if that changes your results with your given model. You can run python scripts/run_inference.py --model $model_path --text ... --max_new_tokens 100 --use_flash_attn.

I do wonder if the different dependency versions are making a difference in the tuning and inference since you are on slightly older dependency versions but on a newer cuda version than me. As we are not able to reproduce your results.

@weidotwisc
Copy link
Author

weidotwisc commented May 7, 2024

Thanks, Anh! Yeah, now it works :) The attached is the
inference_result.json that appears normal in the end.

I see your solution is also to ask the user the pass a flag that indicates that is a flash attention enabled training model, then the inference code would load the model by including the option attn_implementation="flash_attention_2" . This looks like the same solution that I proposed at #103 (comment)

Now, two things are on my mind:
(1) Maybe have folks like @fabianlim to have a look, as our solution indicates the SDPA (granite default attn mechanism) and Flash Attn are not equivalent mechanisms thus cannot be replaced transparently, at least for our granite models. And that seems to be contrary to the conventional wisdom, per Fabian ?

(2) If (1) is indeed the case, then the consumers of platform SFT trained models need to be aware of this, unless you have a control over how all the users are going to use the model. For example, in our group (AI4Ansible) 's evaluation pipeline, we didn't have this attn_implementation="flash_attention_2" flag, because we didn't use flash attention mechanism to train our model. My feeling is the research teams that are using platform team's SFT pipeline would need to modify your inference code somehow to fit their existing scenario.

Thanks!

Wei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants