feat: Add support for jinja based template rendering of the dataset #438

Abhishek-TAMU · 2025-01-15T23:17:53Z

Description of the change

Added a handler apply_custom_data_formatting_jinja_template which does jinja based template rendering of the dataset.

Handling of edge case:
Example template:"### Input: {{Tweet text}} \n\n ### Response: {{text_label}}"
Jinja2 by default, does not support placeholders variable names with spaces (e.g., {{Tweet text}}), which will raise an error.

Hence additional preprocessing check (function: transform_placeholders) has been done. This checks if there is space between the placeholder variable and then process it accordingly (by modifying variable by {{element["Tweet text"]}}.

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1470

How to verify the PR

Verify added test cases.

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <[email protected]>

github-actions · 2025-01-15T23:18:05Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Abhishek-TAMU · 2025-01-21T23:32:25Z

tuning/data/data_handlers.py

+    except Exception as e:
+        raise KeyError(f"Dataset does not contain field in template. {e}") from e
+
+    rendered_text += tokenizer.eos_token


@ashokponkumar Wanted to just confirm the removal of eos_token from dataset samples in this handler. In other handlers we add eos_token and don't expect users to add it. Hence in this handler where user passes Jinja template are we expecting user to pass eos_token too? I guess in case of non-pretokenized dataset not using eos_token when using DataCollatorForCompletionOnlyLM might affect F1 score on tuned models ?

2- @dushyantbehl Can I ask how Jinja templating could be used with pre-tokenized dataset (Having input_ids and labels as columns) ?

I think we need a proper documentation for now and a patch where we let users choose if they want an eos_token with the data handlers or not via one argument e.g. add a kwarg to the data handlers like add_eos_token this way we can let them choose what they want inside a data config.
for a data config we should not assume things like what should we do while users want to do.
for our data args we can have this added inside our code at the last data handler whatever we choose so that our data args usecases remain same.

if you feel can you take this up with this patch? to add the kwarg for eos_token to clean up the interface with users? else we can park this to a next patch.

For pre tokenised datasets we can ignore the jinja template imo this should be applied only to non tokeniser data sets .
We can add all these things to documentation and I request you to please add documentation with this patch.

As per offline discussion, addition of kwarg add_eos_token would be done part of this issue and hence documentation of the same would also be taken care.

Though handler documentation is added in this PR.

Abhishek-TAMU · 2025-01-21T23:35:38Z

tuning/data/data_handlers.py

+    return {dataset_text_field: rendered_text}
+
+
+def transform_placeholders(template: str) -> str:


@dushyantbehl @ashokponkumar Are we handling nested dataset use case also, as I see every other handler expects dataset element Dict[str, str] and not Dict[str, Dict] ?

I think we were only handling non nested datasets apart from chat templates...can we test things out with this patch if our code works for nested datasets then can we have a change of the argument type here?

Also if you can move to utils as we discussed in our last call. Thanks.

As per offline discussion, handling of nested dataset would be checked and done for all handlers as part of this issue.

Also if you can move to utils as we discussed in our last call. Thanks.

Done

dushyantbehl · 2025-01-28T14:18:51Z

tests/data/test_data_handlers.py

+    template = "### Input: {{not found}} \n\n ### Response: {{text_label}}"
+    formatted_dataset_field = "formatted_data_field"
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+    with pytest.raises((KeyError, TemplateSyntaxError)):


can we catch this error inside our code and give users a simple text error?

TemplateSyntaxError is not needed anymore as this error comes when there is a space between placeholder variable in the template, and we are handling the space now with transform_placeholders utils function.

For KeyError the text error for user is mentioned in the handler apply_custom_data_formatting_jinja_template.

Signed-off-by: Abhishek <[email protected]>

dushyantbehl · 2025-02-04T12:03:12Z

tests/data/test_data_handlers.py

@@ -16,6 +16,7 @@
 # https://spdx.dev/learn/handling-license-info/

 # Third Party
+from jinja2.exceptions import TemplateSyntaxError


Can you remove this import as its not used

dushyantbehl · 2025-02-04T12:05:00Z

tuning/data/data_handlers.py

+       Expects to be run as a HF Map API function.
+    Args:
+        element: the HF Dataset element loaded from a JSON or DatasetDict object.
+        dataset_text_field: formatted_dataset_field.


Please add tokenizer to the args here.

Also I know this is not on you but can you please fix the doc string for line 104 as well.

dushyantbehl · 2025-02-04T12:08:36Z

tuning/utils/config_utils.py

@@ -135,3 +136,34 @@ def txt_to_obj(txt):
    except UnicodeDecodeError:
        # Otherwise the bytes are a pickled python dictionary
        return pickle.loads(message_bytes)
+
+
+def transform_placeholders(template: str) -> str:


could we please rename this function to be more descriptive?

sanitise jinja placeholders?

dushyantbehl · 2025-02-04T12:10:08Z

Suggested minor changes and barring those LGTM.
@willmj please share your review post that you are good to merge @Abhishek-TAMU

Signed-off-by: Abhishek <[email protected]>

Abhishek-TAMU · 2025-02-04T21:40:18Z

Made the suggested changes. @willmj Feel free to give a final review.

Add support for jinja template

902af4f

Signed-off-by: Abhishek <[email protected]>

github-actions bot added the feat label Jan 15, 2025

foundation-model-stack deleted a comment from github-actions bot Jan 15, 2025

Abhishek-TAMU marked this pull request as ready for review January 21, 2025 14:33

Abhishek-TAMU requested review from anhuong, Ssukriti, aluu317, fabianlim and kmehant as code owners January 21, 2025 14:33

Abhishek-TAMU commented Jan 21, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into jinja_template

d65f759

dushyantbehl reviewed Jan 28, 2025

View reviewed changes

Abhishek-TAMU added 2 commits February 3, 2025 11:42

Merge remote-tracking branch 'upstream/main' into jinja_template

1a4ef2e

Signed-off-by: Abhishek <[email protected]>

Suggested PR changes

0e9ad3f

Signed-off-by: Abhishek <[email protected]>

dushyantbehl reviewed Feb 4, 2025

View reviewed changes

PR Changes

8d3e77f

Signed-off-by: Abhishek <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for jinja based template rendering of the dataset #438

feat: Add support for jinja based template rendering of the dataset #438

Abhishek-TAMU commented Jan 15, 2025 •

edited

Loading

github-actions bot commented Jan 15, 2025

Abhishek-TAMU Jan 21, 2025

dushyantbehl Jan 28, 2025

Abhishek-TAMU Feb 3, 2025

Abhishek-TAMU Jan 21, 2025

dushyantbehl Jan 28, 2025

dushyantbehl Jan 28, 2025 •

edited

Loading

Abhishek-TAMU Feb 3, 2025

dushyantbehl Jan 28, 2025

Abhishek-TAMU Feb 3, 2025

dushyantbehl Feb 4, 2025

dushyantbehl Feb 4, 2025

dushyantbehl Feb 4, 2025

dushyantbehl Feb 4, 2025

dushyantbehl commented Feb 4, 2025

Abhishek-TAMU commented Feb 4, 2025

		return {dataset_text_field: rendered_text}


		def transform_placeholders(template: str) -> str:

feat: Add support for jinja based template rendering of the dataset #438

Are you sure you want to change the base?

feat: Add support for jinja based template rendering of the dataset #438

Conversation

Abhishek-TAMU commented Jan 15, 2025 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dushyantbehl Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dushyantbehl commented Feb 4, 2025

Abhishek-TAMU commented Feb 4, 2025

Abhishek-TAMU commented Jan 15, 2025 •

edited

Loading

dushyantbehl Jan 28, 2025 •

edited

Loading