feat: Add custom embedder #2236

vonodiripsa · 2024-06-12T18:52:47Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Briefly describe the changes included in this Pull Request.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

mhamilton723 · 2024-06-12T18:55:02Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        self,
+        inputCol=None,
+        outputCol=None,
+        useTRTFlag=None,


nit: useTRTFlag -> runtime: "cpu", "gpu", "tensorrt", default cpu

mhamilton723 · 2024-06-12T18:55:27Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+
+    # Define additional parameters
+    useTRT = Param(Params._dummy(), "useTRT", "True if use TRT acceleration")
+    driverOnly = Param(


nit: remove driver Only code

mhamilton723 · 2024-06-12T18:57:55Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            inputCol="combined",
+            outputCol="embeddings",


look at other examples of proper defaults for these columns in library

mhamilton723 · 2024-06-12T18:58:53Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            for batch_size in [64, 32, 16, 8, 4, 2, 1]:
+                for sentence_length in [20, 300, 512]:
+                    yield (batch_size, sentence_length)


make these magic numbers, parameters with defaults

mhamilton723 · 2024-06-12T18:59:52Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            """
+            Create a data loader with synthetic data using Faker.
+            """
+            faker = Faker()


nit: lets try to remove this dependency

mhamilton723 · 2024-06-12T19:00:21Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+                for sentence_length in [20, 300, 512]:
+                    yield (batch_size, sentence_length)
+
+        def get_dataloader(repeat_times: int = 2):


nit: _get_dataloader

mhamilton723 · 2024-06-12T19:00:41Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            func, dataloader=tqdm(get_dataloader(), total=total_batches), config=conf
+        )
+
+    def run_on_driver(self, queries, spark):


mhamilton723 · 2024-06-12T19:01:34Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        """
+        return self._defaultCopy(extra)
+
+    def load_data_food_reviews(self, spark, path=None, limit=1000):


move this code into demo

mhamilton723 · 2024-06-12T19:02:11Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+class SuppressLogging:
+    def __init__(self):
+        self._original_stderr = None
+
+    def start(self):
+        """Start suppressing logging by redirecting sys.stderr to /dev/null."""
+        if self._original_stderr is None:
+            self._original_stderr = sys.stderr
+            sys.stderr = open('/dev/null', 'w')
+
+    def stop(self):
+        """Stop suppressing logging and restore sys.stderr."""
+        if self._original_stderr is not None:
+            sys.stderr.close()
+            sys.stderr = self._original_stderr
+            self._original_stderr = None


mhamilton723 · 2024-06-12T19:04:18Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+    FloatType,
+)
+
+class EmbeddingTransformer(Transformer, HasInputCol, HasOutputCol):


nit: HuggingFaceSentenceEmbedder

Also name the file HuggingFaceSentenceEmbedder.py

mhamilton723 · 2024-06-12T19:18:12Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        modelName="intfloat/e5-large-v2",
+        moduleName="e5-large-v2",


nit: no defaults here, and try to make this module Name thing go away

mhamilton723 · 2024-06-12T19:18:40Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        Initialize the EmbeddingTransformer with input/output columns and optional TRT flag.
+        """
+        super(EmbeddingTransformer, self).__init__()
+        self._setDefault(


try it on some other models from : https://sbert.net/docs/sentence_transformer/pretrained_models.html

mhamilton723 · 2024-06-12T19:32:45Z

tools/init_scripts/init_retriever.sh

+/databricks/python/bin/pip install --extra-index-url https://pypi.nvidia.com cudf-cu11~=${RAPIDS_VERSION} cuml-cu11~=${RAPIDS_VERSION} pylibraft-cu11~=${RAPIDS_VERSION} rmm-cu11~=${RAPIDS_VERSION} 
+
+# install model navigator
+/databricks/python/bin/pip install --extra-index-url https://pypi.nvidia.com onnxruntime-gpu==1.16.3 "tensorrt==9.3.0.post12.dev1" "triton-model-navigator<1" "sentence_transformers~=2.2.2" "faker" "urllib3<2" 


nit: remove faker

azure-pipelines · 2024-08-01T05:10:59Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T05:24:52Z

/azp run

azure-pipelines · 2024-08-01T05:25:06Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T06:46:28Z

/azp run

azure-pipelines · 2024-08-01T06:46:42Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T06:57:37Z

/azp run

azure-pipelines · 2024-08-01T06:57:49Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T17:02:32Z

/azp run

azure-pipelines · 2024-08-01T17:02:48Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T18:37:09Z

/azp run

azure-pipelines · 2024-08-01T18:37:24Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T18:37:44Z

/azp run

azure-pipelines · 2024-08-01T18:37:54Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-01T20:50:03Z

/azp run

azure-pipelines · 2024-08-01T20:50:15Z

Azure Pipelines successfully started running 1 pipeline(s).

mhamilton723 · 2024-08-02T14:54:31Z

/azp run

azure-pipelines · 2024-08-02T14:54:44Z

Azure Pipelines successfully started running 1 pipeline(s).

bvonodiripsa · 2024-08-06T19:54:53Z

/azp run

azure-pipelines · 2024-08-06T19:55:07Z

Azure Pipelines successfully started running 1 pipeline(s).

Feat: Add custom embedder

0d76e67

vonodiripsa requested a review from mhamilton723 as a code owner June 12, 2024 18:52

mhamilton723 reviewed Jun 12, 2024

View reviewed changes

bvonodiripsa and others added 15 commits June 13, 2024 23:02

Corrected Names and file location

9afa431

Code style corrections

ee8f6f9

Source temp fixes

09af0e0

Formating

a485010

First test

eae2aee

Name changes

2cdfb59

With two models

88d34b8

Source style corrections

052f39b

Name change

ac7bc67

Name change

52a1581

Merge init scripts

1bd45ab

Removed extra file

cce365a

Added result output and _ correction

8d2ed06

Formatted

6b7798d

Merge branch 'microsoft:master' into add-demo

a8b4dc9

Reverse style change

72b4595

change data size

575f020

comment a line

fc66d0d

Corrected init_spark()

e8d308d

Style and added SQLContext

07a67d3

Corrected result_df and remove old image

2599ae0

bvonodiripsa and others added 2 commits August 2, 2024 04:42

Corrected sidebars.js

2b3f3d4

Merge branch 'master' into add-demo

be95231

match web names

b19a895

Merge branch 'master' into add-demo

5a993cf

mhamilton723 merged commit 6da5f57 into microsoft:master Aug 7, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add custom embedder #2236

feat: Add custom embedder #2236

vonodiripsa commented Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

mhamilton723 commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

bvonodiripsa commented Aug 6, 2024

azure-pipelines bot commented Aug 6, 2024

feat: Add custom embedder #2236

feat: Add custom embedder #2236

Conversation

vonodiripsa commented Jun 12, 2024

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

bvonodiripsa commented Aug 1, 2024

azure-pipelines bot commented Aug 1, 2024

mhamilton723 commented Aug 2, 2024

azure-pipelines bot commented Aug 2, 2024

bvonodiripsa commented Aug 6, 2024

azure-pipelines bot commented Aug 6, 2024