Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Gives more details to the load data step of the semantic search tutorials #113088

Merged
merged 3 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -120,12 +120,12 @@ IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to tra
It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
You can use a different data set to test the workflow and become familiar with it.

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI.
Assign the name `id` to the first column and `content` to the second column.
The index name is `test-data`.
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
Once the upload is complete, you will see an index named `test-data` with 182,469 documents.

[discrete]
[[reindexing-data-elser]]
Expand Down Expand Up @@ -161,6 +161,17 @@ GET _tasks/<task_id>

You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.

Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets.
You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset.
The following API request will cancel the reindexing task:

[source,console]
----
POST _tasks/<task_id>/_cancel
----
// TEST[skip:TBD]


[discrete]
[[text-expansion-query]]
==== Semantic search by using the `sparse_vector` query
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
All unique passages, along with their IDs, have been extracted from that data set and compiled into a
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI.
Assign the name `id` to the first column and `content` to the second column.
The index name is `test-data`.
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
Once the upload is complete, you will see an index named `test-data` with 182,469 documents.

[discrete]
[[reindexing-data-infer]]
Expand All @@ -92,7 +92,9 @@ GET _tasks/<task_id>
----
// TEST[skip:TBD]

You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
Following this tutorial, you can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets.
You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset.
The following API request will cancel the reindexing task:

[source,console]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
have been extracted from that data set and compiled into a
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI. Assign the name `id` to the first column and `content` to
the second column. The index name is `test-data`. Once the upload is complete,
you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
Once the upload is complete, you will see an index named `test-data` with 182,469 documents.


[discrete]
Expand Down Expand Up @@ -137,8 +138,9 @@ GET _tasks/<task_id>
------------------------------------------------------------
// TEST[skip:TBD]

It is recommended to cancel the reindexing process if you don't want to wait
until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
Following this tutorial, it is recommended to cancel the reindexing process if you don't want to wait until it is fully complete which might take a long time for an inference endpoint with few assigned resources.
You can test the feature even if you reindex only a subset of the data set - a few thousand data points for example - and generate embeddings for the subset.
The following API request will cancel the reindexing task:

[source,console]
------------------------------------------------------------
Expand Down
Loading