Skip to content

Commit

Permalink
feat(docs): Improve documentation on Search (#5889)
Browse files Browse the repository at this point in the history
  • Loading branch information
szalai1 authored Oct 3, 2022
1 parent e688dd5 commit 5fb875a
Showing 1 changed file with 127 additions and 14 deletions.
141 changes: 127 additions & 14 deletions docs/how/search.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,145 @@
# Search Guide
import FeatureAvailability from '@site/src/components/FeatureAvailability';

## Introduction
# About DataHub Search

The search bar is one of the means of finding data in Datahub. In this document, we discuss more effective ways of finding information beyond doing a standard keyword search. This is because keyword searches can return results from almost any part of an entity.
<!-- All Feature Guides should begin with `About DataHub ` to improve SEO -->

### Search in Specific Fields
<!--
Update feature availability; by default, feature availabilty is Self-Hosted and Managed DataHub
The following examples are in the format of
X: *typical question* :
Add in `saasOnly` for Managed DataHub-only features
-->

<FeatureAvailability/>

The **search bar** is an important mechanism for discovering data assets in DataHub. From the search bar, you can find Datasets, Columns, Dashboards, Charts, Data Pipelines, and more. Simply type in a term and press 'enter'.

<p align="center">
<img width="70%" src="https://github.com/datahub-project/static-assets/blob/main/imgs/search-landingpage.png?raw=true" />
</p>

**Advanced queries** and the **filter sidebar** helps fine tuning queries. For programmatic users Datahub provides a **GraphQL API** as well.

## Search Setup, Prerequisites, and Permissions

Search is available for all users. Although Search works out of the box, the more relevant data you ingest, the better the results are.

## Using Search

Searching is as easy as typing in relevant business terms and pressing 'enter' to view matching data assets.

By default, search terms will match against different aspects of a data assets. This includes asset names, descriptions, tags, terms, owners, and even specific attributes like the names of columns in a table.


### Filters

The filters sidebar sits on the left hand side of search results, and lets users find assets by drilling down. You can quickly filter by Data Platform (e.g. Snowflake), Tags, Glossary Terms, Domain, Owners, and more with a single click.

<p align="center">
<img width="70%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/filters_highlighted.png" />
</p>

### Results

Search results appear ranked by their relevance. In self-hosted DataHub ranking is based on how closely the query matched textual fields of an asset and its metadata. In Managed DataHub, ranking is based on a combination of textual relevance, usage (queries / views), and change frequency.

With better metadata comes better results. Learn more about ingestion technical metadata in the [metadata ingestion](../../metadata-ingestion/README.md) guide.

### Advanced queries

The search bar supports advanced queries with pattern matching, logical expressions and filtering by specific field matches.

The following examples are in the format of
X: *typical question* :
```what to key in search bar```. [sample url](https://example.com)
Wildcard characters can be added to the search terms as well. These examples are non exhaustive and using Datasets as a reference.

I want to:
1. *Find a dataset with the word **mask** in the name* :
If you want to:
1. Find a dataset with the word **mask** in the name:
```name: *mask*``` [Sample results](https://demo.datahubproject.io/search?page=1&query=name%3A%20%2Amask%2A)
This will return entities with **mask** in the name.
Names tends to be connected by other symbols, hence the wildcard symbols before and after the word.

2. *Find a dataset with a property, **encoding***
2. Find a dataset with a property, **encoding**
```customProperties: encoding*``` [Sample results](https://demo.datahubproject.io/search?page=1&query=customProperties%3A%20encoding%2A)
Dataset Properties are indexed in ElasticSearch the manner of key=value. Hence if you know the precise key-value pair, you can search using ```key=value```. However, if you only know the key, you can use wildcards to replace the value and that is what is being done here.

3. *Find a dataset with a column name, **latitude***
3. Find a dataset with a column name, **latitude**
```fieldPaths: latitude``` [Sample results](https://demo.datahubproject.io/search?page=1&query=fieldPaths%3A%20latitude)
fieldPaths is the name of the attribute that holds the column name in Datasets.

4. *Find a dataset with the term **latitude** in the field description*
4. Find a dataset with the term **latitude** in the field description
```editedFieldDescriptions: latitude OR fieldDescriptions: latitude``` [Sample results](https://demo.datahubproject.io/search?page=1&query=editedFieldDescriptions%3A%20latitude%20OR%20fieldDescriptions%3A%20latitude)
Datasets has 2 attributes that contains field description. fieldDescription comes from the SchemaMetadata aspect, while editedFieldDescriptions comes from the EditableSchemaMetadata aspect. EditableSchemaMetadata holds information that comes from UI edits, while SchemaMetadata holds data from ingestion of the dataset.

5. *Find a dataset with the term **logical** in the dataset description*
5. Find a dataset with the term **logical** in the dataset description
```editedDescription: *logical* OR description: *logical*``` [Sample results](https://demo.datahubproject.io/search?page=1&query=editedDescription%3A%20%2Alogical%2A%20OR%20description%3A%20%2Alogical%2A)
Similar to field descriptions, dataset descriptions can be found in 2 aspects, hence the need to search 2 attributes.

6. *Find a dataset which reside in one of the browsing folders, for instance, the **hive** folder*
6. Find a dataset which reside in one of the browsing folders, for instance, the **hive** folder
```browsePaths: *hive*``` [Sample results](https://demo.datahubproject.io/search?page=1&query=browsePaths%3A%20%2Ahive%2A)
BrowsePath is stored as a complete string, for instance ```/datasets/prod/hive/SampleKafkaDataset```, hence the need for wildcards on both ends of the term to return a result.

## Where to find more information?
<!--
## Additional Resources
Comment out any irrelevant or empty sections -->

### Videos

**What can you do with DataHub?**

<p align="center">
<iframe width="560" height="315" src="https://www.youtube.com/watch?v=dubrKIcv37c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</p>


### GraphQL

* [searchAcrossEntities](https://datahubproject.io/docs/graphql/queries/#searchacrossentities)
* You can try out the API on the demo instance's public GraphQL interface: [here](https://demo.datahubproject.io/api/graphiql)

The same GraphQL API that powers the Search UI can be used
for integrations and programmatic use-cases.

```
# Example query
{
searchAcrossEntities(
input: {types: [], query: "*", start: 0, count: 10, filters: [{field: "fieldTags", value: "urn:li:tag:Dimension"}]}
) {
start
count
total
searchResults {
entity {
type
... on Dataset {
urn
type
platform {
name
}
name
}
}
}
}
}
```


### DataHub Blog
* [Using DataHub for Search & Discovery](https://blog.datahubproject.io/using-datahub-for-search-discovery-fa309089be22)

## FAQ and Troubleshooting

**How are the results ordered?**

The order of the search results is based on the weight what Datahub gives them based on our search algorithm. The current algorithm in OSS DataHub is based on a text-match score from Elastic Search.

**Where to find more information?**

The sample queries here are non exhaustive. [The link here](https://demo.datahubproject.io/tag/urn:li:tag:Searchable) shows the current list of indexed fields for each entity inside Datahub. Click on the fields inside each entity and see which field has the tag ```Searchable```.
However, it does not tell you the specific attribute name to use for specialized searches. One way to do so is to inspect the ElasticSearch indices, for example:
```curl http://localhost:9200/_cat/indices``` returns all the ES indices in the ElasticSearch container.
Expand Down Expand Up @@ -121,5 +223,16 @@ example information of a dataset:
},
```

<!-- Use the following format:
**Question in bold text**
Response in plain text
-->

*Need more help? Join the conversation in [Slack](http://slack.datahubproject.io)!*

### Related Features

* [Metadata ingestion framework](../../metadata-ingestion/README.md)

0 comments on commit 5fb875a

Please sign in to comment.