Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Glue jobs #2687

Merged
merged 60 commits into from
Jun 22, 2021
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
334a485
Update README
kevinhu Jun 9, 2021
8580c54
Merge branch 'master' of github.com:kevinhu/datahub into glue-etl
kevinhu Jun 9, 2021
499d308
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 10, 2021
520b099
Merge branch 'glue-etl' of github.com:kevinhu/datahub into glue-etl
kevinhu Jun 10, 2021
1941e93
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 10, 2021
5270fac
Read transformation DAGs
kevinhu Jun 11, 2021
67c0807
Extract node sources
kevinhu Jun 11, 2021
3ffdb1a
Init glue MCEs
kevinhu Jun 11, 2021
bdae7c1
Refactor job and flow wus
kevinhu Jun 11, 2021
4acc825
Resolve source and sink datasets
kevinhu Jun 11, 2021
68ed8e5
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 11, 2021
2fd692b
Set URNs correctly
kevinhu Jun 11, 2021
a0bc357
Isort and update snapshot JSONs
kevinhu Jun 11, 2021
e4b4d64
Successful ingestion
kevinhu Jun 11, 2021
7f0eb42
Refactor job listing
kevinhu Jun 11, 2021
fe63ce6
Glue ETL comments
kevinhu Jun 11, 2021
218338e
Clean up s3 naming
kevinhu Jun 11, 2021
5c1c9f2
Add job properties
kevinhu Jun 12, 2021
5e4873a
Fix lint errors
kevinhu Jun 14, 2021
4739367
Temp disable extract_transform in tests
kevinhu Jun 14, 2021
ce58f7b
Fix S3 URN
kevinhu Jun 14, 2021
a37b1db
Stubs for S3
kevinhu Jun 14, 2021
5050b8f
Fix lint errors
kevinhu Jun 14, 2021
ff682a8
Create Glue golden MCE json
kevinhu Jun 15, 2021
5144ffc
Trim Glue golden MCE
kevinhu Jun 15, 2021
ab419a4
Reapply freeze to Glue files
kevinhu Jun 15, 2021
ecd89a6
Fix golden path
kevinhu Jun 15, 2021
06d6c8a
Merge
kevinhu Jun 15, 2021
51ae22a
Fix duplicate MCEs
kevinhu Jun 15, 2021
108a203
Fix outputDatasets
kevinhu Jun 15, 2021
d9d8bc5
Remove S3 URIs
kevinhu Jun 15, 2021
e4a96b4
Expand job names
kevinhu Jun 15, 2021
db9d8ca
Expand job custom props
kevinhu Jun 15, 2021
7e6333b
Update golden
kevinhu Jun 15, 2021
4026917
Remove ownership classes
kevinhu Jun 15, 2021
c436149
Clean up redundant properties
kevinhu Jun 15, 2021
8619918
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 15, 2021
33dd0ea
Fix topological sort
kevinhu Jun 15, 2021
e0eaf0c
Fix S3 browse paths
kevinhu Jun 15, 2021
6f7b74a
Restore feast
kevinhu Jun 15, 2021
c738a74
Smaller stubs
kevinhu Jun 15, 2021
1d89927
Update README
kevinhu Jun 16, 2021
fe5b087
Resolve golden script conflict
kevinhu Jun 17, 2021
948dd24
Regenerate snapshot JSONs
kevinhu Jun 17, 2021
8a681cd
Merge
kevinhu Jun 17, 2021
95efa7e
Rebuild
kevinhu Jun 17, 2021
55e6619
Refactor node processing
kevinhu Jun 17, 2021
51bdb88
Add links to boto docs
kevinhu Jun 17, 2021
59bbf15
Fix sequence type error
kevinhu Jun 18, 2021
2a4501b
Fix Id typo
kevinhu Jun 18, 2021
8feb874
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 18, 2021
d173d6c
Types for process_dataflow_graph
kevinhu Jun 18, 2021
1b44baf
Include extension type in glue imports
kevinhu Jun 18, 2021
d80f529
S3 deduplication logic
kevinhu Jun 19, 2021
db1c638
Fix type annotation
kevinhu Jun 19, 2021
8eb3d50
Add comments for deduplication
kevinhu Jun 19, 2021
5b66074
Fix dataset IDs for Glue
kevinhu Jun 21, 2021
9955e06
Merge branch 'linkedin:master' into glue-etl
kevinhu Jun 21, 2021
6335f95
Update golden files
kevinhu Jun 21, 2021
09b9b60
Merge
kevinhu Jun 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions datahub-web-react/src/utils/sort/topologicalSort.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ function topologicalSortHelper(
explored: Set<string>,
result: Array<EntityRelationship>,
urnsArray: Array<string>,
nodes: Array<EntityRelationship>,
) {
if (!node.entity?.urn) {
return;
Expand All @@ -16,11 +17,14 @@ function topologicalSortHelper(
.filter((entity) => entity?.entity?.urn && urnsArray.includes(entity?.entity?.urn))
.forEach((n) => {
if (n?.entity?.urn && !explored.has(n?.entity?.urn)) {
topologicalSortHelper(n, explored, result, urnsArray);
topologicalSortHelper(n, explored, result, urnsArray, nodes);
}
});
if (urnsArray.includes(node?.entity?.urn)) {
result.push(node);
const fullyFetchedEntity = nodes.find((n) => n?.entity?.urn === node?.entity?.urn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation was causing a bug where nodes only had urns and types specified as they were from upstreamLineage – @gabe-lyons can elaborate!

if (fullyFetchedEntity) {
result.push(fullyFetchedEntity);
}
}
}

Expand All @@ -34,7 +38,7 @@ export function topologicalSort(input: Array<EntityRelationship | null>) {
.map((node) => node.entity?.urn) as Array<string>;
nodes.forEach((node) => {
if (node.entity?.urn && !explored.has(node.entity?.urn)) {
topologicalSortHelper(node, explored, result, urnsArray);
topologicalSortHelper(node, explored, result, urnsArray, nodes);
}
});

Expand Down
4 changes: 2 additions & 2 deletions gms/api/src/main/pegasus/com/linkedin/datajob/DataJob.pdl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import com.linkedin.common.Status
import com.linkedin.common.GlobalTags

/**
* Metadata bout DataJob
* Metadata about DataJob
*/
record DataJob includes DataJobKey, ChangeAuditStamps {
/**
Expand All @@ -28,7 +28,7 @@ record DataJob includes DataJobKey, ChangeAuditStamps {
/**
* Input and output datasets of job
*/
inputOutput: optional DataJobInputOutput
inputOutput: optional DataJobInputOutput

/**
* Status information for the chart such as removed or not
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,13 @@
"items" : "ChartDataSourceType"
},
"doc" : "Data sources for the chart",
"optional" : true
"optional" : true,
"Relationship" : {
"/*/string" : {
"entityTypes" : [ "dataset" ],
"name" : "Consumes"
}
}
}, {
"name" : "type",
"type" : {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,7 @@
"type" : "record",
"name" : "DataJob",
"namespace" : "com.linkedin.datajob",
"doc" : "Metadata bout DataJob",
"doc" : "Metadata about DataJob",
"include" : [ {
"type" : "record",
"name" : "DataJobKey",
Expand Down Expand Up @@ -438,9 +438,10 @@
"name" : "AzkabanJobType",
"namespace" : "com.linkedin.datajob.azkaban",
"doc" : "The various types of support azkaban jobs",
"symbols" : [ "COMMAND", "HADOOP_JAVA", "HADOOP_SHELL", "HIVE", "PIG", "SQL" ],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not keep piling onto the AzkabanJobType

cc @jjoyce0510 is the plan still to add a free form string?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This can either be a freeform string or another enum with a better name.

I don't have a strong preference for adding a freeform string vs a better enum

"symbols" : [ "COMMAND", "HADOOP_JAVA", "HADOOP_SHELL", "HIVE", "PIG", "SQL", "GLUE" ],
"symbolDocs" : {
"COMMAND" : "The command job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder.\nUpon execution, Azkaban spawns off a process to run the command.",
"GLUE" : "Glue type is for running AWS Glue job transforms.",
"HADOOP_JAVA" : "Runs a java program with ability to access Hadoop cluster.\nhttps://azkaban.readthedocs.io/en/latest/jobTypes.html#java-job-type",
"HADOOP_SHELL" : "In large part, this is the same Command type. The difference is its ability to talk to a Hadoop cluster\nsecurely, via Hadoop tokens.",
"HIVE" : "Hive type is for running Hive jobs.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,13 @@
"items" : "ChartDataSourceType"
},
"doc" : "Data sources for the chart",
"optional" : true
"optional" : true,
"Relationship" : {
"/*/string" : {
"entityTypes" : [ "dataset" ],
"name" : "Consumes"
}
}
}, {
"name" : "type",
"type" : {
Expand Down Expand Up @@ -1187,9 +1193,10 @@
"name" : "AzkabanJobType",
"namespace" : "com.linkedin.datajob.azkaban",
"doc" : "The various types of support azkaban jobs",
"symbols" : [ "COMMAND", "HADOOP_JAVA", "HADOOP_SHELL", "HIVE", "PIG", "SQL" ],
"symbols" : [ "COMMAND", "HADOOP_JAVA", "HADOOP_SHELL", "HIVE", "PIG", "SQL", "GLUE" ],
"symbolDocs" : {
"COMMAND" : "The command job type is one of the basic built-in types. It runs multiple UNIX commands using java processbuilder.\nUpon execution, Azkaban spawns off a process to run the command.",
"GLUE" : "Glue type is for running AWS Glue job transforms.",
"HADOOP_JAVA" : "Runs a java program with ability to access Hadoop cluster.\nhttps://azkaban.readthedocs.io/en/latest/jobTypes.html#java-job-type",
"HADOOP_SHELL" : "In large part, this is the same Command type. The difference is its ability to talk to a Hadoop cluster\nsecurely, via Hadoop tokens.",
"HIVE" : "Hive type is for running Hive jobs.",
Expand Down
4 changes: 2 additions & 2 deletions metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,11 +485,11 @@ source:
config:
aws_region: # aws_region_name, i.e. "eu-west-1"
env: # environment for the DatasetSnapshot URN, one of "DEV", "EI", "PROD" or "CORP". Defaults to "PROD".

# Filtering patterns for databases and tables to scan
database_pattern: # Optional, to filter databases scanned, same as schema_pattern above.
table_pattern: # Optional, to filter tables scanned, same as table_pattern above.

# Credentials. If not specified here, these are picked up according to boto3 rules.
# (see https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html)
aws_access_key_id: # Optional.
Expand Down
5 changes: 3 additions & 2 deletions metadata-ingestion/examples/recipes/glue_to_datahub.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
source:
type: glue
config:
aws_region: "us-east-1"
aws_region: "us-west-2"
extract_transforms: true

sink:
type: "datahub-rest"
config:
server: 'http://localhost:8080'
server: "http://localhost:8080"
1 change: 1 addition & 0 deletions metadata-ingestion/scripts/update_golden_files.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ cp tmp/test_mysql_ingest0/mysql_mces.json tests/integration/mysql/mysql_mce_gold
cp tmp/test_mssql_ingest0/mssql_mces.json tests/integration/sql_server/mssql_mce_golden.json
cp tmp/test_mongodb_ingest0/mongodb_mces.json tests/integration/mongodb/mongodb_mce_golden.json
cp tmp/test_feast_ingest0/feast_mces.json tests/integration/feast/feast_mce_golden.json
cp tmp/test_glue_ingest0/glue_mce.json tests/unit/glue/glue_mce_golden.json

# Print success message.
set +x
Expand Down
Loading