Replies: 1 comment
-
I like your evaluation of Frictionless. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Frictionless is a framework for common Data Engineering problems. It boils down to a set of specifications that are housed on their GitHub repo. There are several projects that use the specifications including a Python package.
As of now the basic workflow we have developed for MESSES is Extract -> Validate -> Convert, which is very similar to the generalized workflow described in Data Engineering, Extract -> Transform -> Load. The issues Data Engineering deals with are very similar to the issues we are dealing with, but usually in Data Engineering they have much better starting sources and ending destinations. By "better" I mean that the sources are often existing SQL databases with some level of cleanliness and structure, and the destinations are also usually SQL databases. In our case we usually have to create tables out of rawer data and the repositories we are loading into require a variety of data formats, none of which are SQL tables. (It might actually simplify things a lot if repositories would expose their internal data structures, because they are likely SQL tables, so common Data Engineering tools and techniques can be used for deposition.)
The more I learn about Data Engineering the more tools and techniques I am finding that could be applied in MESSES. So far, every time I investigate them though they are not flexible enough as is for use wholesale in MESSES. The assumptions of these tools are always a relatively clean tabular structure input and clean tabular structure output. By "clean" input I am referring to an assumption of 1 data table per file, so they don't expect the values in each column to be clean (there may be string representations of numbers and numbers mixed in 1 column for example), but they do expect that only 1 table is in the file/sheet. We can't and don't assume that. Frictionless makes this assumption so it can't help at all for extraction.
For us "transformation" would refer more to going from one format to another, but in Data Engineering it refers more to type casting, column splitting/combining, joining, etc. These types of transformations can be done in MESSES through the extract -> validate loop, and in the conversion step, but I think of the conversion step more as the transformation into the final format. Taking entity records in a table and dispersing them through more nested JSON objects for example, feels categorically different to me than what is typically described when talking about "transformation" in the Data Engineering context. Frictionless has no support for complex transformations into formats other than tables, and thus isn't helpful for conversion.
Validation is the piece that is largely the same for us and in the Data Engineering context. For some reason though in the Data Engineering context it is not given its own letter. Validation and cleaning is largely part of the extraction and transformation parts of ETL. Once again here though we generally have more needs than what Data Engineering does. Namely our Protocol Dependent Schema. There are ways of doing what we do with Protocol Dependent Schema in SQL through CONSTRAINTS and CHECK, but it is a bit advanced and is typically not talked about in Data Engineering as it is more of an advanced database design concept. Currently, we accomplish our validation needs through a combination of JSON Schema and Python code. Frictionless has validation very similar to JSON Schema and drew inspiration from it, but they do not implement everything JSON Schema does. They basically took their tabular data assumption and created a JSON Schema like validation that expands some functionality, such as adding foreign keys and primary keys, but drops some other functionality, such as if/else clauses. Their Table Schema can be viewed here. There is an important distinction to be made here about validation with Frictionless. The Table Schema that they specify only has limited optional constraints that can be specified through a JSON or YAML file about individual tables, but the Python Frictionless package gives an interface for more checks and custom checks here. Basically, this means that switching to Frictionless would like very similar to the current implementation since we would not be able to create 1 Schema that specifies all the validation. Ultimately it doesn't look like Frictionless would be worth the effort of switching implementations to for us here either.
In summation, Frictionless looked quite promising, but is not flexible enough for our use in MESSES. It could possibly be used for validation, but would not significantly improve the implementation.
Beta Was this translation helpful? Give feedback.
All reactions