Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(explore): Postgres datatype conversion #13294

Merged
merged 36 commits into from
Mar 12, 2021
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
70572c8
test
nikolagigic Feb 23, 2021
8e330db
unnecessary import
nikolagigic Feb 23, 2021
83f3996
fix lint
nikolagigic Feb 23, 2021
d6c4c1c
changes
nikolagigic Feb 24, 2021
35fa495
fix lint
nikolagigic Feb 25, 2021
c00e4c3
changes
nikolagigic Feb 26, 2021
488a840
changes
nikolagigic Feb 26, 2021
c894b90
changes
nikolagigic Feb 26, 2021
2fb975b
Merge branch 'master' into postgres_type_conversion
nikolagigic Feb 26, 2021
82a8c9d
changes
nikolagigic Mar 2, 2021
4b8d0ec
answering comments & changes
nikolagigic Mar 3, 2021
ddcc14a
answering comments
nikolagigic Mar 3, 2021
fcb5edc
answering comments
nikolagigic Mar 3, 2021
010e50e
changes
nikolagigic Mar 3, 2021
32f58a8
changes
nikolagigic Mar 4, 2021
ebcbb53
changes
nikolagigic Mar 4, 2021
1bdebbf
fix tests
nikolagigic Mar 4, 2021
2f341b9
fix tests
nikolagigic Mar 4, 2021
39cc3b3
fix tests
nikolagigic Mar 5, 2021
d9afba7
fix tests
nikolagigic Mar 5, 2021
1050974
fix tests
nikolagigic Mar 5, 2021
1a799df
Merge branch 'master' into postgres_type_conversion
nikolagigic Mar 5, 2021
b92e2ac
fix tests
nikolagigic Mar 8, 2021
e630335
fix tests
nikolagigic Mar 8, 2021
c9b5e56
fix tests
nikolagigic Mar 8, 2021
276c820
fix tests
nikolagigic Mar 8, 2021
8430ef3
fix tests
nikolagigic Mar 9, 2021
0189072
fix tests
nikolagigic Mar 9, 2021
bfdc994
fix tests
nikolagigic Mar 10, 2021
8e1d813
Merge branch 'master' into postgres_type_conversion
nikolagigic Mar 10, 2021
f089e9a
fix tests
nikolagigic Mar 10, 2021
7292e37
fix tests
nikolagigic Mar 10, 2021
7e0d4d1
fix tests
nikolagigic Mar 10, 2021
b5f6244
fix tests
nikolagigic Mar 11, 2021
1e06266
fix tests
nikolagigic Mar 11, 2021
4336ae5
fix tests
nikolagigic Mar 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 49 additions & 11 deletions superset/db_engine_specs/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
import sqlparse
from flask import g
from flask_babel import lazy_gettext as _
from sqlalchemy import column, DateTime, select
from sqlalchemy import column, DateTime, select, types
from sqlalchemy.engine.base import Engine
from sqlalchemy.engine.interfaces import Compiled, Dialect
from sqlalchemy.engine.reflection import Inspector
Expand All @@ -57,6 +57,7 @@
from superset.models.sql_lab import Query
from superset.sql_parse import ParsedQuery, Table
from superset.utils import core as utils
from superset.utils.core import ColumnSpec

if TYPE_CHECKING:
# prevent circular imports
Expand Down Expand Up @@ -145,7 +146,12 @@ class BaseEngineSpec: # pylint: disable=too-many-public-methods
_date_trunc_functions: Dict[str, str] = {}
_time_grain_expressions: Dict[Optional[str], str] = {}
column_type_mappings: Tuple[
Tuple[Pattern[str], Union[TypeEngine, Callable[[Match[str]], TypeEngine]]], ...,
Tuple[
Pattern[str],
Union[TypeEngine, Callable[[Match[str]], TypeEngine]],
utils.GenericDataType,
],
...,
] = ()
time_groupby_inline = False
limit_method = LimitMethod.FORCE_LIMIT
Expand Down Expand Up @@ -179,6 +185,13 @@ class BaseEngineSpec: # pylint: disable=too-many-public-methods
),
}

dttm_types = [
types.TIME,
types.TIMESTAMP,
types.TIMESTAMP(timezone=True),
types.Interval,
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is needed anymore.


@classmethod
def get_dbapi_exception_mapping(cls) -> Dict[Type[Exception], Type[Exception]]:
"""
Expand Down Expand Up @@ -967,25 +980,27 @@ def make_label_compatible(cls, label: str) -> Union[str, quoted_name]:
return label_mutated

@classmethod
def get_sqla_column_type(cls, type_: Optional[str]) -> Optional[TypeEngine]:
def get_sqla_column_type(
cls, column_type: Optional[str]
) -> Tuple[Union[TypeEngine, utils.GenericDataType, None]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider just leaving the signature of this method unchanged for now; otherwise I'd just remove this one and start using the new get_column_spec method and implement that wherever get_sqla_column_type is being used, as that's the end state we want to aim for.

"""
Return a sqlalchemy native column type that corresponds to the column type
defined in the data source (return None to use default type inferred by
SQLAlchemy). Override `column_type_mappings` for specific needs
(see MSSQL for example of NCHAR/NVARCHAR handling).

:param type_: Column type returned by inspector
:param column_type: Column type returned by inspector
:return: SqlAlchemy column type
"""
if not type_:
return None
for regex, sqla_type in cls.column_type_mappings:
match = regex.match(type_)
if not column_type:
return None, None
for regex, sqla_type, generic_type in cls.column_type_mappings:
match = regex.match(column_type)
if match:
if callable(sqla_type):
return sqla_type(match)
return sqla_type
return None
return sqla_type(match), generic_type
return sqla_type, generic_type
return None, None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases where we want to return an empty value (=no match was made), I'd perhaps prefer returning a pure None instead of a Tuple with Nones.


@staticmethod
def _mutate_label(label: str) -> str:
Expand Down Expand Up @@ -1097,3 +1112,26 @@ def get_extra_params(database: "Database") -> Dict[str, Any]:
def is_readonly_query(cls, parsed_query: ParsedQuery) -> bool:
"""Pessimistic readonly, 100% sure statement won't mutate anything"""
return parsed_query.is_select() or parsed_query.is_explain()

def get_column_spec(
self,
column_name: Optional[str],
native_type: str,
source: utils.ColumnTypeSource = utils.ColumnTypeSource.GET_TABLE,
) -> utils.ColumnSpec:

column_type, generic_type = self.get_sqla_column_type(native_type)
is_dttm = generic_type == utils.GenericDataType.TEMPORAL

if column_name: # Further logic to be implemented
pass
if (
source == utils.ColumnTypeSource.CURSOR_DESCRIPION
): # Further logic to be implemented
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we can assume GET_TABLE and CURSOR_DESCRIPTION is handled with the same matching logic - this can later be refined for engines where this makes a difference. Also, we might consider keeping the base implementation as simple as possible, and leaving the more complex logic to the individual db engine specs.


column_spec = ColumnSpec(
sqla_type=column_type, generic_type=generic_type, is_dttm=is_dttm
)

return column_spec
12 changes: 10 additions & 2 deletions superset/db_engine_specs/mssql.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,16 @@ def fetch_data(
return cls.pyodbc_rows_to_tuples(data)

column_type_mappings = (
(re.compile(r"^N((VAR)?CHAR|TEXT)", re.IGNORECASE), UnicodeText()),
(re.compile(r"^((VAR)?CHAR|TEXT|STRING)", re.IGNORECASE), String()),
(
re.compile(r"^N((VAR)?CHAR|TEXT)", re.IGNORECASE),
UnicodeText(),
utils.GenericDataType.STRING,
),
(
re.compile(r"^((VAR)?CHAR|TEXT|STRING)", re.IGNORECASE),
String(),
utils.GenericDataType.STRING,
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably design this so that an engine spec can extend the base mapping. In the case of MSSQL, I believe the base mapping is a good fallback. Also, we might consider incorporating these types into the base spec, as I assume fairly many engines support N-prefixed character types, and some of those engines might also benefit from the UnicodeText SQLA type over the regular String one.

)

@classmethod
Expand Down
92 changes: 92 additions & 0 deletions superset/db_engine_specs/postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@
from typing import Any, Dict, List, Optional, Tuple, TYPE_CHECKING

from pytz import _FixedOffset # type: ignore
from sqlalchemy import types
from sqlalchemy.dialects.postgresql import DOUBLE_PRECISION
from sqlalchemy.dialects.postgresql.base import PGInspector

from superset.db_engine_specs.base import BaseEngineSpec
Expand All @@ -45,6 +47,96 @@ class PostgresBaseEngineSpec(BaseEngineSpec):
engine = ""
engine_name = "PostgreSQL"

column_type_mappings = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to consolidate the structure of db_column_types and column_type_mappings as they are basically doing very similar things (one to convert db column type to generic types; one to convert to SQLA types)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is the ultimate objective. However, we want to leave the existing structures untouched until we're certain this doesn't break existing functionality. We'll be including other types in this mapping that are missing in db_column_types, like DbColumnType, PyArrow types etc to make sure this is a one-stop-shop for making available all necessary types in one place.

(
re.compile(r"^smallint", re.IGNORECASE),
types.SMALLINT,
utils.GenericDataType.NUMERIC,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we import GenericDataType directly and get rid of the utils. prefix? Just want the code the look cleaner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, is adding GenericDataType here really necessary, though? I'd imagine all SQLA types could be definitively mapped to a GenericDataType, wouldn't they? Maybe ColumnSpec can just have an instance @property to map all known SQLA types to GenericDataType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud Yes, it could've been mapped as you proposed. The reason behind the current way is that if we get multiple matches we can prioritise those at the start of the list without necessarily sorting the results for the most part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need to use a list to have priority for RegExp matching, but will there be a case where two SQLA types may be matched to different GenericDataType? If not, isn't GenericDataType already inferred by SQLA types?

),
(
re.compile(r"^integer", re.IGNORECASE),
types.INTEGER,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^bigint", re.IGNORECASE),
types.BIGINT,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^decimal", re.IGNORECASE),
types.DECIMAL,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^numeric", re.IGNORECASE),
types.NUMERIC,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^real", re.IGNORECASE),
types.REAL,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^double precision", re.IGNORECASE),
DOUBLE_PRECISION,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^smallserial", re.IGNORECASE),
types.SMALLINT,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^serial", re.IGNORECASE),
types.INTEGER,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^bigserial", re.IGNORECASE),
types.BIGINT,
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^varchar", re.IGNORECASE),
types.VARCHAR,
utils.GenericDataType.STRING,
),
(re.compile(r"^char", re.IGNORECASE), types.CHAR, utils.GenericDataType.STRING),
(re.compile(r"^text", re.IGNORECASE), types.TEXT, utils.GenericDataType.STRING),
(
re.compile(r"^date", re.IGNORECASE),
types.DATE,
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^time", re.IGNORECASE),
types.TIME,
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^timestamp", re.IGNORECASE),
types.TIMESTAMP,
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^timestamptz", re.IGNORECASE),
types.TIMESTAMP(timezone=True),
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^interval", re.IGNORECASE),
types.Interval,
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^boolean", re.IGNORECASE),
types.BOOLEAN,
utils.GenericDataType.BOOLEAN,
),
)

_time_grain_expressions = {
None: "{col}",
"PT1S": "DATE_TRUNC('second', {col})",
Expand Down
92 changes: 75 additions & 17 deletions superset/db_engine_specs/presto.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,31 +356,89 @@ def _show_columns(
return columns

column_type_mappings = (
(re.compile(r"^boolean.*", re.IGNORECASE), types.Boolean()),
(re.compile(r"^tinyint.*", re.IGNORECASE), TinyInteger()),
(re.compile(r"^smallint.*", re.IGNORECASE), types.SmallInteger()),
(re.compile(r"^integer.*", re.IGNORECASE), types.Integer()),
(re.compile(r"^bigint.*", re.IGNORECASE), types.BigInteger()),
(re.compile(r"^real.*", re.IGNORECASE), types.Float()),
(re.compile(r"^double.*", re.IGNORECASE), types.Float()),
(re.compile(r"^decimal.*", re.IGNORECASE), types.DECIMAL()),
(
re.compile(r"^boolean.*", re.IGNORECASE),
types.Boolean(),
utils.GenericDataType.BOOLEAN,
),
(
re.compile(r"^tinyint.*", re.IGNORECASE),
TinyInteger(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^smallint.*", re.IGNORECASE),
types.SmallInteger(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are types.SMALLINT and types.SmallInteger() interchangeable? If yes, can we stick to just one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to keep it consistent

utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^integer.*", re.IGNORECASE),
types.Integer(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^bigint.*", re.IGNORECASE),
types.BigInteger(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^real.*", re.IGNORECASE),
types.Float(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^double.*", re.IGNORECASE),
types.Float(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^decimal.*", re.IGNORECASE),
types.DECIMAL(),
utils.GenericDataType.NUMERIC,
),
(
re.compile(r"^varchar(\((\d+)\))*$", re.IGNORECASE),
lambda match: types.VARCHAR(int(match[2])) if match[2] else types.String(),
utils.GenericDataType.STRING,
),
(
re.compile(r"^char(\((\d+)\))*$", re.IGNORECASE),
lambda match: types.CHAR(int(match[2])) if match[2] else types.CHAR(),
utils.GenericDataType.STRING,
),
(
re.compile(r"^varbinary.*", re.IGNORECASE),
types.VARBINARY(),
utils.GenericDataType.STRING,
),
(
re.compile(r"^json.*", re.IGNORECASE),
types.JSON(),
utils.GenericDataType.JSON,
),
(
re.compile(r"^date.*", re.IGNORECASE),
types.DATE(),
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^timestamp.*", re.IGNORECASE),
types.TIMESTAMP(),
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^time.*", re.IGNORECASE),
types.Time(),
utils.GenericDataType.TEMPORAL,
),
(
re.compile(r"^interval.*", re.IGNORECASE),
Interval(),
utils.GenericDataType.TEMPORAL,
),
(re.compile(r"^varbinary.*", re.IGNORECASE), types.VARBINARY()),
(re.compile(r"^json.*", re.IGNORECASE), types.JSON()),
(re.compile(r"^date.*", re.IGNORECASE), types.DATE()),
(re.compile(r"^timestamp.*", re.IGNORECASE), types.TIMESTAMP()),
(re.compile(r"^time.*", re.IGNORECASE), types.Time()),
(re.compile(r"^interval.*", re.IGNORECASE), Interval()),
(re.compile(r"^array.*", re.IGNORECASE), Array()),
(re.compile(r"^map.*", re.IGNORECASE), Map()),
(re.compile(r"^row.*", re.IGNORECASE), Row()),
(re.compile(r"^array.*", re.IGNORECASE), Array(), utils.GenericDataType.ARRAY),
(re.compile(r"^map.*", re.IGNORECASE), Map(), utils.GenericDataType.MAP),
(re.compile(r"^row.*", re.IGNORECASE), Row(), utils.GenericDataType.ROW),
)

@classmethod
Expand Down
19 changes: 18 additions & 1 deletion superset/utils/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@
from sqlalchemy.engine import Connection, Engine
from sqlalchemy.engine.reflection import Inspector
from sqlalchemy.sql.type_api import Variant
from sqlalchemy.types import TEXT, TypeDecorator
from sqlalchemy.types import TEXT, TypeDecorator, TypeEngine
from typing_extensions import TypedDict

import _thread # pylint: disable=C0411
Expand Down Expand Up @@ -148,6 +148,10 @@ class GenericDataType(IntEnum):
STRING = 1
TEMPORAL = 2
BOOLEAN = 3
ARRAY = 4
JSON = 5
MAP = 6
ROW = 7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure this logic is in synced with superset-ui/core - for now it might be a good idea to just map all complex types to STRING as has previously been done, and later introduce more complex generic types once we add proper support for them.



class ChartDataResultFormat(str, Enum):
Expand Down Expand Up @@ -306,6 +310,19 @@ class TemporalType(str, Enum):
TIMESTAMP = "TIMESTAMP"


class ColumnTypeSource(Enum):
GET_TABLE = 1
CURSOR_DESCRIPION = 2


class ColumnSpec(NamedTuple):
sqla_type: Union[TypeEngine, str]
generic_type: GenericDataType
is_dttm: bool
normalized_column_name: Optional[str] = None
python_date_format: Optional[str] = None


try:
# Having might not have been imported.
class DimSelector(Having):
Expand Down