Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type for heterogeneous dictionaries with string keys #28

Closed
JukkaL opened this issue Nov 21, 2014 · 56 comments
Closed

Type for heterogeneous dictionaries with string keys #28

JukkaL opened this issue Nov 21, 2014 · 56 comments

Comments

@JukkaL
Copy link
Contributor

JukkaL commented Nov 21, 2014

I've recently been reading Python code where heterogeneous dictionary objects are used a lot. I mean cases like this:

foo({'x': 1, 'y': 'z'})

The value for key 'x' must be an integer and the value for key 'z' must be a string. Currently there is no precise way of specifying the type of the argument to foo in the above example. In general, we'd have to fall back to Dict[str, Any], Mapping[str, Union[int, str]] or similar. This loses a lot of information.

However, we could support such types. Here is a potential syntax:

def foo(arg: Dict[{'x': int, 'y': str}]) -> ...: ...

Of course, we could also allow Dict[dict(x=int, y=str)] as an equivalent. I don't really love either syntax, though.

Alternatively, we could omit Dict[...] as redundant:

def f(arg: dict(x=int, y=str)) -> ...

Using type aliases would often be preferred:

ArgType = Dict[{'x': int, 'y': str}]

def f(arg: ArgType) -> ...

These types would use structural subtyping, and missing keys could plausibly be okay. So Dict[dict(x=int, y=str)] could be a subtype of Dict[dict(x=int)], and vice versa (!).

Maybe there should also be a way of deriving subtypes of heterogeneous dictionary types (similar to inheritance) to avoid repetition.

Maybe we'd also want to support Mapping[...] variants (for read-only access and covariance).

Some existing languages have types resembling these (at least Hack and TypeScript, I think).

@JukkaL
Copy link
Contributor Author

JukkaL commented Apr 14, 2015

Even though something like this might be useful, after discussing this with several people it seems that everybody agrees that this should be left out from the PEP, but this is a potential feature to add in a later Python release (after experimenting with an actual prototype implementation, etc.).

@markshannon
Copy link
Member

We (@JukkaL, @vlasovskikh and I) agreed that this is best left until 3.6, or dropped all together.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 25, 2016

Riffing on the right syntax, I wonder if the way to spell such types shouldn't look more like NamedTuple than like Dict. How about:

Foo = Struct('Foo', x=int, y=Optional[List[str]])

Using PEP 526 we could write it like this:

class Foo(Struct):
    x: int
    y: Optional[List[str]]

but that syntax remains a pipe dream for most of us (since it only works in Python 3.6).

Also I propose not to bother with initial values and not to express in the type whether a given key is mandatory or can be left out. (Note that Optional means it can be None, not that it can be left out.)

@davidfstr
Copy link
Contributor

I was going to mention a NamedTuple-style syntax myself, since the original semantics I had in mind were nearly equivalent to NamedTuple:

  • all keys are required
  • accesses must be performed with a string literal
  • accesses using a non-literal are treated the same by the typechecker as a getattr() or setattr() applied to a NamedTuple

These semantics are more restrictive than those originally proposed by @JukkaL.

Maybe there should also be a way of deriving subtypes of heterogeneous dictionary types (similar to inheritance) to avoid repetition.

In case it's a useful data point, my current Django codebase wouldn't make use of such subtypes.

@davidfstr
Copy link
Contributor

davidfstr commented Sep 25, 2016

Consider:

Point = DictStruct(x=int, y=int)
point = dict(x=42, y=1337)  # type: Point
  • The type declaration and value definition look similar and in alignment.
  • There's no need to repeat the type name as the first argument to DictStruct since the type name is not actually needed at runtime.

Sometimes code defines the actual dict value using {}, so perhaps the following might also be useful to support in case an existing codebase uses {} extensively.

Point = DictStruct({'x': int, 'y': int})
point = {'x': int, 'y': int}

Edit: Added secondary syntax that uses {}.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 25, 2016 via email

@davidfstr
Copy link
Contributor

That's true. If the type object were to remain available at runtime (as NamedTuple, TypeVar, and NewType do) that would make perfect sense. (In a previous iteration of my thinking, the DictStruct would actually erase to a plain dict at runtime. But upon reflection I don't see any good reason for that erasure.)

So I'm onboard with providing the type name as a constructor argument.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 25, 2016 via email

@davidfstr
Copy link
Contributor

Yes indeed. I've spent a good chunk of today on reverse-engineering how mypy works in general and how it handles NamedTuple. Once I have a more detailed implementation proposal, I intend to bring it up in python/mypy#985 since this would be a mypy implementation proposal rather than a typing proposal per-se.

@gvanrossum
Copy link
Member

@JukkaL and I spent some time offline brainstorming about this. Jukka has a lot of additional ideas.

  • Anonymous Structs, e.g. def f(d: Struct(x=int)) -> None: ...
  • A way to infer a Struct instead of a regular Dict for a dictionary expression, e.g. x = {'x': 1, 'y': 'foo'} # type: Struct might infer Struct(x=int, y=str) as the type for x instead of Dict[str, object] (that's what it currently does and I think it ought to remain the default).
  • From these two we infer that Struct type checking should be structural, i.e. two Structs defined with the same keys and value types ought to be considered equivalent.
  • People will want to subclass their Structs, either to define other structs with additional fields or to add methods (we've seen this for NamedTuple too, the latter is already possible).
  • What to do about heterogeneous **kwds? Some people will want to use Structs here. It's tricky though, since the current syntax assumes the dict is homogeneous. Also, people really should use "lone star" syntax if they have keyword-only arguments, e.g. def f(self, *, x: int) -> None: ... -- but this syntax only works in Python 3.
  • Sometimes you have a dict that's mostly homogeneous except for a few keys. Maybe it should be possible to give a Struct a default type for additional keys. (It's tricky though, unless the key type is different.)
  • Some other rules we may borrow from NamedTuple, e.g. no partially defined Structs. (@JukkaL: I'm not sure what you meant by that? That all keys should be present? I think I've seen code that checks whether a given key is present, but if it is, assumes it has a certain type.)
  • JSON is probably one of the main use cases.
  • Various automatically inferred uses of this could be supported by the conditional type binder in mypy (hmm, that's probably too mypy-specific for this tracker).
  • There's some potential confusion with the struct module, which defines struct.Struct quite differently (of course the etymology is the same, from C structs).

@ilevkivskyi
Copy link
Member

@gvanrossum

but that syntax remains a pipe dream for most of us (since it only works in Python 3.6)

There is no need to chose. Struct could be implemented to support both syntax: the 3.6+ one and the backward-compatible one. This could be done the same way it is done for NamedTuple -- the same code is called on instantiation and subclassing.

@gvanrossum
Copy link
Member

There is no need to cho[o]se.

Agreed, but the pre-3.6 syntax needs to be reasonable since that's what almost everybody will see for the forseeable future.

@JukkaL
Copy link
Contributor Author

JukkaL commented Sep 26, 2016

By partially defined structs, I meant struct with some missing keys. We have at least these options:

  1. Make all keys required. If a key is missing, a type checker will have to be shut up explicitly, such as by using a cast. Checking whether a key exists using in is still possible but it won't be enforced by the type checker. For example, you could write x = cast(Struct(x=int), {}), but x = {} # type: Struct(x=int) would be an error.
  2. Make all keys optional (this is different from Optional[t]). Thus {} would be compatible with all struct types. Some legitimate errors would be silently ignored by type checkers.
  3. Allow optionality to be specified individually for keys. For example, something like Struct(x=int, y=Something[str]), where Something specifies an optional key. We can't use Optional, as it already means something quite different. Not sure what would be a good name.
  4. Like (3), but the default would be optional, and required keys would have to marked explicitly.

@JukkaL
Copy link
Contributor Author

JukkaL commented Sep 26, 2016

Structural subtyping would likely also imply that dict(x=1, y='a') is compatible with Struct(x=int).

Using this to support adding new keys and automatically inferring a new type is problematic:

def f(x: Struct(x=int)) -> None:
    x['y'] = 2  # Is this okay? Would we infer type Struct(x=int, y=int) from this point on?

d = dict(x=1, y='a')  # type: Struct
f(d)  # Probably okay?
d[y] + 'b'  # Runtime error!

The latter example might be beyond the scope of PEP 484, but it's perhaps worth at least considering here. Maybe struct 'extension' as in f() above would require a cast to make it explicit that it is potentially unsafe. For example:

def f(x: Struct(x=int)) -> None:
    x = cast(Struct(x=int, y=int), x)
    x['y'] = 2  # Now okay

@ilevkivskyi
Copy link
Member

ilevkivskyi commented Sep 26, 2016

@gvanrossum

Agreed, but the pre-3.6 syntax needs to be reasonable since that's what almost everybody will see for the forseeable future.

In addition to the proposed syntax, I think something even more similar to NamedTuple could be considered:

Foo = Struct('Foo', [('x', Optional[List[str]]), ('y', int)])

Although it is a bit verbose, new features could be easily added, e.g. default values:

Foo = Struct('Foo', [('x', Optional[List[str]]), ('y', int, 42)])

class Foo(Struct):
    x: Optional[List[str]]
    y: int = 42

Then also specifying a default value to something special (ellipsis ...) means that this key could be omitted, etc.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 26, 2016 via email

@ilevkivskyi
Copy link
Member

I am not sure. On one hand the Struct('A', x=int) looks cleaner, on other hand the NamedTuple-like syntax is more flexible. Having both would cover more situations, but there should be one right way to do it. Taking into account that we already have this syntax for NamedTuple I would probably say "instead of".

@gvanrossum
Copy link
Member

gvanrossum commented Sep 26, 2016 via email

@ilevkivskyi
Copy link
Member

It is more flexible in the sense that we can accept 3-tuples (for example with default values) in the list of fields. Such option could be added later as a minimal change. While I don't see how this could be done with the dict-like syntax.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 26, 2016 via email

@JukkaL
Copy link
Contributor Author

JukkaL commented Sep 27, 2016

Default values don't seem to fit in seamlessly with our philosophy, as I don't see how they would be useful without a runtime effect, and elsewhere we try to avoid type annotations having a runtime effect. Removing typing-related stuff from a program generally shouldn't affect behavior.

@gvanrossum
Copy link
Member

gvanrossum commented Sep 27, 2016 via email

@ilevkivskyi
Copy link
Member

OK, then I agree that we should go with Struct('A', x=int) syntax. Concerning the partially defined Struct I think the 3rd option proposed by @JukkaL is convenient while still safe. One could write:

A = Struct('A', x=int, y=Facultative[int])
d = {'x': 1}  # type: A # OK
if 'y' in d:
    d['y'] += 1  # Also OK

@davidfstr
Copy link
Contributor

davidfstr commented Oct 17, 2016

class Point2D(TypedDict):
    x: int
    y: int

That a nice syntax with variable annotations. One slight difficulty is
that making the TypedDict type constructor a superclass is that it would
imply that TypedDict has a runtime presence, when it is intended as a
zero-cost abstraction over dictionaries. This is difference between it
and NamedTuple.

Here is a similar syntax that could be implemented with zero-cost semantics:

@TypedDict
class Point2D:
    x: int
    y: int

The preceding would be equivalent to Point2D = TypedDict('Point2D', dict(x: int, y: int)) (and ultimately Point2D = dict) at runtime.


Concerning subclassing, I think that |TypedDict| should be subclassable,
in analogy with |NamedTuple|.

I could potentially see type-instances of TypedDict (ex: Point2D) being
subclassable in the same way that dict is subclassable. Under those
circumstances the subclass would in fact attain a runtime presence.

However with a runtime presence you lose the advantages of being able to
do zero-cost casts to the type, since that is no longer possible. The
only advantage you retain is that a type checker will recognize certain
keys. With only those advantages you might as well use a NamedTuple or a
subclass thereof.

@gvanrossum
Copy link
Member

gvanrossum commented Oct 17, 2016 via email

@davidfstr
Copy link
Contributor

davidfstr commented Oct 17, 2016

maybe it's possible to have
the class structure as suggested but make instantiations just return plain
dicts?

I could see that.


It occurs to me that my "Should TypedDict support subtyping?" question above was ambiguous.

Originally I was only considering the question of structural subtyping. That is, whether or not the following would be accepted:

Point2D = TypedDict('Point2D', dict(x=int, y=int))
Point3D = TypedDict('Point2D', dict(x=int, y=int, z=int))

def narrow(p: Point3D) -> Point2D:
    return p

Again, prior discussion suggests yes.

But here are a lot of other crazy ideas that also come to mind in the realm of "subtyping":

(1) Extending with an "extends" keyword to TypedDict

Point1D = TypedDict('Point1D', dict(
    x=int
))
Point2D = TypedDict('Point2D', extends=Point1D, fields=dict(
    y=int
))
Point3D = TypedDict('Point3D', extends=Point2D, fields=dict(
    z=int
))

(2) Extending with class-based syntax

class Point1D(TypedDict):
    x: int
class Point2D(Point1D):
    y: int
class Point3D(Point2D):
    z: int

Although the preceding syntaxes have the feel of nominal subtyping, it is just syntactic suger for fully spelling out all fields. The type system would still use the more general structural subtyping when checking type compatibility.

@ilevkivskyi
Copy link
Member

@davidfstr

it is intended as a zero-cost abstraction over dictionaries

As Guido already mentioned, there are various kinds of manipulations with __new__ all having different speed and possibilities. Probably the fastest one would be something similar to NewType: metaclass's __new__ will return a function that returns its argument:

class TDMeta(type):
    def __new__(cls, name, bases, ns, *, _root=False):
        if _root:
            return super().__new__(cls, name, bases, ns)
        return lambda x: x
class TypedDict(metaclass=TDMeta, _root=True): ...

class Point2D(TypedDict):
    x: int
    y: int

assert Point2D({'x': 1, 'y': 2}) == {'x': 1, 'y': 2}

However, one will be not able to subclass this (as one cannot subclass NewType). I would prefer exactly what Guido proposes, it will be a bit slower, but will support your example (2).

@gvanrossum
Copy link
Member

(Also see the next stage of the implementation, WIP here: python/mypy#2342)

@gvanrossum
Copy link
Member

In regards to subclassing, I don't like the idea of an "extends" keyword or a class decorator, but I think we could rig a metaclass so that at runtime, after

class Point1D(TypedDict):
    x: int
class Point2D(Point1D):
    y: int

both Point1D and Point2D are just aliases for dict, but mypy can still type-check code that uses these. However it should still use structural checking, so that Point2D would be exactly equivalent (both at runtime and in the type checker) as this definition:

class Point2D(TypedDict):
    x: int
    y: int

The type checker should reject isinstance(x, cls) calls using a TypedDict subclass, since at runtime those would all be equivalent to isinstance(x, dict)...

@davidfstr
Copy link
Contributor

The type checker should reject isinstance(x, cls) calls using a TypedDict subclass, since at runtime those would all be equivalent to isinstance(x, dict)...

+1

@shoyer
Copy link

shoyer commented Dec 12, 2017

Would a generic version of TypedDict be feasible? There are some strong use cases for this for scientific computing / data science applications:

  • NumPy has structured data types, where each array element is struct (xref https://github.com/numpy/numpy_stubs/issues/7).
  • pandas.DataFrame probably the most popular data structure when using Python for data science. Essentially, it is a dictionary where indexing returns a vector of all entries in a column.

In both cases, it is quite common to define ad-hoc "types" in applications analogous to TypedDict, e.g., a "user dataframe" which is defined to have a fixed set of column names with particular data types. Type checking would be quite valuable for checking such code.

@JukkaL
Copy link
Contributor Author

JukkaL commented Dec 12, 2017

@shoyer Can you give a few concrete code examples of where this could be useful?

@ilevkivskyi
Copy link
Member

@shoyer Note that you can already have generic protocols. Together with literal types (that are proposed in another issue) you can just overload __getitem__. For example:

class MyFrame(Protocol[T]):
    @overload
    def __getitem__(self, item: Literal['name']) -> str:
    @overload
    def __getitem__(self, item: Literal['value']) -> T:

or similar. In principle TypedDict (as well as NamedTuple) can be made generic and there are several requests for NamedTuple, there are however not so many for TypedDict.

@shoyer
Copy link

shoyer commented Dec 13, 2017

The use cases for TypedDataFrame (my hypothetical TypedDict/pandas.DataFrame combination) line up very closely with TypedDict itself. Basically, any time you would use an ad-hoc record type, but need performance or interoperability with the Python for data stack.

To build off the example in mypy's docs, you would use TypedDataFrame if you wanted to build an application to analyze a database of movies. Various functions could create Movies (e.g., from a CSV file or relational database) and other function could transform (e.g., by filtering entries) or consume them (e.g., to compute statistics or make plots).

As is the case for TypedDict, most of these use cases would also work fine with a dataclass or namedtuple (in this case, where the entries are 1-dimensional arrays), but there are advantages to standardizing on common types and APIs, and using types that can be defined dynamically when desired. In the PyData ecosystem, pandas.DataFrame fills a similar niche to dict: if you want to pass around an adhoc namespace (of data), it's the idiomatic way to do it.

@ilevkivskyi Yes, I suppose protocols with literals would work for some use cases, but that wouldn't be a very satisfying solution. There are a long list of methods beyond indexing that only take column names found in the DataFrame as valid entries, e.g., to group by a column, plot a column, set an index on a column, data, rename columns, etc.

I only have a vague idea what support for writing custom key-value types would look like, but perhaps it would pay dividends, because in some sense this is a generalized version of typing for TypeDict, NamedTuple and dataclasses. There would need to be some way of defining paired key/value TypeVar instances, and then you could define methods in any desired fashion, e.g., perhaps

K = TypeVar('K')
V = TypeVar('V')

class TypedDict(Enumerated[K, V], Dict[str, Any]):
    def __getitem__(self, key: K) -> V: ...

class NamedTuple(Enumerated[K, V], namedtuple):
    def __getattr__(self, name: K) -> V: ...

(Feel free to declare this out of scope for now or push it to another issue -- I don't want to pull TypedDict off track!)

@JukkaL
Copy link
Contributor Author

JukkaL commented Dec 13, 2017

@shoyer Generalizing TypedDict in the way you are suggesting seems out of scope, unfortunately. TypedDict and NamedTuple are currently very special-purpose constructs, and I don't expect this to change. However, perhaps it's possible to special case pandas.DataFrame in a similar way? Experimenting with that could happen as a mypy plugin, for example. This still wouldn't be trivial, and the current mypy plugin system would have to be extended.

@ilevkivskyi
Copy link
Member

@shoyer I agree with Jukka here. This should be done via a plugin to mypy. Note that there is a PR python/mypy#4328 that extends the current plugin system (to allow special casing of decorators/base classes/metaclasses). With this new plugin system, a user will be able to write something like this (very approximately):

class MyTable(pandas.DataFrame):
    id: int
    name: str

table: MyTable
table.id  # OK, inferred type is 'array[int]'
table['name']  # Also OK, inferred type is 'array[str]'

Currently, the author of the mentioned PR is also working on the plugin for attrs, so you can keep an eye on this.

@DrPyser
Copy link

DrPyser commented Sep 20, 2018

Hi. Anybody knows what's happening with this(TypedDict/DictStruct)? Is it ever going to be in the standard library? Looking forward to this feature...
Thanks!

@ilevkivskyi
Copy link
Member

@DrPyser it is supported by mypy for a year and half, but it leaves in mypy_extensions, until there is a PEP. After it is accepted TypedDict can move to typing.

@JukkaL
Copy link
Contributor Author

JukkaL commented Sep 20, 2018

Maybe we should close this issue now? We can create follow-up issues about remaining work that isn't covered by other issues.

@ilevkivskyi
Copy link
Member

OK, let us close this. I think we have issues about missing features on mypy tracker and the pandas question is unrelated (btw @shoyer the plugin hook I mentioned was added to mypy and attrs successfully uses it, I think we can write a similar plugin for data frames as I described above in #28 (comment)).

@davidfstr
Copy link
Contributor

Hi @DrPyser, I was originally providing a lot of the organizational energy around the original TypedDict design and implementation but my life has gotten super busy over the past year or so.

I've been out of the loop long enough that I don't know if anyone else has stepped up to lead the charge on polishing TypedDict to a state that's solid enough to standardize. If not, that would be a valuable role for someone to take on that cares and has time.

@iddan
Copy link

iddan commented Oct 8, 2018

Is there a plan for some required fields on TypedDict? Is there already a dedicated issue?

@gvanrossum
Copy link
Member

@iddan

Is there a plan for some required fields on TypedDict? Is there already a dedicated issue?

We did give this some thought and decided to punt on it. Maybe a volunteer can implement this. The main problem is how to spell it -- we currently have the total flag that can be set to require all fields or cleared to require none. The problem with requiring some but not all fields is that you have to invent a spelling for it. And you can't reuse Optional because that already has a meaning (the field is required but the value may be None).

If you're interested in pursuing this idea I recommend filing a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants