- Feature Name:
union
- Start Date: 2015-12-29
- RFC PR: rust-lang/rfcs#1444
- Rust Issue: rust-lang/rust#32836
Provide native support for C-compatible unions, defined via a new "contextual
keyword" union
, without breaking any existing code that uses union
as an
identifier.
Note: This RFC has been partially superseded by unions-and-drop
.
Many FFI interfaces include unions. Rust does not currently have any native
representation for unions, so users of these FFI interfaces must define
multiple structs and transmute between them via std::mem::transmute
. The
resulting FFI code must carefully understand platform-specific size and
alignment requirements for structure fields. Such code has little in common
with how a C client would invoke the same interfaces.
Introducing native syntax for unions makes many FFI interfaces much simpler and less error-prone to write, simplifying the creation of bindings to native libraries, and enriching the Rust/Cargo ecosystem.
A native union mechanism would also simplify Rust implementations of space-efficient or cache-efficient structures relying on value representation, such as machine-word-sized unions using the least-significant bits of aligned pointers to distinguish cases.
The syntax proposed here recognizes union
as though it were a keyword when
used to introduce a union declaration, without breaking any existing code
that uses union
as an identifier. Experiments by Niko Matsakis demonstrate
that recognizing union
in this manner works unambiguously with zero conflicts
in the Rust grammar.
To preserve memory safety, accesses to union fields may only occur in unsafe code. Commonly, code using unions will provide safe wrappers around unsafe union field accesses.
A union declaration uses the same field declaration syntax as a struct
declaration, except with union
in place of struct
.
union MyUnion {
f1: u32,
f2: f32,
}
By default, a union uses an unspecified binary layout. A union declared with
the #[repr(C)]
attribute will have the same layout as an equivalent C union.
A union must have at least one field; an empty union declaration produces a syntax error.
Rust normally prevents the use of a keyword as an identifier; for instance, a
declaration fn struct() {}
will produce an error "expected identifier, found
keyword struct
". However, to avoid breaking existing declarations that use
union
as an identifier, Rust will only recognize union
as a keyword when
used to introduce a union declaration. A declaration fn union() {}
will not
produce such an error.
A union instantiation uses the same syntax as a struct instantiation, except that it must specify exactly one field:
let u = MyUnion { f1: 1 };
Specifying multiple fields in a union instantiation results in a compiler error.
Safe code may instantiate a union, as no unsafe behavior can occur until accessing a field of the union. Code that wishes to maintain invariants about the union fields should make the union fields private and provide public functions that maintain the invariants.
Unsafe code may read from union fields, using the same dotted syntax as a struct:
fn f(u: MyUnion) -> f32 {
unsafe { u.f2 }
}
Unsafe code may write to fields in a mutable union, using the same syntax as a struct:
fn f(u: &mut MyUnion) {
unsafe {
u.f1 = 2;
}
}
If a union contains multiple fields of different sizes, assigning to a field smaller than the entire union must not change the memory of the union outside that field.
Union fields will normally not implement Drop
, and by default, declaring a
union with a field type that implements Drop
will produce a lint warning.
Assigning to a field with a type that implements Drop
will call drop()
on
the previous value of that field. This matches the behavior of struct
fields
that implement Drop
. To avoid this, such as if interpreting the union's
value via that field and dropping it would produce incorrect behavior, Rust
code can assign to the entire union instead of the field. A union does not
implicitly implement Drop
even if its field types do.
The lint warning produced when declaring a union field of a type that
implements Drop
should document this caveat in its explanatory text.
Unsafe code may pattern match on union fields, using the same syntax as a
struct, without the requirement to mention every field of the union in a match
or use ..
:
fn f(u: MyUnion) {
unsafe {
match u {
MyUnion { f1: 10 } => { println!("ten"); }
MyUnion { f2 } => { println!("{}", f2); }
}
}
}
Matching a specific value from a union field makes a refutable pattern; naming a union field without matching a specific value makes an irrefutable pattern. Both require unsafe code.
Pattern matching may match a union as a field of a larger structure. In particular, when using a Rust union to implement a C tagged union via FFI, this allows matching on the tag and the corresponding field simultaneously:
#[repr(u32)]
enum Tag { I, F }
#[repr(C)]
union U {
i: i32,
f: f32,
}
#[repr(C)]
struct Value {
tag: Tag,
u: U,
}
fn is_zero(v: Value) -> bool {
unsafe {
match v {
Value { tag: I, u: U { i: 0 } } => true,
Value { tag: F, u: U { f: 0.0 } } => true,
_ => false,
}
}
}
Note that a pattern match on a union field that has a smaller size than the
entire union must not make any assumptions about the value of the union's
memory outside that field. For example, if a union contains a u8
and a
u32
, matching on the u8
may not perform a u32
-sized comparison over the
entire union.
Unsafe code may borrow a reference to a field of a union; doing so borrows the entire union, such that any borrow conflicting with a borrow of the union (including a borrow of another union field or a borrow of a structure containing the union) will produce an error.
union U {
f1: u32,
f2: f32,
}
#[test]
fn test() {
let mut u = U { f1: 1 };
unsafe {
let b1 = &mut u.f1;
// let b2 = &mut u.f2; // This would produce an error
*b1 = 5;
}
assert_eq!(unsafe { u.f1 }, 5);
}
Simultaneous borrows of multiple fields of a struct contained within a union do not conflict:
struct S {
x: u32,
y: u32,
}
union U {
s: S,
both: u64,
}
#[test]
fn test() {
let mut u = U { s: S { x: 1, y: 2 } };
unsafe {
let bx = &mut u.s.x;
// let bboth = &mut u.both; // This would fail
let by = &mut u.s.y;
*bx = 5;
*by = 10;
}
assert_eq!(unsafe { u.s.x }, 5);
assert_eq!(unsafe { u.s.y }, 10);
}
The pub
keyword works on the union and on its fields, as with a struct. The
union and its fields default to private. Using a private field in a union
instantiation, field access, or pattern match produces an error.
The compiler should consider a union uninitialized if declared without an initializer. However, providing a field during instantiation, or assigning to a field, should cause the compiler to treat the entire union as initialized.
A union may have trait implementations, using the same impl
syntax as a
struct.
The compiler should provide a lint if a union field has a type that implements
the Drop
trait. The explanation for that lint should include an explanation
of the caveat documented in the section "Writing fields". The compiler should
allow disabling that lint with #[allow(union_field_drop)]
, for code that
intentionally stores a type with Drop in a union. The compiler must never
implicitly generate a Drop implementation for the union itself, though Rust
code may explicitly implement Drop for a union type.
A union may have a generic type, with one or more type parameters or lifetime parameters. As with a generic enum, the types within the union must make use of all the parameters; however, not all fields within the union must use all parameters.
Type inference works on generic union types. In some cases, the compiler may not have enough information to infer the parameters of a generic type, and may require explicitly specifying them.
Rust code must not use unions to invoke undefined behavior. In particular, Rust code must not use unions to break the pointer aliasing rules with raw pointers, or access a field containing a primitive type with an invalid value.
In addition, since a union declared without #[repr(C)]
uses an unspecified
binary layout, code reading fields of such a union or pattern-matching such a
union must not read from a field other than the one written to. This includes
pattern-matching a specific value in a union field.
A union declared with #[repr(C)]
must have the same size and alignment as an
equivalent C union declaration for the target platform. Typically, a union
would have the maximum size of any of its fields, and the maximum alignment of
any of its fields. Note that those maximums may come from different fields;
for instance:
#[repr(C)]
union U {
f1: u16,
f2: [u8; 4],
}
#[test]
fn test() {
assert_eq!(std::mem::size_of<U>(), 4);
assert_eq!(std::mem::align_of<U>(), 2);
}
Adding a new type of data structure would increase the complexity of the language and the compiler implementation, albeit marginally. However, this change seems likely to provide a net reduction in the quantity and complexity of unsafe code.
Proposals for unions in Rust have a substantial history, with many variants and
alternatives prior to the syntax proposed here with a union
pseudo-keyword.
Thanks to many people in the Rust community for helping to refine this RFC.
The most obvious path to introducing unions in Rust would introduce union
as
a new keyword. However, any introduction of a new keyword will necessarily
break some code that previously compiled, such as code using the keyword as an
identifier. Making union
a keyword in the standard way would break the
substantial volume of existing Rust code using union
for other purposes,
including multiple functions in the standard
library. The approach proposed
here, recognizing union
to introduce a union declaration without prohibiting
union
as an identifier, provides the most natural declaration syntax and
avoids breaking any existing code.
Proposals for unions in Rust have extensively explored possible variations on
declaration syntax, including longer keywords (untagged_union
), built-in
syntax macros (union!
), compound keywords (unsafe union
), pragmas
(#[repr(union)] struct
), and combinations of existing keywords (unsafe enum
).
In the absence of a new keyword, since unions represent unsafe, untagged sum
types, and enum represents safe, tagged sum types, Rust could base unions on
enum instead. The unsafe enum
proposal took this approach, introducing unsafe, untagged enums, identified
with unsafe enum
; further discussion around that proposal led to the
suggestion of extending it with struct-like field access syntax. Such a
proposal would similarly eliminate explicit use of std::mem::transmute
, and
avoid the need to handle platform-specific size and alignment requirements for
fields.
The standard pattern-matching syntax of enums would make field accesses
significantly more verbose than struct-like syntax, and in particular would
typically require more code inside unsafe blocks. Adding struct-like field
access syntax would avoid that; however, pairing an enum-like definition with
struct-like usage seems confusing for developers. A declaration using enum
leads users to expect enum-like syntax; a new construct distinct from both
enum
and struct
avoids leading users to expect any particular syntax or
semantics. Furthermore, developers used to C unions will expect struct-like
field access for unions.
Since this proposal uses struct-like syntax for declaration, initialization,
pattern matching, and field access, the original version of this RFC used a
pragma modifying the struct
keyword: #[repr(union)] struct
. However, while
the proposed unions match struct syntax, they do not share the semantics of
struct; most notably, unions represent a sum type, while structs represent a
product type. The new construct union
avoids the semantics attached to
existing keywords.
In the absence of any native support for unions, developers of existing Rust code have resorted to either complex platform-specific transmute code, or complex union-definition macros. In the latter case, such macros make field accesses and pattern matching look more cumbersome and less structure-like, and still require detailed platform-specific knowledge of structure layout and field sizes. The implementation and use of such macros provides strong motivation to seek a better solution, and indeed existing writers and users of such macros have specifically requested native syntax in Rust.
Finally, to call more attention to reads and writes of union fields, field
access could use a new access operator, rather than the same .
operator used
for struct fields. This would make union fields more obvious at the time of
access, rather than making them look syntactically identical to struct fields
despite the semantic difference in storage representation. However, this does
not seem worth the additional syntactic complexity and divergence from other
languages. Union field accesses already require unsafe blocks, which calls
attention to them. Calls to unsafe functions use the same syntax as calls to
safe functions.
Much discussion in the tracking issue for unions debated whether assigning to a union field that implements Drop should drop the previous value of the field. This produces potentially surprising behavior if that field doesn't currently contain a valid value of that type. However, that behavior maintains consistency with assignments to struct fields and mutable variables, which writers of unsafe code must already take into account; the alternative would add an additional special case for writers of unsafe code. This does provide further motivation for the lint for union fields implementing Drop; code that explicitly overrides that lint will need to take this into account.
Can the borrow checker support the rule that "simultaneous borrows of multiple fields of a struct contained within a union do not conflict"? If not, omitting that rule would only marginally increase the verbosity of such code, by requiring an explicit borrow of the entire struct first.
Can a pattern match match multiple fields of a union at once? For rationale, consider a union using the low bits of an aligned pointer as a tag; a pattern match may match the tag using one field and a value identified by that tag using another field. However, if this complicates the implementation, omitting it would not significantly complicate code using unions.
C APIs using unions often also make use of anonymous unions and anonymous structs. For instance, a union may contain anonymous structs to define non-overlapping fields, and a struct may contain an anonymous union to define overlapping fields. This RFC does not define anonymous unions or structs, but a subsequent RFC may wish to do so.
- This RFC was amended in #1663
to clarify the behavior when an individual field whose type
implements
Drop
.