Status | Accepted |
---|---|
Author(s) | Dero Gharibian ([email protected]) |
Sponsor | Gunhan Gulsoy ([email protected]) |
Updated | 2019-04-11 |
To unify and define the byte interface of a string tensor across TensorFlow’s C
API (TF_STRING
), TF Lite (kTfLiteString
), and TF-Core/C++ (DT_STRING
) with
the purpose of enabling
modular TensorFlow
and mitigating the performance overhead of string tensor conversions.
C++ string tensors
(DT_STRING
)
in TensorFlow are defined as a
contiguous array
of std::strings
.
In contrast, C string
(TF_STRING
)
and TFLite
(kTfLiteString
)
strings tensors have a different public byte layout. In C, string tensors
are defined as
a list of uint64 offsets to varint prefixed char strings (where the varint
defines the length of the string). Unlike C++ tensor strings, which can
allocate larger strings on the heap, C string tensors are defined in a single
block of contiguous memory. Similarly, in TFLite, string tensors
are defined as
a list of integer offsets to character strings. The offset list is prefixed by
total string count and suffixed by total buffer size. TFLite strings do not
explicitly specify the string length for each string, instead they are inferred
from the offset table. Unlike C strings, TFLite string tensors contain explicit
string counts and the total buffer size in the buffer. Furthermore, since the
endianness of TFLite string tensor description is explicit, TFLite strings are
self-describing, exportable, and effectively mmap-able.
When string tensors are marshalled across the C API, an expensive conversion
process, via
TF_TensorToTensor
and
TF_TensorFromTensor
,
is done to convert a
TF_STRING
to a DT_STRING
and vice-versa. This results in a performance hit
at external language binding boundaries for string tensors. Furthermore, the
current implementation of the C API does not provide setters/getters or other
ancillary methods for constructing a TF_STRING
. As a result, downstream
language bindings to
Java,
golang,
etc, modify a raw buffer in order to build
the index offset list of strings. A similar conversion is done when TFLite
strings are passed to C++ kernels.
Our aim with modular TensorFlow is to facilitate the creation of externally
built and dynamically loaded kernels. With modular TensorFlow, we plan to
provide a thin header only C++ API that depends on the C API. If we do not
update our approach to string tensors for modular TensorFlow, we will incur a
heavy cost when processing string tensors in kernels due to the constant
conversion between DT_STRING
and TF_STRING
across the C API. Currently, the
marshalling of string tensors across TFLite and TFCore incurs a similar cost.
In order to mitigate unnecessary performance degradation, we need to have a single definition of a string tensor which is ABI compatible across TFLite, C and C++. Furthermore, the string implementation needs to be ABI compatible across various compilers in order to enable modular TensorFlow.
STL containers/strings are not ABI stable, and can vary across compilers,
compiler versions, and even compiler flags. To mitigate these issues, we
propose a lightweight ABI stable replacement for the underlying objects
representing DT_STRING
/TF_STRING
/kTfLiteString
s.
We are proposing two sets of changes in order to (1) unify the definition of string tensors across C++/Core, C-API, and TFLite; and (2) to achieve ABI stability for string tensors across compilers on a single architecture.
In order to unify the definition of string tensors in C, we propose the addition
of new methods for creating and ingesting tensors in C. We are also proposing
for the original method of creating tensors in the C API to be marked as
deprecated. In order to support the transition, we will include a flag in the
TF_Tensor
struct to track which API the tensor was created with. Furthermore,
we plan to provide accessors and mutators for string tensors in C, in order to
simplify language bindings, and ease potential future changes to the byte layout
of string tensors.
For TFLite, we propose an additional enum for the new string tensor type,
allowing for backwards compatibility with existing kTfLiteString tensors. The
prototypes for string creation and string accessors in strings_util.h
and
strings.h
do not need to change.
For ABI stability, we propose a new string type that can handle four string
variants: local “small strings” (SmallType
), longer heap allocated strings
(LargeType
), exportable/mmap
-able offset based strings (OffsetType
), and
preallocated strings---with capacity defined at string tensor
initialization---as a part of a contiguous buffer (PreallocType
).
To achieve our aim of having universal ABI stable string tensors, we must adhere to the following requirements
- Our approach must be ABI stable
- Our approach must work with the Eigen C++ header library.
- For TFLite adoption, our approach must support direct memory mapping of string tensors.
- For TFLite adoption, our approach must allow for the packed representation of a string tensor.
- Our approach must be performant relative to the current use of std::string.
- Our approach must allow for lvalue assignment of string values.
- Our approach must allow for piecewise deployment externally. In other words, during the migration period, downstream users must be able to opt out of our new string tensor implementation.
We propose a new header-only ABI-stable tstring class. Our aim with tstring is
to provide a simplified container for strings which fits our narrow set of
requirements stipulated above. tstring is not meant as a replacement for
std::string
, and will not implement the full interface for std::string. (Note
that tensorflow::string
is currently aliased to std::string
)
Our proposed string implementation will be similar to the canonical 'Small
Strings Optimization' (SSO) used to implement std::string in C++ libraries, but
will feature two additional underlying string container types. In particular,
in addition to having a small local definition, and a heap allocated variant
for longer strings, we propose two additional types: an mmap-able/exportable
offset based string tensor (OffsetType
), and a preallocated string tensor that
allocates a user-specified minimum capacity for each string as a part of a
contiguous buffer (PreallocType
).
OffsetType strings will be used to replace the current TFLite string tensor
layout. PreallocType strings can be used in the future for performance
improvements, where N strings with M capacity can be pre-allocated from a
contiguous block of memory at tensor initialization with a single malloc
(instead of incurring N mallocs for large strings with the current std::string
implementation). In the scenario where a PreallocType
string’s capacity is
exceeded, the PreallocType
would be converted to an LargeType
.
The following is a layout overview of the proposed new string container type:
namespace tensorflow {
class tstring {
public:
static const uint8_t kSmallType = 0x00;
static const uint8_t kLargeType = 0x01;
static const uint8_t kOffsetType = 0x02;
static const uint8_t kPreallocType = 0x03;
static const uint8_t kTypeMask = 0x03;
struct LargeType {
size_t size_;
char* ptr_;
// ...
};
struct PreallocType {
uint32_t size_;
uint32_t cap_;
char* ptr_;
// See “Capacity member variable” section below.
// ...
};
struct OffsetType {
uint32_t size_;
uint32_t offset_; // `this` pointer + offset_ points to char string
uint32_t count_;
// ...
};
struct RawType {
uint8_t raw_[16];
};
union UnionedType {
LargeType p;
OffsetType o;
PreallocType f;
RawType r;
};
enum {
SmallTypeCapacity = (sizeof(UnionedType) - sizeof(uint8_t)) / sizeof(char),
};
struct SmallType {
uint8_t size_;
char str_[SmallTypeCapacity];
// ...
};
union {
LargeType p;
OffsetType o;
PreallocType f;
SmallType s;
RawType r;
};
const uint8_t type() const { return r.raw_[0] & kTypeMask; }
// ...
};
}; // namespace tensorflow
Independent of endian-ness, the first two bits (lowest order) of the first byte
will denote the string type (i.e. r.raw_[0] & 0x03
above). For all string
types except, kOffsetType, values will be stored in host byte order. Values
for OffsetType strings will be explicitly define little endian.
Of the four string types defined above, the only type with potential eigen
compatibility issues is the OffsetType
. Since the OffsetType
relies on an
offset value instead of a char pointer to point to the character string, and
since eigen is restrictive on how
values are indexed,
the simplest approach for
providing eigen compatibility is to define the 'offset' value as an offset from
the this pointer of a tstring scalar, and not as an offset from the start of
the tensor buffer. More concretely, accessing the character string for an
OffsetType would be analogous to:
const char* data() const {
return reinterpret_cast<const char*>(this) + offset_;
}
In the scenario were an assignment exceeds the capacity OffsetType
, SmallType
,
or PreallocType
the string is converted to LargeType
string. The original
string type is copied as a prefix to the LargeType
so that it can be reverted
when the capacity falls below the original on a subsequent assignment. This
feature will allow for lvalue assignment of OffsetType types, which, as a
corollary, will allow for the assignment of TFLite string tensors.
Size in Bytes | Small String Capacity | |
---|---|---|
tensorflow::string | 16 | 15 |
tensorflow::string w/ capacity (see below) | 24 | 23 |
GCC | 32 | 15 |
MSVC | 32 | 15 |
LLVM | 24 | 22 |
Currently, TFLite has an ~8 byte overhead per string entry, which is used to describe the offset. The tstring class described above has an overhead of 16 bytes, and does not include a capacity field normally found in SSO implementations for LargeType strings. This comes with downsides. Without a capacity field LargeType strings are forced to always call realloc on assignment. To reduce the potential number of calls to realloc, we can add a capacity field, at the cost of increasing the per string overhead to 24 bytes. This would put as at parity with LLVM strings, but would result in a 3x overhead compared to current TFLite strings.
Currently, downstream language bindings to
Java,
golang,
etc, are expected to
modify a raw buffer in order to construct and pass a string tensor. We propose
new functions to abstract the creation of C string tensors, and a new flag in
TF_Tensor
which tracks the method with which a C string tensor was created.
Using the new methods will effectively create C++ string tensors underneath,
and, when passed back and forth, will mitigate the conversion of C TF_STRING
tensors to C++ DT_STRING
tensors and vice-versa via
TF_TensorToTensor
and
TF_TensorFromTensor
.
Since TFLite provides generators and accessors for TFLite string tensors, the
requisite changes needed to have TFLite conform to the OffsetType
defined above
is on the order of a ~20 line CL. Backwards compatibility can be maintained by
creating a new TFLite enum for tstring separate from the existing
kTfLiteString enum.