feature: Extend the max precision of decimal datatype to 38 digits #456
Labels
A-feature
feature with good idea
B-storage
data type, data storage, insert,update,delete, transactions
prio: high
High priority
Milestone
Describe the problem
In Tianmu, a decimal point is converted to a number of int64_t for storage. Following provides the code:
Following provides the code to define dec_f->val_real():
Solution
Boost Multiprecision Library provides integer, rational, and floating-point number types in C++ that have more range and precision than built-in types of C++. The big number types in Multiprecision can be used with a wide selection of basic mathematical operations. Boost Multiprecision provides a generic interface to GMP, MPFR, MPIR, TomMath backends, with support for integer, rational and floating-point types. In addition, user-defined backends can be created and used with the interface.
The Multiprecision library consists of two parts:
● An expression-template-enabled frontend number that handles all the operator overloading, expression evaluation optimization, and code reduction.
● A selection of backends that implement the actual arithmetic operations, and need conform only to the reduced interface requirements of the frontend. Supported types of backends include: GMP, MPFR, MPIR, TomMath, and Boost-licensed.
Frontend declaration:
Backend declaration (cpp_int_backend is used as an example):
Frontend and backend declaration (boost::multiprecision::int128_t is used as an example):
Another backend declaration (GMP is used as an example):
● Declaration of the Boost's built-in Integer data type:
// Fixed precision unsigned types:
// Fixed precision signed types:
● Analysis on the storage of data of Boost's built-in Integer data type
If is_trivial_cpp_int::value is set to true, the maximum precision is smaller than or equal to the value set for double_limb_type:
If is_trivial_cpp_int::value is set to true, the maximum precision is larger than the value set for double_limb_type:
● Required storage for each data type (BOOST_HAS_INT128 Macro enabled)
intX_t data types are used for calculating required memory capacity during compilation. They consume more memory capacity while providing higher performance, compared to integer data types with arbitrary precision.
The cpp_int data type is a combination of static scaling and dynamic scaling. This data type requires memory allocation. It sacrifices performance to ensure scalability.
● Required storage for cpp_int, GMP, and TomMath (BOOST_HAS_INT128 Macro enabled)
According to the table above, GMP is less memory-consuming, TomMath requires more times of memory allocation, cpp_int consumes the most memory capacity. The required storage for each data type in the table is presented in the "a+b" format, where a indicates the capacity required for the data type and b indicates the capacity allocated from the stack for the data type.
● Performance comparison
Based on the official description, GMP provides higher performance than cpp_int. However, whether GMP provides higher performance than int256_t needs to be tested.
● Performance comparison between GMP and int256_t
About the performance test: single thread that is bound to CPU cores. Ten thousands of 1- to 65-bit integers are pupulated, and two of them are randomly choosen to complete an operation (such as add, divide, multiply, and subtract) for 1 billion times.
The following table provides the time consumed (unit: ms):
Based on the table above, the performance of GMP is 1.68 times of that of int256_t.
Following is the test code.
Command: numactl -C 2 ./mmx
3. Conclusion
For integers with 1 to 65 precision, the required storage for GMP is 40 to 56 bytes. If the precision is evenly distributed, the average storage is 48 bytes. int256_t adopts fixed precision, with the required storage of 48 bytes. Its performance is 68% higher than GMP. What's more, using GMP will introduce another third-party library, which will increase the integration complexity, and GMP is not suitable for cross-platform scenarios. Therefore, GMP is not the optimal option.
cpp_int provides good scalability but low performance, while int256_t is on the contrary. Currently, scalability is not a factor for us to consider.
In conclusion, to address the current problem, int128_t and int256_t can be used to replace cpp_int.
Implementation steps
● Data type for storage and storage encoding format.
● decimal's support for SQL logical operators, including Greater Than, Smaller Than, and Equal To.
● decimal's support for aggregate functions, including SUM, MAX, and STD.
● decimal's support for GROUP BY, JOIN, SORT, IN, and NOT IN.
● Range check for decimal.
● decimal's support for indexing and filtering.
Specifically, the following must be achieved:
● Conversion of data type for storage from PackInt to PackStr, conversion from decimal points to binary strings, and support for all kinds of conditions for decimal.
● core::ValueOrNull's support for decimal[optional].
● ConstColumn's support for decimal.
● Item_tianmufield's support for decimal.
● core::DataType's support for decimal: Binary strings can be converted to any other data type.
● RCAttr's support for decimal: Filtering and comparison of data packs whose data type is decimal are supported.
● MysqlExpression's support for decimal[option]: The underlying support can remain unchanged.
● MultiValColumn's support for decimal: especially for IN and NOT IN.
● decimal's support for AVG, SUM, MAX, and MIN.
● decimal's support for JOIN: HASH, SORT, and MAP.
● SorterWrapper's support for decimal: ORDER is supported.
CheckLists
The text was updated successfully, but these errors were encountered: