Support IoTDB as Time Series Storage #2857

bossenti · 2024-05-13T08:38:04Z

bossenti
May 13, 2024
Collaborator

I'm currently working on supporting Apache IoTDB as an additional option to store our time series data in StreamPipes.
My experience with IoTDB is limited to reading the documentation, performing some small evaluations and the implementation effort I've spent sofar.
Thus, the following is based on my current understanding of IoTDB's concept and may therefore be erroneous or incomplete.

What is already possible

As of now, we have a very rudimentary support for Apache IoTDB as time series storage in our development branch.
Using IoTDB as time series storage is possible by setting a set of environment variables.
This effects that all time series data persisted in StreamPipes is persisted in the IoTDB.

Under the hood this works the following:

Let's assume we have created a data stream in StreamPipes with help of the machine data simulator
and persist the data in StreamPipes into a measure called flowrate.
StreamPipes now has a so called DataLakeMeasure with the name flowrate and the following schema:

timestamp: Long
sensorId: String
mass_flow: Float
volume_flow: Float
temperature: Float
density: Float
sensor_fault_flags: Boolean

Every event of our data stream is written into aligned time series of one device.
The device name in our scenario equals the name of the data lake measure (flowrate) and the following path is used within IoTDB: root.streampipes.flowrate.
This is how the data looks like:

select * from root.streampipes.flowrate limit 10 align by device

In addition, the count mechanism to get an overview about how many events exists per data lake measure is implemented as well:

Using the following query:

Select count(temperature) from root.streampipes.flowrate

Well, so far so good. Getting here was straightforward and IoTDB is easy to work with, it's a really cool piece of software!

Where problems arise

The approach outlined above is straightforward and works well with StreamPipes.
However, there are two aspects we haven't considered yet that make things more challenging and are currently unclear how to support:

Dimension properties
Complex data types (mainly lists and nested data)

The latter is not directly relevant, but if anyone has a viable suggestion, I'd really appreciate it.
So let's have a deeper look on dimension properties.
In StreamPipes dimension properties refer to event properties (event fields) that represent dimensions.
In the example given above, sensorId would typically be modeled as a dimension property.
As such sensorId has discrete value space containing values like flowrate01 and flowrate02.
Dimension properties allow users, e.g., to group data along the provided dimensions in the data explorer. See an example configuration from the StreamPipes UI in the screenshot below.

In general there can be more than one dimension property such as locationId and sensorId and users can potentially change an existing adapter and remove or add a dimension.

If our flowrate data stream is persisted as above, there is no way to group data based on sensorId best to my knowledge,
e.g., to calculate the average temperature per sensorId. Or is there anything I'm missing?

One possible solution to this, is to split the data of our data stream into multiple devices in IoTDB and work with tags.
The idea is now to have a device per value of a dimension property, in this case flowrate.flowrate01 and flowrate.flowrate02.
This allows us to tag the corresponding time series, e.g., flowrate.flowrate01.temperature with a corresponding tag: sensorId=flowrate01.
This brings us the desired capability of grouping values along dimensions:

select avg(temperature) from root.streampipes.flowrate.** GROUP BY TAGS(sensorId)

Please excuse the different tag name and values in this screenshot, but I think you get the point.

As an alternative, we could also imagine to not use tags and use the GROUP BY LEVEL statement.
The good side here is, that this allows to calculate aggregations based on dimension values, it also has some downsides.
First of all, it would break with the current relationship of a data lake measure having one measurement in the time series storage.
Modeling as described here would result potentially in a huge amount of time series since there must be one for each combination of dimension property values.
In addition, it would make our queries more complex since we are not able to directly query all data for a data lake measure or to count all records for one data lake measure without further computations.

What are your thoughts about this considerations?
Do I use the concepts in the right way?
Are there any alternatives we could achieve the same results?
What if an adapter is modified? E.g., a dimension field is removed?

What should be possible in the end

Beyond the scenario above we require the following functionalities to be performed via queries from IoTDB which may rise further considerations/issues and should be considered ideally in finding a solution:

simply list all values per data lake measure (Select *)
filter data based on values of dimension properties
group data based on a dimension property (e.g., show a line series per dimension property in the data explorer)

chrisdutz · 2024-05-13T09:34:15Z

chrisdutz
May 13, 2024
Collaborator

I was going to propose that for multiple devices there should be stored with different deviceIds in IoTDB. Also do I think that one device will most probably send one measurement using a fixed measure, so this ideally should be implemented as a tag assigned to a measure.

While generally IoTDB can handle an unbounded amount of timeseries and an unbounded amount of measuements for each of these, my gut-feeling does tell me that option 1 would be the better way to approach this.

1 reply

chrisdutz May 13, 2024
Collaborator

However I do hope one of my colleagues will jump on this thread with more in detail help.

ottlukas · 2024-05-13T09:42:08Z

ottlukas
May 13, 2024
Collaborator

First of all thank you for getting that started.

One concept you did not mention are the "template" Device Template "IoTDB supports the device template function, enabling different entities of the same type to share metadata, reduce the memory usage of metadata, and simplify the management of numerous entities and measurements.".
As far as I understand this would be the "DataLakeMeasure".

+1 for using tags. this is IMHO the right path.
for your simply list, filter data and group by: Select Clause -> Aggregation query by one single / multiple tag

1 reply

bossenti May 14, 2024
Collaborator Author

Thanks for bringing device templates to the table @ottlukas!

If I understand the concept correctly, this would only help us if we have multiple data streams (or then data lake measures) in StreamPipes that share the same metadata, i.e. event properties. Right?

JackieTien97 · 2024-05-14T04:26:50Z

JackieTien97
May 14, 2024

Actually, we're developping on relational model, but it may need several months to be finished.

For current tree model of IoTDB, I suggest that you put all your tag dimensions' value into the path, like root.streampipes.flowrate. flowrate02. If you have more than one dimension, you can also keep adding that into path, like root.streampipes.flowrate.West.flowrate02.

Then, if you want to group the data for all dimensions, you can use align by device clause (https://iotdb.apache.org/UserGuide/latest/User-Manual/Query-Data.html#align-by-clause-1), like select avg(temperature) from root.streampipes.flowrate.**.

The only question for this align by device way is that it can only group data by all dimensions, if you only need one or some dimensions, you may need to do the aggregation in your client.

12 replies

JackieTien97 May 20, 2024

For IoTDB, old data still remain too. For example, if we decicde use level2 to represent the version of the device, like root.db.version1.d1 and we can write data of that device into root.db.version1.d1(time, s1, s2, s3).

If version of that device upgrades to version2, we will generate another device(same as influx, it will generate another new seriesKey) called root.db.version2.d1, then we can write data into this new device root.db.version2.d1(time, s1, s2, s3).
Old data of version1 are still in root.db.version1.d1.

You can use select * from root.db.version1.d1 to query old data and use select * from root.db.version2.d1 to query new data. Schema can also be different between root.db.version1.d1 and root.db.version2.d1, like we have root.db.version1.d1(time, s1, s2, s3), but have root.db.version1.d2(time, s4, s5, s6)

qiaojialin May 20, 2024

Hi, IoTDB's tree schema is Complete equivalence to InfluxDB's tag schema. Here is an example

Here are some concept correspondence:

Database (InfluxDB) = Database (IoTDB)
Measurment (InfluxDB) = Internal node under database (IoTDB)
Tags (InfluxDB) = Other internal nodes (IoTDB)
Fields (InfluxDB) = Leaf nodes (IoTDB)
Measurement + Tags (series in InfluxDB) = All internal nodes under database (device in IoTDB)

Adding new schema (Tag3 in the example) will create new series in InfluxDB, also IoTDB will create new devices (root.m1.A2.B2.C1, root.m1.A2.B3.C2).

Adding new timeseries also do not affect old timeseries, we can still access data by its path in IoTDB.

SteveYurongSu May 20, 2024
Collaborator

@bossenti @tenthe As previously mentioned, IoTDB will develop a relational model later on. I will also provide a development roadmap for the IoTDB relational model here. After the relational model is developed, the tree model and the relational model will coexist within IoTDB.

Relation Model V1 (expected to be released around July this year):

Establish the overall framework of the table model engine.
Writing: Support for JAVA SDK and SQL writing.
Query: Single table raw data query, support for time and value filtering, common functions, and operators.
Metadata: Database management, addition, deletion, and query of tables.

Relation Model V2 (expected to be released around September this year):

Writing: Support for C++ and Python SDK writing, support for Java Session SDK redirection, support for compaction, TTL, deletion, and write performance that is not inferior to the original tree interface.
Query: Support for multi-table inner join based on the time column, aggregate queries (group by clause support for any combination of time, identifier, and attribute columns), support for more scalar and aggregate functions.
Metadata: Table modification, index creation, permissions.

I think we can start by adapting the tree model now. However, considering factors such as understandability, we can switch to adapting the relational model after the relational model is developed.

bossenti May 21, 2024
Collaborator Author

@qiaojialin @JackieTien97 thank you very much for your detailed explanation! This underlings my current understanding of IoTDB's tree model.

@SteveYurongSu thank you, this looks awesome!

With the schedule in mind, I would suggest to wait for V1 of the relational model. It promises all the features and functionalities we currently have and I would like to save the implementation effort if we switch to the relational model either way. This way we can also be one of the early adopters of the new relational model and provide you with feedback.

JackieTien97 May 21, 2024

sure, look forward to your feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support IoTDB as Time Series Storage #2857

{{title}}

Replies: 3 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support IoTDB as Time Series Storage #2857

bossenti May 13, 2024 Collaborator

What is already possible

Where problems arise

What should be possible in the end

Replies: 3 comments · 14 replies

chrisdutz May 13, 2024 Collaborator

chrisdutz May 13, 2024 Collaborator

ottlukas May 13, 2024 Collaborator

bossenti May 14, 2024 Collaborator Author

JackieTien97 May 14, 2024

JackieTien97 May 20, 2024

qiaojialin May 20, 2024

SteveYurongSu May 20, 2024 Collaborator

bossenti May 21, 2024 Collaborator Author

JackieTien97 May 21, 2024

bossenti
May 13, 2024
Collaborator

Replies: 3 comments 14 replies

chrisdutz
May 13, 2024
Collaborator

chrisdutz May 13, 2024
Collaborator

ottlukas
May 13, 2024
Collaborator

bossenti May 14, 2024
Collaborator Author

JackieTien97
May 14, 2024

SteveYurongSu May 20, 2024
Collaborator

bossenti May 21, 2024
Collaborator Author