Using IoTDB as database for the time-series storage? #1369

dominikriemer · 2023-03-01T18:40:44Z

dominikriemer
Mar 1, 2023
Collaborator

Hi everyone,
currently, we use InfluxDB as the backend for our time-series storage. I'm not 100% satisfied with this setup as Influx is missing some query features such as grouping by fields. On the other hand, we now have a really cool database within the ASF IoT space - so I thought about why not using IoTDB as the storage backend of StreamPipes?

IoTDB should fit really well into our model, has many cool features Influx is missing (many built-in functions) and from my point of view it would just make sense to use an ASF sister project ;-)

The only thing is that we would need to find a good migration path - but I think it should be possible to start with refactoring the current implementation so that there are clear APIs for queries coming from the data explorer API, and then having IoTDB as an alternative database with Influx as default until we reach a stable state and could then switch to IoTDB as default.

What do you think of this idea?

bossenti · 2023-03-02T18:42:43Z

bossenti
Mar 2, 2023
Collaborator

@dominikriemer thank you for starting this interesting discussion.
Although, I am not really into the strengths and weaknesses of influx and IotDB, your proposal sounds good to me.
I like the outlined migration path especially.

1 reply

bossenti Mar 2, 2023
Collaborator

I just skimmed through a few comparisons of both DBs and I think the full support of data types by IoTDB compared to influx seems to be a major advantage that also connects really well to StreamPipes
https://db-engines.com/en/system/Apache+IoTDB%3BInfluxDB

tenthe · 2023-03-03T15:47:58Z

tenthe
Mar 3, 2023
Collaborator

Thank you for initiating the discussion. I think a phased approach, starting with API optimization and followed by database migration, would be a good strategy.
In addition, I believe it is important to establish a clear API that facilitates the transition to the new database in the future. However, I am not in favor of supporting multiple databases as this introduces additional dependencies that must be maintained.

I think it would be beneficial to conduct a very brief comparison of the two databases to evaluate their respective advantages and disadvantages, especially how the data model maps to the one used in StreamPipes. If anyone has experience working with both, it would be helpful to share their insights.

@SteveYurongSu do you have any thoughts on this discussion

4 replies

bossenti Mar 3, 2023
Collaborator

However, I am not in favor of supporting multiple databases as this introduces additional dependencies that must be maintained.

Are we talking here about supporting multiple databases in the long term?
If so I would agree
But I think we should support the old world (influx) and the new world (iotdb, in case) for some period of time in parallel, e.g. two releases (that's also how I understood @dominikriemer)

tenthe Mar 4, 2023
Collaborator

Okay, then I must have misunderstood. I thought the idea is to support multiple time-series databases in the longrun.
Doing the migration over one or two releases makes sense. But then we need to make sure that users can safely change the database during an update.

bossenti Mar 4, 2023
Collaborator

Fully agree 🙂

SteveYurongSu Mar 4, 2023
Collaborator

@tenthe @bossenti +1 for doing the migration over one or two releases, keeping both of them in parallel and finally finding a way for user to safely upgrade 🙂

SteveYurongSu · 2023-03-03T18:39:35Z

SteveYurongSu
Mar 3, 2023
Collaborator

@dominikriemer Thanks for starting this discussion. I'm so excited to see the discussion here!

@bossenti Here are some key differences between IoTDB and InfluxDB (most of them are the main advantages of IoTDB):

1. [Cluster]

The cluster version of IoTDB is open source (and of course under the ASF 😆), while the cluster version of InfluxDB is NOT open source and needs to be paid for use.

IoTDB cluster document: https://iotdb.apache.org/UserGuide/Master/Cluster/Cluster-Concept.html

2. [Schema / Data Model]

The data model of IoTDB is tree like, and it is more suitable for the IIoT scenario than the tag-field model (InfluxDB data model). Data in tree model can be naturally bound to the asset management of industrial sites.

IoTDB Data Model: https://iotdb.apache.org/UserGuide/Master/Data-Concept/Data-Model-and-Terminology.html

@tenthe Another key issue is how to convert data in tag-field model to tree model. I know a document which may help: https://iotdb.apache.org/UserGuide/Master/API/InfluxDB-Protocol.html

3. [Performance]

In our test environment, IoTDB has better read and write performance than InfluxDB.

There are many reasons for this. Take an example, IoTDB provides a native thrift based Java client, while InfluxDB's Java client is implemented by a simple HTTP client wrapper.

We have a benchmark tool to test the performance differences between IoTDB and many other time series databases. You can also try it: https://github.com/thulab/iot-benchmark

4. [Data Processing]

[User Defined Plugins] IoTDB supports user-defined functions and triggers, which means that the calculations originally on the SP may have a chance to be pushed down to the database.
- UDF: https://iotdb.apache.org/UserGuide/Master/Operators-Functions/User-Defined-Function.html
- Triggers: https://iotdb.apache.org/UserGuide/Master/Trigger/Instructions.html
[Rich Group-by Ways] https://iotdb.apache.org/UserGuide/Master/Query-Data/Group-By.html
[Rich IoT Functions] IoTDB has many special functions designed for IoT scenarios:
- Data Matching: https://iotdb.apache.org/UserGuide/Master/Operators-Functions/Data-Matching.html
- Frequency Domain Analysis: https://iotdb.apache.org/UserGuide/Master/Operators-Functions/Frequency-Domain.html
- Data Quality: https://iotdb.apache.org/UserGuide/Master/Operators-Functions/Data-Quality.html
…

Of course, IoTDB has its drawbacks, such as the richness of the surrounding ecological tools, the stability of the system, and so on.

For the migration path:

but I think it should be possible to start with refactoring the current implementation so that there are clear APIs for queries coming from the data explorer API, and then having IoTDB as an alternative database with Influx as default until we reach a stable state and could then switch to IoTDB as default.

I think a phased approach, starting with API optimization and followed by database migration, would be a good strategy.

we should support the old world (influx) and the new world (iotdb, in case) for some period of time in parallel.

I totally agree with your approach. @dominikriemer @tenthe @bossenti

I think we can start the migration task as follows:

Disassemble the task and determine which components we need to implement/adapt at each step
Discuss the tasks of each step in detail and determine how to implement/adapt

I am very excited to join in the migration task, but I am not completely familiar with the system, so I may need some help in the task disassembly process first. Can someone help me to disassemble the task? Then we can some more detailed discussions. 😄

6 replies

bossenti Mar 5, 2023
Collaborator

The tree based data model structure seems to be a great fit
This probably matches very well to OPC-UA, right?
cc @tenthe

tenthe Mar 6, 2023
Collaborator

Yes, you are right, this is similar to how data is represented in PLCs. However, I think this is a point we could discuss. Currently we represent data in event streams with a pretty fixed event schema.Therefore, we should talk about how to adapt the event schema to the IoTDB tree model.

For initial integration, I think this should work well since we can map an event to a tree, but perhaps there are concepts from IoTDB we can incorporate into StreamPipes to soften the "rigid" event schema a bit in the future. I look forward to hearing your ideas on this topic in the discussions that follow.

tenthe Mar 6, 2023
Collaborator

Am I assuming correctly that a StreamPipes EventStream with an EventSchema would be equivalent to an IoTDB aligned timeseries?
@qiaojialin @SteveYurongSu

SteveYurongSu Mar 7, 2023
Collaborator

Am I assuming correctly that a StreamPipes EventStream with an EventSchema would be equivalent to an IoTDB aligned timeseries? @qiaojialin @SteveYurongSu

@tenthe You are right :)

Take an example, here is an event in EventStream, and x, y, z are described by an EventSchema.

{
  "group-id": "614156E02",
  "factory-id": "ASD",
  "timestamp": "2023-02-09 10:42:27.000",
  "device-id": "ZXC",
  "values":  {
      "x": "12.03",
      "y": "0.0",
      "z": "0.9",
    }
}

Then the event can be stored to an aligned timesereis:

root.`614156E02`.`ASD`.`ZXC`.(timestamp, x, y, z)

x, y, z are sharing the same timestamp.

tenthe Mar 8, 2023
Collaborator

Great thanks.
This is a good example you have given that illustrates something we should discuss.
In your example, the metadata, such as group-id, is part of the event. In StreamPipes we usually only have the actual sensor data, in your example x, y, z.

To represent this data, a user can create an asset model in the user interface. Currently, this information is only used to group the various resources (adapters, pipelines, dashboards, etc). Perhaps we can find a way to use this information for the tree structure as well.

qiaojialin · 2023-03-04T04:46:36Z

qiaojialin
Mar 4, 2023

Hi, I'm Jialin Qiao, PMC of Apache IoTDB. Thanks for bringing up the integration between Apache StreamPipes and Apache IoTDB :)

Both SreamPipes and IoTDB are built for IIoT scenarios, and SP is closer to the end-user. We will very glad to satisfy the interesting demand of SP, which will help IoTDB more deeply into the industry.

Here is a comparison between TSDBs: https://iotdb.apache.org/UserGuide/Master/Reference/TSDB-Comparison.html#feature-comparison

Some highlight points of IoTDB:

IoTDB is designed for end-edge-cloud scenarios, the data synchronization between IoTDB instances is natively designed without external ETL tools.
The batch-data ingestion speed of IoTDB using Session.insertTablet() interface is higher(10×) than InfluxDB's line protocol, so IoTDB could support High-frequency(1kHz) timeseries.
The longer the time range to aggregate, the faster the polymerization of IoTDB, because IoTDB has pre-aggregations while InfluxDB does not.
The data file of IoTDB is called TsFile, which is a highly compressed file format for time series. TsFile has its own API and could be used independently like Parquet and ORC.

Looking forward to building close cooperation between Apache StreamPipes and Apache IoTDB communities 😄

4 replies

bossenti Mar 5, 2023
Collaborator

@qiaojialin thanks a lot for reaching out to us 🎉
The native support of aggregations is really great and something from what our data explorer definitely will benefit
Since StreamPipes follows an even-based approach, batch-based ingestion is more a corner case for us.
Do you have so references or experiences for high-frequency/streaming approaches?

qiaojialin Mar 6, 2023

Hi, we usually put the high-frequency data(a batch) directly into the database without processing the data point one by one, in case of write throughput. Then, IoTDB could store all the raw data, or use SDT(dead zone method) to filter out unnecessary data and use triggers for generating events.

In StreamPipes, maybe we could treat high-frequency data(into database) and event-based data(into rule engine) separately.

SteveYurongSu Mar 7, 2023
Collaborator

Since StreamPipes follows an even-based approach, batch-based ingestion is more a corner case for us.

@bossenti And I might add that, IoTDB has a dedicated ingestion interface for event-by-event ingestion scenarios. The interface is in fact used by the new SP IoTDB sink that I've re-implemented previously. :)

or use SDT(dead zone method) to filter out unnecessary data and use triggers for generating events.

@qiaojialin Recently I‘ve submitted a PR to bring the SDT method to StreamPipes. I think this may help when meeting the very high-frequency insertion scenarios.

bossenti Mar 7, 2023
Collaborator

@bossenti And I might add that, IoTDB has a dedicated ingestion interface for event-by-event ingestion scenarios. The interface is in fact used by the new SP IoTDB sink that I've re-implemented previously. :)

That's great!
Very interesting to have this sink as a minimal example 🙂

dominikriemer · 2023-03-04T21:45:59Z

dominikriemer
Mar 4, 2023
Collaborator Author

Hi everyone!
Great that everyone likes the idea :-)

@SteveYurongSu @qiaojialin thanks a lot for providing so much background information - that's very helpful! I've already started to get a deeper understanding and really like the features IoTDB offers.

For the next steps, I'll have a look at the data explorer module to get a better understanding which Influx features we currently use. Afterwards I could create a page in our wiki where we can collect requirements, identify changes and mappings for the current implementation to the IoTDB model. Then, we can split the work into tasks and work towards a first prototype.

I'm really excited to work with you on this topic!

7 replies

SteveYurongSu Mar 7, 2023
Collaborator

I think @dominikriemer suggestion is great, once the interface is cleaned up we can plan the integration in the wiki.

That will be great! Looking forward to our first prototype :) @dominikriemer @tenthe @bossenti 🚀

dominikriemer Apr 3, 2023
Collaborator Author

Hi, just to follow up on this - in the meantime, I had a closer look at the streampipes-data-explorer module and refactored the code to get a better understanding of the queries and to decouple query management from the underlying database. It's still not perfect, but I think it should make it easier to check the requirements against IoTDB features.

The Influx implementation is at [1] - I guess the next good step would be to create the wiki page, but any comments so far are welcome!

[1] https://github.com/apache/streampipes/tree/dev/streampipes-data-explorer/src/main/java/org/apache/streampipes/dataexplorer/influx

tenthe Apr 3, 2023
Collaborator

Great work, due to the refactoring it is now structured much better

SteveYurongSu Apr 4, 2023
Collaborator

@dominikriemer Thanks a lot for the refactor! This will be very helpful for us ❤️

Recently, my friend @ppppoooo and I have been delving deeply into the code of StreamPipes.

We have initially implemented an IoTDB datalake sink and IoTDB data explorer, and we are currently testing the implementation (it's dirty but works) [1]. Afterwards, we will refer to DataLakeInfluxQueryBuilder.java to adjust the implementation and consider splitting it into multiple PRs to contribute to the community.

[1] SteveYurongSu#2

dominikriemer Apr 10, 2023
Collaborator Author

@SteveYurongSu absolutely awesome that you two have managed to create a working version! If you have any feedback or further ideas for improvement of the refactored API just let me know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using IoTDB as database for the time-series storage? #1369

{{title}}

Replies: 5 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using IoTDB as database for the time-series storage? #1369

dominikriemer Mar 1, 2023 Collaborator

Replies: 5 comments · 22 replies

bossenti Mar 2, 2023 Collaborator

bossenti Mar 2, 2023 Collaborator

tenthe Mar 3, 2023 Collaborator

bossenti Mar 3, 2023 Collaborator

tenthe Mar 4, 2023 Collaborator

bossenti Mar 4, 2023 Collaborator

SteveYurongSu Mar 4, 2023 Collaborator

SteveYurongSu Mar 3, 2023 Collaborator

bossenti Mar 5, 2023 Collaborator

tenthe Mar 6, 2023 Collaborator

tenthe Mar 6, 2023 Collaborator

SteveYurongSu Mar 7, 2023 Collaborator

tenthe Mar 8, 2023 Collaborator

qiaojialin Mar 4, 2023

bossenti Mar 5, 2023 Collaborator

qiaojialin Mar 6, 2023

SteveYurongSu Mar 7, 2023 Collaborator

bossenti Mar 7, 2023 Collaborator

dominikriemer Mar 4, 2023 Collaborator Author

SteveYurongSu Mar 7, 2023 Collaborator

dominikriemer Apr 3, 2023 Collaborator Author

tenthe Apr 3, 2023 Collaborator

SteveYurongSu Apr 4, 2023 Collaborator

dominikriemer Apr 10, 2023 Collaborator Author

dominikriemer
Mar 1, 2023
Collaborator

Replies: 5 comments 22 replies

bossenti
Mar 2, 2023
Collaborator

bossenti Mar 2, 2023
Collaborator

tenthe
Mar 3, 2023
Collaborator

bossenti Mar 3, 2023
Collaborator

tenthe Mar 4, 2023
Collaborator

bossenti Mar 4, 2023
Collaborator

SteveYurongSu Mar 4, 2023
Collaborator

SteveYurongSu
Mar 3, 2023
Collaborator

bossenti Mar 5, 2023
Collaborator

tenthe Mar 6, 2023
Collaborator

tenthe Mar 6, 2023
Collaborator

SteveYurongSu Mar 7, 2023
Collaborator

tenthe Mar 8, 2023
Collaborator

qiaojialin
Mar 4, 2023

bossenti Mar 5, 2023
Collaborator

SteveYurongSu Mar 7, 2023
Collaborator

bossenti Mar 7, 2023
Collaborator

dominikriemer
Mar 4, 2023
Collaborator Author

SteveYurongSu Mar 7, 2023
Collaborator

dominikriemer Apr 3, 2023
Collaborator Author

tenthe Apr 3, 2023
Collaborator

SteveYurongSu Apr 4, 2023
Collaborator

dominikriemer Apr 10, 2023
Collaborator Author