Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Be able to merge streams with different groups #144

Closed
melck opened this issue Jan 14, 2016 · 5 comments
Closed

[Feature Request] Be able to merge streams with different groups #144

melck opened this issue Jan 14, 2016 · 5 comments
Milestone

Comments

@melck
Copy link

melck commented Jan 14, 2016

I want to merge streams with differents groups.

This issue is related to this post on google groups for @nathanielc

@nathanielc
Copy link
Contributor

@melck Thanks for creating the issue.

Simplified use case from the mailing list. This TICKscript wants to be able to calculate the percentage of user cpu each process is consuming on a given host.

// Get the user cpu consumed for the host
var user = stream.from().measurement('cpu_time_user')
    .where(lambda: "cpu" == 'cpu-total')
    .groupBy('host')
      ....

// Get the per process user cpu consumed.
var puser = stream
    .from()
    .measurement('procstat_cpu_user')
    .groupBy('host', 'name')
     ....

// Here the puser stream is grouped by 'host', and process 'name'
// The user stream is only grouped by 'host'
// The intent is to apply the same user value for each process name per host.
// This way we calculate the percentage of user cpu each process consumes by host.
// Right now this doesn't work and no points are joined.
puser.join(user)
      .as('puser', 'user')
    .eval(lambda: 100.0 * ("puser.value" / "user.value"))
      .as('value')
    .influxDBOut()
     //store the per process percentage of user cpu consumed by host.

My proposed solution is to add an optional on property to the join node. Theon will take a list of dimensions to join on and if on of the parent streams has more dimensions than specified then an outer join is performed for each unique instance of the more specific stream.

puser.join(user)
      .as('puser', 'user')
       // Instruct the join to only join on host.
       // But since puser has a dimensions host and name,
      // the same user value will be used for each unique process name by host.
      // The resultant data points will continue to be grouped by both host and name.
      .on('host')
    .eval(lambda: 100.0 * ("puser.value" / "user.value"))
      .as('value')
    .influxDBOut()
    // Store the per process percentage of user cpu consumed by host.
    // The data points stored will have the tags 'host' and 'name'

This should continue to work for more dimensions. Taking a different example say we were trying to get percentage of network traffic per interface, protocol and port by host.

var packets = stream....
       .groupBy('host', 'interface', 'protocol', 'port')

var interface_packets = stream...
       .groupBy('host','interface')
       .mapReduce(influxql.sum('value')).as('value')

var total_packets = stream...
      .groupBy('host')
      .mapReduce(influxql.sum('value')).as('value')

// Get per interface protocol port by host percentages.
packets.join(total_packets)
       .as('packets', 'total')
       .on('host')
    .eval(lambda: 100.0 * "packets.value" / "total.value").as('value')
    .influxDBOut()
      .measurement('host_network_percentages')
    // store package used for each interface protocol and port as a percentage of the total host packets.
    // The data points stored will have the tags 'host', 'interface', 'protocol' and 'port'

// Get per protocol port by host, interface percentages.
packets.join(interface_packets)
       .as('packets', 'total')
       .on('host', 'interface')
    .eval(lambda: 100.0 * "packets.value" / "total.value").as('value')
    .influxDBOut()
      .measurement('interface_network_percentages')
    // store package used for each  protocol and port as a percentage of the total per host and interface.
    // The data points stored will have the tags 'host', 'interface', 'protocol' and 'port'

While both joins will have the same tags set the value will have a different meaning. As a result they will need to be stored in different measurements.

Thoughts, Is the on property intuitive on how it would work and what it would do?

@melck
Copy link
Author

melck commented Jan 16, 2016

I like the way to do it, it's clear and simple but after implementation, it would be nice to have those examples in documentation.

In meantime, i will see in what other cases i need to use this feature and check if it theoricly works

@nathanielc When do you think it would be available with your current solution ?

@nathanielc
Copy link
Contributor

@melck Could you give the new join.on feature a spin and let us know how it works? I implemented as discussed above there shouldn't surprises. It was merged just now so you can build from source or wait for the nightly build tonight at midnight PST.

@melck
Copy link
Author

melck commented Feb 27, 2016

@nathanielc I will try asap and do a feedback. Thanks

@nathanielc
Copy link
Contributor

Great, FYI, I discovered a bug where Kapacitor will hang when using theses
new features. I am working on the fix, so if you run into that just kill
the daemon and start over.
On Feb 27, 2016 6:53 AM, "Melck" [email protected] wrote:

@nathanielc https://github.com/nathanielc I will try asap and do a
feedback. Thanks


Reply to this email directly or view it on GitHub
#144 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants