Azure Blob Store support for Firehose and Indexing Service Logs #1518

helloitszak · 2015-07-14T01:34:34Z

This builds upon the existing azure extension, adding support for Azure's Blob Store as a Firehose and target for Indexing Service logs. The functionality is nearly identical to the S3 implementations. Most of the code was based off of that, simply using the Azure SDK instead.

For the firehose, I chose a uri scheme like azureblobstore://container/path/to/your/file.json which might seem weird considering the "endpoint" is actually https://myaccount.blob.core.windows.net/container/path/to/your/file.json

I opted for a uri like this because the current azure extension configures settings like a storage account, access key and schema (http/https) globally. I believe using the "official" URI would give the illusion of being able to use any storage account, which would lead to a bad user experience.

I considered a couple of other solutions like having to specify a storage account and key for each usage (ex. separate properties for the firehose account and indexing service account), but I didn't like that either as the current property for setting an account implies that it's global and I didn't want to introduce breaking changes in the azure extension. Having one that sounded global, but having it only apply to the deep storage is also a bad user experience.

If anyone has any other ideas for this I'd be happy to discuss. Otherwise, merge at will.

xvrl · 2015-07-14T05:37:36Z

docs/content/configuration/indexing-service.md

@@ -19,11 +19,11 @@ The indexing service uses several of the global configs in [Configuration](../co

 #### Task Logging

-If you are running the indexing service in remote mode, the task logs must S3 or HDFS.
+If you are running the indexing service in remote mode, the task logs must S3, Azure Blob Store or HDFS.


can we fix the missing verb after must while we're at it?

fjy · 2015-07-14T15:44:10Z

@ubercow Thanks for the contrib! I have a few minor comments. Can I also request some tests similar to what is in FileTaskLogsTest and HDFSTaskLogsTest

drcrallen · 2015-07-14T23:53:10Z

I've been back and forth with @cheddar about when is the appropriate time to use URIs vs other object serialization formats. In general I think that URIs are most great and essential when communicating with outside tools, but when used purely for internal stuff, sticking to Druid's commonly used internal object SerDe (Jackson - Json) is much preferred when possible.

Since this is implementing a custom firehose(factory) json structure, can you just put the required object definitions there, and not use a URI?

That is to say, unless you can show that either:

a URI like this is needed for inter-tool operability
A URI like this provides significant functionality improvments

... then we should be using Json structures to describe objects.

I love URIs, and think they are a handy way to communicate resource information between multiple tools, but don't think we should be inventing our own URIs.

cheddar · 2015-07-15T00:00:27Z

@ubercow to kinda make what Charles is suggesting a bit more concrete, did you consider having a json blob like:

{
  "type": "azurestorage",
  "container": "myContainer",
  "path": "/the/path"
}

?

Using JSON in this way should allow for pretty wide flexibility in how things are defined and the things like your credentials and stuff can all be done externally. Does that make sense?

helloitszak · 2015-07-15T00:25:48Z

@cheddar and @drcrallen I didn't even consider that honestly and I think you're right. I was kind of tunnel-visioned because the s3 version just used a URI.

I could see myself implementing it like so, all the logic remaining in place, just dropping out a list of URIs with a list of this new JSON object:

"firehose" : {
  "type" : "static-azure-blobstore",
  "uris": [
    {
      "container": "container",
      "path": "/path/to/your/file.json"
    },
    {
      "container": "anothercontainer",
      "path": "/another/path.json"
    }
  ]
}

Thoughts?

drcrallen · 2015-07-15T00:43:30Z

Cool. could it be called "blobs" (or similar) instead of "uris"? otherwise that format works for me.

FYI, if you're going to have a list for a firehose you'll need to have pretty good rules for ordering and what to do about errors, and what happens if there's interruptions and restarts. Imagine JVM crash 3/4 of the way through your first container, what do you want to have the system do?

helloitszak · 2015-07-16T22:43:52Z

Alright, I think I've addressed all of your concerns. I'd be happy to squash if you guys want, I made a lot of commits just to make the changes easy to follow.

@drcrallen as for your concern about rules, I really don't know how to approach this. It should be noted that this behaves similarly to the S3 firehose so whatever changes we make here we should probably make there for consistency. I don't see this firehose being applicable for realtime scenarios, only indexing service and correct me if I'm wrong, but it's my understanding that won't push segments until the task completes with a "Success" status. The firehose doesn't delete files either, so I believe the answer to your question is just "Run the task again"

fjy · 2015-07-16T23:33:06Z

👍

drcrallen · 2015-07-16T23:55:19Z

...re-extensions/src/main/java/io/druid/firehose/azure/StaticAzureBlobStoreFirehoseFactory.java

+                  )
+              );
+            } catch (Exception e) {
+              log.error(e,


Do we need to close innerInputStream in error case?

I honestly don't know, haven't dealt with Java streams enough to be able to make that call myself.

From some basic source code digging, it seems the gzipInputStream will throw IOException if parent stream (in this case the stream from blob store) is closed. Since close appears to be idempotent here, I'll close them both, starting with outerInputStream.

close however, can also throw IOException, and I plan on just letting that bubble up.

helloitszak · 2015-07-17T20:53:08Z

Alright, I think this is everything.

drcrallen · 2015-07-17T22:21:14Z

Cool, Can you squash?

👍

Azure Blob Store support for Task Logs and a firehose for data ingestion

helloitszak · 2015-07-17T22:40:40Z

Squashed! Ready for merge at your leisure.

drcrallen · 2015-07-17T23:00:28Z

@ubercow Thanks! Do you mind filling out our CLA (http://druid.io/community/cla.html)? I'll merge when travis is completed

helloitszak · 2015-07-17T23:09:49Z

@drcrallen Since this work was done on my employer's time, he sent one of the corporate CLAs to the druid-admin email. Should have come from an Andre Andersen.

drcrallen · 2015-07-17T23:10:18Z

@ubercow Yep thanks!

Azure Blob Store support for Firehose and Indexing Service Logs

xvrl reviewed Jul 14, 2015
View reviewed changes

drcrallen reviewed Jul 16, 2015
View reviewed changes

drcrallen added this to the 0.8.1 milestone Jul 17, 2015

Add more support for Azure Blob Store

0bda7af

Azure Blob Store support for Task Logs and a firehose for data ingestion

helloitszak force-pushed the more-azure-features branch from b6bfd7d to 0bda7af Compare July 17, 2015 22:40

drcrallen added a commit that referenced this pull request Jul 17, 2015

Merge pull request #1518 from RealROI/more-azure-features

e051e93

Azure Blob Store support for Firehose and Indexing Service Logs

drcrallen merged commit e051e93 into apache:master Jul 17, 2015

helloitszak deleted the more-azure-features branch July 17, 2015 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Blob Store support for Firehose and Indexing Service Logs #1518

Azure Blob Store support for Firehose and Indexing Service Logs #1518

helloitszak commented Jul 14, 2015

xvrl Jul 14, 2015

fjy commented Jul 14, 2015

drcrallen commented Jul 14, 2015

cheddar commented Jul 15, 2015

helloitszak commented Jul 15, 2015

drcrallen commented Jul 15, 2015

helloitszak commented Jul 16, 2015

fjy commented Jul 16, 2015

drcrallen Jul 16, 2015

helloitszak Jul 17, 2015

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015

Azure Blob Store support for Firehose and Indexing Service Logs #1518

Azure Blob Store support for Firehose and Indexing Service Logs #1518

Conversation

helloitszak commented Jul 14, 2015

xvrl Jul 14, 2015

Choose a reason for hiding this comment

fjy commented Jul 14, 2015

drcrallen commented Jul 14, 2015

cheddar commented Jul 15, 2015

helloitszak commented Jul 15, 2015

drcrallen commented Jul 15, 2015

helloitszak commented Jul 16, 2015

fjy commented Jul 16, 2015

drcrallen Jul 16, 2015

Choose a reason for hiding this comment

helloitszak Jul 17, 2015

Choose a reason for hiding this comment

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015

helloitszak commented Jul 17, 2015

drcrallen commented Jul 17, 2015