Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Blob Store support for Firehose and Indexing Service Logs #1518

Merged
merged 1 commit into from
Jul 17, 2015

Conversation

helloitszak
Copy link
Contributor

This builds upon the existing azure extension, adding support for Azure's Blob Store as a Firehose and target for Indexing Service logs. The functionality is nearly identical to the S3 implementations. Most of the code was based off of that, simply using the Azure SDK instead.

For the firehose, I chose a uri scheme like azureblobstore://container/path/to/your/file.json which might seem weird considering the "endpoint" is actually https://myaccount.blob.core.windows.net/container/path/to/your/file.json

I opted for a uri like this because the current azure extension configures settings like a storage account, access key and schema (http/https) globally. I believe using the "official" URI would give the illusion of being able to use any storage account, which would lead to a bad user experience.

I considered a couple of other solutions like having to specify a storage account and key for each usage (ex. separate properties for the firehose account and indexing service account), but I didn't like that either as the current property for setting an account implies that it's global and I didn't want to introduce breaking changes in the azure extension. Having one that sounded global, but having it only apply to the deep storage is also a bad user experience.

If anyone has any other ideas for this I'd be happy to discuss. Otherwise, merge at will.

@@ -19,11 +19,11 @@ The indexing service uses several of the global configs in [Configuration](../co

#### Task Logging

If you are running the indexing service in remote mode, the task logs must S3 or HDFS.
If you are running the indexing service in remote mode, the task logs must S3, Azure Blob Store or HDFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix the missing verb after must while we're at it?

@fjy
Copy link
Contributor

fjy commented Jul 14, 2015

@ubercow Thanks for the contrib! I have a few minor comments. Can I also request some tests similar to what is in FileTaskLogsTest and HDFSTaskLogsTest

@drcrallen
Copy link
Contributor

I've been back and forth with @cheddar about when is the appropriate time to use URIs vs other object serialization formats. In general I think that URIs are most great and essential when communicating with outside tools, but when used purely for internal stuff, sticking to Druid's commonly used internal object SerDe (Jackson - Json) is much preferred when possible.

Since this is implementing a custom firehose(factory) json structure, can you just put the required object definitions there, and not use a URI?

That is to say, unless you can show that either:

  1. a URI like this is needed for inter-tool operability
  2. A URI like this provides significant functionality improvments

... then we should be using Json structures to describe objects.

I love URIs, and think they are a handy way to communicate resource information between multiple tools, but don't think we should be inventing our own URIs.

@cheddar
Copy link
Contributor

cheddar commented Jul 15, 2015

@ubercow to kinda make what Charles is suggesting a bit more concrete, did you consider having a json blob like:

{
  "type": "azurestorage",
  "container": "myContainer",
  "path": "/the/path"
}

?

Using JSON in this way should allow for pretty wide flexibility in how things are defined and the things like your credentials and stuff can all be done externally. Does that make sense?

@helloitszak
Copy link
Contributor Author

@cheddar and @drcrallen I didn't even consider that honestly and I think you're right. I was kind of tunnel-visioned because the s3 version just used a URI.

I could see myself implementing it like so, all the logic remaining in place, just dropping out a list of URIs with a list of this new JSON object:

"firehose" : {
  "type" : "static-azure-blobstore",
  "uris": [
    {
      "container": "container",
      "path": "/path/to/your/file.json"
    },
    {
      "container": "anothercontainer",
      "path": "/another/path.json"
    }
  ]
}

Thoughts?

@drcrallen
Copy link
Contributor

Cool. could it be called "blobs" (or similar) instead of "uris"? otherwise that format works for me.

FYI, if you're going to have a list for a firehose you'll need to have pretty good rules for ordering and what to do about errors, and what happens if there's interruptions and restarts. Imagine JVM crash 3/4 of the way through your first container, what do you want to have the system do?

@helloitszak
Copy link
Contributor Author

Alright, I think I've addressed all of your concerns. I'd be happy to squash if you guys want, I made a lot of commits just to make the changes easy to follow.

@drcrallen as for your concern about rules, I really don't know how to approach this. It should be noted that this behaves similarly to the S3 firehose so whatever changes we make here we should probably make there for consistency. I don't see this firehose being applicable for realtime scenarios, only indexing service and correct me if I'm wrong, but it's my understanding that won't push segments until the task completes with a "Success" status. The firehose doesn't delete files either, so I believe the answer to your question is just "Run the task again"

@fjy
Copy link
Contributor

fjy commented Jul 16, 2015

👍

)
);
} catch (Exception e) {
log.error(e,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to close innerInputStream in error case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't know, haven't dealt with Java streams enough to be able to make that call myself.

From some basic source code digging, it seems the gzipInputStream will throw IOException if parent stream (in this case the stream from blob store) is closed. Since close appears to be idempotent here, I'll close them both, starting with outerInputStream.

close however, can also throw IOException, and I plan on just letting that bubble up.

@drcrallen drcrallen added this to the 0.8.1 milestone Jul 17, 2015
@helloitszak
Copy link
Contributor Author

Alright, I think this is everything.

@drcrallen
Copy link
Contributor

Cool, Can you squash?

👍

Azure Blob Store support for Task Logs and a firehose for data ingestion
@helloitszak helloitszak force-pushed the more-azure-features branch from b6bfd7d to 0bda7af Compare July 17, 2015 22:40
@helloitszak
Copy link
Contributor Author

Squashed! Ready for merge at your leisure.

@drcrallen
Copy link
Contributor

@ubercow Thanks! Do you mind filling out our CLA (http://druid.io/community/cla.html)? I'll merge when travis is completed

@helloitszak
Copy link
Contributor Author

@drcrallen Since this work was done on my employer's time, he sent one of the corporate CLAs to the druid-admin email. Should have come from an Andre Andersen.

@drcrallen
Copy link
Contributor

@ubercow Yep thanks!

drcrallen added a commit that referenced this pull request Jul 17, 2015
Azure Blob Store support for Firehose and Indexing Service Logs
@drcrallen drcrallen merged commit e051e93 into apache:master Jul 17, 2015
@helloitszak helloitszak deleted the more-azure-features branch July 17, 2015 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants