-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labeled feeds #86
Labeled feeds #86
Conversation
…, now every feed is labeled in the config file and is now stored by label (was url). In thresher.py the thresher_map has been substituted with a list of parsers defined by the user from the config file too - this will come handy when plugins will be implemented
I have pulled this over into the |
Alternately @gbrindisi if you can pull the current master into yours, the conflict is pretty easy to fix and it will update this PR. |
Done! |
So, I am not ignoring this, but I was thinking that this is an opportunity to begin to address the extra fields that would be required from a more robust TI feeds parsing engine (such as confidence, campaign, other notes). I have some work to do around defining that on some other things I am working on the internal parts of the non-open source code I have, so I'd like to propose a direction for us to move forward by early next week. @gbrindisi I think this is a step in the right direction, but I want to make sure we do not code ourselves into a corner :) |
Ok I understand :) Feel free to mail or message me on slack (I'm lurking daily btw ;)). |
Just a quick poke here to see where we are on this :) |
I can help out with this - in the past, i did something like this for feed parsing, and had the engine just read the line regex from the conf file. I also used crits-specific indicator names, but that's obviously just semantics and easily changed: [
{
"impact": "high",
"source": "malwareDomainList",
"campaign":"testCampaign",
"confidence": "medium",
"format": "^\\\".*\\\"\\,\\\"(.*?)\\\"\\,\\\"(\\d+\\.\\d+\\.\\d+\\.\\d+|-)\\\"\\,\\\"(.*?)\\\"\\,\\\".*?\\\"\\,\\\".*?\\\"\\,\\\"(\\d+|-)\\\"",
"reference": "http://www.malwaredomainlist.com/updatescsv.php",
"fields": ["URI - URL", "Address - ipv4-addr", "URI - Domain Name","Address - asn"]
},
{
"impact": "medium",
"confidence": "medium",
"campaign":"testCampaign",
"format": "(.+)",
"reference": "https://zeustracker.abuse.ch/blocklist.php?download=compromised",
"fields": ["URI - URL"],
"source": "ZeusTracker"
}
] If you guys want to go somewhere like this, i can create a branch and see if I can bastardize the code to allow for this. @gbrindisi, where did you want to put the custom feed parsers? |
I know this is from October, and I told I would think about this, but believe it or not I have not finished thinking about it yet. 😕 I am trying to align this with some other ideas for projects I am entertaining right now (including something for presentations on BlackHat and DefCon 2015). I will have a "recommendation" you you guys to give input on before the end of the holidays. 🎅 🎄 |
That sounds a lot like 'here, you guys write code for my blackhat prez.' We should probably think about relationships between indicators, contextual information, bla bla bla. |
@paulpc I just realized that did sound wrong. That was not what I meant, and I am sorry if it sounded that way. I am aiming for a minimum set of fields and parameters that would give anyone analyzing the data the ability pivot and aggregate the data in multiple ways. Having this kind of flexibility would help me on some of the things I want to work on and I am sure would help others as well. So, I think a bare minimum would be:
I am not sure how you would do the relationship matching without a storage back-end such as CRITs, so maybe that is what you mean. I want to make sure we are feeding something such as CRITs enough data so what you described is possible to be done there with queries on different fields. What are other fields you would like to see in this? |
@alexcpsec, why constrain ourselves to a few indicator types? We could use the openIOC (http://schemas.mandiant.com/) or the STIX(http://stixproject.github.io/documentation/idioms/) dictionaries. We don't have to go all out and output STIX or openIOC XML, but we don't need to reinvent the language either and should provide an easy conversion mechanism (maybe if we wrote a STIX output function, it would help materialize the ideas, but i realize it's obscene scope creep). As for relationship matching, it would obviously be easy to do it in a backend system a la CRITs, Avalanche / Soltra Edge, commercial threatX. But since we're ingesting a bunch of OS-INT feeds in combine, we could do a very rudimentary relationship model here based on what we see in which feeds, any common elements, and any metadata present with those feeds. I had ideas to do some post hoc relationship building in CRITs, but you would lose the point-in-time aspect. For example, malwaredomainslist, has some registrar info, domain, ip, ASN. Alienware has category and some confidence information. We can maybe connect all those before inserting them in an intelligent analysis system (CRITs) to help further analysis and documentation of point-in-time relationships. This is all a moot point if all the end-user is doing is putting the IPs in a firewall blocklist. |
First, I don't think correlation / relationship matching is in scope for Combine. It grabs the data, does minimal normalization, then outputs it in forms other stuff can consume. And we've definitely always considered the STIX stack to be on the roadmap; see issue #33 for example. The use case here, as I understand it, is to grab a bit of extra metadata and make that available to users. Some of them use STIX or OpenIOC and some don't (although anyone putting it in a firewall blocklist is omg doing it wrong). For this issue, we should probably just make sure we consider the metadata we should grab and add that to the data model. The list from @alexcpsec above is a good start. |
Here is what I am thinking: Immediate action:
I would not, however, extract data that we could optionally enrich later on winnower (such as ASN). I appreciate that the enrichment code could be faster as it is now, but that is what that is for. The point-in-time aspect could be mitigated by having 2 different timestamps on the entries:
I'd try to stay away from the relationship matching or correlation just now, at least until these more immediate things can be done. Combine is not trying to be fully fledged TI Platform, and maybe these would be functionalities for a separate project ("hay silo", anyone?). Or even leave this to CRITs, MISP or even CIF. What do you people think? |
I understand the correlation and relationships might be out of the purview. The ASN example wasn't meant as a replacement for the current utility, but just another datapoint, and since Combine is not a TI platform, no reason to worry about my point-in-time complications. The reason why i went with regex is b/c i can do meta programming in the config file - for new feeds i don't have to change my code to allow for column names, mapping to existing columns, et cetera, but in stead I can define everything in the regex and fields. Downside (and it's a pretty big one) is that i am expecting normalized input and I am doing minimum error checking. @btv's idea is awesome! To take it a bit further, since JSON is our bread and butter for transporting observables between the modules, why not doing something like csv.DictReader for the csv-formatted fields and get a dictionary object out of it automatically. It would be parsed by the csv library, so, it might do some of the heavy loading. Unfortunately, I thing it will drive us further into conversations about feed-format-specific parsers and complicating the retrieving and parsing algorithms. It seems we're heading towards creating/adopting a normalized observable-descriptive language and coming up with transform tables from all the feeds ingested. If so, JSON / XML / CSV output would be pretty trivial regardless of how complicated we decide to get with the metadata. |
Yes, I think we should get this done right first before making the scope bigger. |
The parsers are just functions in |
Gianluca,
|
That would be super cool and relevant to #23 |
I am closing this PR unmerged and we should focus on #110 that has a more complete implementation of these ideas we started discussing here. I really want to thank @gbrindisi for kicking off this here and providing us with the cornerstone for this discussion. Please help us with #110 as well. :) |
@alexcpsec you are welcome! :) |
Hi!
I've managed to tidy up the code a bit.
I'm aware that the dot-based nomenclature I've chosen is not so pretty but I still find useful the overall functionality. Anyhow check it out and let me know if you would like something different.
The config I've used is the following: