Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labeled feeds #86

Closed
wants to merge 3 commits into from
Closed

Conversation

gbrindisi
Copy link
Contributor

Hi!
I've managed to tidy up the code a bit.

I'm aware that the dot-based nomenclature I've chosen is not so pretty but I still find useful the overall functionality. Anyhow check it out and let me know if you would like something different.

The config I've used is the following:

[feeds.outbound]
#malwaregroup     = http://www.malwaregroup.com/ipaddresses
#malc0de          = http://malc0de.com/bl/IP_Blacklist.txt
#zeustracker      = https://zeustracker.abuse.ch/blocklist.php?download=ipblocklist
#spyeyetracker    = https://spyeyetracker.abuse.ch/blocklist.php?download=ipblocklist
#palevotracker    = https://palevotracker.abuse.ch/blocklists.#php?download=ipblocklist
alienvault       = http://reputation.alienvault.com/reputation.data
#nothink-malware-dns = http://www.nothink.org/blacklist/blacklist_malware_dns.txt
#nothink-malware-http = http://www.nothink.org/blacklist/blacklist_malware_http.txt
#nothink-malware-irc = http://www.nothink.org/blacklist/blacklist_malware_irc.txt

[feeds.inbound]
#projecthoneypot = http://www.projecthoneypot.org/list_of_ips.php?rss=1
#openbl = http://www.openbl.org/lists/base_30days.txt
#blocklist-ssh = http://www.blocklist.de/lists/ssh.txt
#blocklist-apache = http://www.blocklist.de/lists/apache.txt
#blocklist-asterisk = http://www.blocklist.de/lists/asterisk.txt
#blocklist-bots = http://www.blocklist.de/lists/bots.txt
#blocklist-courierimap = http://www.blocklist.de/lists/courierimap.txt
#blocklist-courierpop3 = http://www.blocklist.de/lists/courierpop3.txt
#blocklist-email = http://www.blocklist.de/lists/email.txt
#blocklist-ftp = http://www.blocklist.de/lists/ftp.txt
#blocklist-imap = http://www.blocklist.de/lists/imap.txt
#blocklist-ircbot = http://www.blocklist.de/lists/ircbot.txt
#blocklist-pop3 = http://www.blocklist.de/lists/pop3.txt
#blocklist-postfix = http://www.blocklist.de/lists/postfix.txt
#blocklist-proftpd = http://www.blocklist.de/lists/proftpd.txt
#blocklist-sip = http://www.blocklist.de/lists/sip.txt
#ciarmy = http://www.ciarmy.com/list/ci-badguys.txt
alienvault-inbound = http://reputation.alienvault.com/reputation.data
#drg-ssh = http://dragonresearchgroup.org/insight/sshpwauth.txt
#drg-vnc = http://dragonresearchgroup.org/insight/vncprobe.txt
#rulez = http://danger.rulez.sk/projects/bruteforceblocker/blist.php
#sans = https://isc.sans.edu/ipsascii.html
#nothink-ssh = http://www.nothink.org/blacklist/blacklist_ssh_day.txt
#packetmail = https://www.packetmail.net/iprep.txt
#autoshun = http://www.autoshun.org/files/shunlist.csv
#haleys = http://charles.the-haleys.org/ssh_dico_attack_hdeny_format.#php/hostsdeny#.txt
#virbl = http://virbl.org/download/virbl.dnsbl.bit.nl.txt
#botscout = http://botscout.com/last_caught_cache.htm

[feeds.parsers]
alienvault = process_alienvault
alienvault-inbound = process_alienvault
projecthoneypot = process_project_honeypot
rulez = process_rulez
sans = process_sans
packetmail = process_packetmail
autoshun = process_autoshun
haleys = process_haleys
drg-ssh = process_drg
drg-vnc = process_drg
malwaregroup = process_malwaregroup

…, now every feed is labeled in the config file and is now stored by label (was url). In thresher.py the thresher_map has been substituted with a list of parsers defined by the user from the config file too - this will come handy when plugins will be implemented
@krmaxwell
Copy link
Member

I have pulled this over into the gbrindisi-labeled-feeds branch and fixed the merge conflicts. Will try to test tonight.

@krmaxwell
Copy link
Member

Alternately @gbrindisi if you can pull the current master into yours, the conflict is pretty easy to fix and it will update this PR.

@gbrindisi
Copy link
Contributor Author

Done!
...I'm not totally sure I did it right 🐐

@alexcpsec
Copy link
Member

So, I am not ignoring this, but I was thinking that this is an opportunity to begin to address the extra fields that would be required from a more robust TI feeds parsing engine (such as confidence, campaign, other notes).

I have some work to do around defining that on some other things I am working on the internal parts of the non-open source code I have, so I'd like to propose a direction for us to move forward by early next week.

@gbrindisi I think this is a step in the right direction, but I want to make sure we do not code ourselves into a corner :)

@gbrindisi
Copy link
Contributor Author

Ok I understand :)
Let me know what you decide and if you and @technoskald want I can help with the coding.

Feel free to mail or message me on slack (I'm lurking daily btw ;)).

@krmaxwell
Copy link
Member

Just a quick poke here to see where we are on this :)

@paulpc
Copy link
Contributor

paulpc commented Dec 23, 2014

I can help out with this - in the past, i did something like this for feed parsing, and had the engine just read the line regex from the conf file. I also used crits-specific indicator names, but that's obviously just semantics and easily changed:

[
{
  "impact": "high", 
  "source": "malwareDomainList",
  "campaign":"testCampaign", 
  "confidence": "medium", 
  "format": "^\\\".*\\\"\\,\\\"(.*?)\\\"\\,\\\"(\\d+\\.\\d+\\.\\d+\\.\\d+|-)\\\"\\,\\\"(.*?)\\\"\\,\\\".*?\\\"\\,\\\".*?\\\"\\,\\\"(\\d+|-)\\\"", 
  "reference": "http://www.malwaredomainlist.com/updatescsv.php", 
  "fields": ["URI - URL", "Address - ipv4-addr", "URI - Domain Name","Address - asn"] 
},
{
  "impact": "medium",
  "confidence": "medium",
  "campaign":"testCampaign",
  "format": "(.+)",
  "reference": "https://zeustracker.abuse.ch/blocklist.php?download=compromised", 
  "fields": ["URI - URL"], 
  "source": "ZeusTracker"
}
]

If you guys want to go somewhere like this, i can create a branch and see if I can bastardize the code to allow for this. @gbrindisi, where did you want to put the custom feed parsers?
The conf can stay in the standard conf format, it doesn't have to be changed to json (i just did json at the time)

@alexcpsec
Copy link
Member

I know this is from October, and I told I would think about this, but believe it or not I have not finished thinking about it yet. 😕

I am trying to align this with some other ideas for projects I am entertaining right now (including something for presentations on BlackHat and DefCon 2015). I will have a "recommendation" you you guys to give input on before the end of the holidays. 🎅 🎄

@paulpc
Copy link
Contributor

paulpc commented Dec 23, 2014

That sounds a lot like 'here, you guys write code for my blackhat prez.'

We should probably think about relationships between indicators, contextual information, bla bla bla.
Maybe looking at them from a STIX/CYBOX standpoint would help, as long as we can generate the relationships between indicators (e.g. not just everything is connected to the ZeusTracker feed, but x.x.x.x was seen by zeustracker in conjunction with www.pornmalware.com, which was seen by alienware along with y.y.y.y, and the c2 communication was using ZZZ user agent)

@alexcpsec
Copy link
Member

@paulpc I just realized that did sound wrong. That was not what I meant, and I am sorry if it sounded that way. I am aiming for a minimum set of fields and parameters that would give anyone analyzing the data the ability pivot and aggregate the data in multiple ways. Having this kind of flexibility would help me on some of the things I want to work on and I am sure would help others as well.

So, I think a bare minimum would be:

  • Source: feed or reference where the indicator came from
  • type: IPv4, FQDN, user-agent, MD5 hash, etc, etc
  • category: name of the malware / exploit kit / dropper family, if available (ideally respecting different versions having different names)
  • campaign: public campaign or private org incident related to the indicator, if available
  • impact & confidence: could be read from the feeds or configured as a default on the configuration for the feeds (as in I think feed XXX is a "medium" confidence, regardless what they say)

I am not sure how you would do the relationship matching without a storage back-end such as CRITs, so maybe that is what you mean. I want to make sure we are feeding something such as CRITs enough data so what you described is possible to be done there with queries on different fields.

What are other fields you would like to see in this?

@paulpc
Copy link
Contributor

paulpc commented Dec 23, 2014

@alexcpsec, why constrain ourselves to a few indicator types? We could use the openIOC (http://schemas.mandiant.com/) or the STIX(http://stixproject.github.io/documentation/idioms/) dictionaries. We don't have to go all out and output STIX or openIOC XML, but we don't need to reinvent the language either and should provide an easy conversion mechanism (maybe if we wrote a STIX output function, it would help materialize the ideas, but i realize it's obscene scope creep).

As for relationship matching, it would obviously be easy to do it in a backend system a la CRITs, Avalanche / Soltra Edge, commercial threatX. But since we're ingesting a bunch of OS-INT feeds in combine, we could do a very rudimentary relationship model here based on what we see in which feeds, any common elements, and any metadata present with those feeds. I had ideas to do some post hoc relationship building in CRITs, but you would lose the point-in-time aspect.

For example, malwaredomainslist, has some registrar info, domain, ip, ASN. Alienware has category and some confidence information. We can maybe connect all those before inserting them in an intelligent analysis system (CRITs) to help further analysis and documentation of point-in-time relationships.

This is all a moot point if all the end-user is doing is putting the IPs in a firewall blocklist.

@krmaxwell
Copy link
Member

First, I don't think correlation / relationship matching is in scope for Combine. It grabs the data, does minimal normalization, then outputs it in forms other stuff can consume. And we've definitely always considered the STIX stack to be on the roadmap; see issue #33 for example.

The use case here, as I understand it, is to grab a bit of extra metadata and make that available to users. Some of them use STIX or OpenIOC and some don't (although anyone putting it in a firewall blocklist is omg doing it wrong). For this issue, we should probably just make sure we consider the metadata we should grab and add that to the data model. The list from @alexcpsec above is a good start.

@alexcpsec
Copy link
Member

Here is what I am thinking:

Immediate action:

  • Define a minimal schema that makes use of varied metadata from the feeds - this is a no brainer, we already discussed this at use more of the feed #84 and it is hinted at Plugins #23 way back on the first version
  • Extract the additional metadata from the feeds - again referencing use more of the feed #84 , and @paulpc 's suggestion of the regex and capture points is good, but I'd rather have something more involved (I mean actual Python functions) if the extraction is easier with that. I think a proof of concept with a few feeds for that idea would be awesome. Also, check out the work that @btv is suggesting at updated packetmail to use csv. #98 to clean up the parsing.

I would not, however, extract data that we could optionally enrich later on winnower (such as ASN). I appreciate that the enrichment code could be faster as it is now, but that is what that is for.

The point-in-time aspect could be mitigated by having 2 different timestamps on the entries:

  • A timestamp extracted from the feed (if the feed says when the indicator entered)
  • A timestamp from when the feed was scraped.

I'd try to stay away from the relationship matching or correlation just now, at least until these more immediate things can be done. Combine is not trying to be fully fledged TI Platform, and maybe these would be functionalities for a separate project ("hay silo", anyone?). Or even leave this to CRITs, MISP or even CIF.

What do you people think?

@paulpc
Copy link
Contributor

paulpc commented Dec 23, 2014

I understand the correlation and relationships might be out of the purview. The ASN example wasn't meant as a replacement for the current utility, but just another datapoint, and since Combine is not a TI platform, no reason to worry about my point-in-time complications.

The reason why i went with regex is b/c i can do meta programming in the config file - for new feeds i don't have to change my code to allow for column names, mapping to existing columns, et cetera, but in stead I can define everything in the regex and fields. Downside (and it's a pretty big one) is that i am expecting normalized input and I am doing minimum error checking.

@btv's idea is awesome! To take it a bit further, since JSON is our bread and butter for transporting observables between the modules, why not doing something like csv.DictReader for the csv-formatted fields and get a dictionary object out of it automatically. It would be parsed by the csv library, so, it might do some of the heavy loading. Unfortunately, I thing it will drive us further into conversations about feed-format-specific parsers and complicating the retrieving and parsing algorithms.

It seems we're heading towards creating/adopting a normalized observable-descriptive language and coming up with transform tables from all the feeds ingested. If so, JSON / XML / CSV output would be pretty trivial regardless of how complicated we decide to get with the metadata.

@alexcpsec
Copy link
Member

It seems we're heading towards creating/adopting a normalized observable-descriptive language and coming up with transform tables from all the feeds ingested. If so, JSON / XML / CSV output would be pretty trivial regardless of how complicated we decide to get with the metadata.

Yes, I think we should get this done right first before making the scope bigger.

@gbrindisi
Copy link
Contributor Author

@paulpc

 @gbrindisi, where did you want to put the custom feed parsers?

The parsers are just functions in thresher.py so they could be just moved outside in a seprate module to allow easy customization.

@paulpc
Copy link
Contributor

paulpc commented Jan 8, 2015

Gianluca,
I poked around and got something similar on my version.
I'll put in a pr
On Jan 8, 2015 10:00 AM, "Gianluca Brindisi" [email protected]
wrote:

@paulpc https://github.com/paulpc

@gbrindisi, where did you want to put the custom feed parsers?

The parsers are just functions in thresher.py so they could be just moved
outside in a seprate module to allow easy customization.


Reply to this email directly or view it on GitHub
#86 (comment).

@krmaxwell
Copy link
Member

That would be super cool and relevant to #23

@alexcpsec
Copy link
Member

I am closing this PR unmerged and we should focus on #110 that has a more complete implementation of these ideas we started discussing here.

I really want to thank @gbrindisi for kicking off this here and providing us with the cornerstone for this discussion. Please help us with #110 as well. :)

@gbrindisi
Copy link
Contributor Author

@alexcpsec you are welcome! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants