Data Science Hunting Funnel


In the age of cybersecurity and data science, we often hear about machine learning being applied to cybersecurity to catch sophisticated hackers which evade Intrusion Detection Systems (IDS). Data science can be used to identify anomalous network activity, but requires an additional level of processing in order prevent analyst alert fatigue.

The Data Science Hunting Funnel was created to illustrate a workflow for security researchers and data scientist to help reduce their dataset and have the best likelihood of identifying malicious traffic and also attempt to set expectations. Data science is not a Cybersecurity silver bullet, but can be very useful when coupled with the right domain expertise.

Hunt Funnel Breakdown

Values are approximations

  • [INPUT] All Network Traffic - Bits flowing into the funnel ready to be processed.
  • [PHASE 1] Produced Naturally - Generated naturally by network users and devices. Represents all of your normal network traffic.
  • [PHASE 2] Machine Learning - After applying machine learning, you can reduce your set of data to a much smaller subset by identifying anomalies. I chose ~10% to help visualize the data left to analyze after applying machine learning. A few examples include:
    • Identify periodic communication in the network in an attempt to identify an infected computer using command and control.
    • Applying the markov model to user agents with the lowest likelihood of occurrence.
    • Identify DNS requests with high entropy or are identified as DGA using Flare
    • and much more…
  • [PHASE 3] Domain Knowledge - Once you have you have reduced your dataset using machine learning, domain expertise must be applied to categorize the results. The goal is to identify legit vs suspicious or malicious network traffic and further reduce the dataset to approximately 1-5% of total network traffic. This is where the interesting results live. By interesting, I mean results that are anomolous to the network and did NOT pass common questions an analyst might ask of network traffic (filtering out known good). Depending on the protocol you’re analyzing, you can apply domain expertise such as:
    • Is this domain in Umbrella, Majestic or Alexa Top 1 million?
    • Is this IP a known TOR node
    • Does this domain have any blacklist or threat intelligence association?
    • Who owns this IP space?
    • How long ago was this domain registered?
    • and much more…
  • [PHASE 4] Potential Bad - Your data is ready to hunt on. The value .001 is meant to set expectations that finding malicious traffic, especially in larger networks, is very difficult. It requires the right amount of data science and domain expertise.
    • If you look closely, bad is written at the bottom of the funnel. It’s a bit hard to see because, like reality, finding evil in networks can be very difficult. This is where your hunt teams should focus.
  • [OUTPUT] Malicious Traffic -Confirmed malicious traffic identified by Threat Hunters/Analyst.

Below are slides from my presentation at Data Intelligence Conference Practitioner Focused Machine Learning where I applied the Data Science Hunting Funnel to results to beaconing and DGA use cases.

Slides from Capital One Presentation

Threat Hunting with Data Science from Austin Taylor

Have you applied data science to network traffic? What has your experience been? Share in the comments section below!