Recently you can’t pick up a trade magazine or get a cyber news email without hearing about machine learning and AI. Besides being the current buzz, there are many legitimate reasons to pay attention to this technology. I heard about Cylance, a leader in the application of machine learning to cyber issues, for the first time three years back in relation to helping get control over the OPM breach. Given the success of this effort, I was quite interested to learn more. So I reached out to Cylance and they were quite amenable to sharing a deeper dive on machine learning and how they apply it. I was able to interview Mr. Strong who leads their advances in this area. So read the interview below to find out how Cylance applies machine learning to improve the accuracy in detecting and classifying malware to improve your endpoint vulnerability management program.
Spotlight on Homer Strong
Read his bio below.
April 16, 2018
Chris Daly, ActiveCyber: How is machine learning changing the landscape of cybersecurity tools and approaches? Where in the ecosystem of cybersecurity tools will machine learning have its greatest impact?
Mr. Homer Strong, Cylance: Machine learning is most effective when there are large, high-quality sets of data available. Malware classification is a great application for machine learning because it is relatively easy to collect a large number of executables. There are still many challenges to clean and curate the datasets, extract features, train models, and so forth, but all of these steps hinge on having data.
Any tools which depends on some sort of signature or heuristic can be potentially improved by machine learning. It may not be immediately viable to swap out signatures with ML models, but ultimately machine learning models have overwhelming potential. Even with ML-driven systems, signatures and heuristics can still have a place. For example, they can be used to maintain testing sets of known malware to test against future ML models.
Cybersecurity is a quickly moving field and sometimes it seems that machine learning is already old. The first commercially successful applications of machine learning in cybersecurity, such as Cylance, have been focused on optimizing or augmenting existing tools and workflows. In the future I expect to see machine learning enable new types of tools and workflows that were previously not possible.
ActiveCyber: What types of cybersecurity use cases or features does Cylance support in the application of machine learning?
Strong: The flagship model that Cylance uses is a deep neural network for the supervised classification task of detecting malware before it executes. It has been seen multiple released iterations over the years, and scored files on over ten million endpoints. It sets the standard for a growing suite of static file classifiers which continue to roll out.
Cylance is beginning to deploy behavioral classifiers that detect malicious processes to complement the more mature models based on static analysis. These models represent major steps in towards a more general Cylance AI platform.
In addition to the machine learning models which power Cylance’s product offerings, Cylance uses machine learning to accelerate many of our internal processes. For example, we use clustering techniques to help internal analysts sort through their review threat queues.
ActiveCyber: Please discuss the two major types of techniques used by machine learning, how they are different, and how Cylance employs these techniques for endpoint security and malware classification?
Strong: Machine learning traditionally comes in two flavors, supervised and unsupervised. The difference is whether the algorithm is provided with a so-called ‘label’ that it should learn how to predict. An example of a supervised model is a neural network which is provided with pictures of dogs along with what breed each dog is. Given a new dog picture, the model can usually correctly identify the breed of dog. An unsupervised model could also be shown pictures of dogs but would not require any breed labels. The unsupervised model could determine which dogs are most similar to which other dogs, and possibly it could group together similar dogs roughly into breeds – however, it would not know anything the concept of breeds, it would just notice that some dogs share striking similarities like fur color or size.
Supervised learning is easier to optimize for because you have a very natural way to evaluate the performance of a model: the accuracy with which it can predict the correct label (how often the dog model guesses the right breed). Unsupervised learning tends to be more difficult because there isn’t always a good way to evaluate how well a model is doing. However, unsupervised learning is operationally somewhat easier because it does not require labels. Thus unsupervised models have fewer prerequisites than supervised models.
Cylance primarily employs supervised learning to power endpoint security products. CylancePROTECT and CylanceOPTICS incorporate machine learning classifiers to stop attacks. Unsupervised learning is also used by internal systems to help sort through and manage our enormous amounts of data. We are always exploring new applications and methodologies for both supervised and unsupervised learning.
ActiveCyber: Your journey as a data scientist has taken you from medical fraud systems to cybersecurity systems and quite a bit in between. How do the goals and approaches for applying machine learning differ in these different problem domains? What particular challenges does cybersecurity present for classification and prediction using machine learning? How does the cybersecurity industry compare to other industries in its maturity for the application of machine learning and the accuracy of prediction that is provided?
Strong: Cybersecurity is still in early stages of leveraging machine learning and other data science techniques. There are three major distinctions that characterize machine learning in cybersecurity.
• Cybersecurity is itself a rapidly changing field. Both defensively and offensively, there continue to be many changes to the economic markets, the technical architectures, and the workforce. A consequence is that there are many opportunities to research or imagine applications of machine learning, but it can be difficult to reach maturity or evaluate the real-world effectiveness. Another issue is that senior experts can be hard to find; the growing demand for skilled human resources is much greater than the supply.
• We operate in an adversarial environment. Fraud detection also does so, but most other common areas of machine learning application do not. Adversarial machine learning is currently one of the hottest areas of academic research in machine learning, and it may well help to expand similar concerns to a wider audience of machine learning practitioners.
• The stakes tend to be high in machine learning. If a machine learning model is not sufficiently accurate then it can bring down a business’ critical infrastructure, either by missing an attack or by mistakenly interfering with normal operations. A single incorrect prediction can have major impacts. The bulk of machine learning applications tend to be low-risk, e.g. Amazon wants to have a recommendation algorithm that will maximize the number of product purchases on average, but they need not be concerned with an individual miss.
ActiveCyber: Are the attackers ahead of the defenders in the use of machine learning? Open source repositories for data mining algorithms are beginning to sprout up on the Internet such as Weka (https://www.cs.waikato.ac.nz/ml/weka/). Do you believe the presence of these repositories will accelerate the development of defensive capabilities or will they provide a greater benefit to attackers?
Strong: I believe that defenders are currently leveraging ML more than attackers. Attackers can have a great deal of success without any ML. In contrast, ML enables defensive capabilities that would be otherwise unreachable. It is possible that adoption of offensive ML will be driven by the increasing prevalence of defensive ML.
The availability of open source projects for data science does dramatically reduce the barrier to using techniques from machine learning and related fields such as data mining. However, those tools still require a great deal of expertise in order to use them successfully.
Generating targeted phishing messages is probably the most immediate concerning offensive ML application. ML-powered phishing optimizes a favorite vector of modern attackers, both increasing infection rates as well as reducing attacker costs. Furthermore, the phishing use case could follow commonly-used ML patterns which have been developed for online advertising and marketing applications.
ActiveCyber: What type of process is involved with your threat hunters to ensure proper validation of the model?
Strong: The Cylance data science team is collaborates extensively with Threat Research and other teams of SMEs. Human expert validation is an absolutely critical for evaluating model quality. There are three primary mechanisms that are regularly leveraged.
1. Feature review: machine learning methods will automatically find sets of features which it considers useful for prediction. Part of our production modelling process is a review of the selected features with domain experts who can spot potential pitfalls or suggest alternative strategies for feature extraction.
2. Monitoring of deployed models: threat research monitors for false positives in model detections from released models, as well as proactively hunting for threats that the model could miss. As mis-classified samples are identified, they are remediated and tagged for future tracking. Clustering techniques are used to identify groups of similar misses.
3. Validation of release candidates: when we are considering the release of a new model, the new model and the current model will sometimes disagree. For disagreements with known files, we can just check the known label. However, there are always cases where the two models can disagree on files whose true nature is not known. In such cases, threat researchers are needed to verify that the new model is more likely to be correct than the current model. Again, clustering methods are used to group disagreements into similar groups.
These practical processes allow threat researchers and other domain experts to ensure that the models are constantly improving. Furthermore, I would mention that there are many data science-driven opportunities to more efficiently and effectively learning from human experts. Active learning, semi-supervised learning, and weakly supervised learning are overlapping niches of ML research which contain many techniques that could be applied.
ActiveCyber: How will the use of machine learning evolve in the next 5 years in the cybersecurity industry? Do you believe that autonomous protections will extend to all endpoints as we see autonomous capabilities being extended in other industries?
Strong: I predict that machine learning will continue to grow in popularity, but also that we will begin to see more differentiation in the use cases and maturity of ML offerings. ML has proven itself in that many industry professionals see potential value, but also there is more hype than ever. Using machine learning does not automatically make a product better. Machine learning leaders such as Cylance will apply ML more and more effectively. Cylance has learned many lessons from building and deploying ML models. In contrast, vendors who implement machine learning ineffectively, will gain little improvement while accumulating costly technical baggage. As Google researchers Sculley, et al. put it, machine learning is the high-interest credit card of technical debt.
Yes, autonomous protections will become standard endpoint security. The shift towards autonomy will be gradual as different capabilities are automated. There will always be humans in the loop to configure, verify, and validate. Further automation is inevitable because ultimately autonomous protections offer more effective defense at a lower human cost.
Thank you Mr. Strong for sharing your knowledge and insights on applying machine learning and associated data sciences for improving security of endpoints. I know that Cylance will continue to make great advances under your leadership, and it looks like reliable autonomous endpoint security is not too far off.
And thanks for checking out ActiveCyber.net! Please give us your feedback because we’d love to know some topics you’d like to hear about in the area of active cyber defenses, PQ cryptography, risk assessment and modeling, autonomous security, securing the Internet of Things, or other security topics. Also, email firstname.lastname@example.org if you’re interested in interviewing or advertising with us at ActiveCyber.
About Homer Strong
Homer Strong is Director of Data Science at Cylance, where he leads the research and development of machine learning applications in cybersecurity. Previously Homer applied statistical models to combat healthcare fraud, sought cosmic rays through a crowd-sourced network of smartphones, and deployed distributed systems for matching online advertisements in real-time. He was CTO and co-founder at Lucky Sort, a natural language processing and visualization firm, which Twitter acquired in 2012. Homer holds an MS in statistics from the University of California, Irvine.