ABSTRACT : This paper presents our efforts towards developing a prescriptive maintenance system that integrates with and enhances state-of-the-art asset performance management software available in the industry. The goal of prescriptive maintenance is to analyze the behavior of an asset, assess its condition, and recommend specific actions to maximize the utility of that asset. Specifically, this work evaluates three approaches of different complexities for vectorization of short-text maintenance case titles for kNN-based recommendation of cases relevant to a new input case title. Industrial text must first be vectorized to build automated and/or machine learning-based prediction and recommendation models. The choice of vectorization methods heavily dictates how the language gets modeled and consequently impacts the performance of downstream prediction and recommendation models.The objective of the nearest neighbor case recommendations is to reduce manual Subject Matter Expert (SME) effort and increase consistency of recommended maintenance actions on industrial assets by reusing actions performed on the identified nearest neighbor cases from past maintenance work. Four models based on three text vectorization approaches are evaluated, quantitatively and qualitatively, using real data from a large variety of utility customers from the energy domain. A single tier (WVEC-1tier) and a three-tier (WVEC-3tier) approach that represent case titles in word-based vector spaces each significantly outperform a more complex bag-of-phrases topic vector space-based approach (TVEC-K-topics). We present our findings and challenges identified so far in building such a recommendation system.
ABSTRACT : Network traffic analysis, with the objective of identifying and preempting malicious campaigns, is an active area of research. An effective model that predicts future malicious network events based on observed malicious event sequences can aid with preemptive action that includes intervention by a security analyst. Predicting threat events that are part of a cybersecurity threat campaign that spans a long duration of time remains a challenge as the time lag between various steps in a campaign is unbounded. In this paper, we describe an approach to create an ensemble of Hidden Markov Models trained on sequences of malicious network events. The ensemble is used to predict the next expected malicious event given an already observed malicious traffic sequence at any network host. Ensembles of different sizes in combination with two prediction strategies are evaluated using prediction accuracy relative to two baselines predictors.
ABSTRACT : Evolving cybersecurity threats are a persistent challenge for system administrators and security experts as new malwares are continually released. Attackers may look for vulnerabilities in commercial products or execute sophisticated reconnaissance campaigns to understand a targets network and gather information on security products like firewalls and intrusion detection / prevention systems(network or host-based). Many new attacks tend to be modifications of existing ones. In such a scenario, rule-based systems fail to detect the attack, even though there are minor differences in conditions /attributes between rules to identify the new and existing attack. To detect these differences the IDS must be able to isolate the subset of conditions that are true and predict the likely conditions (different from the original) that must be observed. In this paper, we propose a probabilistic abductive reasoning approach that augments an exist-ing rule-based IDS (snort [29]) to detect these evolved attacks by (a)Predicting rule conditions that are likely to occur (based on existing rules) and (b) able to generate new snort rules when provided with seed rule (i.e. a starting rule) to reduce the burden on experts to constantly update them. We demonstrate the effectiveness of the approach by generating new rules from the snort 2012 rules set and testing it on the MACCDC 2012 dataset.
ABSTRACT : As government agencies increasingly make public data available online, it provides opportunities to leverage such data for descriptive, predictive and prescriptive analytics. One domain where these technological capabilities are applicable is real-estate development and housing market domain. This domain is of interest to home buyers, investors and policy makers. Diverse and varying preferences of residents of a geography are latent behavioral factors that affect residential property prices. This paper describes a geographical area agnostic housing typology classifier for Baltimore City communities or neighborhoods. Further, it discussed correlation analysis and composite Vital Signs scores to characterize city population perceptions of different community development categories. These scores enable community clustering to investigate price disparity in comparable communities based on configurable categories and year-on-year trend analysis. Various visualization possibilities are discussed in conjunction with these approaches to make a case for interactive, visual exploration of geographical communities which may be extended to comparative analysis across geographies.
ABSTRACT : A system and method for investigating trust scores. A trust score is calculated based on peer transfers, a graphical user interface displays actuatable elements associated with a first peer transfer from the peer transfers, in response to receiving an indication the first actuatable element has been actuated, recalculating the trust score without the first peer transfer.
ABSTRACT : A system and method for determining confidence scores for accounts based on peer-to-peer interactions. One or more clustering algorithms are applied to a database of peer-to-peer interactions to identify and group related peer-to-peer interactions. A classifying algorithm is applied to a group resulting from the one or more clustering algorithms that classifies each peer-to-peer interaction within the group based on one or more relationships between the peer-to-peer interactions with the group. A score is provided to each transaction in the group based at least in part on the classification. The system uses the score to change functionality of at least one of the accounts associated with one of the transactions and/or provides information regarding the trustworthiness of a user of an account.
ABSTRACT : Automated Knowledge Discovery is an active area of research that seeks to address the need for extensive knowledge acquisition and elicitation, curation and archival for large quantities of text. A generic, flexible and extendable text analytics framework is based on robust theme detection methods. A novel method is described here to extract thematic hierarchies using the Latent Dirichlet Allocation (LDA) topic models, noun-phrase extraction and phrase filtering heuristics. Further, a visual representation of theme dynamics, the "Document Thematic Map (DTmap)", is created to enable text segmentation using the theme-mix.
ABSTRACT : A large collection of software and hardware sensors exist for monitoring network traffic at different granularity and alerting when suspicious traffic is encountered. The sensors utilize large and diverse rule-sets to detect malicious network traffic patterns. The data generated by these sensors can be utilized to provide a holistic assessment and reason about network threat patterns. We propose an analytic pipeline which applies graph theoretic and machine learning methods to achieve this. The proposed analytics pipeline allows a holistic assessment of network traffic patterns at custom temporal granularity. Further, temporal coccurrence of host interactions and associativity can help discover possible collusion and attack campaign signatures. This automated workflow is extendable and customizable by adding new computation blocks and an interactive, human-in-the-loop experience.
ABSTRACT : Automated knowledge discovery is central to augmenting knowledge acquisition and elicitation by humans from vast amounts of content. Precise and concise representations, both structured and semi-structured, of knowledge contained in textual content have the potential to boost human productivity. Further, they can reduce, if not eliminate, human error and bias in knowledge retrieval and curation by humans from vast collections of content to use for their subsequent knowledge-based tasks.Conventionally, knowledge discovery in text (KDT) approaches and paradigms have been designed to build domain knowledge by processing large collections of text documents and applying them to process individual text documents using this acquired domain knowledge for guidance. Consequently, these approaches are blind to the finer topical features of the individual document because these features are abstracted by topic models that infer topicality in the context of the whole corpus.We need an unsupervised method to extract topical or thematic phrases from a single text document without the need to access entire collections of texts or background domain or language dictionaries and thesauri. Further, the method should not abstract fine-grained thematic phrases contained in the document, thus, enabling its application for hierarchical knowledge representation and downstream document level text analytics tasks.This work describes ThemaPhrase (ThP), a novel framework for unsupervised extraction of thematic phrases from single text artifacts. The framework operates without the need for corpus wide statistics and external domain knowledge which makes it domain agnostic. ThP configurations are more robust than competing methods to topic-to-partitions ratio and varying average token occurrence frequencies in a document. Different configurations of ThemaPhrase are identified that outperform competing methods in extracting thematic phrases that represent the topicality of a document at varied granularities.Further, this work shows that sentence pre-filtering based on thematic phrases and thematic words helps improve extractive summarization for texts, such as patents, that have relatively higher occurrence frequencies of tokens where the baseline TextRank summarizer underperforms. ThemaPhrase configurations that outperform competing thematic phrase extraction methods in extractive summarization using sentence pre-filtering are discussed.
ABSTRACT : Workflows identified from user event logs and click-stream data are useful as knowledge bases for behavioral analysis and recommendation systems. In this study we identify abstractions or summaries of event logs modeled as user activity flow networks. The abstractions are identified based on structural properties as well as user activity flow dynamics over the network using community detection methods. We apply a fast modularity optimization and multi-level resolution approach to detect hierarchical community structure in user activity flow networks. The detected communities are compared to those detected by the information-theoretic map equation minimization approach to weigh pros and cons of the fast modularity optimization approach in the workflows context. We further attempt to identify the most probable sources and sinks of user activity in individual communities and trim the network accordingly to reduce entropy of the workflow abstractions.