Toolbox Functionalities

The AMIDST system is an open source Java 8 toolbox that makes use of a functional programming style to support parallel processing on mutli-core CPUs (Masegosa et al., 2015). AMIDST provides a collection of functionalities and algorithms for learning both static and dynamic hybrid Bayesian networks from streaming data.

In what follows, we describe the main functionalities that the AMIDST toolbox supplies.

AMIDST Core Structures

An overview of the core structures of the AMIDST toolbox is illustrated in the figure below. Blue boxes represent software components that have been fully implemented, while green boxes represent components that are part of AMIDST design specification to be implemented in the future.

The core structures mainly relates to the basic components that play a key role in ensuring the implementation of the different AMIDST learning and inference algorithms.

For starters, instantiation of a particular probabilistic graphical model (PGM component) will be required. Currently, it is possible to create either a static Bayesian network (BN component) or a two time-slice dynamic Bayesian network (2T-DBN component). Then, the DAG component is defined over a list of Static Variables, whereas the 2T-DBN component is defined over a list of Dynamic Variables.

Both BN and 2T-DBN models rely on the Distributions component to define the conditional probability distributions. This component supports both Conditional Linear Gaussians and Exponential Family representations of the distributions. The implementation of the latter ensures an alternative representation of the standard distributions based on vectors of natural and moment parameters.

AMIDST Modules

The following figure shows a high-level overview of the key modules of the AMIDST toolbox:

Data Sources

In the AMIDST framework, we consider two types of data sources for learning, namely, DataStream, where data arrives at high frequency with no storage of historical data, and DataOnMemory for static databases that simply correspond to traditional databases. The data format supported by AMIDST is Weka’s attribute-relation file format (ARFF).

Moreover, we ensure a scalable DistributedDataProcessing using Apache Flink that runs the AMIDST developed algorithms on top of the Hadoop and Yarn architectures on Amazon Elastic MapReduce (EMR). This part will be soon released!

Each of the data sources are furthermore connected to the DataInstance and DynamicDataInstance components that represent a particular evidence configuration (static or dynamic, respectively) such as the observed values of a collection of variables at time t or a particular row in a database.

Inference Engines

Learning Engine

Learning engine includes the Structural Learning, Parameter Learning, and Feature Selection (still under development) components that are connected to both PGM and DataStream.

Links to MOA and Hugin

AMIDST leverages existing functionalities and algorithms by interfacing to software tools such as Hugin, and MOA (Massive Online Analysis). This allows the toolbox to efficiently exploit well-established systems and also broaden the AMIDST user-base.

Concept drift

AMIDST also offers some support for dealing with concept drift while learning BNs from data streams. More precisely, the toolbox supports a novel probabilistic approach based on latent variables (Borchani et al., 2015) for detecting and adapting to concept drifts.

Bibliography

Hanen Borchani, Ana M. Martinez, Andres R. Masegosa, et al. Modeling concept drift: A probabilistic graphical model based approach. In The Fourteenth International Symposium on Intelligent Data Analysis, pages 72-83, 2015.

Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. Streaming variational Bayes. In Advances in Neural Information Processing Systems, pages 1727–1735, 2013.

J.M. Hammersley and D.C. Handscomb. Monte Carlo Methods. Methuen & Co, London, UK, 1964.

Anders L. Madsen, Frank Jensen, Uffe B. Kjærulff, and Michael Lang. HUGIN-the tool for Bayesian networks and influence diagrams. International Journal of Artificial Intelligence Tools, 14(3):507–543, 2005.

Anders L. Madsen, Frank Jensen, Antonio Salmeron, et al. A new method for vertical parallelisation of TAN learning Based on balanced incomplete block designs. In The Seventh European Workshop on Probabilistic Graphical Models, pages 302–317, 2014.

Anders L. Madsen, Frank Jensen, Antonio Salmeron, et al. Parallelization of the PC Algorithm. In The XVI Conference of the Spanish Association for Artificial Intelligence, pages 14-24, 2015.

Andres R. Masegosa, Ana M. Martinez, and Hanen Borchani. Probabilistic graphical models on multi-core CPUs using Java 8. IEEE Computational Intelligence Magazine, Special Issue on Computational Intelligence Software, Under review, 2015.

Antonio Salmeron, Dario Ramos-Lopez, Hanen Borchani, et al. Parallel importance sampling in conditional linear Gaussian networks. In The XVI Conference of the Spanish Association for Artificial Intelligence, pages 36-46, 2015.

F. W. Scholz. Maximum Likelihood Estimation. John Wiley & Sons, Inc., 2004.

John M. Winn and Christopher M. Bishop. Variational message passing. Journal of Machine Learning Research, 6:661–694, 2005.