Data-driven hypothesis management and analytics

The Y-DB vision

Scientific hypotheses are tentative, testable explanations of phenomena. In the era of data-intensive science and big data, much of the scientific thinking is shifting to the data analysis phase of the research life cycle. The vision behind Y-DB meets this paradigm shift by abstracting the data-intensive scientific method as a well-defined application of uncertain and probabilistic databases.

What is Y-DB?

Y-DB (read "upsilon-DB") is a probabilistic data system for scientists and engineers to manage data-intensive scientific hypotheses. It is an abstraction layer on top of MayBMS, a state-of-the-art probabilistic database management system based on U-relations and probabilistic world-set algebra. Y-DB comprises a design methodology to automatically synthesize a probabilistic database from a set of mathematical equations and their associated data, viz., parameter settings and computed predictions. (Note: currently, the system supports mathematical models in W3C MathML-based formats, e.g., Physiome's MML, and data loading from file formats such as .par, .csv and .mat.).

What is Y-DB used for?

Hypothesis Management. In Y-DB, hypotheses (as data) are encoded into Y-relations by processing their mathematical structure as provided in a MathML-based file. The system infers the functional dependencies (FDs) hidden in the model equations. Once a hypothesis has been defined and at least one simulation trial dataset has been loaded into the system, the user is able to query it as usual in traditional relational databases. Phenomena (as data) are encoded by the incremental addition of empirical (observational or experimental) datasets. Attribute symbol mappings can be inserted to link phenomenon symbols (global) to symbols of existing hypotheses (local).

Predictive Analytics. In Y-DB, data predicted from the computational models (hypotheses) can be analyzed against empirical data (phenomena). Based on the structure of both hypotheses and phenomena, the system supports the user in the setup of data-driven evaluation studies, i.e., given a phenomenon of interest, one chooses and assesses a selection of competing hypotheses whose predictive power is fit for the phenomenon. Once the user assigns a prior probability distribution (uniform by default) to the competing hypotheses, Bayes' theorem is applied to get a posterior distribution, possibly re-ranking the hypotheses in face of the evidence available for the phenomenon.

Key Concepts

Hypotheses as data.

In data-intensive science, hypotheses can be viewed as being (i) formed as principles or ideas, (ii) then mathematically expressed and (iii) implemented in silico as a program that is run (iv) to give its decisive form of data.

Theoretical uncertainty.

Seen from a data perspective, hypotheses are used as functions to predict data. Because hypotheses are, by definition, uncertain, the data generated from them is uncertain as well. For a given phenomenon of interest, multiple working hypotheses are put forward as theoretical alternatives.

Data-driven evaluation studies.

Data-driven evaluation studies are defined by the user by selecting a set of competing hypotheses which the system validates to be fit for a chosen phenomenon. The goal of each study is to repair phi (the phenomenon ID) as a key w.r.t. upsilon (the hypothesis ID) in the explanation relation. The system applies Bayes' theorem to the prior probability given in order to re-assess it to a posterior in face of the evidence available for the phenomenon.

Y-DB synthesis pipeline.

The construction and update of Y-DB is activated by two main actions from the user.

Hypothesis data definition: the user uploads an XML file containing the hypothesis mathematical structure and data files in some supported formats. Each hypothesis may have a large number of different simulation trials (originated by changing parameters). Phenomena are defined by uploading available empirical data. The system then synthesizes both hypotheses and phenomena as certain relations.
Evaluation study definition: the user selects a set of hypotheses as alternative explanations for a phenomenon of his/her interest. The system allows only the selection of hypotheses whose structure is fit to the phenomenon. A probability distribution is assigned to them which is uniform by default, but the user has the option to set biased distributions as well.

Demonstration

A first prototype of the Y-DB system has been implemented as a Java web application with the pipeline component in the server side on top of MayBMS (a backend extension of PostgreSQL). The demonstrated system addresses them throughout a design-by-synthesis pipeline that defines its architecture, see Fig. below. It processes hypotheses from their XML-based extraction to encoding as uncertain and probabilistic U-relational data, and eventually to their conditioning in the presence of observations.

Figure: Design-by-synthesis pipeline for processing hypotheses as uncertain and probabilistic data.

The figures below shows screenshots of the system in a population dynamics scenario comprising the Malthusian model, the logistic equation and the Lotka-Volterra model applied to predict the Lynx population in Hudson's Bay in Canada from 1900 to 1920. The observations are used to rank the competing hypotheses and their trials accordingly.

Fig. demo (a) shows the research projects currently available for a user. Figs. demo (b, c) show the ETL interfaces for phenomenon and hypothesis data definition (by synthesis), and then the insertion of hypothesis simulation trial datasets. Note that it requires simple phenomena description, hypothesis naming and file upload to get phenomena and hypotheses available in the system to be managed as probabilistic data.

Fig. demo (d) shows the interface for a basic retrieval of simulation data, given a selected phenomenon and a hypothesis trial. Figs. demo (e, f) show two tabs of the predictive analytics module.

Note that the user chooses a phenomenon for study and imposes some selectivity criteria onto its observational sample. The system then lists in the next tab the corresponding predictions available, ranked by their probabilities conditioned on the selected observations. In this case, Lotka-Volterra's model (under trial tid=2) is the top-ranked hypothesis to explain the Lynx population observations in Hudson's Bay from 1900 to 1920.