The Y-DB vision
Scientific hypotheses are tentative, testable explanations of phenomena. In the era of data-intensive science and big data,
much of the scientific thinking is shifting to the data analysis phase of the research life cycle. The vision behind
Y-DB meets this paradigm shift by abstracting the
data-intensive scientific method as a well-defined application
of uncertain and probabilistic databases.
What is Y-DB?
Y-DB (read "upsilon-DB") is a
probabilistic data system for scientists and engineers to manage data-intensive scientific hypotheses.
It is an abstraction layer on top of
MayBMS, a state-of-the-art probabilistic database
management system based on U-relations and probabilistic world-set algebra. Y-DB comprises a design methodology to automatically
synthesize a probabilistic database from a set of mathematical equations and their associated data, viz., parameter settings and
computed predictions. (Note: currently, the system supports mathematical models in W3C
MathML-based
formats, e.g., Physiome's
MML, and data loading from file formats
such as .par, .csv and .mat.).
What is Y-DB used for?
Hypothesis Management. In Y-DB,
hypotheses (as data) are encoded into Y-relations by processing their mathematical structure as provided in a MathML-based file. The system infers
the functional dependencies (FDs) hidden in the model equations. Once a hypothesis has been defined and at least one simulation trial dataset has been
loaded into the system, the user is able to query it as usual in traditional relational databases.
Phenomena (as data)
are encoded by the incremental addition of empirical (observational or experimental) datasets. Attribute symbol mappings can be inserted to link phenomenon symbols (global) to
symbols of existing hypotheses (local).
Predictive Analytics. In Y-DB, data predicted from the computational models (hypotheses) can be analyzed against empirical data (phenomena).
Based on the structure of both hypotheses and phenomena, the system supports the user in the setup of
data-driven evaluation studies,
i.e., given a phenomenon of interest, one chooses and assesses a selection of competing hypotheses whose predictive power is fit for the phenomenon.
Once the user assigns a prior probability distribution (uniform by default) to the competing hypotheses, Bayes' theorem is applied to get a posterior distribution,
possibly re-ranking the hypotheses in face of the evidence available for the phenomenon.
Key Concepts
Hypotheses as data.
In data-intensive science, hypotheses can be viewed as being
(i) formed as principles or ideas, (ii) then mathematically expressed and (iii) implemented in silico as a program that is run
(iv) to give its
decisive form of data.

Theoretical uncertainty.
Seen from a data perspective, hypotheses are used as functions to predict data. Because hypotheses are,
by definition, uncertain, the data generated from them is uncertain as well. For a given phenomenon
of interest, multiple working hypotheses are put forward as theoretical
alternatives.

Data-driven evaluation studies.
Data-driven evaluation studies are defined by the user by selecting a set of competing hypotheses which the system validates
to be fit for a chosen phenomenon. The goal of each study is to
repair phi (the phenomenon ID) as a
key w.r.t.
upsilon (the hypothesis ID)
in the
explanation relation.
The system applies Bayes' theorem to the prior probability given in order to re-assess it to a posterior in face of the evidence available for the phenomenon.

Y-DB synthesis pipeline.
The construction and update of Y-DB is activated by two main actions from the user.
- Hypothesis data definition: the user uploads an XML file containing the hypothesis mathematical structure and data files in some
supported formats. Each hypothesis may have a large number of different simulation trials (originated by changing parameters).
Phenomena are defined by uploading available empirical data. The system then synthesizes both hypotheses
and phenomena as certain relations.
- Evaluation study definition: the user selects a set of hypotheses as alternative explanations for a phenomenon of his/her interest.
The system allows only the selection of hypotheses whose structure is fit to the phenomenon. A probability distribution
is assigned to them which is uniform by default, but the user has the option to set biased distributions as well.
Demonstration
A first prototype of the Y-DB system has been implemented as a Java web application with the pipeline component in the server side
on top of MayBMS (a backend extension of PostgreSQL). The demonstrated system addresses them throughout a design-by-synthesis
pipeline that defines its architecture, see Fig. below. It processes hypotheses from their XML-based extraction to encoding as
uncertain and probabilistic U-relational data, and eventually to their conditioning in the presence of observations.
Figure: Design-by-synthesis pipeline for processing hypotheses as uncertain and probabilistic data.
The figures below shows screenshots of the system in a population dynamics scenario comprising the Malthusian model, the logistic
equation and the Lotka-Volterra model applied to predict the Lynx population in Hudson's Bay in Canada from 1900 to 1920. The
observations are used to rank the competing hypotheses and their trials accordingly.
Fig. demo (a) shows the research projects currently available for a user. Figs. demo (b, c) show the ETL interfaces for phenomenon
and hypothesis data definition (by synthesis), and then the insertion of hypothesis simulation trial datasets. Note that it
requires simple phenomena description, hypothesis naming and file upload to get phenomena and hypotheses available in the system
to be managed as probabilistic data.
Fig. demo (d) shows the interface for a basic retrieval of simulation data, given a selected phenomenon and a hypothesis trial.
Figs. demo (e, f) show two tabs of the predictive analytics module.
Note that the user chooses a phenomenon for study and imposes some selectivity criteria onto its observational sample.
The system then lists in the next tab the corresponding predictions available, ranked by their probabilities conditioned on the
selected observations. In this case, Lotka-Volterra's model (under trial tid=2) is the top-ranked hypothesis to explain the Lynx
population observations in Hudson's Bay from 1900 to 1920.
Figure demo (a)
Figure demo (b)

Figure demo (c)

Figure demo (d)

Figure demo (e)

Figure demo (f)