The cloud has become a good match for managing big data since it provides unlimited computing, storage and network resources on demand. By centralizing all data in a large-scale data-center, the cloud significantly simplifies the task of system administration. But for scientific data, where different organizations may have their own data-centers, a distributed (multisite) cloud model where each site is visible from outside, is needed. The main objective of this research and scientific collaboration is to develop a multisite cloud architecture for managing and analyzing scientific data, including support for heterogeneous data; distributed scientific workflows, and complex big data analysis. The resulting architecture will enable scalable data management infrastructures that can be used to host a variety of scientific applications that benefit from computing, storage, and networking resources that span multiple data-centers.
The research challenge that we will confront is the design of new techniques for scientific data that must be done in a distributed and parallel manner, leveraging machines. In particular, the following issues will be investigated:
Distributed Scientific Data Management
The approach is to capitalize on the principles of distributed and parallel data management. In particular,this study investigates the impact of intersite data movement in the performance/cost tradeoffs of our algorithms.
The Management of Numerical Simulation data using a multidimensional array model.
Numerical Simulation applications aim at producing a computer-based realistic simulation of phenomena. Typically, the application computes the values of variables of interest in space-time. Depending on the simulation precision and the domain to be simulated, the computation may take a very long time to compute and produce a huge amount of data. In this research, we aim at supporting simulation data analytics by providing efficient data management techniques to store, distribute and query simulation data. We have built SimDB on top of the multidimensional database system SciDB
Data-locality aware scientific workflows strategies.
Data locality has been successfully explored by the Map/Reduce paradigm and its most known open-source implementation Apache-Hadoop. More recently, the Apache Spark system proposed a richer language for dataflow specification and an in-memory data storage. Integrating existing scientific workflows into Spark is, however, difficult: (1) due to the specific programming languages adopted and (2) the use of file system I/O. In this work, we will investigate alternatives to integrate existing scientific workflows to Spark execution model.
We will validate our techniques by building software prototypes that exploit the expertise of the two teams with Spark, SciCumuls and modern DBMS (MonetDB and SciDB) . We will apply these techniques on real-world scientific data obtained from our application partners in astronomy, bioinformatics and computational engineering.
V. P. Freire, J. A. F. de Macedo, F. Porto, R. Akbarinia, NACluster: A Non-Supervised Clustering Algorithm for Matching Multi Catalogues, The 10th IEEE International Conference on e-Science,Oct , Guaruja, S.P., Brazil, 2014.
Liu, Ji. ; Silva, Vitor. ; Pacitti, Esther ; Valduriez, Patrick ; Mattoso, Marta . Scientific Workflow Partitioning in Multisite Cloud. In: 7th International Workshop on Multi/many-Core Computing Systems, 2014, Porto. EuroPar 2014.
Liu, Ji. ; Silva, Vitor. ; Pacitti, Esther ; Valduriez, Patrick ; Mattoso, Marta . Parallelization of Scientific Workflows in the Cloud. In: INRIA Research Report N° RR-8565 (2014).
Dias, J. ; G. Guerra, F. Rochinha, A. Coutinho, P. Valduriez, M. Mattoso. Data-Centric Iteration in Dynamic Workflows. Future Generation Computer Systems, Elsevier, Vol. 4, 114-126, (2015).
Liu, Ji. ; V. Silva, E. Pacitti, P. Valduriez, M. Mattoso. Parallelization of Scientific Workflows in the Cloud.Journal of Grid Computing, DOI 10.1007/s10723-015-9329-8, online march (2015).
Lutosa, H. ; F. Porto, R. Costa, P. Blanco, P. Valduriez. Managing Simulation Data with Multidimensional Arrays. Brazilian Symposium on Databases (SBBD), (2015).
Silva, V. ; D. de Oliveira, P. Valduriez, M. Mattoso. Analyzing Related Raw Data Files through Dataflows.Concurrency and Computation: Practice and Experience, to appear, (2015).
Souza, R. ; V. Silva, D. de Oliveira, P. Valduriez, A. Lima, M. Mattoso. Parallel Execution of Workflows Driven by a Distributed Database Management System. Int. Conf. For High Performance Computing, Networking, Storage and Analysis (SC15), (2015).
Esther Pacitti, University Montpellier 2, INRIA
Patrick Valduriez, INRIA, LIRMM
Reza Akbarinia, INRIA
Florin Masseglia, INRIA
Miguel Liroz-Gistau, INRIA
Ji Lui (PhD Student), INRIA
Saber Salah (PhD Student), INRIA
Maximilien Servajean (PhD Student), INRIA
Fabio Porto, LNCC
Marta Mattoso, COPPE-UFRJ
Alvaro Coutinho - COPPE-UFRJ
Daniel de Oliveira - UFF
Kary Ocaña - COPPE - UFRJ
Eduardo Ogasawara - CEFET-RJ
Flavio Costa (PhD Student COPPE - UFRJ)
Vitor Silva (PhD Student COPPE - UFRJ)
Douglas Ericson de Oliveira (PhD student LNCC)
Daniel Gaspar (PhD student LNCC)
Hermano Lustosa (PhD student LNCC)
PARTICIPANTS: Esther Pacitti (LIRM), Patrick Valduriez (INRIA - Zenith), Marta Mattoso (COPPE - UFRJ), Eduardo Ogasawara (CEFET-RJ), Daniel de Oliveira (UFF), Fabio Porto (LNCC), Kary Ocana (LNCC), Hermano Lustosa (LNCC), Noel Lemus (LNCC), Heraldo Borges (CEFET-RJ), Yania Souto (LNCC)
PROGRAM: Fabio Porto - Opening and Project Status Esther Pacitti (LIRM) - 15" Marta Mattoso (COPPE - UFRJ) - 15" Eduardo Ogasawara (CEFET-RJ) - 15" Daniel de Oliveira (UFF) - 15" New Project Proposal - (11:00 - 11:30) Patrick Valduriez (INRIA - Zenith) - 30"
PARTICIPANTS: Fabio Porto (LNCC), Marta Mattoso (COPPE-UFRJ), Patrick Valduriez (INRIA), Esther Paccitti (LIRMM-INRIA), Eduardo Ogasawara (CEFET-RJ), Daniel Oliveira (UFF), Kary Ocana (LNCC)
PRESENTATIONS: Daniel Oliveira (UFF): Runtime Performance Monitoring Using Provenance and Domain-Specific Data Vitor Silva (COPPE-UFRJ): Analyzing related raw data files through dataflows Renan(COPPE-UFRJ): Controlling the Parallel Execution of Workflows Relying on a Distributed Database Kary O'Kana (LNCC): Desenho e Execução de Experimentos de Bioinformática em Larga Escala: Experiências e Desafios em Aberto no LNCC Eduardo Ogasawara (CEFET-RJ): Identifying Motifs in Spatio-temporal series Amir Khatibi (LNCC): Unveiling Objects in Big Data Hermano Lustosa (LNCC): Managing Numerical Simulation Data Daniel Gaspar (LNCC): Optimizing Scientific Workflows
Title: Workshop - MUSIC - FAPERJ-INRIA Coordination: Fabio Porto (LNCC), Esther Pacitti (LIRMM - INRIA) Agenda
PRESENTATIONS: Esther Pacitti Fabio Porto Marta Mattoso Daniel de Oliveira Kary Ocaña
PRESENTATIONS: Opening Vitor Silva Hermano Lustosa Daniel Oliveira Douglas Ericson de Oliveira Esther Pacitti