Dissertations and habilitations - Laboratoire de Recherche en Informatique

Ph.D de

Ph.D
Group : Large-scale Heterogeneous DAta and Knowledge

Efficient technfor large-scale Web data management

Starts on 08/09/2011
Advisor : COLAZZO, Dario
[MANOLESCU Ioana]

Funding :
Affiliation : Université Paris-Saclay
Laboratory : LRI - LEO

Defended on 25/09/2014, committee :

Directeur de thèse :
- M. Dario Colazzo, Professeur, Université Paris-Dauphine

Co-encadrante :
- Mme Ioana Manolescu, Directrice de Recherche, Inria et Université Paris-Sud

Examinateurs :
- M. Reza Akbarinia, Chargé de Recherche, Inria et Université Montpellier II
- M. Marc Baboulin, Professeur, Université Paris-Sud et Inria
- M. Philippe Rigaux, Professeur, Conservatoire National des Arts et Métiers

Rapporteur :
- M. Donald Kossmann, Professeur, ETH Zürich

Research activities :

Abstract :
The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.

In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.

This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.

First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.

Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.

Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies.