Ph.D
Group : Large-scale Heterogeneous DAta and Knowledge
Query Debugging and Fixing to Recover Missing Query Results
Starts on 01/10/2012
Advisor : BIDOIT, Nicole
[Mélanie HERSCHEL]
Funding : Contrat doctoral uniquement recherche
Affiliation : Université Paris-Saclay
Laboratory : LRI-BD
Defended on 14/12/2015, committee :
Directrice de thèse
Mme. Nicole Bidoit, Professeur, Université Paris-Sud
Co-encandrante de thèse
Mme. Mélanie Herschel, Professeur, Université de Stuttgart, Allemagne
Rapporteurs
M. David Gross-Amblard, Professeur, Université de Rennes
M. Yannis Velegrakis, Maître de Conférences, Université de Trento
Examinateurs
Mme. Christine Froidevaux, Professeur, Université Paris-Sud
M. Philippe Rigaux, Professeur, Conservatoire national des arts et métiers
Research activities :
Abstract :
With the increasing amount of available data and data transformations, typically specified by queries, the need to understand them also increases. “Why are there medicine books in my sales report?” or “Why are there not any database books?” For the first question we need to find the origins or provenance of the result tuples in the source data. However, reasoning about missing query results, specified by Why-Not questions as the latter previously mentioned, has not till recently received the attention it is worth of.
Why-Not questions can be answered by providing explanations for the missing tuples. These explanations identify why and how data pertinent to the missing tuples were not properly combined by the query. Essentially, the causes lie either in the input data (e.g., erroneous or incomplete data) or at the query level (e.g., a query operator like join).
Assuming that the source data contain all the necessary relevant information, we can identify the responsible query operators forming query-based explanations. This information can then be used to propose query refinements modifying the responsible operators of the initial query such that the refined query result contains the expected data. This thesis proposes a framework targeted towards SQL query debugging and fixing to recover missing query results based on query-based explanations and query refinements.
Our contribution to query debugging consist in two different approaches. The first one is a tree-based approach. First, we provide the formal framework around Why-Not questions, missing from the state-of-the-art. Then, we review in detail the state-of-the-art, showing how it probably leads to inaccurate explanations or fails to provide an explanation. We further propose the NedExplain algorithm that computes correct explanations for SPJA queries and unions there of, thus considering more operators (aggregation) than the state of the art. Finally, we experimentally show that NedExplain is better than the both in terms of time performance and explanation quality.
However, we show that the previous approach leads to explanations that differ for equivalent query trees, thus providing incomplete information about what is wrong with the query. We address this issue by introducing a more general notion of explanations, using polynomials. The polynomial captures all the combinations in which the query conditions should be fixed in order for the missing tuples to appear in the result. This method is targeted towards conjunctive queries with inequalities. We further propose two algorithms, Ted that naively interprets the definitions for polynomial explanations and the optimized Ted++. We show that Ted does not scale well w.r.t. the size of the database. On the other hand, Ted++ is capable of efficiently computing the polynomial, relying on schema and data partitioning and advantageous replacement of expensive database evaluations by mathematical calculations. Finally, we experimentally evaluate the quality of the polynomial explanations and the efficiency of Ted++, including a comparative evaluation.
For query fixing we propose is a new approach for refining a query by leveraging polynomial explanations. Based on the input data we propose how to change the query conditions pinpointed by the explanations by adjusting the constant values of the selection conditions. In case of joins, we introduce a novel type of query refinements using outer joins. We further devise the techniques to compute query refinements in the FixTed algorithm, and discuss how our method has the potential to be more efficient and effective than the related work.
Finally, we have implemented both Ted++ and FixTed in an system prototype. The query debugging and fixing platform, short EFQ allows users to interactively debug and fix their queries (conjunctive queries with inequalities) when having Why-Not questions.