Dissertations and habilitations - Laboratoire de Recherche en Informatique

Ph.D de

Ph.D
Group : Large-scale Heterogeneous DAta and Knowledge

Automatic key discovery for Data Linking

Starts on 05/10/2011
Advisor : PERNELLE-MANSCOUR, Nathalie
[SAIS Fatiha]

Funding :
Affiliation : Université Paris-Saclay
Laboratory : LRI-IASI

Defended on 09/10/2014, committee :

Directrice de thèse :
- Mme Nathalie Pernelle, Maître de Conférences, LRI, Université Paris Sud

Co-encadrante :
- Mme Fatiha Saïs, Maître de Conférences, LRI, Université Paris Sud

Rapporteurs :
- Mme Marie-Christine Rousset, Professeur, LIG, Université de Grenoble
- M. Aldo Gangemi , Professeur, LIPN, Université Paris 13

Examinateurs :
- M. Olivier Curé, Maître de Conférences, LIGM, Université Marne-la-Vallée
- M. Alain Denise, Professeur, LRI, Université Paris Sud

Research activities :

Abstract :
In the recent years, the Web of Data has increased significantly, containing a huge number of RDF triples. Integrating data described in different RDF datasets and creating semantic links among them, has become one of the most important goals of RDF applications. These links express semantic correspondences between ontology entities or data. Among the different kinds of semantic links that can be established, identity links express that different resources refer to the same real world entity. By comparing the number of resources published on the Web to the number of identity links, one can observe that the goal of building a Web of data is still not accomplished. Several data linking approaches infer identity links using keys. Nevertheless, in most datasets published on the Web, keys are not available and it can be difficult, even for an expert, to declare them.

The aim of this thesis is to study the problem of automatic key discovery in RDF data and to propose new efficient approaches to tackle this problem. Data published on the Web are usually created automatically, thus may contain erroneous information, duplicates or may be incomplete. Therefore, we focus on developing key discovery approaches that can handle datasets with numerous, incomplete or erroneous information. Our objective is to discover as many keys as possible, even ones that are valid in subparts of the data.

We first introduce KD2R, an approach that allows the automatic discovery of composite keys in RDF datasets that may conform to different ontologies. KD2R is able to treat datasets that may be incomplete and for which the Unique Name Assumption is fulfilled. To deal with the incompleteness of data, KD2R proposes two heuristics that offer different interpretations for the absence of data. KD2R uses pruning techniques to reduce the search space. However, this approach is overwhelmed by the huge amount of data found on the Web. Thus, we present our second approach, SAKey, which is able to scale in very large datasets by using effective filtering and pruning techniques. Moreover, SAKey is capable of discovering keys in datasets where erroneous data or duplicates may exist. More precisely, the notion of almost keys is proposed to describe sets of properties that are not keys due to few exceptions.