I thought it would be interesting to share some of what we’re working on at Revelytix and also ask for comment on an initial piece of it. I’ll start with a little background first. Much of this is also covered in my recent slides from SemTech and that might be a useful set of complementary pictures.
Revelytix is currently working on a project for the US Department of Defense. The aim is to integrate information from across the various service branches within DOD. The Office of the Secretary of Defense (OSD) does not actually manage or control much of data within the DOD – instead that data is managed independently in the various service branches, often in many different databases.
The solution we’re building integrates data from these service branches using semantic web techniques, which allows the OSD to ask questions of this federated data and get back answers in a combined way that is not currently possible. For a high level architectural overview (including this project as only one example), I would recommend this talk by Dennis Wisnosky.
Getting the data from those service branches requires talking to sources that are not actually providing data or answering queries in the languages used in the semantic web (RDF and SPARQL respectively). In particular, much of the data resides in traditional relational databases like Oracle.
The general problem of exposing relational data for use in the semantic web has been explored for years. There is a W3C standards group (the RDB2RDF working group) actively working on standardizing the approach to these problems. Some important dimensions to consider in this process are:
- Usage – most existing projects allow you to access the data in one of two ways: 1) an ETL process that does a bulk dump of all relational data into an RDF triple store, or 2) dynamic query translation from SPARQL to SQL against the underlying database.
- Output ontology – many existing projects can produce an automated domain ontology based on the relational database schema. Generally, a domain class is created for each table and a domain property is created for each column. The key columns are used to generate a resource identifier for each row of each table. Additionally, many of the existing projects allow you specify a custom mapping into a pre-existing domain ontology.
- Mapping definition – when specifying a domain mapping, projects use a wide variety of forms: DDL, rules, ontologies, custom DSLs, etc.
We were looking for a solution that had the following features:
- Dynamically translates queries. Moving the data in an ETL process is not an option due to data volume and security.
- Leverages existing relational schema info. We want to leverage information already in the database schema without needing to repeat it in the mapping language.
- Minimizes repetition within the mapping. Repeating URI patterns or other info in the mapping is a common source of errors during initial modeling and evolution.
- Exposes full power of SQL. SQL supports virtually any possible transformation and we want to be able to bring all of that power to bear when specifying a mapping. In particular, we want to have the ability to use aggregates, subqueries, database functions, etc.
- Allows analysis of the mapping itself with semantic web tools. We have found a number of use cases for analyzing and validating the mappings themselves with SPARQL or rules. This allows us to do project management tracking, dependency analysis, impact analyis, etc.
Some existing projects that provide database mapping capabilities include D2RQ, Virtuoso, Triplify, Asio, SquirrelRDF, R2O, Dartquery, SPASQL, RDBtoOnto, and many more. Of those, D2RQ seems to be the most commonly used solution and we have built some deliverables using it ourselves. D2RQ satisfies many of the things we were looking for, but fell a bit short in areas like minimizing repetition and in particular providing mappings sufficient for analysis. D2RQ refers to database information in strings of SQL without any other explicit linkage to the database schema.
Due to these deficiencies, we have created our own mapping language. It shares some of the ideas present in D2RQ, however it departs significantly from D2RQ in the details. One major difference is that the Revelytix mapping language depends on importing the database schema and building a source ontology that represents the tables, columns, keys etc from that source. The mapping ontology then references that source ontology.
One major benefit of this strategy is that many of the ontology references to columns and keys are done by pointing to a resource in the database ontology rather than indicating that reference in text. This reduces the chance of errors in the mapping, makes the mapping process easier in an editor, and most importantly, exposes actual linkages from database to domain. We can then write SPARQL queries against the mapping ontology that provide answers about data lineage.
We are very interested in hearing feedback about the choices made in this mapping language. You can find the full mapping language specification (still a work in progress) at this location or view the actual ontology and some examples in our public Knoodl repository. We have an evolving prototype implementation called Spyder that uses these mappings to perform dynamic SPARQL to SQL translation. In the future, we hope to be able to release Spyder as an open source project. Spyder is a key component in a larger semantic federation engine we are building.