What is Harmony?
Harmony is an innovative, open source schema matching tool, available both as a standalone and as part of the OpenII information integration tool suite. Harmony is a semi-automated tool that greatly speeds the finding of correspondences across two data schemas. (A schema is a template for data.) Harmony works with a wide variety of data models, including those expressed as XML Schema (XSD files), SQL data definition language, OWL, spreadsheets with column headings, and others. In addition, it is fairly straightforward to create importers for new schema formats. In addition to matching across schemas, Harmony has been used successfully to speed matching across separately developed code lists.
Harmony is freely downloadable and is available for use on your data integration projects. Harmony is available in both a stand-alone version for use in simple matching tasks as well as part of the OpenII open source information integration tool suite.
What problem does Harmony solve?
There are lots of reasons you might need to find quickly correspondences across different data models:
- You might be doing a mashup across different sources and need to find out what concepts they have in common and the terms they use to describe them so that you can create an integrated view.
- You might need to do the same thing in order to build a data warehouse.
- Maybe you need to quickly pull data out of a legacy database and use it to populate a message that uses some community exchange format.
- Two communities use different data standards, and you want to map between them to enable rapid data sharing across these previously separate communities. Alternatively, maybe two subcommunities have independently extended a community data standard with overlapping concepts, and the community wants to consolidate these models.
- Finding correspondences can also be a critical step toward prioritizing IT investments, to answer questions like: 1) what's a rough guess on effort required to get data to flow from S to T? 2) we're building this new database; does it cover enough of the concepts in legacy system L so that we could begin to phase L out?
In all these cases, you need to quickly find correspondences across data models. Until recently, that's been a mostly manual process with the results captured in a spreadsheet. Harmony greatly speeds the process.
How does Harmony solve this problem?
There are two main pieces to Harmony: 1) match algorithms (called match voters) that automatically suggest candidate correspondences and 2) a sophisticated GUI that allows the user to interactively refine the computer-suggested list.
Harmony offers several match voters, each of which uses a different strategy for identifying the most likely correspondences. Match voters include:
- Bag of words. This treats both the name of a schema element and its associated text documentation as a "document" and uses information retrieval techniques to find the best matches. Names are parsed into separate tokens, stemming is done, and stop words are ignored. The stop word list can be customized.
- Edit distance. This looks just at the closeness of the names of different elements.
- Thesaurus. This supplements the bag of words matcher with an extensible thesaurus.
- Exact structure matcher. Sometimes it is useful to find portions of two schemas that are 100% identical. This is helpful when one suspects that the two schemas may have pulled some concepts from the same place.
- Roll your own matcher. Because Harmony is extensible and open source, it is straightforward for a developer to build a customized matcher tailored to a particular environment and to plug it into Harmony.
Harmony's innovative GUI offers several features that help a user more efficiently confirm and disconfirm computer-suggested matches and add necessary annotations, including specifying transformation functions. For a description of Harmony's filters and other GUI features, please see the Harmony user guide.
Harmony also offers many other useful features. Because Harmony is part of OpenII, it includes importers for many schema models, so one can use Harmony with XML schemas, relational database, and spreadsheets with column headings. Also, it is relatively straightforward to write your own importer for specialized data formats that your organization may use. In addition, Harmony includes a number of exporters (including one for spreadsheets) to facilitate visualization and further processing of mappings with other tools.