What is Unity?

Unity semi-automatically produces a common vocabulary, based on aligning the schema elements in a set of source schemas. This common vocabulary consists of a list of canonical terms the source schemas agree on, with mappings back to the source terms they correspond to. Stated differently: Given a set of N source schemas, Unity produces a new schema V which is a flat list of canonical terms, and a set of N new mappings from V back to each of the sources. Canonical terms can be sorted by the extent to which all sources participate in them. That is, canonical terms mapped to source terms in all N sources are first, terms mapped to all but one source are second, and so on down to terms only found in a single source.

What problem does Unity solve?

Consider a group of potential sharing partners. Where do they begin? A productive starting point would be to quickly determine the set of terms their data models held in common. Simply performing manual pairwise matches (as in Harmony) between source schemas does not yield this answer. Unity combines evidence from term matches across schemas to infer likely synonym sets; thus Unity can also be considered a thesaurus-generation tool.

How does Unity solve this problem?

Unity utilizes interschema mappings, like those generated by Harmony. Given N schemas, a vocabulary can be based on up to N(N-1)/2 mappings; at least a spanning set is required. These mappings are conveniently managed by the OpenII repository. Based on available mappings between source schemas, Unity generates a set of "synsets", each consisting of a) a set of terms which strongly match each other, and b) a canonical term for the synset. For example a synset could consist of {"Car", "Auto", "Car", "Automobile", "Car"}, where the first term is canonical and the other four represent terms in four schemas belonging to that synset. Synsets are efficiently generated using a modified disjoint set forest algorithm. Canonical terms can be algorithmically chosen: Unity allows schemas to be ranked by "authority"; the element from the highest authority schema becomes the canonical term. Manual choices can be made as well. Manual merging and splitting of synsets is also permitted.