OpenII banner
Frequently Asked Questions

What does the TF-IDF score mean in the Affinity "View terms shared by schemas" dialog?

TF-IDF is a measure of how well a particular term characterizes or distinguishes an Affinity cluster. Viewing terms with higher TF-IDF scores can provide insights into the semantics of the cluster. (Note that we are extending the traditional use of TF-IDF with text documents to apply to schema clusters.)

TF (term frequency) is the frequency of the term in the chosen cluster. TF = # of schemas in the cluster containing the term / # of schemas in the cluster. Note that we disregard multiple occurences of a term in one schema. (This is a normalization to extend TF for clusters).

IDF (inverse document frequency) is a measure the general importance of the term across all schemas. It is the log of the inverse of the fraction of schemas containing the term: IDF = log ( # of schemas overall / # of schemas overall containing term).

TF-IDF = TF * IDF. It is always greater than or equal to 0, and usually much less than 10.