Infotools Harmoni Discover works by performing a series of statistical tests on each descriptor, comparing the cell value for each group with the value for those not in the group (the Rest). These values are compared using Bayesian Statistics to calculate the probability that the group value is greater (or lower) than the Rest values.
1. Methodology and Bayesian Statistics
When comparing values using the traditional null hypothesis significance tests (NHST), we initially assume there is no difference between the values. We then look at the actual difference in values and calculate how likely this data is given this assumption. If the resulting probability is small (typically under 0.05) we "reject" the hypothesis and can conclude there is a significant difference between the values. But often the interesting question is not simply whether there is a difference, but how large is the difference?
Discover uses the magnitude of the difference between the target group and the Rest to rank the descriptors. More specifically it uses the probability that the target value is greater than the Rest value.
Discover uses Bayesian statistics to reverse the conditional probabilities implicit in NHST and calculate the probability of a hypothesis given the data. In other words, instead of finding p(D|H) (the probability of the data given the hypothesis is true) we can calculate p(H|D) (the probability of the hypothesis given the data) - called the posterior probability.
For example, imagine we have two groups of size 100 each and proportions on some metric of 22% and 31%. Is the difference in proportions (9%) between the two groups significant? Using NHST gives a p-value of 0.2 and all we can say is the test failed to find a significant difference.
With a similar Bayesian proportion test we can calculate a 95% credible interval for the difference [-3%, +21%]. This interval includes 0 so ‘no difference’ is a credible conclusion complementing the NHST conclusion. But in addition, given this data, we can determine the probability of the second percentage being larger than the first and vice-versa. In this particular example, the probability of the second percentage being larger than the first happens to be 92% (the pink area in the chart below).
With Bayesian statistics, we can rank cell values by the probability that they are greater (or less) than their reference values, even if they would not be considered a “significant difference” using traditional NHST tests. This is because we are basing the test on the believability of the magnitudes of differences conditional on the Bayesian posterior probability distribution i.e. conditional on the data.
In this way, Discover ranks the descriptors by the probability that the target group values are greater than the Rest values.
Column Similarity using Robinsons Agreement
The measurement of agreement is a special case of measuring the association between two or more variables. Agreement metrics measure the extent to which a set of values are identical to another set i.e. agree, rather than the extent to which they are correlated. For example, two sets of values where one set is exactly half the other set would be perfectly correlated but would not be in agreement.
Robinson’s agreement metric is calculated from the partitioning of the sums of squares (as in the analysis of variance).