Paper ID: | 4062 |
---|---|

Title: | Foundations of Comparison-Based Hierarchical Clustering |

In this work the authors study hierarchical clustering under quadruplet comparison framework. The authors show that single and complete linkages are inherently comparison based and propose two variants of average linkage clustering exploiting quadruplet comparison. Exact hierarchy recovery guarantee is provided under planted hierarchical partition model and empirical evaluation is provided. 1. The meaning of the variables \mu, \delta etc are hard to interpret from the description. They have been nicely summarized (and explained) in the appendix A.1. May be placing these descriptions in the main paper will improve readability. 2. The drawback of theorem 2 is that size of pure cluster N_0 to be of the order \sqrt(N). For large N, this is a bit restrictive, is it possible to improve N_0 to be logarithmic in N or perhaps \theta(1)? Otherwise, is it possible to prove that no recovery is possible unless N_0 is at lest \sqrt(N)? 3. The main difference between active and passive sampling is that in case of passive sampling comparison queries are asked by flipping a Bernoulli random variable out of all O(N^4) quadruplets. Clearly many of such quadruplets may be uninformative thus requiring more comparison queries. In case of active sampling, first a tuple (i_0, j_0) is fixed uniformly at random and the the rest of the queries are asked with respect to this chosen tuple. Clearly, uniformly choosing a single tuple will have high variance and may be using more than one such uniformly chosen sample will reduce variance as was suggested for empirical evaluation (equation 2). Is it possible to quantify the reduction in variance and come up with the size of the set \mathcal{R} in equation 2? In particular, can it be the case that a properly chosen \mathcal{R} will result in better than O(N log N) comparison complexity? 4. In equation 5, K seems to be the number of clusters. But K will vary at different level of the dendrogram? Is that interpretation correct? 5. Interpretation of Theorem 4 in lines 256-259 is confusing. If the number of passive comparisons is O(N^4 log N /m) as given in line 251 and m=\Omega( log N), then number of passive comparison required is O(N^4) that is all quadruplets needs to be ovserved, in line 256-257 it is claimed otherwise. Similarly Only when m=\Omega (N) (in line 257 should it be \Omega (N) instead of \Omega(N_0)?) required number of comparison queries is O(N^3 log N). But then pure cluster of size m=\Omega(N) is kind of useless is not it (each pure cluster is just too big)? 6. In plants hierarchical model experiment size of pure cluster (N_0) =30. in 4-AL-I3, initial cluster size is 3, how is initial cluster different from pure cluster? In case of 4-AL, is the size of pure cluster (N_0) still 30? In this case number of levels L=3. Now in case of 4-AL-I3, where initial cluster size is 3, the level has to be greater than 3 if N is fixed in both cases. Can you please clarify this issue? 7. Since there is no common element in quadruplet comparison as opposed to triplet comparison Is it possible that quadruplet comparison may have inherent noise. For example, in the crowdsourcing setting, asking the question “Is similarity between dolphins and whales larger than similarity between tigers and panthers” might result in different answers. In absence of an omnipotent oracle, how to tackle such inherent noise/bias towards perfect reconstruction?

- Such a query setup may be done in the following mode: * Active queries: Here the clustering algorithm generates comparison queries during its execution and once these queries are answered, it continues its execution and possibly generate more queries. Note that this may involve a lot of time since the clustering algorithm’s time depends on the queries being answered. * Passive queries: The clustering algorithm decides all the queries before its execution. - This paper specifically discusses the effectiveness of quadruplet queries in the context of the planted partition model. Single linkage, complete linkage, and average linkage are popular algorithms in the classical hierarchical clustering framework where pairwise distances are known. These involve finding the maximum/minimum/average distance between a pair of points in two clusters. In the new model of quadruplet queries finding the max/min are still possible using the queries. So, it is still possible to implement the single linkage and complete linkage in the new setting. However, it is not immediately clear how to implement the average linkage algorithm in the comparison model. They suggest a method to define distance between points/clusters using the comparison queries and this gives an average-linkage based algorithm for agglomerative hierarchical clustering. The effectiveness of this algorithm in the context of planted partition model is discussed and theoretical guarantees are given. Experimental results are also given for real datasets where comparison is made with respect to the hierarchical clustering score given by Dasgupta. - It would be good to see a theorem for average linkage similar to theorem 1 that is for single linkage and complete linkage. - It seems that the comparison based algorithm is giving good Dasgupta score for some real datasets. Can some theoretical guarantees be obtained? - Can lower bounds on the number of queries be given in the comparison setting? Overall, this paper starts a very nice discussion. It would be even stronger if more questions in the context are explored.

This paper is original and solves an important and interesting problem. It is also very well written and I loved reading it. However I have the following questions: 1) In Theorem 1, exact conditions on \Delta/ \sigma is provided which is improved in the subsequent theorems. However, the necessary conditions on \Delta/ \sigma are not provided in the subsequent theorems explicitly. Is it possible to provide a table/plot for comparison? 2) Can you provide an intuition behind why average linkage performs better? Why not try something else besides these popular linkage algorithms such as median linkage? Median will definitely be more robust to outliers. Is it known that average linkage is the best someone can do in these settings? 3) Are each comparisons sampled multiple times in the algorithm to smooth the noise? Note that asking the same query multiple times might not be good in practice although it will work in theory.