Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher log2m value not always producing more accurate estimates #15

Open
hossman opened this issue May 6, 2015 · 1 comment
Open

Higher log2m value not always producing more accurate estimates #15

hossman opened this issue May 6, 2015 · 1 comment

Comments

@hossman
Copy link

hossman commented May 6, 2015

My understanding from the HLL algorithm (which may be flawed, in which case please correct me and close this issue) is that for any fixed set of input values, the accuracy of any estimate from an HLL built from those values should increase as the "m" value used in the HLL increases.

Ie:

if you build 2 HLL instances, with different log2m settings, and add the exact same set of (raw) values to both, then the HLL with the larger log2m will give you the most accurate results then the HLL with a smaller log2m setting.

In my testing however, I'm frequently encountering situations where "smaller" HLL instances are producing more accurate cardinality estimates -- which I can't explain.

I've created a reproducible test case that demonstrates the problem, which i will post as a separate comment.

@hossman
Copy link
Author

hossman commented May 6, 2015

Here's a test case demonstrating the issue (requires Guava to use murmur hash)...

package net.agkn.hll;

/*
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import static org.testng.Assert.assertEquals;
import static org.testng.Assert.assertTrue;
import static org.testng.Assert.assertFalse;

import com.google.common.hash.Hashing;
import com.google.common.hash.HashFunction;

import org.testng.annotations.Test;

/**
 * Tests {@link HLL} with differnet log2m params to confirm accuracy improvements
 *
 */
public class TestIncreasedAccuracy {

    final static int NUM_VALUES = 11762;
    final static long BIG_PRIME = 982451653L;
    final static long MAX_LONG = 3628504890999L; 
    final static HashFunction HASHER = Hashing.murmur3_128();

    @Test
    public void test_log2m14_Vs_log2m16() {

        final HLL hll_14 = new HLL(14, 6);
        final HLL hll_16 = new HLL(16, 6);
        // sanity check discrepency isn't because of explicit or sparse storage
        final HLL hll_16_nos = new HLL(16, 6, 0, false/* no sparse */, HLLType.FULL);

        long longValue = MAX_LONG;

        for (int i = 1; i <= NUM_VALUES; i++) {
            final long hashedValue = HASHER.hashLong(longValue).asLong();

            hll_14.addRaw(hashedValue);
            hll_16.addRaw(hashedValue);
            hll_16_nos.addRaw(hashedValue);

            longValue -= BIG_PRIME;
        }

        // sanity check how far things have promoted
        assertEquals(HLLType.FULL, hll_14.getType(), "hll_14 type");
        assertEquals(HLLType.SPARSE, hll_16.getType(), "hll_16 type");
        assertEquals(HLLType.FULL, hll_16_nos.getType(), "hll_16_nos type");

        final long hll_14_estimate = hll_14.cardinality();
        final long hll_16_estimate = hll_16.cardinality();
        final long hll_16_nos_estimate = hll_16_nos.cardinality();

        // sanity check that sparse and full log2m=16 representations have same estimate
        assertEquals(hll_16_estimate, hll_16_nos_estimate,
                     "sparse HLL had diff estimate from full HLL (log2m=16)");

        // FAILS: log2m=16 estimate (11739) is less accurate then log2m=14 estimate (11766) 
        assertTrue(Math.abs(NUM_VALUES - hll_16_estimate) <= Math.abs(NUM_VALUES - hll_14_estimate),
                   "log2m=16 estimate ("+hll_16_estimate+") is less accurate then "+
                   "log2m=14 estimate ("+hll_14_estimate+")");
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant