HyperLogLog - an algorithm for the count-distinct problem

:light_bulb: What is HyperLogLog?

HyperLogLog (HLL) is a probabilistic algorithm that allows you to estimate the number of unique items in a set (its “cardinality”) using a very small, fixed amount of memory.

For example, you can use it to answer questions like:

  • “How many unique players joined my game today?”
  • “How many unique votes were cast in this poll?”
  • “How many unique items were favorited this week?”

…all while using just a few kilobytes of memory, even if the number of items is in the millions or billions.

:thinking: Why use HyperLogLog?

Counting unique items seems simple, but it can be very difficult at scale.

Imagine you want to count the unique players who join your game in a day. The “simple” way is to store every unique UserId in a table and check #table at the end. If you have 10,000 unique players, your table is small. If you have 10 million unique players, that table becomes enormous, consuming vast amounts of memory.

This is commonly known as the count-distinct problem in computer science.

HyperLogLog solves this problem by trading perfect accuracy for massive efficiency.

  • :rocket: Extremely Fast: Adding an item is just a quick hashing operation and a memory check. It’s a constant-time operation O(1) that doesn’t slow down as the count grows.
  • :floppy_disk: Incredibly Memory-Efficient: This is the key benefit. A HyperLogLog counter uses a small, fixed amount of memory (just a few kilobytes) whether you add 1,000 items or 1,000,000,000 items.
  • :handshake: Mergeable: You can take two separate HLL counters (e.g., from different game servers) and merge them together. The resulting counter will give you the estimated unique count across both sets. This is extremely powerful for distributed data.
  • :chart_increasing: Tunable Accuracy: It’s an estimate, but it’s a very good one. You can control the accuracy vs. memory trade-off using the precision parameter. A higher precision gives you a more accurate estimate at the cost of more memory.

It’s the perfect tool for when you need a “good enough” count of unique items at a massive scale, without sacrificing performance or memory.

Credits

Thanks to @XoifailTheGod for helping optimize and test the library.

10 Likes

How funny, I was just looking at something like this the other day. Thank you for creating this. will be using it in an upcoming project.

1 Like