In [1]:

```
## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://localhost:5432/baseball
```

In [2]:

```
%sql CREATE EXTENSION IF NOT EXISTS plpython3u; -- import extension
```

Running query in 'postgresql://localhost:5432/baseball'

Out[2]:

*CountMin*sketches for`SELECT COUNT(*) ... WHERE col = k`

- Also
`WHERE col BETWEEN k AND l`

- Also
*HyperLogLog*sketches for`SELECT COUNT(DISTINCT col)`

`SELECT COUNT(*) ... WHERE col = x`

Incredibly simple idea!

Intuition:

- Pick a hash function $h$ that maps the data type of
`col`

to integers - Create an array $A$ of $b$ counters
- For each value $v$ in
`col`

, increment the counter at $A[{h(v)}]$ - To compute the count of rows
`WHERE col = x`

return $A[{h(x)}]$

How bad is our estimate?

- Could be
*too high*, due to hash "collisions" (never too low!) - If we collide with key $y \ne x$, we return the sum of their frequencies
- $A[{h(x)}] = f_x + \sum_{y \in S}f_y \hspace{2em}$ where $S = {y \ne x : h_i(y) = h_i(x)}$
- We expect $x$ to collide with $1/b$ of the values in our data
- So we expect $A[{h(x)}] = f_x + \frac{1}{b}\sum_{y \ne x}f_y \le f_x + \frac{n}{b}$
- after all, the sum of all frequencies is $n$

- If we want to bound our overestimate by $\epsilon f_x$, just choose $b = 1/\epsilon$.
- Note this has no dependence on $n$, the size of our dataset: scales great!

Why not stop here?

- We got the expected error to an $\epsilon$ factor!
- But the
*variance*in the error can be high- some buckets will be "unlucky"!
- We want the probability that the error is bigger than $\epsilon$ to be $\delta$.

- So let's use the idea of independent samples to help.

- Repeat the idea above with $l$ pairwise-independent hash functions.
- Easy to parameterize these from a "family"
- The CountMin Sketch is an array of $l$ rows and $b$ columns
- each row "belongs" to one of the hash functions

Easy to write

- You can find lots of reference implementations online in Java, C++, Python, etc.
- PostgreSQL has a package called Apache MADLib that provides CountMin sketches and lots of Stat/ML routines in SQL

Scales to arbitrarily large data sets!

In practice, all the hashing at construction time can be slow

- Needs to be paid off by many queries
- Parallelizes trivially though!

PostgreSQL/MADlib example below

- MADlib CMsketch is set to $l = 8, b = 1024$. With 64-bit integers this is just 64 KB!
- $\epsilon = 2/b = .002$
- $\delta = \frac{1}{2}^l = .004$ (i.e., $99.6\%$ probability within $\epsilon n$!)

- MADlib CMsketch is set to $l = 8, b = 1024$. With 64-bit integers this is just 64 KB!

In [3]:

```
## MADlib is only compatible with PostgreSQL 12
## My PostgreSQL 12 installation didn't have Python3
## So let's switch connections now
%reload_ext sql
%sql postgresql://localhost:5433/baseball
```

Connecting and switching to connection postgresql://localhost:5433/baseball

RuntimeError: (psycopg2.OperationalError) connection to server at "localhost" (127.0.0.1), port 5433 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? connection to server at "localhost" (::1), port 5433 failed: Cannot assign requested address Is the server running on that host and accepting TCP/IP connections? (Background on this error at: https://sqlalche.me/e/20/e3q8) If you need help solving this issue, send us a message: https://ploomber.io/community

In [ ]:

```
%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class,
floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
sketch AS (SELECT madlib.cmsketch(class) AS class_cm, madlib.cmsketch(a1) AS a1_cm FROM data)
SELECT 'sketch' as method,
madlib.cmsketch_count(class_cm, 7) as class_7, madlib.cmsketch_count(class_cm, 9) as class_9,
madlib.cmsketch_count(a1_cm, 3) as a1_3, madlib.cmsketch_count(a1_cm, 7) as a1_7
FROM sketch
UNION ALL
SELECT 'actual',
sum(CASE WHEN class = 7 THEN 1 ELSE 0 END), sum(CASE WHEN class = 9 THEN 1 ELSE 0 END),
sum(CASE WHEN a1 = 3 THEN 1 ELSE 0 END), sum(CASE WHEN a1 = 7 THEN 1 ELSE 0 END)
FROM data;
```

In [ ]:

```
%%sql
WITH sketch AS (SELECT madlib.cmsketch(hr) AS hr_cm FROM batting)
SELECT 'sketch' as method,
madlib.cmsketch_count(hr_cm, 40)
FROM sketch
UNION ALL
SELECT 'actual',
COUNT(*)
FROM batting
WHERE hr = 40;
```

`SELECT COUNT(DISTINCT col) FROM table`

Problem:

- Imagine you've seen 1 billion distinct values so far
- Originally studied to classify network packet streams at line rate
- How many unique source/destination pairs have we seen?

- To see if a new row has a new value, we need to remember the previous 1 billion!
- Can we do this in a small amount of space?

HyperLogLog is one of many solutions to this problem.

- Jelani Nelson is a co-author on the first asymptotically space- and time-optimal algorithm for this problem.
- Recently completely resolved the asymptotic space complexity of this problem!

- For each value $v$ we see, compute a hash $h(v)$
- Generates a number chosen uniformly at random between 0 and $\infty$
- How many leading 0's on the left do we expect?

In [ ]:

```
# https://www.geeksforgeeks.org/number-of-leading-zeros-in-binary-representation-of-a-given-number/
def countZeros(x):
# Keep shifting x by one until
# leftmost bit does not become 1.
total_bits = 32
res = 0
while ((x & (1 << (total_bits - 1))) == 0):
x = (x << 1)
res += 1
return res
from random import randint
import pandas as pd
s = pd.Series([countZeros(hash(i)) for i in range(1,1000)])
s.plot.hist(grid=True, bins=20, rwidth=0.9, color='#607c8e')
```

Run the cell above a few times and you'll see:

- About 1/2 the values have no leading zeros. Makes sense!
- These are random bit strings, so odds that first bit is
`1`

is $50\%$

- These are random bit strings, so odds that first bit is
- Decays by a factor of 2 for each bar to the right. Makes sense!
- Odds that the first 2 bits are
`10`

is $25\%$. Etc.

- Odds that the first 2 bits are
- Continuing to divide by 2, we expect the rightmost non-zero bar to be at $\rho = log_2(n) - 1$
- Where $n$ is the total number of values
- But note: adding
*duplicate*values raises all bars but doesn't add any new bars to the right! - So $2^{\rho + 1}$ is a good estimator of
`COUNT(DISTINCT)`

!!

- But ... lots of variance across trials.
- HyperLogLog breaks the input into subsets and uses the harmonic mean of the resulting estimates

In [ ]:

```
%%sql
WITH data AS (SELECT floor(random()*10)::integer AS class,
floor(exp(log(random()*100)))::integer AS a1 FROM generate_series(1,100000)),
approx AS (SELECT madlib.fmsketch_dcount(class) AS class_fm, madlib.fmsketch_dcount(a1) AS a1_fm FROM data)
SELECT 'sketch' as method,
*
FROM approx
UNION ALL
SELECT 'actual',
COUNT(DISTINCT class), COUNT(DISTINCT a1)
FROM data;
```

Sketches are like materialized views that can approximate the answer to a class of queries.

- Like Materialized views, they take time to build, and need to be kept "fresh"
- But they're typically
*tiny*and insensitive to input size, which is very cool - Can pass them around for all kinds of tricks: e.g. ship to apps in browser or phone, etc.

- But they're typically
- Sketches typically work as streaming algorithms, which is nice
- Most support incremental additions
- Some support deletions

- Many can be computed in parallel

- There are more sketch types and variants to handle more classes of queries
- "Heavy Hitter" queries (return the top k most popular values in the stream)
- Exists queries (the earliest sketch: Bloom Filters)
- Count-Range queries
- Histograms
- Approximate data cubes
- Etc.

Sketches are mostly used in high-volume streaming settings

- The approximation/performance tradeoffs has to be acceptable
- You need to have a need to do LOTS of queries on the sketch to amortize cost of hashing
- Not typically supported in database systems even today

- Prof. Nelson's graduate course at Berkeley
- Book: Cormode/Garofalakis/Haas/Jermaine Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches