Leftover Hash Lemma and Noise Smudging
All info comes from David Wu’s Note, Purdue’s cryptography course note, some MIT note and Stanford CS355 note
This note will be focusing on Leftover Hash Lemma and Noise Smudging in Homomorphic Encryption.
graph LR; subgraph Measure of Randomness GP(Guess Probability) --> ME(Min Entropy) CP(Collision Probability) --> RE(Renyi Entropy) end ME --> LHL(Leftover Hash Lemma) RE --> LHL subgraph Hash Function KI(k-independence) --> UHF(Universal Hash Function) CLBD(Collision Lower Bound) --> UHF end UHF --> LHL LHL --> SLHL(Simplified LHL versions) LHL -->|?| NS(Noise Smudging) SS(Subset Sum) -->|?| NS
Hash Function Part
Let $\mathcal{H}$ be a family of $X \to Y$, for distinct $x_1,\dots, x_k \in X$ and any $y_1 ,\dots, y_k \in Y$, the class of hash function $\mathcal{H}$ satisfies
$$
\Pr\lbrack h \overset{R}{\leftarrow} \mathcal{H} : h(x_1) = y_1,\dots, h(x_k) = y_k\rbrack = \frac{1}{|Y|^k}
$$
is $k$-wise independence.
For any $i \in \lbrack k\rbrack$, the $i$-th input is uniformly randomly answered, thus the proof follows.
Let $\mathcal{H}$ be a family of $X \to Y$ hash functions. $\mathcal{H}$ is a $\epsilon$-universal hash function ($\epsilon$-UHF) if $\forall x, x’ \in X, x \ne x’$,
$$
\Pr\lbrack h \overset{R}{\leftarrow} \mathcal{H}: h(x) = h(x’)\rbrack < \epsilon
$$
2-wise independent hash function family implies $\frac{1}{|Y|}$-universal hash function family.
But universal hash function is not necessarily 2-wise independent hash function. See construction.
Collision Lower Bound
Let $\mathcal{H}$ be a hash function family such that $X \to Y$, supposing $|X| > |Y|$, there exists distinct $x_1^\ast, x_2^\ast \in X$ such that
$$
\Pr\lbrack h \overset{R}{\leftarrow} \mathcal{H} : h(x_1^\ast) = h(x_2^\ast)\rbrack \geq \dfrac{\frac{|Y|}{|X|} - 1}{|Y| - 1}
$$
Fix a hash function $h \in \mathcal{H}$, suppose the range $Y = \lbrace y_1,\dots,y_m\rbrace$. Then we have a bunch of sets $\lbrace x \in X: h(x) = y_i\rbrace$, and we denote $n_i$ be the size.
We denote the entries of $\lbrace x_1, x_2 \rbrace$ for $x_1, x_2 \in X, x_1 \neq x_2$ such that $h(x_1) = h(x_2)$, $\sharp \text{col}_h = \sum\limits^m_{i = 1}C^2_{n_i}$.
Noticing that $\sharp\text{col}_h = \sum\limits^m_{i = 1} \dfrac{n_i^2}{2} - \dfrac{|X|}{2} \ge \dfrac{1}{2} (\dfrac{|X|^2}{|Y|} - |X|)$. The lower bound is reached if $n_i = \dfrac{|X|}{|Y|}$.
For $\mathcal{H}$, we have $\sharp\text{col}_{\mathcal{H}} \ge \dfrac{|\mathcal{H}|}{2} (\dfrac{|X|^2}{|Y|} - |X|)$. Then perform experiment sampling $(x_1,x_2)\overset{R}{\leftarrow} X,h\overset{R}{\leftarrow} \mathcal{H}$ and outputs 1 if $h(x_1) = h(x_2)$.
$$
\Pr\lbrack h(x_1) = h(x_2) \mid (x_1, x_2)\overset{R}{\leftarrow} X,h\overset{R}{\leftarrow} \mathcal{H}\rbrack = \dfrac{\sharp\text{col}_{\mathcal{H}}}{|\mathcal{H}| \cdot C^2_{|X|}}\geq \dfrac{\frac{|Y|}{|X|} - 1}{|Y| - 1}
$$
Fact: $\Pr\lbrack h \overset{R}{\leftarrow} \mathcal{H} : h(x_1^\ast) = h(x_2^\ast)\rbrack > \dfrac{1}{|Y|} - \dfrac{1}{|X|}$.
Measure of Randomness
Suppose $R$ is a random variable over set $\Omega$ where $|\Omega|= n$. Let $U$ denote the uniform distribution over $\Omega$.
Statistical Distance
The statistical distance between two random variables $X,Y$ is defined as $\Delta(X,Y) = \dfrac{1}{2} \sum\limits_{u \in \Omega} |\Pr\lbrack X = u \rbrack - \Pr \lbrack Y = u \rbrack|$. We denote $\epsilon$-close for two distribution
$$
\Delta(X ,Y) \le \epsilon \Leftrightarrow X \approx_\epsilon Y
$$
$\Delta(X,Y) = \max\limits_{T \subset \Omega}(\Pr\lbrack X \in T\rbrack - \Pr \lbrack Y \in T\rbrack)$
$\forall T \subset U, \Pr\lbrack Y \in T\rbrack \le \Pr \lbrack X \in T \rbrack + \Delta (X,Y)$
For every randomized function $F$, $\Delta(F(X), F(Y)) \le \Delta(X,Y)$. Moreover, equality is achieved if each realization $f$ of $F$ is one-to-one.
Let $F$ be a random variable over functions $f: \Omega \to V$
$$
\begin{aligned}
\Delta(F(X),F(Y)) &= \dfrac{1}{2}\sum_{u \in \Omega} | \Pr \lbrack F(X) = u\rbrack -\Pr \lbrack F(Y) = u\rbrack |\newline
&=\sum_{f \in F} \Pr\lbrack F = f\rbrack \left(\dfrac{1}{2}\sum_{u \in \Omega} | \Pr \lbrack F(X) = u \mid F = f\rbrack -\Pr \lbrack F(Y) = u\mid F = f \rbrack |\right)\newline
&= \sum_{f \in F} \Pr\lbrack F = f\rbrack \Delta(f(X), f(Y))\newline
&\le \sum_{f \in F} \Pr\lbrack F = f\rbrack \Delta(X, Y)\newline
&= \Delta(X,Y)
\end{aligned}
$$
$R$ is $\delta$-uniform if $\delta = \Delta(U,R) = \dfrac{1}{2} \sum\limits_{x \in \Omega} | \Pr\lbrack R = x \rbrack - \dfrac{1}{n} |$.
Randomness Extractor
The guessing probability $\gamma(R)$ is defined by $\gamma(R) = \max\limits_{x \in \Omega}\lbrace \Pr\lbrack R = x\rbrack \rbrace$. The min-entropy $H_\infty(R) = -\log \gamma(R) = -\log \max\limits_{x \in \Omega}\lbrace \Pr\lbrack R = x\rbrack \rbrace$.
We say that $R$ is $k$-source if $H_\infty(R) \geq k$.
Let the seed $U_d$ be uniformly distributed over $\lbrace0,1\rbrace^d$. We say that a function $f_e: \lbrace 0,1 \rbrace^n \times \lbrace 0,1 \rbrace^d \to \lbrace 0,1 \rbrace^m$ is a $(k,\epsilon)$-extractor if $\forall X$ $k$-source variables on $\lbrace 0,1 \rbrace^n$ independent of $U_d$
$$
(f_e(X, U_d), U_d) \approx_\epsilon (U_m, U_d)
$$
The collision probability $\kappa(R) = \sum\limits_{x \in \Omega} \Pr \lbrack R = x \rbrack^2$. The Renyi entropy $H_2(R) = -\log \kappa(R) = -\log \sum\limits_{x \in \Omega} \Pr \lbrack R = x \rbrack^2$.
Define $p_R \in \lbrack 0,1\rbrack^{n}$ to be a vector of probabilities for values of $R$.
The guessing probability is $l_\infty$-norm and collision probability is $l_2$-norm.
Fact: for a uniform distribution $U$ over $\Omega$, $\gamma(R) = \kappa(R) = \dfrac{1}{n}$.
Fact: for $\delta$-uniform $R$ on $\Omega$, then $\dfrac{1}{n} + \dfrac{4\delta^2}{n} \le \kappa(R) \le \gamma(R) \le \dfrac{1}{n} + \delta$.
$R$ is $\delta$-uniform, then $\sum\limits_{x \in \Omega} |\Pr \lbrack R = x \rbrack - \dfrac{1}{n}| = 2\delta$.
Suppose $\gamma(R) > \dfrac{1}{n} + \delta$, then let $x^\ast = \max_x \lbrace \Pr\lbrack R = x\rbrack \rbrace$ we have
$$
\begin{aligned}
\sum\limits_{x \in \Omega} |\Pr \lbrack R = x \rbrack - \dfrac{1}{n}| &= \sum_{x \in \Omega, x \ne x^\ast} |\Pr \lbrack R = x \rbrack - \dfrac{1}{n}| + |\Pr \lbrack R = x^\ast \rbrack - \dfrac{1}{n}|\newline
&\ge | \sum_{x \in \Omega, x \ne x^\ast} (\Pr \lbrack R = x \rbrack - \dfrac{1}{n})| + |\gamma(R) - \dfrac{1}{n}|\newline
&> | 1 - \Pr \lbrack R = x^\ast \rbrack - \dfrac{n - 1}{n} | + \delta = 2 \delta
\end{aligned}
$$
Thus contradiction, we have $\gamma(R) \le \dfrac{1}{n} + \delta$.For $\kappa(R) \le \gamma(R)$, we have
$$
\begin{aligned}
\sum_{x \in \Omega} \Pr \lbrack R = x \rbrack^2 &\le \max_{x \in \Omega} \Pr \lbrack R = x\rbrack (\sum_{x \in \Omega} \Pr \lbrack R = x \rbrack)\newline
&= \max_{x \in \Omega} \Pr \lbrack R = x\rbrack
\end{aligned}
$$
For $\dfrac{1}{n} + \dfrac{4\delta^2}{n} \le \kappa(R)$, we have
$$
\begin{aligned}
\kappa(R) &= \sum\limits_{x \in \Omega} \Pr \lbrack R = x \rbrack^2 = \sum_{x \in \Omega} (\dfrac{1}{n} + \delta_x)^2\newline
&=\sum_{x \in \Omega} (\dfrac{1}{n^2} + \dfrac{2\delta_x}{n} + \delta_x^2) = \dfrac{1}{n} + \sum_{x \in \Omega} \delta_x^2\newline
&\ge\dfrac{1}{n} + \dfrac{1}{n}(\sum_{x \in \Omega}|\delta_x|)^2 = \dfrac{1}{n} + \dfrac{4\delta^2}{n}
\end{aligned}
$$
Leftover Hash Lemma
Let $\mathcal{H} = \lbrace h : X \to Y \rbrace$ be an $\epsilon$-universal hash family. $h \overset{R}{\leftarrow} \mathcal{H}$. Let $R$ be a random variable over $X$. $(h, R)$ are independently chosen. Then $(h, h(R))$ is $\delta$-uniform over $\mathcal{H}\times Y$ by
$$
\delta \le \dfrac{1}{2}\sqrt{|Y| \cdot (\kappa(R) + \epsilon) - 1}
$$
Some Simplified Version
In CS355 course note we have a simplified version of LHL, where $\mathcal{H}$ still follows. Let $\mathcal{D_X}$ be some distribution on $\mathcal{X}$ with $\gamma(\mathcal{X}) = \gamma$. Then we have
$$
\left| \Pr\lbrack \mathcal{A}(\text{pk}, H(\text{pk}, x)) = 1 \mid \text{pk} \overset{R}{\leftarrow} \mathcal{K}, x\leftarrow \mathcal{D_X} \rbrack - \Pr\lbrack \mathcal{A}(\text{pk}, y) = 1 \mid \text{pk}\overset{R}{\leftarrow}\mathcal{K}, y \overset{R}{\leftarrow}\mathcal{Y} \rbrack \right| = \gamma\cdot |\mathcal{Y}|
$$
Then if we sample $A \overset{R}{\leftarrow} \mathbb{Z}^{n \times m}, x \overset{R}{\leftarrow} \lbrace 0,1 \rbrace^m, y \overset{R}{\leftarrow} \mathbb{Z}^m$, we have $(A, A\cdot x) \approx_{\text{stat}} (A, y)$.
Noise Smudging
This is written in Gentry’s Thesis, which is used for the (Statistical) Circuit Privacy.
(Statistical) Circuit Privacy: Let $\mathcal{E}$ be an encryption scheme, $\mathcal{E}$ is circuit private for circuits $C_{\mathcal{E}}$.
If for any keypair $(\text{sk},\text{pk})$ output by $\text{KeyGen}_{\mathcal{E}}(\lambda)$, any circuit $C\in C_{\mathcal{E}}$, and any fixed ciphertexts $\Psi = \langle \psi_1,\dots,\psi_k\rangle$, that are in the image of $\text{Encrypt}_{\mathcal{E}}$ for plaintext $\pi_1,\dots,\pi_k$, we have
$$
\text{Encrypt}_{\mathcal{E}}(\text{pk},C(\pi_1,\dots,\pi_t)) \approx_{\text{stat}} \text{Evaluate}_{\mathcal{E}}(\text{pk},C,\Psi)
$$
meaning the distributions are statistically indistinguishable.
The main idea here is that, for every ciphertext in $\Psi$ there is a noise $\rho$. Denote $\rho_f$ to be the noise of $C(\pi_1,\dots,\pi_k)$ encrypted.
We add one more significantly larger noise $\rho^\ast \gg \rho_f > \rho$, so that the distribution should look just the same.