Sum-of-squares: proofs, beliefs, and algorithms — Boaz Barak and David Steurer

Digression to boosting, experts, dense models, and their quantum counterparts

There is a large collection of results across many fields such as:

Duality in linear programming (Farkas 1902,Minkowski (1896))
The Hahn-Banach theorem in functional analysis (Hahn 1927,Banach (1929))
The minimax theorem in game theory (Neumann 1928)
Regret minimization and expert learning (Hannan 1957, Littlestone and Warmuth (1989))
Boosting in machine learning (Schapire 1990, Freund and Schapire (1995))
The hard core lemma in computational complexity (Impagliazzo 1995)
The dense model theorem in additive combinatorics (Green and Tao 2008,Tao and Ziegler (2008))

that all share the following characteristics:

They appear initially counterintuitive
They are incredibly useful
They are not that hard to prove once you gather the nerve to conjecture that they could be true. In fact, they can all proven by some kind of a local search/improvements type of algorithm such as best response, multiplicative weights or gradient descent.

To show optimality of sos we will need to use a result in this framework, and specifically the generalization of such results into the quantum or positivesemidefinite setting.

Regret minimization

Consider the following setting. There is some universe \(U\) of assets. An investor strategy can be thought of as a distribution \(\mu\) over the assets (which we can think of as either describing the way to partition the portfolio or as describing how to probabilistically sample a single asset to invest in). At each time period \(t\), the investor comes up with a distribution \(\mu_t\), the universe comes up with a function \(f_t\from U \to [-1,+1]\) and profit to the investor is \(\E_{x\sim \mu_t} f_t(x)\). In the setting of regret minimization (also known as expert learning) our goal is to come up with an investment strategy that would minimize the loss we suffer compared to the best fixed strategy in hindsight \(\mu^*\). That is, we wish to find a way such that if for \(t=1,\ldots,T\) we compute \(\mu_t\) based on \(f_0,\ldots,f_{t-1}\) then we will minimize the maximum of \[ \sum_{t=1}^t \E_{\mu^*} f_t \;-\; \sum_{t=1}^T \E_{\mu_t} f_t \] over all distributions \(\mu^*\) over \(U\).

The basic result in this area is the following:

For every parameter \(\eta\), and every choice of \(f_1,\ldots,f_t\) and distribution \(\mu^*\) we can choose \(\mu_t\) based only on \(f_1,\ldots,f_{t-1}\) such that \[ \sum_{t=1}^T \E_{\mu_t} f_t \leq (1+O(\eta))\bigl[\sum_{t=1}^T \E_{\mu^*} f_t \bigr] + \tfrac{1}{\eta}\Delta(\mu^*\|\mu_1) \label{eq:regret} \] where \(\Delta(\mu'\|\mu)\) denotes the KL divergence of \(\mu'\) from \(\mu\).

In particular if we set \(\mu_1\) to be the uniform distribution, then since \(\Delta(\mu^*\|\mu_1) \leq \log |U|\) we cam set \(\eta\) to be \(\sqrt{\log |U|}/T\) and get that the total regret is bounded by \(O(\sqrt{T \log |U|})\) which (for \(T \gg \log |U|\)) is sublinear in \(T\).

We are going to simply let \(\mu_{t+1}(x)\) be eaual to to \(Z_t\mu_t(x)2^{\eta f_t(x)}\) where \(Z_t = \left(\E_{\mu_t} 2^{\eta f_t(x)}\right)^{-1}\) is a normalization factor.

Now let us upper bound the decrease in distance between \(\mu^*\) and our current distribution by something related to the loss we suffer compared to the optimum: \[ \Delta(\mu^*\|\mu_{t+1}) - \Delta(\mu^*\|\mu_{t}) = \E_{x\sim \mu^*} \log\left(\tfrac{\mu^*(x)}{\mu_{t+1}(x)}\right) - \E_{x\sim \mu^*} \left(\tfrac{\mu^*(x)}{\mu_{t}(x)}\right) \] which equals \[ \E_{x\sim \mu^*} \left(\tfrac{\mu_t(x)}{\mu_{t+1}(x)}\right) = \E_{x\sim \mu^*} \log(\tfrac{1}{Z_t 2^{\eta f_t(x)}}) = \log Z_t^{-1} - \eta \E_{\mu^*} f_t \;. \] but since \(Z_t^{-1} = \E_{\mu_t} 2^{\eta f_t} \leq (\eta-O(\eta^2))\E_{\mu_t} f_t\) we get \[ \Delta(\mu^*\|\mu_{t+1}) - \Delta(\mu^*\|\mu_{t}) \leq \eta \bigl( (1-\eta) \E_{\mu_t} f_t - \E_{\mu^*} f_t \bigr) \label{eq:ghfdshdf} \;. \]

The telescopic sum of \eqref{eq:ghfdshdf} over all \(t\) from \(1\) to \(T\) yields the theorem.

The proof of Reference:thm-regret actually yields more than just the statement. In particular the following two points are important:

Our strategies \(\mu_1,\ldots,\mu_t\) are simple in the sense that they are composed of the initial prior \(\mu_1\) reweighed by “few” of the functions \(f\from U\to [0,1]\).
The complexity of our strategy can be controlled by the KL distance from our prior to the optimal distribution. That is, if there were only few bits of information that we were missing, then there is a simple strategy that is nearly optimal.

This is a general (and very useful) phenomena that simple tests can be fooled by simple distributions (see, e.g., Trevisan, Tulsiani, and Vadhan (2009)). In particular, the above proof establishes the following theorem:

Let \(\cF\) be a collection of test functions mapping some universe \(U\) to \([-1,1]\), and let \(\mu_{opt},\mu_{prior}\) be some distribution over \(U\). Then there exists a distribution \(\mu\) such that \[ \E_{\mu_{opt}} f - \E_{\mu_{prior}} f < \epsilon \] for every \(f\in\cF\) and \(\mu\) is simple in the sense that it is obtained by reweighing \(\mu_{prior}\) using a function proportional to \(\e^{\sum_{i=1}^t f_i}\) where \(t = \Delta(\mu_{opt}\| \mu_{prior})\poly(1/\epsilon)\).

Quantum version

We can extend the above observations to the quantum setting. Suppose that now the investor strategy is a quantum state \(\rho\) (i.e., psd matrix of trace \(1\)) on a system with the universe of states \(U\), and the gain is now the probability that \(\rho\) passes some measurement which is a \(|U|\times |U|\) matrix \(M\) satisfying \(0 \preceq M \preceq I\). The same algorithm works where we now use \(\rho_{t+1}\) as proportional to \(\rho_t e^{\eta M_t}\). This is known as the matrix multiplicative weights algorithm (e.g., see Arora, Hazan, and Kale (2012)). A matrix exponential can be computed using the power series for the exponential (or by keeping the same eigenbasis and exponentiating the eigenvalues). The same analysis works except that we now replace the KL divergence of \(\mu_{opt}\) and \(\mu_{prior}\) by the corresponding quantum relative entropy which corresponds to the von Neummann entropy of the states. In particular we can get the following result:

Let \(\cF\) be a collection of quantum measurements over a system with universe of states \(U\), where for every \(f\in\cF\), \(-\Id_U \preceq f \preceq +\Id_U\) where \(\Id_U\) denotes the \(|U|\times |U|\) identity matrix and \(\preceq\) denotes spectral domination. Let \(\rho_{opt},\rho_{prior}\) be two density matrices over \(U\) (i.e., \(|U|\times |U|\) psd matrices with trace 1). Then there exists a density \(\rho\) such that \[ \Tr(\rho f^*) - \Tr(\rho_{opt} f^*) < \epsilon \] for every \(f\in\cF\) and \(\rho\) is simple in the sense that it is obtained by reweighing \(\rho_{prior}\) using a function proportional to \(\e^{\sum_{i=1}^t f_i}\) where \(t = \Delta(\rho_{opt} \| \rho_{prior})\poly(1/\epsilon)\) and \(\Delta(\rho \| \sigma)\) denotes the quantum relative entropy \(\Tr(\rho(\log \rho-\log \sigma))\).

References

Arora, Sanjeev, Elad Hazan, and Satyen Kale. 2012. “The Multiplicative Weights Update Method: A Meta-Algorithm and Applications.” Theory of Computing 8 (6). Theory of Computing: 121–64. doi:10.4086/toc.2012.v008a006.

Banach, Stefan. 1929. “Sur Les Fonctionnelles Linéaires Ii.” Studia Mathematica 1 (1). Institute of Mathematics Polish Academy of Sciences: 223–39.

Farkas, Julius. 1902. “Theorie Der Einfachen Ungleichungen.” Journal Für Die Reine Und Angewandte Mathematik 124: 1–27.

Freund, Yoav, and Robert E Schapire. 1995. “A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting.” In European Conference on Computational Learning Theory, 23–37. Springer.

Green, Ben, and Terence Tao. 2008. “The Primes Contain Arbitrarily Long Arithmetic Progressions.” Annals of Mathematics. JSTOR, 481–547.

Hahn, Hans. 1927. “Über Lineare Gleichungssysteme in Linearen Räumen.” Journal Für Die Reine Und Angewandte Mathematik 157: 214–29.

Hannan, James. 1957. “Approximation to Bayes Risk in Repeated Play.” Contributions to the Theory of Games 3: 97–139.

Impagliazzo, Russell. 1995. “Hard-Core Distributions for Somewhat Hard Problems.” In FOCS, 538–45. IEEE Computer Society.

Littlestone, Nick, and Manfred K. Warmuth. 1989. “The Weighted Majority Algorithm.” In FOCS, 256–61. IEEE Computer Society.

Minkowski, Hermann. 1896. “Geometrie Der Zahlen.” Teubner, Leipzig 1.

Neumann, J v. 1928. “Zur Theorie Der Gesellschaftsspiele.” Mathematische Annalen 100 (1). Springer: 295–320.

Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2). Springer: 197–227.

Tao, Terence, and Tamar Ziegler. 2008. “The Primes Contain Arbitrarily Long Polynomial Progressions.” Acta Mathematica 201 (2). Springer: 213–305.

Trevisan, Luca, Madhur Tulsiani, and Salil P. Vadhan. 2009. “Regularity, Boosting, and Efficiently Simulating Every High-Entropy Distribution.” In IEEE Conference on Computational Complexity, 126–36. IEEE Computer Society.