Sum-of-squares: proofs, beliefs, and algorithms — Boaz Barak and David Steurer

# Mathematical Definitions

Let us now turn to formally defining the problem of polynomial optimization and the sum-of-squares algorithm. In the first few lectures, we will restrict our attention to the following basic special case, which still captures many interesting examples:

Given a low-degree polynomial $$f\from \bits^n\to \R$$, decide if $$f\ge 0$$ over the hypercube or if there exists a point $$x\in \bits^n$$ such that $$f(x)<0$$.

One interesting computational task captured by this problem is finding a maximum cut in a graph. For an $$n$$-vertex graph $$G$$, we encode a bipartition of the vertex set of $$G$$ by a vector $$x\in \bits^n$$ and we let $$f_G(x)$$ be the number of edges cut by the bipartition $$x$$. This function is a degree-$$2$$ polynomial, $f_G(x)=\sum_{\set{ij}\in E(G)} (x_i-x_j)^2\,. \label{eq:max-cut-objective}$ Therefore, deciding if the polynomial $$c-f_G$$ takes a negative value over the hypercube is the same as deciding if the maximum cut in $$G$$ is larger than $$c$$.

The traditional definition of the Max-Cut problem is to recover, given a graph $$G$$, the cut $$x$$ maximizing $$f_G(x)$$. A priori, computing $$\max_x f_G(x)$$, or deciding whether this maximum is larger than $$c$$, is an easier task than recovering the cut. However, in this and many other settings, all known algorithms for solving the decision task (i.e., is $$\max f_G(x)$$ larger than $$c$$?) easily generalize to solving the search problem (i.e., recovering $$x$$ that exactly or approximately maximizes $$f_G(x)$$).

The sum-of-squares algorithm, when restricted to the special case Reference:nonnegative-polynomial-over-the-hypercube, gets a polynomial $$f\from \bits^n\to \R$$ as input and outputs

• either a proof that $$f(x)\ge 0$$ for all $$x\in \bits^n$$,
• or an object that “pretends to be” a point $$x\in \bits^n$$ with $$f(x)<0$$ or, more generally, a collection of such points.

What is the form of this proof? What is the meaning of “pretends to be”? And how can we find such an object when finding an actual solution is hard? These are the questions we address next.

## Sum-of-squares certificates

How could we efficiently certify for a given polynomial $$f\from \bits^n\to\R$$ that it is nonnegative over the hypercube? Since a square is always non-negative, one simple certificate is to show that $$f$$ agrees with a sum of squares of polynomials over the hypercube. This observation motivates the following definition.

A degree-$$d$$ sum-of-squares certificate (of non-negativity) for a function $$f\from \bits^n\to \R$$ consists of polynomials $$g_1,\ldots,g_r\from \bits^n\to\R$$ of degree at most $$d/2$$ for some $$r\in \N$$ such that $f(x) = g_1^2(x) + \cdots + g_r^2(x)\,. \label{eq:sos-representation}$ for every $$x\in\bits^n$$.

We will refer to degree-$$d$$ sos certificate for $$f$$ also as a degree-$$d$$ sum-of-squares proof of the inequality $$f\ge 0$$.

### Verifying certificates

In what sense is this certificate efficiently verifiable? Since $$g_1\ldots,g_r$$ have degree at most $$d/2$$, we can represent each polynomial $$g_i$$ by $$n^{O(d)}$$ coefficients (say in the monomial basis). It also turns out that we can assume $$r$$ to be at most $$n^{O(d)}$$. Thus in $$n^{O(d)}$$ time, we can reduce the task of verifying \eqref{eq:sos-representation} to the task of checking that an explicit polynomial $$p$$ (obtained by computing the coefficients of $$f-(g_1^2+ \cdots +g_r^2)$$) vanishes for every $$x\in\bits^n$$. It can be shown that this holds if and only if $$p$$ becomes the zero polynomial if we reduce it to a multilinear polynomial (where every monomial with non-zero coefficient is of the form $$\prod_{i\in S}x_i$$ for some subset $$S\subseteq [n]$$) by repeatedly applying the identity $$x_i^2 = x_i$$ (which holds when $$x_i\in\bits$$).See Reference:multilinear-representation. The underlying technical reason is the fact that $$\set{x_i(x_i-1)}_{i\in [n]}$$ is a small Groebner basis for the hypercube $$\bits^n$$. Since $$f-(g_1^2+ \cdots +g_r^2)$$ has degree at most $$d$$, we need to consider at most $$n^{O(d)}$$ coefficients. Finally, some mild assumptions on $$f$$ allow us to assume that the bit length of the coefficients is bounded by $$n^{O(d)}$$.Concretely, we need to assume that already $$f-\e$$ has a degree-$$d$$ sos certificate. It follows that we can verify the certificate in time $$n^{O(d)}$$.

For large enough degree, every nonnegative functions has a sum-of-squares certificate of non-negativity:

Every nonnegative function $$f\from \bits^n\to\R$$ has a degree-$$2n$$ sum-of-squares certificate.

Let $$g\from \bits^n\to \R$$ be the function that agrees with $$\sqrt f$$ on the hypercube. This function satisfies $$f=g^2$$ over the hypercube and its multilinear representation of $$g$$ has degree at most $$n$$. Therefore, $$g$$ is a degree-$$2n$$ sos certificate for $$f$$.

In the most general setting (when we allow arbitrary polynomial equality and inequality constraints over $$\R^n$$ instead of just a single polynomial inequality over the hypercube) this result is known as the Positivstellensatz and was proven by Krivine in 1964 (and independently but later by Stengle in 1974), extending Artin’s 1927 resolution of Hilbert’s 17th problem.

### Finding certificates

Not only can we check sos certificates efficiently but there is also an efficient algorithm to find them if they exist. This sum-of-squares algorithm is based on semidefinite programming and has first been proposed by Naum Shor in 1987, later refined by Pablo Parrilo in 2000, and Jean Lasserre in 2001.

There exists an algorithm that given a polynomialUnless explicitly specified otherwise, when we give an $$n$$-variate degree-$$d$$ polynomial as an input to an algorithm, we represent it by its coefficients in the monomial basis up to degree $$d$$. Furthermore, we assume that the bit length of the coefficients is at most polynomial in the number of coefficients, which is roughly $$n^d$$. $$f\from \bits^n\to \R$$ (say represented in the monomial basis with polynomial bit complexity) and a number $$k\in \N$$, outputs a degree-$$k$$ sum-of-squares certificate for $$f+2^{-n}$$ in time $$n^{O(k)}$$ if $$f$$ has a degree-$$k$$ sos certificate.

This result as well extends far beyond the case of a single polynomial over the hypercube to any set of polynomials equalities and inequalities over $$\R^n$$.

To get some intuition for the sum-of-squares algorithm, note that the polynomials $$f$$ with degree-$$d$$ sos certificates form a convex cone (a set closed under convex combination and nonnegative scaling). See Reference:closed-convex-cone for some basic properties of this cone. We refer to this cone as the degree-$$d$$ sum-of-squares cone (over the hypercube).

The key insight for Reference:sum-of-squares-algorithm-certificate is that the degree-$$d$$ sos cone admits a small semidefinite programming (SDP) formulation, which turns out to follow from the following characterization of sos certificates in terms of positive semidefinite matrices.

A polynomial $$f$$ has a degree-$$d$$ sos certificate if and only if there exists a positive semidefinite matrix $$A$$ such that for all $$x\in \bits^n$$, $f(x) = \Bigiprod{(1,x)^{\otimes d/2}, A (1,x)^{\otimes d/2}}\,. \label{eq:sdp}$

Suppose \eqref{eq:sdp} holds for a positive semidefinite matrix $$A$$. Let $$g_i$$ be the polynomial such that $$g_i(x)=\iprod{ e_i, A^{1/2} (1,x)^{\otimes d/2}}$$. Then, $$f$$ has the following degree-$$d$$ sos certificate, $f(x) = \Norm{A^{1/2} (1,x)^{\otimes d/2}}^2 = \sum_i g_i(x)^2\,.$ (Here, we use that positive semidefinite matrices have square roots over the reals.)

On the other hand, suppose that $$f$$ has a deg-$$d$$ sos certificate, $$f=\sum_{i=1}^r g_i^2$$. Let $$v_1,\ldots,v_r$$ be vectors such that $$g_i(x)=\langle v_i, (1,x)^{\otimes d/2}\rangle$$ for all $$x\in\R^n$$ and let $$A=\sum_i \dyad{v_i}$$. Then, for every $$x\in \bits^n$$, $f(x) = \sum_i g_i(x)^2 = \sum_i \Iprod{v_i, (1,x)^{\otimes d/2}}^2 = \Bigiprod{(1,x)^{\otimes d/2}, A (1,x)^{\otimes d/2} }\,.$

## Exercises I

The following exercises are about basic properties of sos certificates and some examples.

Show that every function $$f\from \bits^n\to \R$$ has a unique multilinear representation $$f(x)=\sum_{S\subseteq [n]} c_S x_S$$ where $$x_S=\prod_{i\in S}x_i$$.

The multilinear representation of a function $$f\from\bits^n\to\R$$ is closely related to its Fourier transform, see Ryan O’Donnell’s excellent book on this topic.

Show that every function $$f\from \bits^n\to \R$$ with a degree-$$d$$ sos certificate has one of rank at most $$n^{d/2}$$.

Show that for every $$k\in \N$$, the polynomials with degree-$$k$$ sos certificates of non-negativity form a closed convex cone.

For an $$n+2$$-vertex digraph $$G$$ with a source $$s$$ and sink $$t$$, let $$f(x)$$ with $$x\in\bits^{V(G)\setminus \set{s,t}}$$ be the number of edges going out of $$\set s \cup \Set{i\in V(G)\setminus \set{s,t} \mid x_i =1}$$. Show that $$f$$ is a degree-$$2$$ polynomial and that $$f-c$$ has a degree-$$4$$ sos certificate for all $$c\in \R$$ such that $$f-c\ge 0$$.

For a graph $$G$$, let $$L_G$$ be the Laplacian matrix $L_G = \sum_{(i,j)\in E(G)} \dyad{(e_i-e_j)}\,,$ where $$\set{e_i \mid i\in V(G)}$$ is the coordinate basis. Show that every graph $$G$$ with $$n$$ vertices the function $$\lambdamax(L_G)\cdot n/2 - f_G$$ has a degree-$$2$$ sos certificate.

Show that for every even $$d\in\N$$ and every function $$f\from \bits^n\to \R$$ of degree at most $$d$$, there exists some $$M\in \R_{\ge 0}$$ such that $$M-f$$ has a degree-$$d$$ sos certificate. Also show that $$M$$ can be chosen at most $$n^{O(d)}$$ times the largest coefficient of $$f$$ in the monomial basis.

## Pseudo-distributions

What can we say about a function $$f\from \bits^n\to \R$$ if there is no degree-$$k$$ sos certificate for its non-negativity? Obviously, if the function is not actually non-negative, then there is no certificate for it. Indeed that’s the only kind of obstruction for very large values of $$k$$ (by Reference:high-degree-sos-certificates it suffices that $$k \ge 2n$$). However, it turns out that for smaller values of $$k$$ other kinds of obstructions exist. Since the running time of the sum-of-squares algorithm is exponential in $$k$$, understanding these more general obstructions is key.

The most direct description of obstructions for sos certificates is geometric. In the previous section, we saw that functions with degree-$$k$$ sos certificates form a closed convex cone. By the hyperplane separation theorem for convex cones, for every function $$f\from \bits^n\to \R$$ without degree-$$k$$ sos certificate there exists a hyperplane through the origin that separates $$f$$ from the cone of functions with degree-$$k$$ sos certificates, in the sense that the halfspace $$H$$ above the hyperplane contains the degree-$$k$$ sos cone but not $$f$$.

How do such halfspaces look like? We can represent a halfspace $$H$$ by its normal function $$\mu\from\bits^n\to \R$$ so that $H=\Set{ g\from \bits^n\to \R \Mid \sum_{x\in \bits^n} \mu(x)\cdot g(x) \ge 0}\,.$ By scaling we can assume without loss of generality that $$\sum_{x\in\bits^n}\mu(x)=1$$. It’s illuminating to consider the special case that $$\mu$$ satisfies $$\mu(x)\ge 0$$ for all $$x\in\bits^n$$. Then, $$\mu$$ corresponds to a probability distribution over the hypercube where every point $$x\in \bits^n$$ has probability $$\mu(x)$$. In this case, the halfspace $$H$$ contains all nonnegative functions and therefore also the degree-$$k$$ sos cone. The condition $$f\not\in H$$ simply says that the expected value of $$f(x)$$ when $$x$$ is drawn from the distribution $$\mu$$ is negative. In particular, in this case, if $$f\not\in H$$ then there must exist some $$x\in\bits^n$$ such that $$f(x)<0$$.

It turns out that even if $$\mu$$ does not satisfy $$\mu\ge 0$$ it behaves in many ways like a probability distribution. To formalize this idea we introduce the following notation for the formal expectation of a function $$f\from \bits^n\to \R$$ with respect to another function $$\mu$$ (not necessarily corresponding to a probability distribution), $\pE_{\mu} f = \sum_{x\in \bits^n} \mu(x)\cdot f(x)\,.$

In order to emphasize the variable bound by the formal expectation, we use the notation $$\pE_{\mu(x)}f(x)$$. This notation is useful when the expression $$f(x)$$ also depends on other variables.This notation is analogous to the notation $$\E_{x\sim \mu} f(x)$$ for actual probability distributions, where $$x\sim \mu$$ denotes that $$x$$ is a sample drawn form $$\mu$$. We avoid this notation because the process of sampling is not well-defined in the context of formal expectations.

We define a “pseudo-distribution” to be a function $$\mu$$ such that the formal expectation with respect to $$\mu$$ satisfies some of the properties that expectations of probability distributions satisfy. However unlike actual probability distributions, pseudo-distributions may assign negative probabilities.

A degree-$$d$$ pseudo-distribution over $$\bits^n$$ is a function $$\mu:\bits^n\rightarrow\R$$ such that the formal expectation with respect to $$\mu$$ satisfies $$\pE_\mu 1 = 1$$ and for every polynomial $$f$$ of degree at most $$d/2$$, $\pE_\mu f^2 \geq 0\,.$ We refer to formal expectations with respect to degree $$d$$ pseudo-distributions as degree $$d$$ pseudo-expectations.

If a pseudo-distribution $$\mu$$ satisfies $$\mu(x) \geq 0$$ for all $$x$$ then it corresponds to an actual probability distribution over the hypercube. Reference:high-degree-sos-certificates implies that every degree-$$2n$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ satisfies $$\mu\ge 0$$.

Note that a priori a degree-$$d$$ pseudo-distribution $$\mu\from \bits^n\to \R$$ requires $$2^n$$ numbers to specify (i.e., the values of $$\mu$$ on all inputs). However, the following lemma allows us to reduce the number of parameters to $$n^{O(d)}$$.

Let $$\mu$$ be a degree-$$\ell$$ pseudo-distribution over $$\bits^n$$, there exists a multi-linear polynomial $$\mu'$$ of degree at most $$\ell$$ such that $\pE_{\mu(x)}p = \pE_{\mu'(x)} p \;,$ for every $$p$$ of degree at most $$\ell$$.

Let $$U_\ell\subseteq \R^{\bits^n}$$ be the linear subspace of multilinear polynomials of degree at most $$\ell$$. By Reference:multilinear-representation this subspace contains all polynomials of degree at most $$\ell$$. Decompose the function $$\mu$$ as $$\mu=\mu'+\mu''$$ such that $$\mu'\in U_\ell$$ and $$\mu''$$ is orthogonal to $$U_\ell$$.Here, orthgonality is with respect to the following inner product for real-valued functions on $$\bits^n$$, $\iprod{f,g} = \sum_{x\in \bits^n} f(x) g(x)\,.$ For every $$p\in U_\ell$$, $\pE_{\mu} p = \iprod{\mu'+\mu'',p} = \iprod{\mu',p} = \pE_{\mu'} p\,,$ where we used the fact that $$\mu''$$ is orthogonal to $$U_\ell$$.

We can extend the notation of $$\pE_{\mu(x)} f(x)$$ to the case that $$f$$ is a vector valued function, in which case this denotes the vector obtained by taking expectation of every coordinate of $$f$$. Using this notation we can write the conclusion of Reference:low-degree more succinctly as $\pE_{\mu(x)} (1,x)^{\otimes \ell} = \pE_{\mu'(x)} (1,x)^{\otimes \ell}\,,$ where for an $$m$$-dimensional vector $$v$$ and $$d\in\N$$, $$v^{\otimes d}$$ denotes the $$m^d$$ dimensional vector such that $$(v^{\otimes d})_{i_1,\ldots,i_d} = v_{i_1}\cdots v_{i_d}$$. Indeed, every coordinate of $$(1,x)^{\ell}$$ is a polynomial of degree at most $$\ell$$ in $$x$$, and these coordinates form a basis for all these polynomials, and so if the expectations of $$(1,x)^{\otimes \ell}$$ under $$\mu$$ and $$\mu'$$ are equal then the expectation of every degree $$\leq \ell$$ polynomial $$p$$ would be equal as well.

## Exercises II

The following exercises are about basic properties of pseudo-distributions.

Show that every degree-$$2n$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ satisfies $$\mu(x)\ge 0$$ for every $$x\in\bits^n$$. (Therefore, $$\mu$$ corresponds to an actual probability distribution over $$\bits^n$$.)

Show that a function $$\mu\from \bits^n\to \R$$ is a degree-$$d$$ pseudo-distribution if and only if $$\pE_\mu 1 = 1$$ and the following pseudo-moment matrix is positive semidefinite, $\pE_{\mu(x)} \dyad{\Paren{(1,x)^{\otimes d/2}}} \succeq 0\,.$

Show that for every even $$d$$ and every degree-$$d$$ pseudo-distribution $$\mu$$, there exists a degree-$$d$$ pseudo-distribution $$\mu'$$ with the same pseudo-moments up to degree $$d$$ such that for every $$x\in\bits^n$$,Hint: This exercise might require some Fourier analysis. $\abs{\mu'(x)}\le 2^{-n}\cdot \sum_{d'=0}^d\binom{n}{d'}\,.$ (If $$\mu'$$ was an actual probability distribution, this inequality would mean that $$\mu'$$ has min-entropy at most $$\approx \log (n^d)$$.)

Show that the set of degree-$$d$$ pseudo-distributions over $$\bits^n$$ admits a separation algorithm with running time $$n^{O(d)}$$. Concretely, show that there exists an $$n^{O(d)}$$-time algorithm that given a vector $$N\in (\R^n)^{\otimes d}$$ outside of the following set $$\cX_d$$ outputs a halfspace that separates $$N$$ from $$\cX_d$$. Here, $$\cX_d$$ is the set that consists of all coefficient vectors $$M \in \Paren{\R^{n+1}}^{\otimes d}$$ such that the function $$\mu\from \bits^n\to \R$$ with $$\mu(x)=\Iprod{M,(1,x)^{\otimes d}}$$ is a degree-$$d$$ pseudo-distribution over $$\bits^n$$.

Show that for every even $$d\in \N$$, the following set of pseudo-moments admits a separation algorithm with running time $$n^{O(d)}$$, $\cM_d = \Set{ \pE_{\mu(x)} (1,x)^{\otimes d} \Mid \text{\mu is deg.-d pseudo-distr. over \bits^n}}\,.$

## Duality

We now show that pseudo-distributions are indeed dual to sos proofs by demonstrating that their existence certifies the non-existence of a proof and vice versa.

For every function $$f\from \bits^n\to \R$$ and every even $$d\in\N$$, there exists a degree-$$d$$ sos certificate for the non-negativity of $$f$$ if and only if every degree-$$d$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ satisfies $$\pE_\mu f \ge 0$$.

One direction is immediate. Suppose $$f$$ has a degree-$$d$$ sos certificate so that $$f=g_1^2+\dots+g_r^2$$ for some polynomials $$g_1,\ldots,g_r$$ with $$\deg g_i\le d/2$$. Then, every degree-$$d$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ satisfies $\pE_\mu f = \pE_\mu g_1^2 + \dots + \pE_\mu g_r^2 \ge 0\,.$ For the other direction, suppose that $$f$$ is not contained in the degree-$$d$$ sum-of-squares cone. By the hyperplane separation theorem, there exists a halfspace $$H$$ through the origin that conains the cone but not $$f$$. Let $$\mu\from \bits^n\to \R$$ be the “normal” of $$H$$ so that $H=\Set{g\from \bits^n\to \R \Mid \pE_{\mu} g \ge 0}\,.$ Since $$f$$ is not contained in $$H$$, it satisfies $$\pE_\mu f < 0$$. Since $$H$$ contains the degree-$$d$$ sos cone, every polynomial $$g$$ of degree at most $$d/2$$ satisfies $$\pE_{\mu} g^2\ge 0$$. It remains to argue that $$\pE_\mu 1 >0$$, which means that we can rescale $$\mu$$ by a nonnegative factor to ensure that $$\pE_{\mu}1=1$$. Indeed, by Reference:some-bound, there exists $$M\in \R_{\ge 0}$$ such that $$M+f$$ has a degree-$$d$$ sos certificate, which means that $\pE_\mu 1 = \tfrac 1M\cdot \Paren{\pE_\mu M+f - \pE_\mu f}>0\,,$ as desired.

## Sum-of-squares algorithm

Recall that we described the degree $$d$$ sos algorithm as an algorithm that, given as input a polynomial $$f\from\bits^n\to\R$$, runs in $$n^{O(d)}$$ and either outputs a certificate that $$f(x)\geq 0$$ for all $$x$$, or outputs an object that “pretends to be” a distribution over vectors $$x\in\bits^n$$ such that $$f(x)<0$$. We now state this theorem formally.

For every even $$d\in \N$$, there exists an $$n^{O(d)}$$-time algorithm that given a polynomial $$f\from \bits^n \to \R$$ of degree at most $$d$$ (with polynomial bit length) either outputs a degree-$$d$$ sos certificate for $$f + 2^{-n}$$ or a degree-$$d$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ such that $$\pE_\mu f < 2^{-n}$$.

We will show one part of the theorem (about finding pseudo-distributions). The proof of the other part is similar but not needed for most of the algorithmic applications we will discuss.

Suppose that $$f$$ does not have a degree-$$d$$ sos certificate. By the duality between sos certificates and pseudo-distributions, there exists a degree-$$d$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ such that $$\pE_\mu f<0$$. Our goal is to efficiently find a pseudo-distribution $$\mu$$ over $$\bits^n$$ such that $$\pE_{\mu} f < 2^{-n}$$. Let $$v$$ be a vector such that $$f(x)=\iprod{v,(1,x)^{\otimes {d}}}$$. Then, $$\pE_\mu f = \langle v, \pE_\mu (1,x)^{\otimes d}\rangle$$. Therefore, we want to minimize the linear function $$y\mapsto \langle v,y\rangle$$ over the set $$\cM_d$$ of vectors of the form $$\pE_\mu (1,x)^{\otimes d}$$ for a degree-$$d$$ pseudo-distribution $$\mu$$ over $$\bits^n$$. By Reference:separation-algorithm-for-pseudo-moments, this set has a separation algorithm with running time $$n^{O(d)}$$. Using the ellipsoid algorithm, we can approximately minimize the linear function $$y\mapsto \langle v,y\rangle$$ over all $$y\in \cM_d$$ also in time $$n^{O(d)}$$.

## The different views of pseudo-distributions

Pseudo-distributions are not very complicated as a mathematical objects- they can be simply represented as positive semidefinite matrices. But they are rather subtle to grasp conceptually. (They are related, though not identical, to quantum states which are also modeled by positive semidefinite matrices and not easy to grasp conceptually.) An often useful point of view is to pretend that pseudo-distributions are actual distributions. This viewpoint can help “predict” certain properties of pseudo-distributions. For example, pseudo-distributions satisfy the Cauchy-Schwarz inequality:

If $$\mu$$ is a degree $$d$$ pseudo-distribution and $$P,Q$$ are polynomials of degree at most $$d/2$$ then $\left(\pE_\mu PQ\right)^2 \leq \left(\pE_\mu P^2 \right)\left( \pE_\mu Q^2 \right)$

We may assume that both $$\pE_\mu P^2$$ and $$\pE_\mu Q^2$$ are strictly positive. (If at least one is zero, the proof is simpler.) By scaling $$P$$ and $$Q$$ by nonnegative scalars, we may further assume without loss of generality that $\pE_\mu P^2 = \pE_\mu Q^2 = 1\,.$ It remains to prove $$\pE_\mu PQ \le 1$$. Indeed, $$\pE_\mu (P-Q)^2 \geq 0$$ which means by linearity that $2\pE_\mu PQ = \pE_\mu P^2 + \pE_\mu Q^2 - \pE_\mu (P-Q)^2 \le 2\,.$

## Do all pseudo-distributions correspond to actual distributions?

It turns out that the proofs of many of the inequalities we know and love, including Cauchy-Schwarz, Hölder and more, boil down to a sum-of-squares proof, which means that these statements hold not just for actual distributions but also for pseudo-distributions. In this light, a natural question to ask is whether perhaps every pseudo-distribution is an actual distribution. The answer to this question is negative.

There exists a degree-$$2$$ polynomial $$f\from \bits^n\to \R$$ that is nonnegative $$f\ge 0$$ but has no degree-$$2$$ sum-of-squares certificate. In particular, there exists a degree-$$2$$ pseudo-distribution $$\mu$$ over $$\bits^n$$ such that $$\pE_\mu f < 0$$.

Consider the following nonnegative function on $$\bits^3$$, $f(x) = 2 - \Paren{ (x_1-x_2)^2 + (x_2-x_3)^2 + (x_3-x_1)^2 }\,.$ The fact that this function is nonnegative corresponds to the fact that the maximum cut in a $$3$$-cycle is $$2$$. Consider the degree-$$2$$ pseudo-distribution $$\mu$$ over $$\bits^3$$ with mean $$\pE_{\mu(x)} \transpose x= \tfrac12 (1, 1, 1)$$ and covariance, $\pE_{\mu(x)} \dyad x - \dyad{\Paren{\pE_{\mu(x)} x}} =\frac18 \Paren{\begin{matrix} 2 & -1 & -1\\ -1 & 2 & -1\\ -1 & -1 & 2 \\ \end{matrix}}\,.$ Now, $\pE_{\mu(x)} (x_1-x_2)^2 = \pE_{\mu(x)} (x_2-x_3)^2 = \pE_{\mu(x)} (x_3-x_1)^2 = 3/4\,.$ Therefore, $$f$$ has negative expectation under $$\mu$$, $\pE_{\mu} f = 2- 3\cdot 3/4 = -1/4\,.$

It took about 80 years from the time Hilbert showed that polynomials that are not SOS exist non-constructively until Motzkin came up with an explicit example, and even that example has a low degree sos proof of positivity. One lesson from that is that if an inequality is non-negative and “natural” (i.e., constructed by methods known to Hilbert—not including probabilistic method), then heuristically there should be a low-degree sos proof for this fact. A corollary of this heuristic in the spirit of Bob MarleyBob Marley and the Wailers, “Three Little Birds” (1980).:

“If you analyze the performance of an SOS-based algorithm pretending pseudo-distributions are actual distributions, then unless you used Chernoff+union bound type arguments, then every little thing gonna be alright.”

We will use Marley’s corollary extensively in analyzing sos algorithms. There is a recurring theme in mathematics of “power from weakness”. For example, we can often derandomize certain algorithms by observing that they fall in some restricted complexity classes and hence can be fooled by certain pseudorandom generator. Another example, perhaps closer to ours, is that even though the original way people defined calculus with “infinitesimal” amounts were based on false premises, still much of the results they deduced were correct. One way to explain this is that they used a weak proof system that cannot prove all true facts about the real numbers, and in particular cannot detect if the real numbers are replaced with an object that does have such an “infinitesimal” quantity added to it. In a similar way, if you analyze an algorithm using a weak proof system (e.g., one that is captured by a small degree sos proof), then the analysis will still hold even if we replaced actual distributions with a pseudo-distribution of sufficiently large degree.

We have seen that not every pseudo-distribution is an actual distribution. However it turns out for every pseudo-distribution $$\mu$$ we can at least match the first two moments of $$\mu$$ by an actual probability distribution—albeit over $$\R^n$$ instead of $$\bits^n$$. The following lemma formalizes this idea which is related to hyperplane rounding in approximation algorithms and Gaussian copula in quantitative finance.

For every degree-$$2$$ pseudo-distribution $$\mu$$ over $$\bits^n$$, there exists a probability distribution $$\rho$$ over $$\R^n$$ with the same first two moments, that is, $\pE_{\mu(x)} (1,x)^{\otimes 2} = \E_{x \sim \rho} (1,x)^{\otimes 2}\,.$ Moreover, $$\rho$$ is a multivariate Gaussian distribution.

Let $$v=\E_{\mu(x)} x$$ and $$\Sigma=\pE_{\mu(x)} \dyad{(x-v)}$$ be the formal mean and covariance of $$\mu$$. Like for an actual probability distribution, the covariance $$\Sigma$$ of a degree-$$2$$ pseudo-distribution is positive semidefinite. Indeed, for every $$u\in \R^n$$, $\langle u, \Sigma u\rangle = \pE_{\mu(x)} \langle u,x-v\rangle^2 \ge 0\,.$ The following randomized procedure outputs a random vector $$y$$ in $$\R^n$$ with mean $$v$$ and covariance $$\Sigma$$:

• choose a standard Gaussian vector $$g$$, i.e., the coordinates of $$g$$ are independently identically distributed Gaussian variables with mean $$0$$ and variance $$1$$,
• output the vector $$y=v+\Sigma^{1/2} g$$.

(In the last step, we use that the matrix $$\Sigma$$ has a square root because it is positive semidefinite.) Since $$\E g=0$$, the mean of this distribution is $$\E y=v$$. Since $$\E \dyad g=Id$$, the distribution has covariance, $\E \dyad{(y-v)}= \Sigma^{1/2}\E\dyad g \Sigma^{1/2} = \Sigma\,.$ The distribution we described is called the Gaussian distribution with mean $$v$$ and covariance $$\Sigma$$ and is denoted $$N(v,\Sigma)$$.The above sampling procedure shows that such a distribution $$N(v,\Sigma)$$ exists for every vector $$v\in\R^n$$ and every positive semidefinite matrix $$\Sigma\in\R^{n\times n}$$.

## Pseudo-distributions as Bayesian probabilities

The problem of maximizing a polynomial over $$\bits^n$$ is $$NP$$ hard (indeed Max-Cut is a special case of it), and so (assuming $$P\neq NP$$) if we run the sos algorithm with a small (e.g., constant) degree $$d$$ then the algorithm should sometimes fail to solve it. In other words, on input some function $$f\from\bits^n\to\R$$ the sos algorithm might return a pseudo-distribution $$\mu$$ that will not be an actual distribution over $$x$$’s with $$f(x)<0$$. How do we interpret this pseudo-distribution? One way to think about it is that the pseudo-distribution captures the uncertainty of a computationally bounded observer about the unknown $$x$$ such that $$f(x)<0$$. Bayesian probabilities are often used to capture uncertainty even about events that are completely determined, for example, it might make sense for me to say something like “the probability that my great-grandfather had blue eyes is 25%” since even though obviously he either did or didn’t have blue eyes, the information I have about this fact can still leave me with some uncertainty.
The fact that we have bounded computational powers means that we sometimes have uncertainty even about facts that are completely determined by the information we are given. For example, while the number $$2^{81712357}-1$$ is either prime or composite, the authors (and as far as we know, everyone else) do not know which of the two cases holds. In fact, the information gathered by the Great Internet Mersenne Prime search project only allows us to determine that the probability that this number is prime is roughly $$1.46\cdot 10^{-6}$$.

Similarly, even if a function $$f\from\bits^n\to\R$$ has a unique $$x$$ such that $$f(x)<0$$, this value $$x$$ might be hard to find, and so we could have some uncertainty about it. One way to think about the pseudo-distribution is that it captures this uncertainty, and so a statement such as $$\pE x_{17} = 0.7$$ can be interpreted as saying that, given the information we have, the probability that $$x_{17}=1$$ is $$0.7$$.

## What’s next?

The type of questions we are interested in regarding the sos algorithm are the following:

• For what families of problems does the sos algorithm give us the best-known guarantees? Are there families of problems for which it is reasonable to conjecture that the sos algorithm is optimal, in the sense that for any given $$d$$, there is no other algorithm running much faster than $$n^d$$ time that would do better than the degree $$d$$ sos algorithm?
• There are some a priori seemingly stronger algorithms and proof systems, such as the “dynamic sos” proof system. Can we show natural classes of problems on which sos matches the guarantees of those seemingly stronger systems?
• Can we use the sos algorithm to solve problems that have eluded us via other means? In particular, there are some average case problems arising in machine learning, statistical physics, and other areas for which the sos algorithm seems promising. There are also some very fascinating worst-case problems for which we do not know the sos algorithm’s performance and which resolving could settle important questions such as Khot’s unique games conjecture.
• Can we obtain a systematic understanding of the sos algorithm’s performance? Ideally we would have a “creativity free” analysis, whereby we reduce the question of analyzing the guarantees sos gives on any particular question to some potentially complicated or tedious but ultimately doable and non-creative calculations .