Sum-of-squares: proofs, beliefs, and algorithms — Boaz Barak and David Steurer

# Quantum entanglement, sum of squares, and the log rank conjecture.

Note: These are very rough notes (essentially copied from the introduction of the corresponding paper). A better version should be up shortly, but in the meantime, reading the paper and looking at the lecture video is probably preferable.

Entanglement is one of the more mysterious and subtle phenomena of quantum mechanics. The formal definition is below Reference:separable-states, but roughly speaking, a quantum state $$\rho$$ on two systems $$A$$ and $$B$$ is entangled if a quantum measurement of one system can effect the other system. A non-entangled state is called separable. This type of “spooky interaction at a distance” is responsible for many of the more counter-intuitive features of quantum mechanics. Entanglement is also used by all algorithms for quantum computers that obtain speedups over the best known classical algorithms, and it may be necessary for such speedups (???).

One of the ways in which the complexity of entanglement is manifested is that even given the full description of a quantum state $$\rho$$ as a density matrix, there is no known efficient algorithm for determining whether $$\rho$$ is entangled or not. Indeed, the best known algorithms take time which is exponential in the dimension of the state (which itself is exponential in the number of underlying qubits). This is in contrast to the classical case, where there is an efficient algorithm that given a probability distribution $$\mu$$ over a universe $$A\times B$$, can check whether or not $$\mu$$ is a product distribution by simply computing the rank of $$\mu$$ when viewed as a matrix.

Given the inherently probabilistic and noisy setting of quantum computing, arguably the right question is not to determine entanglement exactly, but rather to distinguish between the case that a state $$\rho$$ is separable, and the case that it is $$\epsilon$$-far from being separable, in the sense that there exists some measurement $$M$$ that accepts $$\rho$$ with probability $$p$$ but accepts every separable state with probability at most $$p-\epsilon$$. This problem is known as the Quantum Separability Problem with parameter $$\epsilon$$. Gharibian (???), improving on Gurvits (???), showed that this problem is NP hard when $$\epsilon$$ is inversely polynomial in the dimension of the state. Harrow and Montanaro Harrow and Montanaro (2013) showed that, assuming the Exponential Time Hypothesis, there is no $$n^{o(\log n)}$$ time algorithm for this problem for $$\epsilon$$ which is a small constant.

A tightly related problem, which is the one we focus on in this paper, is the Best Separable State (BSS) problem.Using the connection between optimization and separation oracles in convex programming, one can convert a sufficiently good algorithm for the search variant of one of these problems to the other. See Harrow and Montanaro (2013 Sec. 4.2) for a thorough discussion of the relations between these and many other problems. In the BSS problem the input is a measurement $$\cM$$ on a two part system and two numbers $$1 \geq c > s \geq 0$$ and the goal is to distinguish between the YES case that there is a separable state that $$\cM$$ accepts with probability at least $$c$$ and the NO case that $$\cM$$ accepts every separable state with probability at most $$s$$. In particular, certifying that a particular measurement $$\cM$$ satisfies the NO case is extremely useful since it implies that $$\cM$$ can serve as entanglement witness M. Horodecki, Horodecki, and Horodecki (1996; ???), in the sense that achieving acceptance probability with $$\cM$$ larger than $$s$$ certifies the presence of entanglement in a state. Such entanglement witnesses are used to certify entanglement in experiments and systems such as candidate computing devices Vedral (2008), and so having an efficient way to certify that they are sound (do not accept separable states) can be extremely useful.

Similarly to the quantum separability problem, the BSS problem is NP hard when $$c-s = 1/poly(n)$$ (???) and Harrow and Montanaro Harrow and Montanaro (2013) (Corollary 13(i)) show that (assuming the ETH) there is no $$n^{o(\log n)}$$ time algorithm for $${BSS}_{1,1/2}$$. An outstanding open question is whether the Harrow and Montanaro (2013) result is tight: whether there is a quasi-polynomial time algorithm for $${BSS}_{c,s}$$ for some constants $$1 \geq c > s \geq 0$$. This question also has a complexity interpretation. A measurement on a two part system can be thought of as a verifier (with hardwired input) that interacts with two provers. Requiring the state to be separable corresponds to stipulating that the two provers are not entangled. Thus it is not hard to see that an algorithm for $${BSS}_{c,s}$$ corresponds to an algorithm for deciding all languages in the complexity class $$QMA(2)$$ of two prover quantum Merlin Arthur systems with corresponding completeness and soundness parameters $$c$$ and $$s$$ respectively. In particular a quasi-polynomial time algorithm for $${BSS}_{0.99,0.5}$$ would imply that $$QMA(2) \subseteq EXP$$, resolving a longstanding problem in quantum complexity.For more on information on this problem and its importance, see the presentations in the recent workshop http://qma2016.quics.umd.edu/ that was dedicated to it.

In 2004, Doherty, Parrilo and Spedialieri Doherty, Parrilo, and Spedalieri (2004) proposed an algorithm for the BSS problem based on the Sum of Squares semidefinite programming hierarchy Parrilo (2000; Lasserre 2001). It is not known whether this algorithm can solve the $${BSS}_{c,s}$$ problem (for constants $$c>s$$) in quasi-polynomial time. However , Christandl and Yard (???) showed that it runs in quasi-polynomial time when the measurement $$\cM$$ is restricted to a special class of measurements known as one-way local operations and classical communications (1-LOCC). and Harrow Brandão and Harrow (2015) showed that similar performance for these types of measurements can be achieved by an algorithm based on searching on an appropriately defined $$\epsilon$$-net.

## Non quantum motivations

The BSS problem is actually quite natural and well motivated from classical considerations. As we’ll see in Section [sec:techniques] below, it turns out that at its core lies the following problem:

Let $$\mathbb{F}\in \{\R,\mathbb{C}\}$$ and $$\epsilon>0$$. The $$\epsilon$$ rank one vector problem over $$\F$$ is the task of distinguishing, given a linear subspace $$\cW \subseteq \mathbb{F}^{n^2}$$, between the case that there is a nonzero rank one matrix $$L\in \cW$$ and the case that $$\norm{L-M}_F \geq \epsilon\norm{L}_F$$ for every rank one $$L$$ and $$M\in\cW$$.For a $$k\times m$$ matrix $$A$$, we denote by $$\norm{A}_F$$ its Frobenius norm, defined as $$\sqrt{\sum_{i,j}|A_{i,j}|^2} = \Tr(AA^*)^{1/2}$$, which is the same as taking the $$\ell_2$$ norm of the matrix when considered as an $$km$$-dimensional vector.

This is arguably a natural problem in its own right. While solving this problem exactly (i.e., determining if there is a rank one solution to a set of linear equations) is the same as the NP hard task of solving quadratic equations, it turns out that we can obtain non-trivial algorithmic results by considering the above notion of approximation. Indeed, our main result implies an $$\exp(\tilde{O}(\sqrt{n}))$$ time algorithm for this problem for any constant $$\epsilon>0$$ in both the real and complex cases.

## Our results

In this work we give a $$2^{\tilde{O}(\sqrt{n})}$$ time algorithm for the $${BSS}_{1,s}$$ problem for every constant $$s<1$$. We now make the necessary definitions and state our main result.For the sake of accessibility, as well as to emphasize the connections with non-quantum questions, we use standard linear algebra notation in this paper as opposed to Dirac’s ket notation that is more common in quantum mechanics. A vector $$u$$ is a column vector unless stated otherwise, and $$u^*$$ denotes the complex conjugate transpose of the vector $$u$$. If $$u$$ is real, then we denote its transpose by $$u^\top$$. See the lecture notes (???) for a more complete coverage of separability and entanglement.

A quantum state on a system of $$m$$ elementary states (e.g., a $$\log m$$-qubit register) is an $$m\times m$$ complex Hermitian matrix $$\rho$$ (known as a density matrix) such that $$\Tr \rho = 1$$. A quantum state $$\rho$$ is pure if it is of the form $$\rho = ww^*$$ for some unit vector $$w\in \mathbb{C}^m$$. Otherwise we say that $$\rho$$ is mixed. Note that every mixed state $$\rho$$ is a convex combination of pure states.

If $$m=n^2$$, and we identify $$[m]$$ with $$[n]\times [n]$$ then an $$m$$-dimension pure quantum state $$\rho = ww^* \in \mathbb{C}^{m^2}$$ is separable if the vector $$w\in\mathbb{C}^m$$ is equal to $$uv^*$$ for some $$u,v \in \mathbb{C}^n$$. A general state $$\rho$$ is separable if it is a convex combination of separable pure states. That is, $$\rho = \E (uv^*)(uv^*)^*$$ where the expectation is taken over a distribution supported over pairs of unit vectors $$u,v \in \mathbb{C}^n$$. A state that is not separable is called entangled.

A quantum measurement operator is an $$m\times m$$ complex Hermitian matrix $$\cM$$ such that $$0 \preceq \cM \preceq I$$. The probability that a measurement $$\cM$$ accepts a state $$\rho$$ is $$\Tr(\rho \cM)$$.

For every $$s<1$$, there is a $$2^{\tilde{O}(\sqrt{n})}$$ time algorithm, based on $$\tilde{O}(\sqrt{n})$$ rounds of the sos hierarchy, that on input an $$n^2\times n^2$$ measurement operator $$\cM$$, distinguishes between the following two cases:

• YES: There exists a separable state $$\rho \in \mathbb{C}^{n^2\times n^2}$$ such that $$\Tr(\rho \cM)=1$$.
• NO: For every separable $$\rho \in \mathbb{C}^{n^2\times n^2}$$, $$\Tr(\rho\cM) \leq s$$

To our knowledge, this algorithm is the first for this problem that beats the brute force bound of $$2^{O(n)}$$ time for general measurements.

Like the algorithms of Doherty, Parrilo, and Spedalieri (2004; ???), our algorithm is based on the sum of squares SDP hierarchy, but we introduce new techniques for analyzing it that we believe are of independent interest. As we discuss in Section [sec:conclusions], it is a fascinating open question to explore whether our techniques can be quantitatively strengthened to yield faster algorithms and/or extended for other problems such as the $$2$$ to $$4$$ norm and small set expansion problems, that have been shown to be related to the BSS problem by Barak et al. (2012) (albeit in a different regime of parameters than the one we deal with in this work). As we remark below, this question seems related to other longstanding open questions in computer science and in particular to the log rank conjecture in communication complexity Lovász and Saks (1988).

[Imperfect completeness] [rem:perfect-completeness] We state our results for the case of perfect completeness for simplicity, but all of the proofs extend to the case of “near perfect completeness” where in the YES case we replace the condition $$\Tr(\rho\cM)=1$$ with the condition $$\Tr(\rho\cM) = 1 - \tfrac{1}{n}$$ (see the proof of Theorem [thm:alg-analysis]). It is an interesting open problem to find out whether our results can extend to the setting where in the YES case $$\Tr(\rho\cM)=1-\epsilon$$ for some absolute constant $$\epsilon$$. We conjecture that this is indeed the case.

While the natural setting for quantum information theory is the complex numbers, much of the power and interest already arises in the case of the real numbers, which is more natural for the sos algorithm (though it does have complex-valued generalization). Hence in this version of this paper we focus solely on the case that all operators, subspaces, matrices are real. We note that there is a natural mapping of an $$n\times n$$ complex matrix $$A+iB$$ (with real $$n\times n$$ matrices $$A,B$$)into the $$2n\times 2n$$ real matrix $$\left( \begin{smallmatrix} A&B\\ -B&A \end{smallmatrix} \right)$$. Note that a complex rank one decomposition $$A+iB = (x+iy)(z+iw)^*$$ for $$x,y,z,w\in \R^n$$ will translate into a rank two decomposition $$\left( \begin{smallmatrix} x&y\\ -y&x \end{smallmatrix} \right)\left( \begin{smallmatrix} z&w\\ -w&z \end{smallmatrix} \right)^*$$. We can use a higher dimensional version of our result (see Reference:thm:multi-dimensional-rank1 to find such decompositions, though we defer the complete derivation to the full version of this paper.

# Our techniques

Our algorithm follows a recent paradigm of constructing rounding algorithms for the sum of squares sdp by considering its solutions as “pseudo distributions” (???). These can be thought of as capturing the uncertainty that a computationally bounded solver has about the optimal solution of the given problem, analogously to the way that probability distributions model uncertainty in the classical information-theoretic Bayesian setting.

Somewhat surprisingly, our main tool in analyzing the algorithm are techniques that arose in proof of the currently best known upper bound for the log rank conjecture Lovász and Saks (1988). This conjecture has several equivalent formulations, one of which is that every $$N\times N$$ matrix $$A$$ with Boolean (i.e., $$0/1$$) entries and rank at most $$n$$, contains a submatrix of size at least $$2^{-\poly\log(n)}N \times 2^{-\poly\log(n)}N$$ that is of rank one.The original formulation of the log rank conjecture is that every such matrix has communication complexity at most $$\poly\log(n)$$, and Nisan and Wigderson Nisan and Wigderson (1994) showed that this is equivalent to the condition that such matrices contains a monochromatic submatrix of the above size. Every monochromatic submatrix is rank one, and every rank one submatrix of size $$s\times s$$ of a Boolean valued matrix contains a monochromatic submatrix of size at least $$\tfrac{s}{2} \times \tfrac{s}{2}$$. The best known bound on the log rank conjecture is by Lovett Lovett (2014) who proved that every such matrix contains a submatrix of size at least $$2^{-\tilde{O}(\sqrt{n})}N \times 2^{-\tilde{O}(\sqrt{n})}N$$.

Our algorithm works by combining the following observations:

1. Lovett’s proof can be generalized to show that every $$N\times N$$ rank $$n$$ real (or complex) matrix $$A$$ (not necessarily with Boolean entries) contains a $$2^{-\tilde{O}(\sqrt{n})}N\times 2^{-\tilde{O}(\sqrt{n})}N$$ submatrix that is close to rank one in Frobenius norm.
2. If $$\mu$$ is an actual distribution over solutions to the sos program for the BSS problem on dimension $$n$$, then we can transform $$\mu$$ into an $$N\times N$$ rank $$n$$ matrix $$A=A(\mu)$$ such that extracting an approximate solution from $$A$$ in time $$2^{\tilde{O}(k)}$$ can be done if $$A$$ contains an approximately rank one submatrix of size at least $$2^{-k}N\times 2^{-k}N$$.
3. Moreover all the arguments used to establish steps 1 and 2 above can be encapsulated in the sum of squares framework, and hence yield an algorithm that extracts an approximately optimal solution to the BSS problem from a degree $$\tilde{O}(\sqrt{n})$$ pseudo-distribution $$\mu$$ that “pretends” to be supported over exact solutions.

Thus, even though in the sos setting there is no actual distribution $$\mu$$, and hence no actual matrix $$A$$, we can still use structural results on this “fake” (or “pseudo”) matrix $$A$$ to obtain an actual rounding algorithm. We view this as a demonstration of the power of the “pseudo distribution” paradigm to help in the discovery of new algorithms, that might not seem as natural without placing them in this framework.

## Rounding from rank one reweighings

We now give a more detailed (yet still quite informal) overview of the proof. As mentioned above, we focus on the case that the $$n^2 \times n^2$$ measurement matrix $$\cM$$ is real (as opposed to complex) valued.

Let $$\cW \subseteq \R^{n^2}$$ be the subspace of vectors $$X$$ such that $$X^\top \cM X = \norm{X}^2$$ (this is a subspace since $$\cM \preceq I$$ and hence $$\cW$$ is the eigenspace of $$\cM$$ corresponding to the eigenvalue $$1$$). We pretend that the sos algorithm yields a distribution $$\mu$$ over rank one matrices of the form $$X=uv^\top$$ such that $$X \in \cW$$. When designing a rounding algorithm, we only have access to marginals of $$\mu$$, of the form $$\E_\mu f(X)$$ for some “simple” function $$f$$ (e.g., a low degree polynomial). We need to show that we can use such “simple marginals” of $$\mu$$ to extract a single rank one matrix $$u_0v_0^\top$$ that has large projection into $$\cW$$.

If $$\mu$$ is a distribution over matrices $$X$$ in a subspace $$\cW \subseteq \R^{n^2}$$ such that the expectation $$\E_\mu X$$ is approximately rank one, in the sense that $$\norm{L-\E_\mu X}_F \leq \epsilon\norm{L}_F$$ for some rank one matrix $$L$$, then $$\Tr (\cM\rho) \geq 1-2\epsilon^2$$ where $$\rho$$ is the pure separable state $$\rho = LL^\top/\norm{L}_F^2$$.

Since $$\mu$$ is supported over matrices in $$\cW$$, $$\E_\mu X$$ is in $$\cW$$. But this means that the $$\ell_2$$ (i.e., Frobenius) norm distance of $$L$$ to the subspace $$\cW$$ is at most $$\epsilon\norm{L}_F$$. Since $$\Tr(XX^\top\cM) = \Tr(X^\top \cM X) = \norm{X}_F^2$$ for every $$X\in\cW$$, the value $$\Tr(LL^\top M)$$ will be at least as large as the norm squared of the projection of $$L$$ to $$\cW$$.

In particular this means that if we were lucky and the condition of Reference:lem:round-from-rankone’s statement occurs, then it would be trivial for us to extract from the expectation $$\E_\mu X$$ (which is a very simple marginal) a rank one matrix that is close to $$\cW$$, and hence achieves probability $$1-\epsilon$$ in the measurement $$\cM$$. Note that even if every matrix in the support of $$\mu$$ has unit norm, the matrix $$L$$ could be of significantly smaller norm. We just need that there is some dimension-one subspace on which the cancellations among these matrices are significantly smaller than the cancellations in the rest of the dimensions.

Of course there is no reason we should be so lucky, but one power that the marginals give us is the ability to reweigh the original distribution $$\mu$$. In particular, for every “simple” non-negative function $$\zeta:\R^{n^2}\rightarrow\R_+$$, we can compute the marginal $$\E_{\mu_\zeta} X$$ where $$\mu_\zeta$$ is the distribution over matrices where $$\Pr_{\mu_\zeta}[X]$$ (or $$\mu_\zeta(X)$$ for short) is proportional to $$\zeta(X)\mu(X)$$. A priori in the degree $$k$$ sos algorithm we are only able to reweigh using functions $$\zeta$$ that are polynomials of degree at most $$k$$, but for the purposes of this overview, let us pretend that we can reweigh using any function that is not too “spiky” and make the following definition:

Let $$\mu$$ be a probability distribution. We say that a probability distribution $$\mu'$$ is a $$k$$-spike reweighing of $$\mu$$ if $$\Delta_{KL}(\mu'\|\mu)\leq k$$ where $$\Delta_{KL}(\mu'\|\mu)$$ denotes the Kullback-Leibler divergence of $$\mu'$$ and $$\mu$$, defined as $$\E_{X\sim\mu'} \log(\mu'(X)/\mu(X))$$.

Thus at least on a “moral level”, the following theorem should be helpful for proving our main result:

Let $$\mu$$ be any distribution over rank one $$n\times n$$ matrices and $$\epsilon>0$$. Then there exists an $$\sqrt{n}\poly(1/\epsilon)$$-spike reweighing $$\mu'$$ of $$\mu$$ and a rank one matrix $$L$$ such that $\norm{L-\pE_{\mu'} X}_F\leq \epsilon\norm{L}_F$

One of the results of this paper is a proof of Reference:thm:rank-one-reweighing (see Reference:rem:proof-rank-one-reweighing). It turns out that this can be done using ideas from the works on the log rank conjecture.

## From monochromatic rectangles to rank one reweighings

What does Theorem Reference:thm:rank-one-reweighing has to do with the log rank conjecture? To see the connection let us imagine that the distribution $$\mu$$ is flat in the sense that it is a uniform distribution over rank one matrices $$\{ u_1v_1^\top ,\ldots, u_Nv_N^\top \}$$ (this turns out to be essentially without loss of generality) and consider the $$n\times N$$ matrices $$U$$ and $$V$$ whose columns are $$u_1,\ldots,u_N$$ and $$v_1,\ldots,v_N$$ respectively. The $$n\times n$$ matrix $$\pE_\mu u_iv_i^\top$$ is proportional to $$UV^\top$$. This matrix has the same spectrum (i.e., singular values) as the $$N\times N$$ matrix $$U^\top V$$. Hence, $$UV^\top$$ is close to a rank one matrix if and only if $$U^\top V$$ is, since in both cases this happens when the square of the top singular value dominates the sum of the squares of the rest of the singular values. Now a flat distribution $$\mu'$$ with $$\Delta_{KL}(\mu'\|\mu)\leq k$$ corresponds to the uniform distribution over $$\{ u_iv_i^\top \}_{i\in I}$$ where $$I \subseteq [N]$$ satisfies $$|I| \geq 2^{-k}N$$. We can see that $$\E_{\mu'} u_iv_i^\top$$ will be approximately rank one if and only if the submatrix of $$U^\top V$$ corresponding to $$I$$ is approximately rank one. Using these ideas it can be shown that Theorem [thm:rank-one-reweighing] is equivalent to the following theorem:To show this formally we use the fact that by Markov, every distribution $$\mu'$$ with $$\Delta_{KL}(\mu'\|U_{[N]})=\log N - H(\mu') = k$$ is $$\epsilon$$-close to a distribution with min entropy $$\log N-O(k/\epsilon)$$ and every distribution of the latter type is a convex combination of flat distributions of support at least $$N2^{-O(k/\epsilon)}$$.

Let $$A$$ be any $$N\times N$$ matrix of rank at most $$n$$. Then there exists a subset $$I\subseteq [N]$$ with with $$|I| \geq \exp(-\sqrt{n}\poly(1/\epsilon))N$$ and a rank one matrix $$L$$ such that $\norm{L-A_{I,I}}_F\leq \epsilon\norm{L}_F$ where $$A_{I,I}$$ is the submatrix corresponding to restricting the rows and columns of $$A$$ to the set $$I$$.

One can think of Reference:thm:rank-one-reweighing-dual as an approximate and robust version of Lovett’s result Lovett (2014) mentioned above. Lovett showed that every $$N\times N$$ matrix of rank $$n$$ with Boolean entries has a $$2^{-\tilde{O}(\sqrt{{n}})}N\times 2^{-\tilde{O}(\sqrt{{n}})}N$$ submatrix that is of exactly rank 1. We show that the condition of Booleanity is not needed if one is willing to relax the conclusion to having a submatrix that is only approximately rank 1. It is of course extremely interesting in both cases whether the bound of $$\tilde{O}(\sqrt{n})$$ can be improved further, ideally all the way to $$polylog(n)$$. In the Boolean setting, such a bound might prove the log rank conjecture,We note a caveat that this depends on the notion of “approximate” used. Gavinsky and Lovett Gavinsky and Lovett (2014) showed that to prove the log rank conjecture it suffices to find a in a rank $$n$$ Boolean matrix a rectangle of measure $$\exp(-\polylog(n))$$ that is nearly monochromatic in the sense of having a $$1-1/O(n)$$ fraction of its entries equal. In this paper we are more concerned with rectangles whose distance to being rank one (or monochromatic) is some $$\e>0$$ that is only a small constant or $$1/\polylog(n)$$. [fn:approx-monochromatic] while in our setting such a bound (assuming it extends to “pseudo matrices”) would yield a quasipolynomial time algorithm for BSS, hence showing that $$QMA(2) \subseteq EXP$$. It can be shown that as stated, Theorem [thm:rank-one-reweighing] is tight. However there are different notions of being “close to rank one” that could be useful in both the log-rank and the quantum separability setting, for which there is hope to obtain substantially improved quantitative bounds. We discuss some of these conjectural directions in Reference:sec:conclusions.

## Overview of proof

In the rest of this technical overview, we give a proof sketch of Reference:thm:rank-one-reweighing-dual and then discuss how the proof can be “lifted” to hold in the setting of sum of square pseudo-distributions. The condition that a matrix $$A$$ is of rank $$n$$ is the same as that $$A = UV^\top$$ where $$U,V$$ are two $$n\times N$$ matrices with columns $$u_1,\ldots,u_N$$ and $$v_1,\ldots,v_N$$ respectively (i.e., $$A_{i,j}=\iprod{u_i,v_j}$$ for all $$i,j\in [N]$$). We will restrict our attention to the case that all the columns of $$U$$ and $$V$$ are of unit norm. (This restriction is easy to lift and anyway holds automatically in our intended application.) In this informal overview, we also restrict attention to the symmetric case, in which $$A=A^\top$$ and can be written as $$A=UU^\top$$ and also assume that $$U$$ is isotropic, in the sense that $$\E_{i\in [N]} u_iu_i^\top = \tfrac{1}{n} \Id$$.

Our inspiration is Lovett’s result Lovett (2014) which establishes a stronger conclusion for Boolean matrices. In particular, our proof follows Rothvoß’s proof Rothvoß (2014) of Lovett’s theorem, though the non-Boolean setting does generate some non-trivial complications. The $$N\times N$$ matrix $$A$$ satisfies that $$A_{i,j} = \iprod{u_i,u_j}$$. An equivalent way to phrase our goal is that we want to find a subset $$I\subseteq [N]$$ over the indices such that:

(i)

$$|I| \geq \exp(-\tilde{O}(\sqrt{n}))N$$.

(ii)

If $$\lambda_1 \geq \lambda_2 \geq \cdots \lambda_n$$ are the eigenvalues of $$\E_{i\in I} u_iu_i^\top$$ then $$\epsilon^2 \lambda_1^2 \geq \sum_{j=2}^n \lambda_j^2$$

We will chose the set $$I$$ probabilistically and show that (i) and (ii) above hold in expectation. It is not hard to use standard concentration of measure bounds to then deduce the desired result but we omit these calculations from this informal overview.

Our initial attempt for the choice of $$I$$ is simple, and is directly inspired by Rothvoß (2014). We choose a random standard Gaussian vector $$g\in N(0,\tfrac{1}{n}\Id)$$ (i.e., for every $$i$$, $$g_i$$ is an independent standard Gaussian of mean zero and variance $$1/n$$). We then define $$I_g = \{ i : \iprod{g,u_i} \geq \sqrt{k/n} \}$$ where $$k=\tilde{O}(\sqrt{n})$$ is a parameter to be chosen later. Since $$u_i$$ is a unit vector, $$\iprod{g,u_i}$$ is a Gaussian of variance $$1/n$$, and so for every $$i$$, the probability that $$i\in I_g$$ is $$\exp(-O(k))$$ hence satisfying (i) in expectation.

The value $$\lambda_1$$ of $$\E_{i\in I} u_iu_i^\top$$ will be at least $$\Omega(k/n)$$ in expectation. Indeed, we can see that the Gaussian vector $$g$$ that we choose (which satisfies $$\norm{g}^2 = 1\pm o(1)$$ with very high probability) will satisfy that $$g^\top \left( \E_i u_i u_i^\top \right) g = \E_{i\in I_g} \iprod{u_i,g}^2 \geq k/n$$ and hence in expectation the top eigenvalue of $$\E_i u_iu_i^\top$$ will be at least $$(1-o(1))k/n$$.

So, if we could only argue that in expectation it will hold that $$\sum_{j=1}^n \lambda_j^2 \ll k^2/n^2 = \mathrm{polylog}(n)/n$$ then we’d be done. Alas, this is not necessarily the case. However, if this does fail, we can see that we have made progress, in the sense that by restricting to the indices in $$I$$ we raised the Frobenius norm of $$\E u_iu_i^\top$$ from the previous value of $$1/n$$ (under the assumption that $$U$$ was isotropic) to $$polylog(n)/n$$. Our idea is to show that this holds in general: we can select a Gaussian vector $$g$$ and define the set $$I_g$$ as above such that by restricting to the indices in $$I_g$$ we either get an approx rank one matrix or we increase the Frobenius norm of our expectation matrix by at least an $$(1+\epsilon)$$ factor for an appropriately chosen $$\epsilon>0$$. Since the latter cannot happen more than $$\log n/\epsilon$$ times, the final set of indices still has measure $$\exp(-\tilde{O}(\sqrt{n}))$$.

In further rounds, if our current set of indices is $$I$$ and the matrix (after subtracting from each vector $$u_i$$ its expectation) $$U_I = \E_{i\in I} u_iu_i^\top = \sum_{j=1}^n \lambda_j v_jv_j^\top$$ is not approximately rank one, then rather than choosing $$g$$ as a standard Gaussian, we choose it from the distribution $$N(0,U_I)$$ where we use $$U_I$$ as the covariance matrix. The expected norm of $$g$$ is simply $$\Tr(U_I)$$ which equals $$1$$. For every $$i$$, the random variable $$\iprod{u_i,g}$$ is a Gaussian with mean zero and variance $$\sum_{j=1}^n \iprod{u_i,v_j}\lambda_j$$. But for every $$j$$ in expectation over $$i$$, $$\E \iprod{u_i,v_j}^2 = \lambda_j$$ and so it turns out that we can assume that this random variable has variance $$\sum \lambda_j^2 = \norm{U_I}_F^2$$.

This means that if we choose $$I' = \{ i\in I: \iprod{u_i,g} \geq \sqrt{k}\norm{U_I}_F \}$$ we get a subset of $$I$$ with measure $$\exp(-O(k))$$. But now the new matrix $$U_{I'} = \E_{i\in I'} u_iu_i^\top$$ will have an eigenvalue of at least $$k\norm{U_I}_F^2$$ magnitude which is much larger than $$\norm{U_I}_F$$ since we chose $$k\gg \sqrt{n}$$. Hence $$U_{I'}$$ has significantly larger Frobenius norm than $$U_I$$.

The above arguments can be made precise, and we do so in Section [sec:structure-thm].

## Rectangle lemma for pseudo-distributions

The above is sufficient to show that given $$N\times n$$ matrices $$U=(u_1|\cdots|u_N)$$ and $$V=(v_1|\cdots|v_n)$$ (which we view as inducing a distribution over rank one matrices by taking $$u_iv_i^\top$$ for a random $$i$$), we can condition on a not too unlikely event (of probability $$\exp(-\tilde{O}(\sqrt{n}))$$ to obtain that $$\E u_iv_i^\top$$ is roughly rank one. But in the sos setting we are not given such matrices. Rather we have access to an object called a “pseudo-distribution” $$\mu$$ which we behaves to a certain extent as if it is such a distribution, but for which it is not actually the case. In particular, we are not able to sample from $$\mu$$, or condition it on arbitrary events, but rather only compute $$\E_\mu f(X)$$ for polynomials $$f$$ of degree at most $$\tilde{O}(\sqrt{n})$$, and even these expectations are only “pseudo expectations” in the sense that they do not need to correspond to any actual probability distribution.

To lift the arguments above to the sos setting, we need to first show that if $$\mu$$ was an actual distribution, then we could perform all of the above operations using only access to $$\tilde{O}(\sqrt{n})$$ degree moments of $$\mu$$. Then we need to show that our analysis can be captured by the degree $$\tilde{O}(\sqrt{n})$$ sos proof systems. Both these steps, which are carried out in Section [??] of this paper, are rather technical and non-trivial, and we do not describe them in this overview.

For starters, we need to move from conditioning a probability distribution to reweighing it. All of our conditioning procedures above had the form of restricting to $$i$$’s such that $$\xi(i) \geq \sqrt{k}$$ where $$\xi(i)$$ was probabilistically chosen so that for every $$i$$ $$\xi(i)$$ is a a mean zero and standard deviation one random variable satisfying $$\Pr[ \xi(i) = \ell ] = \exp(-\Theta(\ell^2))$$. We replace this conditioning by reweighing the distribution $$i$$ with the function $$\zeta(i)=\exp(\sqrt{k}\xi(i))$$. Note that iterative conditioning based on functions $$\xi_1,\ldots,\xi_t$$ can be replaced with reweighing by the product function $$\zeta_1,\ldots,\zeta_t$$. We then show that these $$\zeta_j$$ functions can be approximated by polynomials of $$\tilde{O}(k)$$ degree.

The arguments above allow us to construct a rounding algorithm that at least makes sense syntactically, in the sense that it takes the $$\tilde{O}(\sqrt{n})$$ degrees moments of $$\mu$$ and produces a rank one matrix that is a candidate solution to the original matrix. To analyze this algorithm, we need to go carefully over our analysis before, and see that all the arguments used can be embedded in the sos proof system with relatively low degree. Luckily we can rely on the recent body of works that establishes a growing toolkit of techniques to show such embeddings (???).

# References

Barak, Boaz, Fernando G. S. L. Brandão, Aram Wettroth Harrow, Jonathan A. Kelner, David Steurer, and Yuan Zhou. 2012. “Hypercontractivity, Sum-of-Squares Proofs, and Their Applications.” In STOC, 307–26. ACM.

Brandão, Fernando G. S. L., and Aram Wettroth Harrow. 2015. “Estimating Operator Norms Using Covering Nets.” CoRR abs/1509.05065.

Doherty, Andrew C, Pablo A Parrilo, and Federico M Spedalieri. 2004. “Complete Family of Separability Criteria.” Physical Review A 69 (2). APS: 022308.

Gavinsky, Dmitry, and Shachar Lovett. 2014. “En Route to the Log-Rank Conjecture: New Reductions and Equivalent Formulations.” In ICALP (1), 8572:514–24. Lecture Notes in Computer Science. Springer.

Harrow, Aram Wettroth, and Ashley Montanaro. 2013. “Testing Product States, Quantum Merlin-Arthur Games and Tensor Optimization.” J. ACM 60 (1): 3.

Horodecki, Michał, Paweł Horodecki, and Ryszard Horodecki. 1996. “Separability of Mixed States: Necessary and Sufficient Conditions.” Physics Letters A 223 (1). Elsevier: 1–8.

Lasserre, Jean B. 2001. “An Explicit Exact SDP Relaxation for Nonlinear 0-1 Programs.” In IPCO, 2081:293–303. Lecture Notes in Computer Science. Springer.

Lovász, László, and Michael E. Saks. 1988. “Lattices, Möbius Functions and Communication Complexity.” In FOCS, 81–90. IEEE Computer Society.

Lovett, Shachar. 2014. “Communication Is Bounded by Root of Rank.” In STOC, 842–46. ACM.

Nisan, Noam, and Avi Wigderson. 1994. “On Rank Vs. Communication Complexity.” In FOCS, 831–36. IEEE Computer Society.

Parrilo, Pablo A. 2000. “Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization.” PhD thesis, Citeseer.

Rothvoß, Thomas. 2014. “A Direct Proof for Lovett’s Bound on the Communication Complexity of Low Rank Matrices.” CoRR abs/1409.6366.

Vedral, Vlatko. 2008. “Quantifying Entanglement in Macroscopic Systems.” Nature 453 (7198). Nature Publishing Group: 1004–7.