# Quantum entanglement, sum of squares, and the log rank conjecture.

**Note:** These are very rough notes (essentially copied from the
introduction of the corresponding paper). A better version should be up
shortly, but in the meantime, reading the paper and looking at the
lecture video is probably preferable.

*Entanglement* is one of the more mysterious and subtle phenomena of
quantum mechanics. The formal definition is below
Reference:separable-states, but roughly speaking, a quantum state \(\rho\)
on two systems \(A\) and \(B\) is *entangled* if a quantum measurement of
one system can effect the other system. A non-entangled state is called
*separable*. This type of “spooky interaction at a distance” is
responsible for many of the more counter-intuitive features of quantum
mechanics. Entanglement is also used by all algorithms for quantum
computers that obtain speedups over the best known classical algorithms,
and it may be necessary for such speedups (**???**).

One of the ways in which the complexity of entanglement is manifested is
that even given the full description of a quantum state \(\rho\) as a
density matrix, there is no known efficient algorithm for determining
whether \(\rho\) is entangled or not. Indeed, the best known algorithms
take time which is *exponential* in the dimension of the state (which
itself is exponential in the number of underlying qubits). This is in
contrast to the classical case, where there is an efficient algorithm
that given a probability distribution \(\mu\) over a universe \(A\times B\),
can check whether or not \(\mu\) is a *product distribution* by simply
computing the rank of \(\mu\) when viewed as a matrix.

Given the inherently probabilistic and noisy setting of quantum
computing, arguably the right question is not to determine entanglement
exactly, but rather to distinguish between the case that a state \(\rho\)
is separable, and the case that it is \(\epsilon\)-far from being
separable, in the sense that there exists some *measurement* \(M\) that
accepts \(\rho\) with probability \(p\) but accepts every separable state
with probability at most \(p-\epsilon\). This problem is known as the
*Quantum Separability Problem* with parameter \(\epsilon\).
Gharibian (**???**), improving on
Gurvits (**???**), showed that this problem is NP hard when
\(\epsilon\) is inversely polynomial in the dimension of the state. Harrow
and Montanaro Harrow and Montanaro (2013) showed that, assuming the
Exponential Time Hypothesis, there is no \(n^{o(\log n)}\) time algorithm
for this problem for \(\epsilon\) which is a small constant.

A tightly related problem, which is the one we focus on in this paper,
is the *Best Separable State (BSS)* problem.Using the connection between optimization and separation oracles
in convex programming, one can convert a sufficiently good algorithm
for the search variant of one of these problems to the other. See
Harrow and Montanaro (2013 Sec. 4.2) for a thorough discussion
of the relations between these and many other problems. In the BSS problem the
input is a measurement \(\cM\) on a two part system and two numbers
\(1 \geq c > s \geq 0\) and the goal is to distinguish between the YES
case that there is a separable state that \(\cM\) accepts with probability
at least \(c\) and the NO case that \(\cM\) accepts every separable state
with probability at most \(s\). In particular, certifying that a
particular measurement \(\cM\) satisfies the NO case is extremely useful
since it implies that \(\cM\) can serve as *entanglement
witness* M. Horodecki, Horodecki, and Horodecki (1996; **???**), in
the sense that achieving acceptance probability with \(\cM\) larger than
\(s\) certifies the presence of entanglement in a state. Such entanglement
witnesses are used to certify entanglement in experiments and systems
such as candidate computing devices Vedral (2008), and so
having an efficient way to certify that they are sound (do not accept
separable states) can be extremely useful.

Similarly to the quantum separability problem, the BSS problem is NP
hard when \(c-s = 1/poly(n)\) (**???**) and Harrow and
Montanaro Harrow and Montanaro (2013) (Corollary 13(i)) show that
(assuming the ETH) there is no \(n^{o(\log n)}\) time algorithm for
\({BSS}_{1,1/2}\). An outstanding open question is whether the
Harrow and Montanaro (2013) result is *tight*: whether there is a
quasi-polynomial time algorithm for \({BSS}_{c,s}\) for some constants
\(1 \geq c > s \geq 0\). This question also has a complexity
interpretation. A measurement on a two part system can be thought of as
a *verifier* (with hardwired input) that interacts with two provers.
Requiring the state to be *separable* corresponds to stipulating that
the two provers are not entangled. Thus it is not hard to see that an
algorithm for \({BSS}_{c,s}\) corresponds to an algorithm for deciding all
languages in the complexity class \(QMA(2)\) of *two prover quantum Merlin
Arthur* systems with corresponding completeness and soundness parameters
\(c\) and \(s\) respectively. In particular a quasi-polynomial time
algorithm for \({BSS}_{0.99,0.5}\) would imply that
\(QMA(2) \subseteq EXP\), resolving a longstanding problem in quantum
complexity.For more on information on this problem and its importance, see
the presentations in the recent workshop
http://qma2016.quics.umd.edu/ that was dedicated to it.

In 2004, Doherty, Parrilo and Spedialieri Doherty, Parrilo, and Spedalieri (2004) proposed
an algorithm for the BSS problem based on the *Sum of Squares*
semidefinite programming hierarchy Parrilo (2000; Lasserre 2001). It is not known whether this algorithm can
solve the \({BSS}_{c,s}\) problem (for constants \(c>s\)) in
quasi-polynomial time. However , Christandl and
Yard (**???**) showed that it runs in
quasi-polynomial time when the measurement \(\cM\) is restricted to a
special class of measurements known as *one-way local operations and
classical communications* (1-LOCC). and
Harrow Brandão and Harrow (2015) showed that similar performance
for these types of measurements can be achieved by an algorithm based on
searching on an appropriately defined \(\epsilon\)-net.

## Non quantum motivations

The BSS problem is actually quite natural and well motivated from classical considerations. As we’ll see in Section [sec:techniques] below, it turns out that at its core lies the following problem:

Let \(\mathbb{F}\in \{\R,\mathbb{C}\}\) and \(\epsilon>0\). The *\(\epsilon\)
rank one vector problem over \(\F\)* is the task of distinguishing, given
a linear subspace \(\cW \subseteq \mathbb{F}^{n^2}\), between the case
that there is a nonzero rank one matrix \(L\in \cW\) and the case that
\(\norm{L-M}_F \geq \epsilon\norm{L}_F\) for every rank one \(L\) and
\(M\in\cW\).For a \(k\times m\) matrix \(A\), we denote by \(\norm{A}_F\) its
*Frobenius* norm, defined as
\(\sqrt{\sum_{i,j}|A_{i,j}|^2} = \Tr(AA^*)^{1/2}\), which is the same
as taking the \(\ell_2\) norm of the matrix when considered as an
\(km\)-dimensional vector.

This is arguably a natural problem in its own right. While solving this
problem exactly (i.e., determining if there is a rank one solution to a
set of linear equations) is the same as the NP hard task of solving
*quadratic equations*, it turns out that we can obtain non-trivial
algorithmic results by considering the above notion of approximation.
Indeed, our main result implies an \(\exp(\tilde{O}(\sqrt{n}))\) time
algorithm for this problem for any constant \(\epsilon>0\) in both the
real and complex cases.

## Our results

In this work we give a \(2^{\tilde{O}(\sqrt{n})}\) time algorithm for the
\({BSS}_{1,s}\) problem for every constant \(s<1\). We now make the
necessary definitions and state our main result.For the sake of accessibility, as well as to emphasize the
connections with non-quantum questions, we use standard linear
algebra notation in this paper as opposed to Dirac’s ket notation
that is more common in quantum mechanics. A vector \(u\) is a column
vector unless stated otherwise, and \(u^*\) denotes the complex
conjugate transpose of the vector \(u\). If \(u\) is real, then we
denote its transpose by \(u^\top\). See the lecture notes
(**???**) for a more complete coverage of separability
and entanglement.

A *quantum state* on a system of \(m\) elementary states (e.g., a
\(\log m\)-qubit register) is an \(m\times m\) complex Hermitian matrix
\(\rho\) (known as a *density matrix*) such that \(\Tr \rho = 1\). A quantum
state \(\rho\) is *pure* if it is of the form \(\rho = ww^*\) for some unit
vector \(w\in \mathbb{C}^m\). Otherwise we say that \(\rho\) is *mixed*.
Note that every mixed state \(\rho\) is a convex combination of pure
states.

If \(m=n^2\), and we identify \([m]\) with \([n]\times [n]\) then an
\(m\)-dimension pure quantum state \(\rho = ww^* \in \mathbb{C}^{m^2}\) is
*separable* if the vector \(w\in\mathbb{C}^m\) is equal to \(uv^*\) for some
\(u,v \in \mathbb{C}^n\). A general state \(\rho\) is *separable* if it is a
convex combination of separable pure states. That is,
\(\rho = \E (uv^*)(uv^*)^*\) where the expectation is taken over a
distribution supported over pairs of unit vectors
\(u,v \in \mathbb{C}^n\). A state that is not separable is called
*entangled*.

A quantum *measurement operator* is an \(m\times m\) complex Hermitian
matrix \(\cM\) such that \(0 \preceq \cM \preceq I\). The probability that a
measurement \(\cM\) accepts a state \(\rho\) is \(\Tr(\rho \cM)\).

For every \(s<1\), there is a \(2^{\tilde{O}(\sqrt{n})}\) time algorithm, based on \(\tilde{O}(\sqrt{n})\) rounds of the sos hierarchy, that on input an \(n^2\times n^2\) measurement operator \(\cM\), distinguishes between the following two cases:

- YES: There exists a separable state \(\rho \in \mathbb{C}^{n^2\times n^2}\) such that \(\Tr(\rho \cM)=1\).
- NO: For every separable \(\rho \in \mathbb{C}^{n^2\times n^2}\), \(\Tr(\rho\cM) \leq s\)

To our knowledge, this algorithm is the first for this problem that beats the brute force bound of \(2^{O(n)}\) time for general measurements.

Like the algorithms of Doherty, Parrilo, and Spedalieri (2004; **???**), our algorithm is based on the *sum
of squares* SDP hierarchy, but we introduce new techniques for analyzing
it that we believe are of independent interest. As we discuss in
Section [sec:conclusions], it is a fascinating open question to
explore whether our techniques can be quantitatively strengthened to
yield faster algorithms and/or extended for other problems such as the
\(2\) to \(4\) norm and small set expansion problems, that have been shown
to be related to the BSS problem by Barak et al. (2012) (albeit
in a different regime of parameters than the one we deal with in this
work). As we remark below, this question seems related to other
longstanding open questions in computer science and in particular to the
*log rank conjecture* in communication
complexity Lovász and Saks (1988).

[Imperfect completeness] [rem:perfect-completeness] We state our results for the case of perfect completeness for simplicity, but all of the proofs extend to the case of “near perfect completeness” where in the YES case we replace the condition \(\Tr(\rho\cM)=1\) with the condition \(\Tr(\rho\cM) = 1 - \tfrac{1}{n}\) (see the proof of Theorem [thm:alg-analysis]). It is an interesting open problem to find out whether our results can extend to the setting where in the YES case \(\Tr(\rho\cM)=1-\epsilon\) for some absolute constant \(\epsilon\). We conjecture that this is indeed the case.

While the natural setting for quantum information theory is the *complex
numbers*, much of the power and interest already arises in the case of
the real numbers, which is more natural for the sos algorithm (though it
does have complex-valued generalization). Hence in this version of this
paper we focus solely on the case that all operators, subspaces,
matrices are *real*. We note that there is a natural mapping of an
\(n\times n\) complex matrix \(A+iB\) (with real \(n\times n\) matrices
\(A,B\))into the \(2n\times 2n\) real matrix
\(\left( \begin{smallmatrix} A&B\\ -B&A \end{smallmatrix} \right)\). Note
that a complex rank one decomposition \(A+iB = (x+iy)(z+iw)^*\) for
\(x,y,z,w\in \R^n\) will translate into a rank two decomposition
\(\left( \begin{smallmatrix} x&y\\ -y&x \end{smallmatrix} \right)\left( \begin{smallmatrix} z&w\\ -w&z \end{smallmatrix} \right)^*\).
We can use a higher dimensional version of our result (see
Reference:thm:multi-dimensional-rank1 to find such decompositions,
though we defer the complete derivation to the full version of this
paper.

# Our techniques

Our algorithm follows a recent paradigm of constructing rounding
algorithms for the sum of squares sdp by considering its solutions as
“pseudo distributions” (**???**). These can be thought of as
capturing the uncertainty that a computationally bounded solver has
about the optimal solution of the given problem, analogously to the way
that probability distributions model uncertainty in the classical
information-theoretic Bayesian setting.

Somewhat surprisingly, our main tool in analyzing the algorithm are
techniques that arose in proof of the currently best known upper bound
for the *log rank conjecture* Lovász and Saks (1988). This conjecture
has several equivalent formulations, one of which is that every
\(N\times N\) matrix \(A\) with Boolean (i.e., \(0/1\)) entries and rank at
most \(n\), contains a submatrix of size at least
\(2^{-\poly\log(n)}N \times 2^{-\poly\log(n)}N\) that is of rank one.The original formulation of the log rank conjecture is that every
such matrix has communication complexity at most \(\poly\log(n)\), and
Nisan and Wigderson Nisan and Wigderson (1994) showed that this is
equivalent to the condition that such matrices contains a
monochromatic submatrix of the above size. Every monochromatic
submatrix is rank one, and every rank one submatrix of size
\(s\times s\) of a Boolean valued matrix contains a monochromatic
submatrix of size at least \(\tfrac{s}{2} \times \tfrac{s}{2}\).
The best known bound on the log rank conjecture is by
Lovett Lovett (2014) who proved that every such matrix
contains a submatrix of size at least
\(2^{-\tilde{O}(\sqrt{n})}N \times 2^{-\tilde{O}(\sqrt{n})}N\).

Our algorithm works by combining the following observations:

- Lovett’s proof can be generalized to show that
*every*\(N\times N\) rank \(n\) real (or complex) matrix \(A\) (not necessarily with Boolean entries) contains a \(2^{-\tilde{O}(\sqrt{n})}N\times 2^{-\tilde{O}(\sqrt{n})}N\) submatrix that is*close*to rank one in Frobenius norm. - If \(\mu\) is an
*actual*distribution over solutions to the sos program for the BSS problem on dimension \(n\), then we can transform \(\mu\) into an \(N\times N\) rank \(n\) matrix \(A=A(\mu)\) such that extracting an approximate solution from \(A\) in time \(2^{\tilde{O}(k)}\) can be done if \(A\) contains an approximately rank one submatrix of size at least \(2^{-k}N\times 2^{-k}N\). - Moreover all the arguments used to establish steps 1 and 2 above can be encapsulated in the sum of squares framework, and hence yield an algorithm that extracts an approximately optimal solution to the BSS problem from a degree \(\tilde{O}(\sqrt{n})\) pseudo-distribution \(\mu\) that “pretends” to be supported over exact solutions.

Thus, even though in the sos setting there is no actual distribution
\(\mu\), and hence no actual matrix \(A\), we can still use structural
results on this “fake” (or “pseudo”) matrix \(A\) to obtain an *actual*
rounding algorithm. We view this as a demonstration of the power of the
“pseudo distribution” paradigm to help in the discovery of new
algorithms, that might not seem as natural without placing them in this
framework.

## Rounding from rank one reweighings

We now give a more detailed (yet still quite informal) overview of the
proof. As mentioned above, we focus on the case that the
\(n^2 \times n^2\) measurement matrix \(\cM\) is *real* (as opposed to
*complex*) valued.

Let \(\cW \subseteq \R^{n^2}\) be the subspace of vectors \(X\) such that
\(X^\top \cM X = \norm{X}^2\) (this is a subspace since \(\cM \preceq I\)
and hence \(\cW\) is the eigenspace of \(\cM\) corresponding to the
eigenvalue \(1\)). We pretend that the sos algorithm yields a distribution
\(\mu\) over rank one matrices of the form \(X=uv^\top\) such that
\(X \in \cW\). When designing a rounding algorithm, we only have access to
*marginals* of \(\mu\), of the form \(\E_\mu f(X)\) for some “simple”
function \(f\) (e.g., a low degree polynomial). We need to show that we
can use such “simple marginals” of \(\mu\) to extract a single rank one
matrix \(u_0v_0^\top\) that has large projection into \(\cW\).

We start with the following simple observation:

If \(\mu\) is a distribution over matrices \(X\) in a subspace \(\cW \subseteq \R^{n^2}\) such that the expectation \(\E_\mu X\) is approximately rank one, in the sense that \(\norm{L-\E_\mu X}_F \leq \epsilon\norm{L}_F\) for some rank one matrix \(L\), then \(\Tr (\cM\rho) \geq 1-2\epsilon^2\) where \(\rho\) is the pure separable state \(\rho = LL^\top/\norm{L}_F^2\).

Since \(\mu\) is supported over matrices in \(\cW\), \(\E_\mu X\) is in \(\cW\). But this means that the \(\ell_2\) (i.e., Frobenius) norm distance of \(L\) to the subspace \(\cW\) is at most \(\epsilon\norm{L}_F\). Since \(\Tr(XX^\top\cM) = \Tr(X^\top \cM X) = \norm{X}_F^2\) for every \(X\in\cW\), the value \(\Tr(LL^\top M)\) will be at least as large as the norm squared of the projection of \(L\) to \(\cW\).

In particular this means that if we were lucky and the condition of Reference:lem:round-from-rankone’s statement occurs, then it would be trivial for us to extract from the expectation \(\E_\mu X\) (which is a very simple marginal) a rank one matrix that is close to \(\cW\), and hence achieves probability \(1-\epsilon\) in the measurement \(\cM\). Note that even if every matrix in the support of \(\mu\) has unit norm, the matrix \(L\) could be of significantly smaller norm. We just need that there is some dimension-one subspace on which the cancellations among these matrices are significantly smaller than the cancellations in the rest of the dimensions.

Of course there is no reason we should be so lucky, but one power that
the marginals give us is the ability to *reweigh* the original
distribution \(\mu\). In particular, for every “simple” non-negative
function \(\zeta:\R^{n^2}\rightarrow\R_+\), we can compute the marginal
\(\E_{\mu_\zeta} X\) where \(\mu_\zeta\) is the distribution over matrices
where \(\Pr_{\mu_\zeta}[X]\) (or \(\mu_\zeta(X)\) for short) is proportional
to \(\zeta(X)\mu(X)\). A priori in the degree \(k\) sos algorithm we are
only able to reweigh using functions \(\zeta\) that are polynomials of
degree at most \(k\), but for the purposes of this overview, let us
pretend that we can reweigh using any function that is not too “spiky”
and make the following definition:

Let \(\mu\) be a probability distribution. We say that a probability
distribution \(\mu'\) is a *\(k\)-spike reweighing* of \(\mu\) if
\(\Delta_{KL}(\mu'\|\mu)\leq k\) where \(\Delta_{KL}(\mu'\|\mu)\) denotes
the Kullback-Leibler divergence of \(\mu'\) and \(\mu\), defined as
\(\E_{X\sim\mu'} \log(\mu'(X)/\mu(X))\).

Thus at least on a “moral level”, the following theorem should be helpful for proving our main result:

Let \(\mu\) be any distribution over rank one \(n\times n\) matrices and \(\epsilon>0\). Then there exists an \(\sqrt{n}\poly(1/\epsilon)\)-spike reweighing \(\mu'\) of \(\mu\) and a rank one matrix \(L\) such that \[\norm{L-\pE_{\mu'} X}_F\leq \epsilon\norm{L}_F\]

One of the results of this paper is a proof of Reference:thm:rank-one-reweighing (see Reference:rem:proof-rank-one-reweighing). It turns out that this can be done using ideas from the works on the log rank conjecture.

## From monochromatic rectangles to rank one reweighings

What does Theorem Reference:thm:rank-one-reweighing has to do with the
log rank conjecture? To see the connection let us imagine that the
distribution \(\mu\) is *flat* in the sense that it is a uniform
distribution over rank one matrices
\(\{ u_1v_1^\top ,\ldots, u_Nv_N^\top \}\) (this turns out to be
essentially without loss of generality) and consider the \(n\times N\)
matrices \(U\) and \(V\) whose columns are \(u_1,\ldots,u_N\) and
\(v_1,\ldots,v_N\) respectively. The \(n\times n\) matrix
\(\pE_\mu u_iv_i^\top\) is proportional to \(UV^\top\). This matrix has the
same spectrum (i.e., singular values) as the \(N\times N\) matrix
\(U^\top V\). Hence, \(UV^\top\) is close to a rank one matrix if and only
if \(U^\top V\) is, since in both cases this happens when the square of
the top singular value dominates the sum of the squares of the rest of
the singular values. Now a flat distribution \(\mu'\) with
\(\Delta_{KL}(\mu'\|\mu)\leq k\) corresponds to the uniform distribution
over \(\{ u_iv_i^\top \}_{i\in I}\) where \(I \subseteq [N]\) satisfies
\(|I| \geq 2^{-k}N\). We can see that \(\E_{\mu'} u_iv_i^\top\) will be
approximately rank one if and only if the submatrix of \(U^\top V\)
corresponding to \(I\) is approximately rank one. Using these ideas it can
be shown that Theorem [thm:rank-one-reweighing] is equivalent to the
following theorem:To show this formally we use the fact that by Markov, every
distribution \(\mu'\) with
\(\Delta_{KL}(\mu'\|U_{[N]})=\log N - H(\mu') = k\) is
\(\epsilon\)-close to a distribution with min entropy
\(\log N-O(k/\epsilon)\) and every distribution of the latter type is
a convex combination of flat distributions of support at least
\(N2^{-O(k/\epsilon)}\).

Let \(A\) be any \(N\times N\) matrix of rank at most \(n\). Then there exists a subset \(I\subseteq [N]\) with with \(|I| \geq \exp(-\sqrt{n}\poly(1/\epsilon))N\) and a rank one matrix \(L\) such that \[\norm{L-A_{I,I}}_F\leq \epsilon\norm{L}_F\] where \(A_{I,I}\) is the submatrix corresponding to restricting the rows and columns of \(A\) to the set \(I\).

One can think of Reference:thm:rank-one-reweighing-dual as an
approximate and robust version of Lovett’s
result Lovett (2014) mentioned above. Lovett showed that
every \(N\times N\) matrix of rank \(n\) with *Boolean* entries has a
\(2^{-\tilde{O}(\sqrt{{n}})}N\times 2^{-\tilde{O}(\sqrt{{n}})}N\)
submatrix that is of exactly rank 1. We show that the condition of
Booleanity is not needed if one is willing to relax the conclusion to
having a submatrix that is only *approximately* rank 1. It is of course
extremely interesting in both cases whether the bound of
\(\tilde{O}(\sqrt{n})\) can be improved further, ideally all the way to
\(polylog(n)\). In the Boolean setting, such a bound might prove the log
rank conjecture,We note a caveat that this depends on the notion of “approximate”
used. Gavinsky and Lovett Gavinsky and Lovett (2014) showed that
to prove the log rank conjecture it suffices to find a in a rank \(n\)
Boolean matrix a rectangle of measure \(\exp(-\polylog(n))\) that is
*nearly monochromatic* in the sense of having a \(1-1/O(n)\) fraction
of its entries equal. In this paper we are more concerned with
rectangles whose distance to being rank one (or monochromatic) is
some \(\e>0\) that is only a small constant or \(1/\polylog(n)\).
[fn:approx-monochromatic] while in our setting such a bound (assuming it
extends to “pseudo matrices”) would yield a quasipolynomial time
algorithm for BSS, hence showing that \(QMA(2) \subseteq EXP\). It can be
shown that as stated, Theorem [thm:rank-one-reweighing] is tight.
However there are different notions of being “close to rank one” that
could be useful in both the log-rank and the quantum separability
setting, for which there is hope to obtain substantially improved
quantitative bounds. We discuss some of these conjectural directions in
Reference:sec:conclusions.

## Overview of proof

In the rest of this technical overview, we give a proof sketch of
Reference:thm:rank-one-reweighing-dual and then discuss how the proof
can be “lifted” to hold in the setting of sum of square
pseudo-distributions. The condition that a matrix \(A\) is of rank \(n\) is
the same as that \(A = UV^\top\) where \(U,V\) are two \(n\times N\) matrices
with columns \(u_1,\ldots,u_N\) and \(v_1,\ldots,v_N\) respectively (i.e.,
\(A_{i,j}=\iprod{u_i,v_j}\) for all \(i,j\in [N]\)). We will restrict our
attention to the case that all the columns of \(U\) and \(V\) are of unit
norm. (This restriction is easy to lift and anyway holds automatically
in our intended application.) In this informal overview, we also
restrict attention to the *symmetric* case, in which \(A=A^\top\) and can
be written as \(A=UU^\top\) and also assume that \(U\) is *isotropic*, in
the sense that \(\E_{i\in [N]} u_iu_i^\top = \tfrac{1}{n} \Id\).

Our inspiration is Lovett’s result Lovett (2014) which establishes a stronger conclusion for Boolean matrices. In particular, our proof follows Rothvoß’s proof Rothvoß (2014) of Lovett’s theorem, though the non-Boolean setting does generate some non-trivial complications. The \(N\times N\) matrix \(A\) satisfies that \(A_{i,j} = \iprod{u_i,u_j}\). An equivalent way to phrase our goal is that we want to find a subset \(I\subseteq [N]\) over the indices such that:

**(i)**\(|I| \geq \exp(-\tilde{O}(\sqrt{n}))N\).

**(ii)**If \(\lambda_1 \geq \lambda_2 \geq \cdots \lambda_n\) are the eigenvalues of \(\E_{i\in I} u_iu_i^\top\) then \(\epsilon^2 \lambda_1^2 \geq \sum_{j=2}^n \lambda_j^2\)

We will chose the set \(I\) *probabilistically* and show that **(i)** and
**(ii)** above hold in *expectation*. It is not hard to use standard
concentration of measure bounds to then deduce the desired result but we
omit these calculations from this informal overview.

Our initial attempt for the choice of \(I\) is simple, and is directly
inspired by Rothvoß (2014). We choose a random standard
Gaussian vector \(g\in N(0,\tfrac{1}{n}\Id)\) (i.e., for every \(i\), \(g_i\)
is an independent standard Gaussian of mean zero and variance \(1/n\)). We
then define \(I_g = \{ i : \iprod{g,u_i} \geq \sqrt{k/n} \}\) where
\(k=\tilde{O}(\sqrt{n})\) is a parameter to be chosen later. Since \(u_i\)
is a unit vector, \(\iprod{g,u_i}\) is a Gaussian of variance \(1/n\), and
so for every \(i\), the probability that \(i\in I_g\) is \(\exp(-O(k))\) hence
satisfying **(i)** in expectation.

The value \(\lambda_1\) of \(\E_{i\in I} u_iu_i^\top\) will be at least \(\Omega(k/n)\) in expectation. Indeed, we can see that the Gaussian vector \(g\) that we choose (which satisfies \(\norm{g}^2 = 1\pm o(1)\) with very high probability) will satisfy that \(g^\top \left( \E_i u_i u_i^\top \right) g = \E_{i\in I_g} \iprod{u_i,g}^2 \geq k/n\) and hence in expectation the top eigenvalue of \(\E_i u_iu_i^\top\) will be at least \((1-o(1))k/n\).

So, if we could only argue that in expectation it will hold that \(\sum_{j=1}^n \lambda_j^2 \ll k^2/n^2 = \mathrm{polylog}(n)/n\) then we’d be done. Alas, this is not necessarily the case. However, if this does fail, we can see that we have made progress, in the sense that by restricting to the indices in \(I\) we raised the Frobenius norm of \(\E u_iu_i^\top\) from the previous value of \(1/n\) (under the assumption that \(U\) was isotropic) to \(polylog(n)/n\). Our idea is to show that this holds in general: we can select a Gaussian vector \(g\) and define the set \(I_g\) as above such that by restricting to the indices in \(I_g\) we either get an approx rank one matrix or we increase the Frobenius norm of our expectation matrix by at least an \((1+\epsilon)\) factor for an appropriately chosen \(\epsilon>0\). Since the latter cannot happen more than \(\log n/\epsilon\) times, the final set of indices still has measure \(\exp(-\tilde{O}(\sqrt{n}))\).

In further rounds, if our current set of indices is \(I\) and the matrix (after subtracting from each vector \(u_i\) its expectation) \(U_I = \E_{i\in I} u_iu_i^\top = \sum_{j=1}^n \lambda_j v_jv_j^\top\) is not approximately rank one, then rather than choosing \(g\) as a standard Gaussian, we choose it from the distribution \(N(0,U_I)\) where we use \(U_I\) as the covariance matrix. The expected norm of \(g\) is simply \(\Tr(U_I)\) which equals \(1\). For every \(i\), the random variable \(\iprod{u_i,g}\) is a Gaussian with mean zero and variance \(\sum_{j=1}^n \iprod{u_i,v_j}\lambda_j\). But for every \(j\) in expectation over \(i\), \(\E \iprod{u_i,v_j}^2 = \lambda_j\) and so it turns out that we can assume that this random variable has variance \(\sum \lambda_j^2 = \norm{U_I}_F^2\).

This means that if we choose \(I' = \{ i\in I: \iprod{u_i,g} \geq \sqrt{k}\norm{U_I}_F \}\) we get a subset of \(I\) with measure \(\exp(-O(k))\). But now the new matrix \(U_{I'} = \E_{i\in I'} u_iu_i^\top\) will have an eigenvalue of at least \(k\norm{U_I}_F^2\) magnitude which is much larger than \(\norm{U_I}_F\) since we chose \(k\gg \sqrt{n}\). Hence \(U_{I'}\) has significantly larger Frobenius norm than \(U_I\).

The above arguments can be made precise, and we do so in Section [sec:structure-thm].

## Rectangle lemma for pseudo-distributions

The above is sufficient to show that given \(N\times n\) matrices
\(U=(u_1|\cdots|u_N)\) and \(V=(v_1|\cdots|v_n)\) (which we view as inducing
a distribution over rank one matrices by taking \(u_iv_i^\top\) for a
random \(i\)), we can condition on a not too unlikely event (of
probability \(\exp(-\tilde{O}(\sqrt{n}))\) to obtain that \(\E u_iv_i^\top\)
is roughly rank one. But in the sos setting we are *not* given such
matrices. Rather we have access to an object called a
“pseudo-distribution” \(\mu\) which we behaves to a certain extent as if
it is such a distribution, but for which it is not actually the case. In
particular, we are not able to sample from \(\mu\), or condition it on
arbitrary events, but rather only compute \(\E_\mu f(X)\) for polynomials
\(f\) of degree at most \(\tilde{O}(\sqrt{n})\), and even these expectations
are only “pseudo expectations” in the sense that they do not need to
correspond to any actual probability distribution.

To lift the arguments above to the sos setting, we need to first show
that if \(\mu\) was an actual distribution, then we could perform all of
the above operations using only access to \(\tilde{O}(\sqrt{n})\) degree
moments of \(\mu\). Then we need to show that our *analysis* can be
captured by the degree \(\tilde{O}(\sqrt{n})\) sos proof systems. Both
these steps, which are carried out in Section [??] of this paper, are
rather technical and non-trivial, and we do not describe them in this
overview.

For starters, we need to move from *conditioning* a probability
distribution to *reweighing* it. All of our conditioning procedures
above had the form of restricting to \(i\)’s such that
\(\xi(i) \geq \sqrt{k}\) where \(\xi(i)\) was probabilistically chosen so
that for every \(i\) \(\xi(i)\) is a a mean zero and standard deviation one
random variable satisfying
\(\Pr[ \xi(i) = \ell ] = \exp(-\Theta(\ell^2))\). We replace this
conditioning by *reweighing* the distribution \(i\) with the function
\(\zeta(i)=\exp(\sqrt{k}\xi(i))\). Note that iterative conditioning based
on functions \(\xi_1,\ldots,\xi_t\) can be replaced with reweighing by the
product function \(\zeta_1,\ldots,\zeta_t\). We then show that these
\(\zeta_j\) functions can be approximated by polynomials of \(\tilde{O}(k)\)
degree.

The arguments above allow us to construct a rounding algorithm that at
least makes sense syntactically, in the sense that it takes the
\(\tilde{O}(\sqrt{n})\) degrees moments of \(\mu\) and produces a rank one
matrix that is a candidate solution to the original matrix. To analyze
this algorithm, we need to go carefully over our analysis before, and
see that all the arguments used can be embedded in the sos proof system
with relatively low degree. Luckily we can rely on the recent body of
works that establishes a growing toolkit of techniques to show such
embeddings (**???**).

# References

Barak, Boaz, Fernando G. S. L. Brandão, Aram Wettroth Harrow, Jonathan A. Kelner, David Steurer, and Yuan Zhou. 2012. “Hypercontractivity, Sum-of-Squares Proofs, and Their Applications.” In *STOC*, 307–26. ACM.

Brandão, Fernando G. S. L., and Aram Wettroth Harrow. 2015. “Estimating Operator Norms Using Covering Nets.” *CoRR* abs/1509.05065.

Doherty, Andrew C, Pablo A Parrilo, and Federico M Spedalieri. 2004. “Complete Family of Separability Criteria.” *Physical Review A* 69 (2). APS: 022308.

Gavinsky, Dmitry, and Shachar Lovett. 2014. “En Route to the Log-Rank Conjecture: New Reductions and Equivalent Formulations.” In *ICALP (1)*, 8572:514–24. Lecture Notes in Computer Science. Springer.

Harrow, Aram Wettroth, and Ashley Montanaro. 2013. “Testing Product States, Quantum Merlin-Arthur Games and Tensor Optimization.” *J. ACM* 60 (1): 3.

Horodecki, Michał, Paweł Horodecki, and Ryszard Horodecki. 1996. “Separability of Mixed States: Necessary and Sufficient Conditions.” *Physics Letters A* 223 (1). Elsevier: 1–8.

Lasserre, Jean B. 2001. “An Explicit Exact SDP Relaxation for Nonlinear 0-1 Programs.” In *IPCO*, 2081:293–303. Lecture Notes in Computer Science. Springer.

Lovász, László, and Michael E. Saks. 1988. “Lattices, Möbius Functions and Communication Complexity.” In *FOCS*, 81–90. IEEE Computer Society.

Lovett, Shachar. 2014. “Communication Is Bounded by Root of Rank.” In *STOC*, 842–46. ACM.

Nisan, Noam, and Avi Wigderson. 1994. “On Rank Vs. Communication Complexity.” In *FOCS*, 831–36. IEEE Computer Society.

Parrilo, Pablo A. 2000. “Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization.” PhD thesis, Citeseer.

Rothvoß, Thomas. 2014. “A Direct Proof for Lovett’s Bound on the Communication Complexity of Low Rank Matrices.” *CoRR* abs/1409.6366.

Vedral, Vlatko. 2008. “Quantifying Entanglement in Macroscopic Systems.” *Nature* 453 (7198). Nature Publishing Group: 1004–7.