# SOS and the unit sphere: Sparse vectors, tensor decomposition, dictionary learning, and quantum separability

So far our main focus has been on optimizing polynomials over the
Boolean cube, but as we’ve seen, the sos algorithm can be applied in
more general settings. In particular several very interesting problems
can be phrased as relate to the task of maximizing a polynomial over the
*unit sphere*. That is, given some polynomial \(p\from\R^n\to\R\), compute
or approximate \(\max_{\norm{x}=1} p(x)\) and/or find an input \(x\) that
(approximately) achieves this maximum.This problem is also sometimes known as the problem of computing
the *injective tensor norm*, see (Harrow and Montanaro 2010).

Some examples that we will discuss include the following:

### Tensor PCA and tensor decomposition

In **principal component analysis (PCA)** we are given samples
\(x^1,\ldots,x^m\) of some distribution \(X\) over \(\R^n\), and want to find
the direction \(v\in\R^n\) that maximizes \(\sum_{i,j} v_iv_jM_{i,j}\) where
\(M_{i,j} = \tfrac{1}{m}\sum_{k=1}^m x^k_ix^k_j\). The idea is that if \(X\)
has the form \(v_0+Y\) with \(v_0\in\R^n\) and \(Y\) a mean zero “noise
variable” then \(\E XX^\top = v_0v_0^\top\) and hence we expect the
direction \(v\) that maximizes this correlation to be proportional to
\(v_0\).

PCA only involves maximizing a *quadratic* polynomial which amounts to
an efficiently solvable maximum eigenvalue computation. However, if for
example \(X\) is a *mixture* with equal weights of two distributions of
the form \(v_0 + Y\) and \(v_1 + Y'\) (where \(v_0,v_1\) are unit vectors, and
assume they are orthogonal for simplicity) then one can see that
\(\E XX^\top\) would be proportional to the identity linear operator on
the subspace \(Span \{ v_0,v_1 \}\). Hence, even if we had an infinite
number of samples and could get the expectation of the second moment
matrix \(\E XX^\top\) precisely, we would still not be able to recover
\(v_0\) and \(v_1\) but rather only the subspace that they span. However, it
can be shown that in this case \(v_0\) and \(v_1\) will be the only two
global maximizers of \(\sum_{T_{i,j,k}}x_ix_jx_k\) where
\(T_{i,j,k} = \E X_iX_jX_k\). So, recovering these centers reduces to
maximizing this polynomial.

More generally the **tensor PCA** problem (see
(Richard and Montanari 2014)) is defined as follows: we are given the
\(d\) level moments \(\E X^{\otimes d}\) of some random variable \(X\) over
\(\R^n\), and our goal is to find a vector \(x\) maximizing the value of
\(p(x) = \E \iprod{X,x}^d\). It is an extremely useful generalization of
PCA though unfortunately it is NP hard on average.

If tensor PCA is the higher degree analog of computing the maximum
eigenvalue, the **tensor decomposition problem** (see
(Kolda and Bader 2009)) is the higher degree analog of *singular value
decomposition*. Namely, given some tensor \(T\in\R^{n^d}\) we want to find
the smallest \(r\) and \(dr\) \(n\)-dimensional vectors
\(\{ v_{i,j} \}_{i\in[r],j\in[d]}\) so we can write \(T\) as
\(\sum_{i=1}^r v_{i,1}\otimes \cdots \otimes v_{i,d}\). This is an
incredibly useful primitive and many of the known algorithms and
heuristics for it are obtained by iteratively running subroutines for
tensor PCA and then “peeling off” the resulting vectors.This viewpoint of tensor decomposition being a generalization of
tensor PCA to multiple vectors is somewhat inaccurate. The two
problems are somewhat orthogonal. Tensor decomposition generalizes
the problem of finding the smallest matrices \(B\) and \(C\) such that
\(A=BC\) for a given matrix \(A\).

### Finding sparse vectors in subspaces

In the **sparse vector problem** we are given an \(n\) dimensional
subspace \(V\subseteq \R^m\) and want to find the nonzero vector \(v\in V\)
that is the *sparsest* possible, in the sense of having as few nonzero
entries as possible (or in the sense that a few of the entries have
large magnitude and the rest very small one). This is a natural problem,
that can be thought of as a continuous variant of finding the *shortest
codeword* in a linear code. It also arises in a variety of applications
(including the sparse coding and small set expansion applications
mentioned below), see Demanet and Hand (2014) . In particular in the context
of *compressed sensing* certifying that a subspace does not contain a
sparse vector is closely related to certifying the *restricted isometry
property* (Candès and Tao 2005) of subspaces.

The sparse vector problem does not immediately translate into maximizing
polynomial over the sphere, but it turns out that sparseness of a vector
\(v\) can be approximated by the relation of \(\norm{v}_q/\norm{v}_p\) for
\(q>p\). Alas, there appears to be a subtle tradeoff between the quality
of this approximation and the tractability of computing this ratio. For
example, for \(p=1\) and \(q=\infty\) this can be efficiently computed
(**exercise**) but only yields an \(\Tilde{O}(\sqrt{n})\) approximation.
One can get very good approximation guarantees by considering \(p=1\) and
\(q=2\) but no efficient algorithm is known for this case. For \(p=2\) and
\(q=4\) the problem becomes one of computing \(\max_{\norm{x}_2=1} p(x)\)
where \(p(x)=\norm{Ax}_4^4\) and \(A\from\R^n\to\R^m\) is a generating
matrix for \(V\). It turns out that this formulation yields non-trivial
guarantees, and while it is NP hard in general, it can be efficiently
computed via SOS in several important cases.

### Sparse coding

In the **sparse coding** problem (also known as *dictionary learning*)
we are again given samples \(x^1,\ldots,x^m\) of some distribution over
\(\R^n\) and our goal is to find a basis \(A = \{ a_1,\ldots,a_n \}\) for
\(\R^n\) which maximizes the average *sparsity* of the vectors
\(\{ (\iprod{a_1,x^i},\ldots,\iprod{a_n,x^i})\}_{i=1,\ldots,m}\).Often we consider also \(A\)’s that are not bases of \(\R^n\) but
merely span it, and hence also consider the case \(|A|>n\) which is
known as the *overcomplete* case. The
intuition behind this is that the sparse representation is often the
“right” one, just as sounds tend to be sparser when represented in the
Fourier base, images in the Wavelet base, etc… . Indeed, while for a
generic basis \(A = \{ a_1,\ldots,a_n \}\), a sample \(x\) would have most
of the values \(\iprod{a_i,x}\) be of similar magnitude, in a *sparse*
representations we can interpret the \(a_i\)’s as meaningful *features*
that are turned on or off depending on this magnitude.

Indeed, Olshausen and Field (1997) suggested that sparse coding may be used as
a strategy for the visual cortex and many deep neural networks use
sparse coding as a way to generate their bottom-most layer. One way to
solve the sparse coding problem is to try to recover the elements of \(A\)
one vector at a time by trying to find sparse vectors in the subspace
\(V=Image(X)\) of \(\R^m\) where \(X\) is the \(m\times n\) matrix whose columns
are \(x^1,\ldots,x^m\).

### Quantum information theory, quantum entanglement and QMA(2)

In quantum information theory, a system with \(N\) classical states (such
as a system of \(\log n\) bits) is modeled as an \(N\times N\) positive
semidefinite matrix \(\rho\) of trace \(1\) known as the *density matrix* of
the state. A *classical* state, which can be thought of as a
distribution \((p_1,\ldots,p_N)\) over the \(N\) different outcomes
corresponds to the special case when \(\rho\) is diagonal. A quantum
*pure* state corresponds to the case where \(\rho\) is rank one-
\(\rho = vv^\top\) for some unit vector \(v\in\R^N\). A *mixed* state
corresponds to the more general case where the matrix is not rank one.
Note that by singular value decomposition, any mixed state \(\rho\) can be
written as \(\rho = \sum_{i=1}^N p_i v_iv_i^\top\) where the \(v_i\)’s are
unit and the \(p_i\)’s are non-negative and sum up to one.Quantum states are generally *complex* vectors but essentially all
of their interesting phenomena already arise in the real case in
which they can be thought of as “probabilities with negative
numbers”. In this lecture we will restrict attention to the real
case but everything we say can be generalized to the complex case by
interpreting \(v^\top\) as a the complex adjoint of \(v\) and
interpreting quantities such as \(x^2\) as \(xx^\top = |x|^2\).

If we have two systems each of \(M\) states, then we can think of the
combined system as a system of \(N=M^2\) states (e.g., two one-bit systems
are the same as a single two-bit system). One of the most mysterious
phenomena of quantum states is that they can create *entanglement*
between the different subsystems. A pure state \(v\in\R^N\) is *separable*
(i.e., non entangled) if \(v=ww^\top\) for some \(w\in\R^M\).Actually this is the definition for a state that is both
*symmetric* and *separable*: a separable state would have the form
\(ww'^\top\) for some \(w,w'\). In other words, a separable state on an
\(M^2\) state system can be interpreted as a rank one \(M\times M\)
matrix while we make the additional requirement that this matrix is
symmetric (or Hermitian in the complex case). We restrict to the
symmetric case for notational convenience and easier relation to
other problems though it does not make much of a difference. In other
words the density matrix has the form \(w^{\otimes 4}\). A mixed state
\(\rho\) is separable if it is a convex combination of pure separable
states. That is, \(\rho = \sum_i p_i w_i^{\otimes 4}\) for non-negative
\(p_i\)’s that sum of up to one.

One of the ways that the complexity of entanglement is manifested is
that it is NP hard to determine if a state is separable or not. But one
can ask how hard is it to *approximate*. A *quantum measurement* on an
\(N\)-state system is a p.s.d. \(N\times N\) matrix \(\cM\) such that
\(\cM \preceq I\). The probability that \(\cM\) accepts a state \(\rho\) is
simpy \(\iprod{\cM,\rho}=\Tr(\cM\rho)\). (Can you see why this number is
always between zero and one?.) Two states \(\rho,\rho'\) are identical if
and only if they are accepted with the same probability for all possible
measurements (**exercise**). We say that a state \(\rho\) is
\(\epsilon\)-separable if there is some separable \(\rho'\) such that
\(|\Tr(\cM\rho)-\Tr(\cM\rho')|\leq \epsilon\) for all measurements \(\cM\).
The *quantum separability problem* with parameter \(\epsilon\) is to
distinguish, given a density matrix \(\rho\), between the YES case that
\(\rho\) is separable and the NO case that \(\rho\) is not \(\epsilon\)
separable. Assuming the exponential time hypothesis, the quantum
separability problem requires at least \(N^{\Omega(\log N)}\) time for
every constant \(\epsilon>0\) (Harrow and Montanaro 2010).
Doherty, Parrilo, and Spedalieri (2004) proposed using sos for this problem, but we still
do not know the degree required for this problem in the worst-case.
Brandão, Christandl, and Yard (2011) showed that it does work in \(O(\log N)\)
degree if we consider a relaxed notion of \(\epsilon\)-separability that
only restricts attention to particular types of measurements (known as
local operations and one-way communication or one-way LOCC).

A related problem is the *Best Separable State (BSS)* problem where one
is given a measurement \(\cM\) and paramters \(1>c>s>0\) as input and needs
to distinguish between the case that there is a separable state \(\rho\)
with \(\Tr(\cM\rho) \geq c\) and the case that every separable state
\(\rho\) satisfies \(\Tr(\cM\rho) \leq s\). (A natural setting of parameters
is \(c=1\) and \(s=1-\epsilon\).) This problem is also known as finding the
acceptance probability of a *quantum arthur merlin verifier with two
provers*, since we can think of \(\cM\) as a verfiying algorithm that
receives two quantum states from two provers that are guaranteed to be
non-entangled. A quasipolynomial time algorithm for this problem would
correspond to placing \(QMA(2)\) in \(EXP\) while the best known upper
bounds is \(QMA(2)\in EE = Dtime(2^{O(N)})\) (in this problem \(N\)
corresponds to \(2^n\) where \(n\) is the number of qubits in this proofs).
The problem of determining the complexity of \(QMA(2)\) is of intense
interest to quantum information theorists and recently University of
Maryland’s QuiCS center held a weeklong
workshop dedicated solely to this
problem.

It is not hard to prove that there is always a *pure* state maximizing
the acceptance probability of any measurement \(\cM\) and hence the BSS
problem corresponds to finding the maximum of
\(p(x) = \Tr(\cM (\dyad x)^{\otimes 2})\) over all unit vectors \(x\).Once again, recall that we restrict attention to the real
symmetric case. In the literature this problem is typically written
as maximizing \(\Tr(\cM (x\otimes y)(x\otimes y)^*)\) over all unit
complex \(x,y\) or in quantum notation maximizing
\({\langle x,y\vert }\cM{\vert x,y \rangle }\).

## What does this have to do with sos?

It turns out that all these problems share similar characteristics:

- They are NP hard to solve exactly in the worst-case. Indeed, that’s true for almost all problems involving tensors (Hillar and Lim 2013).
- The algorithm with the best known rigorous guarantees for these problems is the sos hierarchy.
- Often we can
*prove*that the sos algorithm provides non-trivial guarantees in the worst-case and/or average-case setting that go beyond what other algorithms can achieve. - We typically do not have tight bounds on the performance of the sos algorithm for these problems in neither the worst-case nor the average-case setting.
- Some of these problems have natural heuristics that people apply in practice. More often than not we do not know how to analyze the performance of thee heuristics.

Beyond these superficial similarities, it turns out that there are some
deep and fascinating technical connections between these problems as
well as other important questions in theoretical CS such as the *Unique
Games Conjecture* Khot (2002) and the *Log Rank Conjecture*
Lovász and Saks (1988) . We will see some of these results and
connections in the next lectures.

# References

Brandão, Fernando G. S. L., Matthias Christandl, and Jon Yard. 2011. “A Quasipolynomial-Time Algorithm for the Quantum Separability Problem.” In *Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, ca, Usa, 6-8 June 2011*, 343–52. doi:10.1145/1993636.1993683.

Candès, Emmanuel J., and Terence Tao. 2005. “Decoding by Linear Programming.” *IEEE Trans. Information Theory* 51 (12): 4203–15.

Demanet, Laurent, and Paul Hand. 2014. “Scaling Law for Recovering the Sparsest Element in a Subspace.” *Information and Inference*. Oxford University Press, iau007.

Doherty, Andrew C, Pablo A Parrilo, and Federico M Spedalieri. 2004. “Complete Family of Separability Criteria.” *Physical Review A* 69 (2). APS: 022308.

Harrow, Aram Wettroth, and Ashley Montanaro. 2010. “An Efficient Test for Product States with Applications to Quantum Merlin-Arthur Games.” In *FOCS*, 633–42. IEEE Computer Society.

Hillar, Christopher J., and Lek-Heng Lim. 2013. “Most Tensor Problems Are NP-Hard.” *J. ACM* 60 (6): Art. 45, 39. doi:10.1145/2512329.

Khot, Subhash. 2002. “On the Power of Unique 2-Prover 1-Round Games.” In *IEEE Conference on Computational Complexity*, 25. IEEE Computer Society.

Kolda, Tamara G, and Brett W Bader. 2009. “Tensor Decompositions and Applications.” *SIAM Review* 51 (3). SIAM: 455–500.

Lovász, László, and Michael E. Saks. 1988. “Lattices, Möbius Functions and Communication Complexity.” In *FOCS*, 81–90. IEEE Computer Society.

Olshausen, Bruno A, and David J Field. 1997. “Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?” *Vision Research* 37 (23). Elsevier: 3311–25.

Richard, Emile, and Andrea Montanari. 2014. “A Statistical Model for Tensor PCA.” In *NIPS*, 2897–2905.