Sum-of-squares: proofs, beliefs, and algorithms — Boaz Barak and David Steurer

Introduction

The terms “Algebra” and “Algorithm” both originate from the same person. Muhammad ibn Musa al Khwarizmi was a 9th century Persian mathematician, astronomer and geographer. The Latin translation of his books introduced the Hindu-Arabic decimal system to the western world. His book “The Compendious Book on Calculation by Completion and Balancing” also presented the first general solution for quadratic equations via the technique of “completing the square”. More than that, this book introduced the notion of solving general as opposed to specific equations by a sequence of manipulations such as subtracting or adding equal amounts. Al Khwarizmi called the latter operation al-jabr (“restoration” or “completion”), and this term gave rise to the word Algebra. The word Algorithm is derived from the Latin form of Al Khwarizmi’s name.See The equation that couldn’t be solved by Mario Livio for much of this history.

However, the solution of equations of degree larger than two took much longer time. Over the years, a great many ingenious people devoted significant effort to solving special cases of such equations. In the 14th century, the Italian mathematician Maestro Dardi of Pisa gave a classification of 198 types of cubic (i.e., degree \(3\)) and quartic (i.e., degree \(4\)) equations, but could not find a general solution for all such examples. Indeed, in the 16th century, Italian mathematicians would often hold “Mathematical Duels” in which opposing mathematicians would present to each other equations to solve. These public competitions attracted many spectators, were the subject of bets, and winning such duels was often a condition for obtaining appointments or tenure at universities. It is in the context of these competitions, and through a story of intrigue, controversy and broken vows, that the general formula for cubic and quartic equations was finally discovered, and later published in the 1545 book of Cardano.

However, the solution for quintic formulas took another 250 years. Many great mathematicians including Descartes, Leibnitz, Lagrange, Euler, and Gauss worked on the problem of solving equations of degree five and higher, finding solutions for special cases but without discovering a general formula. It took until the turn of the 19th century and the works of Ruffini, Galois and Abel to discover that in fact such a general formula for solving degree 5 or higher equations via combinations of the basic arithmetic formulas and taking roots does not exist. More than that, these works gave rise to a precise characterization of which equations are solvable and thus led to the birth of group theory.

Today, solving an equation such as \(x^{17}=1\) (which amounts to constructing a 17-gon using a compass and straightedge- one of the achievements Gauss was most proud of) can be done in a few lines of routine calculations. Indeed, this is a story that repeats itself often in science: we move from special cases to a general theory, and in the process transform what once required creative genius into mere calculations. Thus often the sign of scientific success is when we eliminate the need for creativity and make boring what was once exciting.

Let us fast-forward to present days, where the design of algorithms is another exciting field that requires a significant amount of creativity. The Algorithms textbook of Cormen et al. has 35 chapters, 156 sections, and 1312 pages, dwarfing even Dardi’s tome on the 198 types of cubic and quartic equations. The crux seems to be the nature of efficient computation. While there are some exceptions, typically when we ask whether a problem can be solved at all the answer is much simpler, and does not seem to require such a plethora of techniques as is the case when we ask whether the problem can be solved efficiently. Is this state of affairs inherent, or is it just a matter of time until algorithm design will become as boring as solving a single polynomial equation?

We will not answer this question in this course. However, it does motivate some of the questions we ask, and the investigations we pursue. In particular, this motivates the study of general algorithmic frameworks as opposed to tailor-made algorithms for particular problems.There is also a practical motivation for this as well: real-world problems often have their own kinks and features, and will rarely match up exactly to one of the problems in the textbook. A general algorithmic framework can be applied to a wider range of problems, even if they have not been studied before. There are several such general frameworks, but we will focus on one example that arises from convex programming: the Sum of Squares (SOS) Semidefinite Programming Hierarchy. It has the advantage that on the one hand it is general enough to capture many algorithmic techniques, and on the other hand it is specific enough that (if we are careful) we can avoid the “curse of completeness”. That is, we can actually prove impossibility results or lower bounds for this framework without inadvertently resolving questions such as \(P\) vs \(NP\). The hope is that we can understand this framework enough to be able to classify which problems it can and can’t solve. Moreover, as we will see, through such study we end up investigating issues that are of independent interest, including mathematical questions on geometry, analysis, and probability, as well as questions about modeling beliefs and knowledge of computationally bounded observers.

The Sum of Squares Algorithm

Let us now take a step back from the pompous rhetoric and slowly start getting around to the mathematical contents of this course. It will be mostly be focused on the Sum of Squares (SOS) semidefinite programming hierarchy. In a sign that perhaps we did not advance so much from the middle ages, the SOS algorithm is also a method for solving polynomial equations, albeit systems of several equations in several variables. However, it turns out that this is a fairly general formalism. Not only is solving such equations, even in degree two, NP-hard, but in fact one can often reduce directly to this task from problems of interest in a fairly straightforward manner.In particular, given a 3SAT formula of a form such as \((\overline{x_7} \vee x_{12} \vee x_{29}) \wedge (x_{5} \vee x_7 \vee \overline{x_{32}}) \wedge \cdots\), we can easily translate the question of whether it has a satisfying assignment \(x\in\bits^n\) (where \(n\) is the number of variables) into the question of whether the equations \(x_1^2 - x_1 = 0,\ldots, x_n^2 - x_1 = 0, P(x)-m = 0\) can be solved where \(m\) is the number of clause and \(P(x)\) is the degree \(6\) polynomial obtained by summing for every clause \(j\) the polynomial \(C_j\) such that \(C_j(x)\) equals \(1\) if \(x\) satisfies \(j^{th}\) clause and \(C_j(x)=0\) otherwise.

We will be interested in solving such equations over the real numbers, and typically in settings where (a) the polynomials in questions are low degree, and (b) obtaining an approximate solution is essentially as good as obtaining an exact solution, which helps avoid at least some (if not all) issues of precision and numerical accuracy. Nevertheless, this is still a very challenging setting. In particular, whenever there is more than one equation, or the degree is higher than two, the task of solving polynomial equations becomes non convex, and generally speaking, there can be exponentially many local minima for the “energy function” which is obtained by summing up the square violations of the equations. This is problematic since many of the tools we use to solve such equations involve some form of local search, maintaining at each iteration a current solution and looking for directions of improvements. Such methods can and will get “stuck” at such local minima.

When faced with a non-convex problem, one approach that is used in both practice and theory is to enlarge the search space.
Geometrically, we hope that by adding additional dimensions, one may find new ways to escape local minima. Algebraically, this often amounts to adding additional variables, with a standard example being the linearization technique where we reduce, say, quadratic equations in \(n\) variables into a linear equations in \(n^2\) variables by letting \(y_{i,j}\) correspond to \(x_ix_j\). If the original system was sufficiently overdetermined, one could hope that we can still solve for \(y\).

The SOS algorithm is a systematic way of enlarging the search space, by adding variables in just such a manner. In the example above it adds in the additional constraint that if the matrix \(Y=( y_{i,j})\) would be positive semidefinite. That is, that it satisfies \(w^\top Y w \geq 0\) for every column vector \(w\). (Note that if \(Y\) was in fact of the form \(Y_{i,j}=x_ix_j\) then \(w^\top Y w\) would equal \(\iprod{w,x}^2 \geq 0\).) More generally, the SOS algorithm is parameterized by a number \(\ell\), known as its degree, and for every set of polynomial equations on \(n\) variables, yields a semidefinite programA linear program is the task of solving a set of linear inequalities (i.e., finding \(x\) that satisfies equations of the form \(\sum a_i x_i \leq b\)). The set of \(x\)’s satisfying some linear inequalities is known as a polyhedron and is convex. A semidefinite program is obtained by adding to a linear program a constraint of the form \(M(x) \succeq 0\) where \(M\) is a symmetric matrix whose every entry is a linear function of \(x\), and \(M \succeq 0\) denotes that \(M\) is positive semidefinite (i.e. \(w^\top M w \geq 0\) for all vectors \(w\)). Geometrically, the intersection of a polyhedron with such a constraint is known as a spectrahedron. on \(n^\ell\) variables that becomes a tighter and tighter approximation of the original equations as \(\ell\) grows. As the problem is NP complete, we don’t expect this algorithm to solve polynomial equations efficiently (i.e., with small degree \(\ell\)) in the most general case, but understanding in which cases it does so is the focus of much research efforts and the topic of this course.

History

The SOS algorithm has its roots in questions raised in the late \(19th\) century by Minkowski and Hilbert of whether any non-negative polynomial can be represented as a sum of squares of other polynomials. Hilbert realized that, except for some special cases (most notably univariate polynomials and quadratic polynomials), the answer is negative and that there are examples—which he showed to exist by non constructive means—of non-negative polynomial that cannot be represented in this way. It was only in the 1960’s that Motzkin gave a concrete example of such a polynomial, namely \(1+ x^4y^2 + x^2y^4 - 3x^2y^2.\) By the arithmetic-mean geometric-mean inequality, \(\tfrac{1+x^4y^2+x^2y^2}{3} \geq (1\cdot x^4y^2 \cdot x^2y^4)^{1/3}\) and hence this polynomial is always non-negative. However, it is not hard, though a bit tedious, to show that it cannot be expressed as a sum of squares.

In his famous 1900 address, Hilbert asked as his 17th problem whether any polynomial can be represented as a sum of squares of rational functions. (For example, Motzkin’s polynomial above can be shown to be the sum of squares of four rational functions of denominator and numerator degree at most \(6\)). This was answered positively by Artin in 1927. His approach can be summarized as follows: given a hypothetical polynomial \(P\) that cannot be represented in this form, to use the fact that the rational functions are a field to extend the reals into a “pseudo-real” field \(\tilde{\R}\) on which there would actually be an element \(\tilde{x} \in \tilde{R}\) such that \(P(\tilde{x})<0\), and then use a “transfer principle” to show that there is an actual real \(x\in\R\) such that \(P(x)<0\).This description is not meant to be understandable but to make you curious enough to look it up… Later in the 60’s and 70’s, Krivine and Stengle extended this result to show that any unsatisfiable system of polynomial equations can be certified to be unsatisfiable via a Sum of Squares (SOS) proof (i.e., by showing that it implies an equation of the form \(\sum_{i=1}^r p_i^2 = -1\) for some polynomials \(p_1,\ldots, p_r\)). This result is known as the Positivstellensatz.

In the late 90’s / early 2000’s, there were two separate efforts on getting quantitative / algorithmic versions of this result. On one hand Grigoriev and Vorobjov (2001) asked the question of how large the degree of an SOS proof needs to be, and in particular Grigoriev (2001):FIXME in a previous version we also had 1999 as the year for Grigoriev’s lower bound. why is that? is there a tech report? maybe we should cite that? proved several lower bounds on this degree for some interesting polynomials. On the other hand Parrilo (2000) and Lasserre (01AD) independently came up with hierarchies of algorithms for polynomial optimization based on the Positivstellensatz using semidefinite programming. (A less general version of this algorithm was also described by Naum Shor (1987) in a 1987 Russian paper, which was cited by Nesterov in 1999.)

It turns that the SOS algorithm generalizes and encapsulates many other convex-programming based algorithmic hierarchies such as those proposed by Lovász and Schrijver (1991) and Sherali and Adams (1990), and other more specific algorithmic techniques such as linear programming and spectral techniques. As mentioned above, the SOS algorithm seems to achieve a “goldilocks” balance of being strong enough to capture interesting techniques but weak enough so we can actually prove lower bounds for it. One of the goals of this course (and line of research) is to also understand what algorithmic techniques can not be captured by SOS, particularly in the setting (e.g., noisy low-degree polynomial optimization) where it seems most appropriate for.

Applications of SOS

SOS has applications to: equilibrium analysis of dynamics and control (robotics, flight controls, …), robust and stochastic optimization, statistics and machine learning, continuous games, software verification, filter design, quantum computation and information, automated theorem proving, packing problems, etc. (For two very different examples, see the following figures.)

The TCS vs Mathematical Programming view of SOS

The SOS algorithm is intensively studied in several fields, but different communities emphasize different aspect of it. The main characteristics of the Theoretical Computer Science (TCS) viewpoint, as opposed to that of other communities are:

In the TCS world, we typically think of the number of variables \(n\) as large and tending to infinity (as it corresponds to our input size), and the degree \(\ell\) of the SOS algorithm as being relatively small— a constant or logarithmic. In contrast, in the optimization and control world, the number of variables can often be very small (e.g., around ten or so, maybe even smaller) and hence \(\ell\) may be large compared to it.Since both time and space complexity of the general SOS algorithm scale roughly like \(n^\ell\), even \(\ell=6\) and \(n=100\) would take something like a petabyte of memory. This may justify the optimization/control view of keeping \(n\) small, although if we show that SOS yields a polynomial-time algorithm for a particular problem, then we can hope that we would be able to then optimize further and obtain an algorithm that doesn’t require a full-fledged SOS solver. As we will see in this course, this hope has actually materialized in some settings.
Typically in TCS our inputs are discrete and the polynomials are simple, with integer coefficients and constraints such as \(x_i^2 = x_i\) that restrict attention to the Boolean cube. Thus we are less concerned with issues of numerical accuracy, boundedness, etc..
Traditionally people have been concerned with exact convergence of the SOS algorithm— when does it yield an exact solution to the optimization problem. This often precludes \(\ell\) from being much smaller than \(n\). In contrast as TCS’ers we would often want to understand approximate convergence– when does the algorithm yield an “approximate” solution (in some problem-dependent sense). Since the output of the algorithm in this case is not actually in the form of a solution to the equations, this raises the question of a obtaining rounding algorithms, which are procedures to translate the output of the algorithm to an approximate solution.

SOS as a “cockroach”

In theoretical computer science we typically define a computational problem \(P\) and then try to find the best (e.g., most time efficient orbest approximation factor) algorithm \(A\) for this problem. One can ask what is the point in restricting attention to a particular algorithmic framework such as SOS, as opposed to simply trying to find the best algorithm for the problem at hand. One answer is that we could hope that if a problem is solved via a general framework, then that solution would generalize better to different variants and cases (e.g., considering average-case variants of a worst-case problem, or measuring “goodness” of the solution in different ways). This is a general phenomenon that occurs time and again many fields, known under many names including the “bias variance trade-off”, the “stability plasticity dilemma”, “performance robustness trade-off” and many others. That is, there is an inherent tension between optimally solving a particular question (or optimally adapting to a particular environment) and being robust to changes in the question/environment (e.g., avoiding “over-fitting”). For example, consider the following two species that roamed the earth few hundred million years ago during the Mesozoic era. The dinosaurs were highly complex animals that were well adapted to their environment. In contrast cockroaches have extremely simple reflexes, operating only on very general heuristics such as “run if you feel a brush of air”. As one can tell by the scarcity of “dinosaur spray” in stores today, it was the latter species that was more robust to changes in the environment. With that being said, we do hope that the SOS algorithm is at least approximately optimal in several interesting settings.

References

Grigoriev, Dima. 2001. “Linear Lower Bound on Degrees of Positivstellensatz Calculus Proofs for the Parity.” Theor. Comput. Sci. 259 (1-2): 613–22.

Grigoriev, Dima, and Nicolai Vorobjov. 2001. “Complexity of Null-and Positivstellensatz Proofs.” Ann. Pure Appl. Logic 113 (1-3): 153–60.

Lasserre, Jean B. 01AD. “Global Optimization with Polynomials and the Problem of Moments.” SIAM J. Optim. 11 (3): 796–817. doi:10.1137/S1052623400366802.

Lovász, László, and Alexander Schrijver. 1991. “Cones of Matrices and Set-Functions and 0-1 Optimization.” SIAM Journal on Optimization 1 (2): 166–90.

Parrilo, Pablo A. 2000. “Structured Semidefinite Programs and Semialgebraic Geometry Methods in Robustness and Optimization.” PhD thesis, California Institute of Technology.

Sherali, Hanif D., and Warren P. Adams. 1990. “A Hierarchy of Relaxations Between the Continuous and Convex Hull Representations for Zero-One Programming Problems.” SIAM J. Discrete Math. 3 (3): 411–30. doi:10.1137/0403036.

Shor, N. Z. 1987. “An Approach to Obtaining Global Extrema in Polynomial Problems of Mathematical Programming.” Kibernetika (Kiev), no. 5: 102–6, 136.