Scaling Enumerative Program Synthesis vi

Conditional Expression Grammars. We introduce conditional expression grammars that separate an expression grammar into two grammars that generate: (a) the return value expression, and (b) the conditionals that decide which return value is chosen. These generated return values (terms) and conditionals (predicates) are combined using if-then-else conditional operators.

A conditional expression grammar is a pair of grammars $\langle G_T, G_P \rangle $ where: (a) the term grammar $G_T$ is an expression grammar generating terms of type $\mathsf {range}(f)$; and (b) the predicate grammar $G_P$ is an expression grammar generating boolean terms. The set of expressions $[\![ \langle G_T, G_P \rangle ]\!]$ generated by $\langle G_T, G_P \rangle $ is the smallest set of expressions $\mathcal {T}[\mathsf {params}]$ such that: (a) $[\![ G_T ]\!] \subseteq [\![ \langle G_T, G_P \rangle ]\!]$, and (b) $e_1, e_2 \in [\![ \langle G_T, G_P \rangle ]\!] \wedge p\in [\![ G_P ]\!] \implies \mathtt {if}~p~\mathtt {then}~e_1~\mathtt {else}~e_2 \in [\![ \langle G_T, G_P \rangle ]\!]$. Most commonly occurring SyGuS grammars in practice can be rewritten as conditional expression grammars automatically.

Example 7

The grammar from Example 3 is easily decomposed into a conditional expression grammar $\langle G_T, G_P \rangle $ where: (a) the term grammar $G_T$ contains only the non-terminal T, and the rules for rewriting T. (b) the predicate grammar $G_P$ contains the two non-terminals $\{T, C\}$ and the associated rules.

Decision Trees. We use the concept of decision trees from machine learning literature to model conditional expressions. Informally, a decision tree $ DT $ maps samples to labels. Each internal node in a decision tree contains an attribute which may either hold or not for each sample, and each leaf node contains a label. In our setting, labels are terms, attributes are predicates, and samples are points.

To compute the label for a given point, we follow a path from the root of the decision tree to a leaf, taking the left (resp. right) child at each internal node if the attribute holds (resp. does not hold) for the sample. The required label is the label at the leaf. We do not formally define decision trees, but instead refer the reader to a standard text-book (see, for example, [4]).

Example 8

Figure 2 contains a decision tree in our setting, i.e., with attributes being predicates and labels being terms. To compute the associated label with the point $\mathsf {pt}\equiv \{ x \mapsto 2, y \mapsto 0 \}$: (a) we examine the predicate at the root node, i.e., $y \le 0$ and follow the left child as the predicate hold for $\mathsf {pt}$; (b) examine the predicate at the left child of the root node, i.e., $x \le y$ and follow the right child as it does not hold; and (c) return the label of the leaf $x + y$.

The expression $\mathsf {expr}( DT )$ corresponding to a decision tree $ DT $ is defined as: (a) the label of the root node if the tree is a single leaf node; and (b) $\mathtt {if}~p~\mathtt {then}~\mathsf {expr}( DT _L)~\mathtt {else}~\mathsf {expr}( DT _Y)$ where $p$ is the attribute of the root node, and $ DT _L$ and $ DT _Y$ are the left and right children, otherwise.

Decision tree learning is a technique that learns a decision tree from a given set of samples. A decision tree learning procedure is given: (a) a set of samples (points), (b) a set of labels (terms), along with a function that maps a label to the subset of samples which it covers; and (c) a set of attributes (predicates). A sound decision tree learning algorithm returns a decision tree $ DT $ that classifies the points correctly, i.e., for every sample $\mathsf {pt}$, the label associated with it by the decision tree covers the point. We use the notation LearnDT to denote a generic, sound decision tree learning procedure. The exact procedure we use for decision tree learning is presented in Sect. 4.2.

4.1 Algorithm

Algorithm 2 presents the full divide-and-conquer enumeration algorithm for synthesis. Like Algorithm 1, the divide-and-conquer algorithm maintains a set of points $\mathsf {pts}$, and in each iteration: (a) computes a candidate solution expression $e$ (lines 3–10); (b) verifies and returns $e$ if it is correct (lines 10 and 11); and (c) otherwise, adds the counter-example point into the set $\mathsf {pts}$ (line 12).

However, the key differences between Algorithms 2 and 1 are in the way the candidate solution expression $e$ is generated. The generation of candidate expressions is accomplished in two steps.

Term Solving. Instead of searching for a single candidate expression that is correct on all points in $\mathsf {pts}$, Algorithm 2 maintains a set of candidate terms $\mathsf {terms}$. We say that a term $t$ covers a point $\mathsf {pt}\in \mathsf {pts}$ if $t\models \varPhi \downharpoonleft \mathsf {pt}$. The set of points that a term covers is computed and stored in $\mathsf {cover}[t]$ (line 15). Note that the algorithm does not store terms that cover the same set of points as already generated terms (line 16). When the set of terms $\mathsf {terms}$ covers all the points in $\mathsf {pts}$, i.e., for each $\mathsf {pt}\in \mathsf {pts}$, there is at least one term that is correct on $\mathsf {pt}$, the term enumeration is stopped (while-loop condition in line 4).

Unification and Decision Tree Learning. In the next step (lines 6–9), we generate a set of predicates $\mathsf {preds}$ that will be used as conditionals to combine the terms from $\mathsf {terms}$ into if-then-else expressions. In each iteration, we attempt to learn a decision tree that correctly labels each point $\mathsf {pt}\in \mathsf {pts}$ with a term $t$ such that $\mathsf {pt}\in \mathsf {cover}[t]$. If such a decision tree $ DT $ exists, the conditional expression $\mathsf {expr}( DT )$ is correct on all points, i.e., $\mathsf {expr}( DT ) \models \varPhi \downharpoonleft \mathsf {pts}$. If a decision tree does not exist, we generate additional terms and predicates and retry.

Remark 1

In line 7, we generate additional terms even though $\mathsf {terms}$ is guaranteed to contain terms that cover all points. This is required to achieve semi-completeness, i.e., without this, the algorithm might not find a solution even if one exists.

Theorem 2

Algorithm 2 is sound for the SyGuS problem. Further, assuming a sound and complete LearnDT procedure, if there exists a solution expression, Algorithm 2 is guaranteed to find it.

The proof of the above theorem is similar to the proof of soundness and partial-completeness for the original enumerative solver. The only additional assumption is that the LearnDT decision tree learning procedure will return a decision tree if one exists. We present such a procedure in the next section.

4.2 Decision Tree Learning

The standard multi-label decision tree learning algorithm (based on ID3 [16]) is presented in Algorithm 3. The algorithm first checks if there exists a single label (i.e., term) $t$ that applies to all the points (line1). If so, it returns a decision tree with only a leaf node whose label is $t$ (line 1). Otherwise, it picks the best predicate $p$ to split on based on some heuristic (line 3). If no predicates are left, there exists no decision tree, and the algorithm returns $\bot $ (line 2). Otherwise, it recursively computes the left and right sub-trees for the set of points on which $p$ holds and does not hold, respectively (lines 4 and 5). The final decision tree is returned as a tree with a root (with attribute $p$), and positive and negative edges to the roots of the left and right sub-trees, respectively.

Information-Gain Heuristic. The choice of the predicate at line 3 influences the size of the decision tree learned by Algorithm 3, and hence, in our setting, the size of the solution expression generated by Algorithm 2. We use the classical information gain heuristic to pick the predicates. Informally, the information gain heuristic treats the label as a random variable, and chooses to split on the attribute knowing whose value will reveal the most information about the label. We do not describe all aspects of computing information gain, but refer the reader to any standard textbook on machine learning [4]. Given a set of points $\mathsf {pts}' \subseteq \mathsf {pts}$ the entropy $H(\mathsf {pts}')$ is defined in terms of the probability $\mathbb {P}_{\mathsf {pts}'}(label(\mathsf {pt}) = t)$ of a point $\mathsf {pt}\in \mathsf {pt}'$ being labeled with the term $t$ as

$$ H(\mathsf {pts}') = -\sum _{t} \mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t) \cdot \log _2{\mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t)} $$

Further, given a predicate $p\in \mathsf {preds}$, the information gain of $p$ is defined as

$$ G(p) = \frac{|\mathsf {pts}_{\mathrm {y}}|}{|\mathsf {pts}|} \cdot H(\mathsf {pts}_{\mathrm {y}}) + \frac{|\mathsf {pts}_{\mathrm {n}}|}{|\mathsf {pts}|} \cdot H(\mathsf {pts}_{\mathrm {n}}) $$

where $\mathsf {pts}_{\mathrm {y}} = \{\mathsf {pt}\in \mathsf {pts}\mid p[\mathsf {pt}]\}$ and $\mathsf {pts}_{\mathrm {n}} = \{\mathsf {pt}\in \mathsf {pts}\mid \lnot p[\mathsf {pt}]\}$. Hence, at line 3, we compute the value $G(p)$ for each predicate in $\mathsf {preds}$, and pick the one which maximizes $G(p)$.

We use conditional probabilities $\mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t\mid \mathsf {pt})$ to compute the probability $\mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t)$. The assumption we make about the prior distribution is that the likelihood of a given point $\mathsf {pt}$ being labeled by a given term $t$ is proportional to the number of points in $\mathsf {cover}[t]$. Formally, we define:

Now, the unconditional probability of an arbitrary point being labeled with $t$ is given by $\mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t) = \sum _{\mathsf {pt}} \mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t\mid \mathsf {pt})\cdot \mathbb {P}_{\mathsf {pts}'}(\mathsf {pt})$. Assuming a uniform distribution for picking points, we have that

$$ \mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t) = \frac{1}{\vert \mathsf {pts}\vert } \cdot \sum _{\mathsf {pt}} \mathbb {P}_{\mathsf {pts}'}(\mathsf {label}(\mathsf {pt}) = t\mid \mathsf {pt}) $$

4.3 Extensions and Optimizations

The Anytime Extension. Algorithm 2 stops enumeration of terms and predicates as soon as it finds a single solution to the synthesis problem. However, there are cases where due to the lack of sufficiently good predicates, the decision tree and the resulting solution can be large (see Example 9). Instead, we can let the algorithm continue by generating more terms and predicates. This could lead to different, potentially smaller decision trees and solutions.

Example 9

Given the specification $(x \ge 0 \wedge y \ge 0) \Rightarrow (f(x, y) = 1 \Leftrightarrow x + y \le 2)$ and a run of Algorithm 2 where the terms 0 and 1 are generated; the terms fully cover any set of points for this specification. Over a sequence of iterations the predicates are generated in order of size. Now, the predicates generated of size 3 include $x = 0$, $x = 1$, $x = 2$, $y \le 2$, $y \le 1$, and $y \le 0$. With these predicates, the decision tree depicted in Fig. 3a is learned, and the corresponding conditional expression is correct for the specification. However, if the procedure continues to run after the first solution is generated, predicates of size 4 are generated. Among these predicates, the predicate $x + y \le 2$ is also generated. With this additional predicate, the decision tree in Fig. 3b is generated, leading to the compact solution $f(x, y) \equiv \mathtt {if}~x + y \le 2~\mathtt {then}~1~\mathtt {else}~0$.

Fig. 3.

Initial decision tree and the more compact version learned with the anytime extension for Example 9

Decision Tree Repair. In Algorithm 2, we discard the terms that cover the same set of points as already generated terms in line 16. However, these discarded terms may lead to better solutions than the already generated ones.

Example 10

Consider a run of the algorithm for the running example, where the set $\mathsf {pts}$ contains the points $\{x \mapsto 1, y \mapsto 0\}$ and $\{x \mapsto -1, y \mapsto 0\}$. Suppose the algorithm first generates the terms 0 and 1. These terms are each correct on one of the points and are added to $\mathsf {terms}$. Next, the algorithm generates the terms x and y. However, these are not added to $\mathsf {terms}$ as x (resp. y) is correct on exactly the same set of points as 1 (resp. 0).

Suppose the algorithm also generates the predicate $x \le y$ and learns the decision tree corresponding to the expression $e\equiv \mathtt {if}~x \le y~\mathtt {then}~0~\mathtt {else}~1$. Now, verifying this expression produces a counter-example point, say $\{x \mapsto 1, y \mapsto 2\}$. While the term 0, and correspondingly, the expression $e$ is incorrect on this point, the term y which was discarded as an equivalent term to 0, is correct.

Hence, for a practical implementation of the algorithm we do not discard these terms and predicates, but store them separately in a map $ Eq : \mathsf {terms}\rightarrow [\![ G_T ]\!]$ that maps the terms in $\mathsf {terms}$ to an additional set of equivalent terms. At lines 16, if the check for distinctness fails, we instead add the term $t$ to the $ Eq $ map. Now, when the decision tree learning algorithm returns an expression that fails to verify and returns a counter-example, we attempt to replace terms and predicates in the decision tree with equivalent ones from the $ Eq $ map to make the decision tree behave correctly on the counter-example.

Example 11

Revisiting Example 10, instead of discarding the terms x and y, we store them into the $ Eq $ array, i.e., we set $ Eq (0) = \{y\}$ and $ Eq (1) = \{x\}$. Now, when the verification of the expression fails, with the counter-example point $\{x \mapsto 1, y \mapsto 2\}$, we check the term that is returned for the counter-example point–here, 0. Now, we check whether any term in $ Eq (0)$ is correct on the counter-example point–here, the term y. If so, we replace the original term with the equivalent term that is additionally correct on the counter-example point and proceed with verification. Replacing 0 with y in the expression gives us $\mathtt {if}~x \le y~\mathtt {then}~y~\mathtt {else}~1$. Another round of verification and decision tree repair will lead to replacing the term 1 with x, giving us the final correct solution.

Branch-Wise Verification. In Algorithm 2, and in most synthesis techniques, an incorrect candidate solution is used to generate one counter-example point. However, in the case of conditional expressions and point-wise specifications, each branch (i.e., leaf of the decision tree) can be verified separately. Verifying each branch involves rewriting the specification as in the point-wise verification defined in Sect. 3 – but instead of adding a premise to each clause asserting that the arguments to the function are equal to a point, we add a premise that asserts that the arguments satisfy all predicates along the path to the leaf. This gives us two separate advantages:

We are able to generate multiple counter-examples from a single incorrect expression. This reduces the total number of iterations required, as well as the number of calls to the expensive decision tree learning algorithm.
It reduces the complexity of each call to the verifier in terms of the size of the SMT formula to be checked. As verification procedures generally scale exponentially with respect to the size of the SMT formula, multiple simpler verification calls are often faster than one more complex call.

This optimization works very well along with the decision tree repair described above as we can verify and repair each branch of the decision tree separately.

Example 12

Consider the verification of the expression $\mathtt {if}~x \le y~\mathtt {then}~0~\mathtt {else}~1$ for the running example. Instead of running the full expression through the verifier to obtain one counter-example point, we can verify the branches separately by checking the satisfiability of the formulae $x \le y \wedge f(x, y) = 0 \wedge \lnot \varPhi $ and $\lnot (x \le y) \wedge f(x, y) = 1 \wedge \lnot \varPhi $. This gives us two separate counter-example points.

Scaling Enumerative Program Synthesis via Divide and Conquer | Springer Nature Link