Complexity of Regex Crosswords
Abstract
In a regular expression crossword puzzle, one is given two non-empty lists \(\tup{\tup{R_1,\ldots , R_m}$ and $\tup{C_1, \ldots , C_n}}\) over some alphabet, and the challenge is to fill in an \(m\times n\) grid of characters such that the string formed by the \(i^\text{th}\) row is in \(L(R_i)\) and the string in the \(j^\text{th}\) column is in \(L(C_j)\). We consider a restriction of this puzzle where all the \(R_i\) are equal to one another and similarly the \(C_j\). We consider a 2-player version of this puzzle, showing it to be \(\relax \mathbf{PSPACE}\)-complete. Using a reduction from \(\relax \mathbf{3SAT}\), we also give a new, simple proof of the known result that the existence problem of a solution for the restricted (1-player) puzzle is \(\relax \mathbf{NP}\)-complete.
1 Introduction
Regular expression crossword puzzles (regex crosswords, for short) share some traits in common with traditional crossword puzzles and with sudoku. One is typically given two lists \(R_1,\ldots ,R_m\) and \(C_1,\ldots ,C_n\) of regular expressions labeling the rows and columns, respectively, of an \(m\times n\) grid of blank squares. The object is to fill in the squares with letters so that each row, read left to right as a string, matches (i.e., is in the language denoted by) the corresponding regular expression, and similarly for each column, read top to bottom. The solution itself may have some additional property, e.g., spelling out a phrase or sentence in row major order.
Regex crosswords have enjoyed some recent popularity, having been discussed in several popular media sources [? ? ], and thanks to some websites where people can solve the puzzles online [? ? ]. Some variants of the basic puzzle have also been posed [? ].
A natural complexity theoretic question to ask is: How hard is it to solve a regex crossword in general?1 The folklore answer—easy to show and apparently found by several people independently—is that it is \(\relax \mathbf{NP}\)-hard, and the corresponding decision problem (“Does a solution exist?”) is \(\relax \mathbf{NP}\)-complete.
In this paper, we consider two variations on the basic regex crossword puzzle: (1) a restriction of the puzzle where all the row regexes \(R_1,\ldots ,R_m\) are equal and all the column regexes \(C_1,\ldots ,C_n\) are equal; and (2) a 2-player game where players take turns attempting to fill in successive rows and columns of the grid. Variation (2) can also be restricted to having equal row regexes and equal column regexes for the two players. These variants have corresponding decision problems: Let \(\relax \mathbf{RC}\) be the solution existence problem for variation (1), \(\relax \mathbf{RCG'}\) the first-player-win problem for variation (2), and \(\relax \mathbf{RCG}\) the first-player-win problem for the restricted version of (2) (see Sections 3 and 4 for precise definitions). Our main result is that \(\relax \mathbf{RCG'}\) and \(\relax \mathbf{RCG}\) are both \(\relax \mathbf{PSPACE}\)-complete (see Section 4, below). We give explicit polynomial reductions from \(\relax \mathbf{TQBF}\) to \(\relax \mathbf{RCG'}\) and from \(\relax \mathbf{RCG'}\) to \(\relax \mathbf{RCG}\).
The \(\relax \mathbf{NP}\)-completeness of \(\relax \mathbf{RC}\) was shown in [? ],2 but the polynomial reduction used there was indirect and needlessly complicated for its purpose. As a warm-up to our main result, we give a simple, straightforward polynomial reduction from \(\relax \mathbf{3SAT}\) to \(\relax \mathbf{RC}\).
In the spirit of the Post Correspondence Problem in computability, our results have the pedagogical benefit of showing the hardness of some decision problems in automata theory that are simply stated and accessible to any undergraduate theory student. The proofs given here are similarly accessible.
1.1 Connections to other work
Regex crossword techniques bear some similarity to results in cellular automata, to the Cook-Levin theorem, and to results of Berger from the 1960s showing the undecidability of tiling the plane with Wang tiles (the so-called “domino problem” [? ], which was the first proof that there exist finite tile sets that tile the whole plane but only aperiodically).
The particular problems we study here are perhaps chiefly inspired by results in the theory of two-dimensional languages (picture languages) from formal language theory [? ]. Given two regexes \(R\) and \(C\) for the rows and columns, respectively, the unbounded \((R,C)\)-crossword problem asks whether a solution grid exists of any size. One can show that the recognizable picture languages coincide exactly with the letter-to-letter projections of \((R,C)\)-crossword solutions [? , Theorem 8.6] (except that the empty picture may also be included in the language). Recognizable picture languages can be defined in terms of finite objects known as tiling systems [? ] (cf. [? , Definition 7.2]), and given a tiling system \(\mathcal{T}\), it is not hard to show that one can effectively find two regular expressions \(R\) and \(C\) (over some alphabet) and a projection \(\pi \) that defines the same picture language as \(\mathcal{T}\). The existence problem for recognizable picture languages (“Given a tiling system, does it define a nonempty language?”) is known to be undecidable ([? , Theorem 9.1]), and so, putting these results together, we get that the existence problem for unbounded \((R,C)\)-crosswords is undecidable as well. A much more direct reduction from the halting problem to unbounded \((R,C)\)-crossword existence was given in [? ], where it was also shown that one could even fix the column regex \(C\) once and for all, as well as restricting \(R\) and \(C\) to be over a binary alphabet.
The unbounded regex crossword problem naturally assumes one regex \(R\) for all rows and one regex \(C\) for all columns, since the number of rows and columns is unspecified. This directly motivates us to impose similar restrictions on the bounded regex crossword problems we study here, where the dimensions of the grid are given as part of the input.
We give some basic concepts and definitions in Section 2. Section 3 gives our polynomial reduction from \(\relax \mathbf{3SAT}\) to \(\relax \mathbf{RC}\). This reduction suggests the technique we use to show our main results about 2-player crossword games in Section 4. We give open problems in Section 5.
2 Preliminaries
We fix an alphabet \(\Sigma \) once and for all and assume it contains the symbols \(0\) and \(1\) at least. For the \(\relax \mathbf{NP}\)-completeness result of Section 3, one can assume that \(\Sigma = \{0,1\}\). For the \(\relax \mathbf{PSPACE}\)-completeness result of Section 4, it suffices that \(\Sigma = \{0,1,2\}\).
2.1 3SAT
An instance of \(\relax \mathbf{3SAT}\) is described by a Boolean formula \(\p \) over \(k\) variables \(x_1,\ldots ,x_k\), given in conjunctive normal form: \begin{align*} \p \coloneqq C_i \land \cdots \land C_d \end{align*}
where each \(C_i\) is a clause of three literals (each a variable or its negation) connected by disjunctions: \begin{align*} C_i \coloneqq \ell _{i,1} \lor \ell _{i,2} \lor \ell _{i,3} \end{align*}
The question is, is \(\p \) true (is it satisfied) for some assignment of the variables. This is the canonical complete problem for \(\relax \mathbf{NP}\). In Section 3 we show that the language \(\relax \mathbf{RC}\) — the language of \((R,C)\)-crosswords — is \(\relax \mathbf{NP}\)-complete by giving reduction from \(\relax \mathbf{3SAT}\).
2.2 TQBF
An instance of \(\relax \mathbf{TQBF}\) is described by a closed Boolean formula \(\p \), given in prenex normal form: \begin{align} \label{eqn:qbf} \p \coloneqq \exists x_0 \forall y_0 \cdots \exists x_{k-1} \forall y_{k-1}\exists x_k \tilde{\p }(x_0, y_0, \ldots , x_{k-1}, y_{k-1}, x_k) \end{align}
where \(\tilde{\p }\) is a quantifier-free Boolean formula which can be assumed to be in conjunctive normal form with \(c\) clauses and \(2k+1\) variables, for some positive \(c\) and \(k\).
The sentence \(\p \) is naturally viewed as a two-player game, where the players alternate choosing truth values for the variables in order, the first player wishing to make the formula \(\tilde{\p }\) true and second player wishing to make it false. The question to be answered is whether \(\p \) is true when the quantified variables range over the Boolean values \(\False \) and \(\True \). 3 That is, whether the first player has a winning strategy in the corresponding game.
As \(\relax \mathbf{3SAT}\) is for \(\relax \mathbf{NP}\), \(\relax \mathbf{TQBF}\) is the canonical complete problem for \(\relax \mathbf{PSPACE}\). In Section 4, \(\relax \mathbf{RCG}\) — the language of (R,C)-crossword games (defined below) with a winning strategy for the first player — is \(\relax \mathbf{PSPACE}\)-complete by reduction from \(\relax \mathbf{TQBF}\).
3 (R,C)-crosswords
For two given regexes \(R\) and \(C\) over \(\Sigma \), an \((R,C)\)-crossword solution is a two-dimensional \(m\) by \(n\) grid of symbols from the alphabet. Interpreting rows and columns as strings, each row must match \(R\) and each column must match \(C\).
An \((R,C)\)-crossword is represented as a 4-tuple \(\tup{0^m, 0^n, R, C}\) where the number of rows and columns are given in unary as \(m\) and \(n\), and \(R\) and \(C\) are row and column regexes over \(\Sigma \) (defined in the usual way, using the operators \(\cup \), \(\|\), \(*\), where \(\|\) or juxtaposition both indicate concatenation).
Definition 1. The language \(\relax \mathbf{RC}\) is the set of all \((R,C)\)-crosswords for which there exists an \((R,C)\)-crossword solution of the given dimensions.
\(\relax \mathbf{RC}\) was shown to be \(\relax \mathbf{NP}\)-complete in [? ] via an indirect, complicated reduction. In this section, we give a much more straightforward polynomial reduction from \(\relax \mathbf{3SAT}\) to \(\relax \mathbf{RC}\).
3.1 The reduction
Given a Boolean formula \(\p \) with \(k\ge 1\) variables and \(d\) clauses as defined in Section 2.1 above (where we can assume \(d\ge 3\)), we construct an instance \(\tup{0^{d+1},0^{k+d},R,C}\) of \(\relax \mathbf{RC}\) as follows: For \(1 \le i \le d\), we define \(t_i\) to be the regex \begin{align*} t_i & = \0^{i-1}\1\0^{d-i} = \underbrace{\0 \cdots \0}_{i - 1} \1 \underbrace{\0 \cdots \0}_{d - i}\;. \\ \shortintertext{Then we define} S & = \1^d\0^* \\ R & = \left ( \displaystyle \bigcup _{i=1}^{d} t_i R_i \right ) \cup S \\ C & = \1\left (\0^*\1\0^*\right ) \cup \0(\0^* \cup \1^*) \end{align*}
where \(S\) is called the ‘spine,’ and for \(1\le i\le d\), \(R_i\) is derived from the formula \(\p \) as follows: \begin{align*} R_i & = (a_{i,1} \cdots a_{i,k}) \cup (b_{i,1} \cdots b_{i,k}) \cup (c_{i,1} \cdots c_{i,k}) \\ \shortintertext{where, for $1 \le j \le k$,} a_{i,j} & = \left \{ \begin{array}{ll} \1 & \text{ if the first literal in the } i^{\text{th}} \text{ clause is } x_j \\ \0 & \text{ if the first literal in the } i^{\text{th}} \text{ clause is } \overline{x_j} \\ (\1 \cup \0) & \text{ otherwise } \end{array} \right . \end{align*}
and \(b_{i,j},\;c_{i,j}\) are set similarly according to the second and third literals in each clause.
We show that \(\p \) is satisfiable iff an \((R, C)\)-crossword solution exists.
First, assuming that \(\p \) is satisfiable, where \(\tup{z_1, \ldots , z_k}\) is a satisfying assignment, then this sets up a \(d+1\) by \(d+k\) crossword solution of the following form:
Here, the first row is the spine (matching \(S\)); the block on the left below the spine is akin to an identity matrix; and the block on the right consists of columns where each column is either all \(1\)’s or all \(0\)’s (save the first element, which is always \(0\)), according to each \(z_i\). An overview representation is shown below:
Where the spine is the string that matches \(S\). The ‘clause verification region’ is determined by the satisfying assignment to \(\p \), i.e., if \(z_j\) is true in the satsifying assignment, then column \(c_{d + j}\) will match the regex \(\0\1^*\); otherwise it will match \(\0\0^*\).
By construction, it is clear that if \(\p \) is satisfiable, then the \((R, C)\)-crossword constructed above is solvable. In other words, there is a way to fill in the crossword such that all rows match the regular expression \(R\), and all columns match the regular expression \(C\).
In fact, since the calibration region requires only one \(1\) per row and column, the solution given in table 1 is not the only valid one. It is easy to see that once any solution is given, any rearranging of the (non-spine) rows gives another valid solution. Due to this fact it is guaranteed that for each \(i\), some row matches \(t_i R_i\), which is important for the converse below.
3.2 An \((R, C)\)-crossword solution guarantees \(\p \) is satisfiable
To complete the proof, it must be shown that if the crossword is solvable, this implies that \(\p \) is satisfiable. We do this via a series of lemmas.
Here we assume an \((R, C)\)-crossword solution exists with rows \(\tup{r_0, \ldots , r_d}\) and columns \(\tup{c_1, \ldots , c_{d+k}}\).
Observe that since each \(r_j\) matches \(R\), it must either start with \(d\) many \(1\)’s or else have exactly one \(1\) among its first \(d\) symbols.
From the definition of \(C\), we see that \(c_i\) must match \(\1(\0^*\1\0^*)\), that is, \(c_i = 10^{j-1}10^{d-j}\) for some \(1\le j\le d\). The picture below shows the case where \(i=2\) and \(j=2\), that is, where \(c_i = c_2 = 10100 \cdots 0\):
For \(r_j\), we have two cases, both leading to contradiction:
- \(r_j\)
matches \(S\): - This requires that all of the first \(d\) columns other than \(c_i\) match \(\0\1^*\), which means \(r_{j'}\) starts with \(1^{i-1}01^{d-i}\cdots \) for all \(j'\ge 1\) such that \(j' \ne j\). These rows do not match \(R\).
- \(r_j\)
matches \(t_i R_i\), that is, \(r_j = 0^{i-1}10^{d-i}\cdots \): - This requires that all of the first \(d\) columns other than \(c_i\) match \(\0^*\), which means no rows other than \(r_j\) and \(r_0\) will match \(R\), since they all start with \(0^d\).
This proves the lemma. □
By Lemma 2, the first \(d\) columns must match \(\1(\0^*\1\0^*)\); we call such columns
But then column \(c_\ell \) does not match \(C\). This completes the proof. □
Since \(i\) was arbitrary, we have that \(\p \) is satisfied by \(\tup{z_1,\ldots ,z_k}\). □
4 (R,C)-crossword games
For two given regexes \(R\) and \(C\) over \(\Sigma \), an \((R,C)\)
Player 2, who we call
We represent an \((R,C)\)-game as a 4-tuple \(\tup{0^m,0^n,R,C}\), where \(m\) and \(n\) are positive integers (the number of rows and columns of the grid, respectively), and \(R\) and \(C\) are the corresponding regexes over \(\Sigma \) (defined in the usual way, using the operators \(\cup \), \(\|\), \(*\)).
Note that the numbers \(m\) and \(n\) are given in
4.1 \(\relax \mathbf{RCG}\) \(\in \lang{PSPACE}\)
It is straightforward to observe that \(\relax \mathbf{RCG}\) \(\in \lang{PSPACE}\). This follows from the properties of \((R,C)\)-games: Given an instance of \(\relax \mathbf{RCG}\) of size \(N = m \cdot n\),
- all game positions are representable by strings of polynomial length (in \(N\)),
- any play of the game lasts for at most polynomially many turns, and
- given any game position, whether a given next move is legal can be determined in polynomial space (polynomial time, in fact).
For this it is crucial that the dimensions of the board be given in unary. If the dimensions were given in binary, then we conjecture that the corresponding language would be complete for \(\relax \mathbf{EXPSPACE}\). Also note that the regex matching problem (“Given a regex \(E\) and string \(w\), does \(w\) match \(E\)?”) is in \(\relax \mathbf{P}\).
4.2 Hardness of \(\relax \mathbf{RCG}\)
Here is the main result of our paper.
To prove Theorem 7, our main result, we first consider a variant of \(\relax \mathbf{RCG}\), where each row and each column may correspond to a different regex, that is, the input is a pair \(\tup{\tup{R_1,\ldots ,R_m},\tup{C_1,\ldots ,C_n}}\) of lists of regexes. Rose and Colin alternate turns as before, but on her \(i^\text{th}\) turn, Rose must fill the remainder of the \(i^\text{th}\) row so that it matches \(R_i\), and similarly, on his \(j^\text{th}\) turn, Colin must fill the remainder of the \(j^\text{th}\) column so that it matches \(C_j\). Call this variant \(\relax \mathbf{RCG'}\).
We show our main result in two steps: in Lemma 8 we show how to polynomially reduce \(\relax \mathbf{TQBF}\) to \(\relax \mathbf{RCG'}\); then we give a polynomial reduction from \(\relax \mathbf{RCG'}\) to \(\relax \mathbf{RCG}\) (Theorem 9 below). In using \(\relax \mathbf{RCG'}\), the goal is to first consider a ‘simpler’ game to verify that there is a correspondence between the formulæ in \(\relax \mathbf{TQBF}\) and the possible games in \(\relax \mathbf{RCG}\).
where given \(1\le i\le k+1\), and \(a,b\in \{\0,\1\}\), the regexes \(S_{iab}\) and \(T_a\) are defined as follows: First, for \(1\le j\le c\) let \begin{align*} u_j & \coloneqq \left \{ \begin{array}{ll} \0 & \mbox{if $x_{i-1}$ occurs negatively in the $j$th clause,} \\ \1 & \mbox{if $x_{i-1}$ occurs positively in the $j$th clause,} \\ \bot & \mbox{if $x_{i-1}$ does not occur in the $j$th clause.} \end{array}\right . & v_j & \coloneqq \left \{ \begin{array}{ll} \0 & \mbox{if $y_{i-1}$ occurs negatively in the $j$th clause,} \\ \1 & \mbox{if $y_{i-1}$ occurs positively in the $j$th clause,} \\ \bot & \mbox{if $y_{i-1}$ does not occur in the $j$th clause.} \end{array}\right . \end{align*}
Now for \(1\le j \le c\), define \begin{align*} d_j & \coloneqq \left \{ \begin{array}{ll} \1 & \mbox{if $u_j=a$ or $v_j=b$,} \\ \0 & \mbox{otherwise.} \end{array}\right . & e_j & \coloneqq \left \{ \begin{array}{ll} \1 & \mbox{if $u_j=a$,} \\ \0 & \mbox{otherwise.} \end{array}\right . \end{align*}
Finally, we let \(S_{iab} := d_1\|\cdots \|d_n\) and \(T_a\coloneqq e_1\|\cdots \| e_n\).
Each of the first \(k+1\) rows and columns corresponds to the truth value (\(0\) or \(1\)) of one or two particular variables in the original formula, as depicted in Figure 2. The remainder of the rows (\(c\) of them) correspond to the clauses of \(\p \).
Here is how this \(\lang{\RCG '}\) game reflects the original instance of \(\lang{\TQBF }\) viewed as a game. When Rose plays the \(i\)th row (for \(1\le i\le k+1\)) she is able to choose the truth value of \(x_{i-1}\) by placing a \(0\) or \(1\) in the corresponding square in Figure 2 (Rose can play any binary string in the remainder of her row, because \(R_i ={(\0\cup \1)}^*\)). Then when Colin plays the remainder of the \(i\)th column according to \(C_i\), he can similarly choose the truth value of \(y_{i-1}\) by placing a \(0\) or \(1\) in the corresponding square. However, because of the \(S_{iab}\) component of \(C_i\), Colin is forced to then place a \(1\) in each of the last \(c\) positions corresponding to a clause that is satisfied by the truth settings of these two variables. (The minor exception is the \((k+1)\)st column, where there is only the variable \(x_k\) to consider.)
Also note that in order for Rose to complete the board, there must be a \(1\) in at least one of the first \(k+1\) positions in every row of the clause region. That is, Rose can win just when the chosen truth values of the variables satisfy all clauses of \(\tilde{\p }\), making the two games equivalent. Our construction is clearly polynomial time, which finishes the proof. □
4.3 Constraining the Players
The rest of this section is a proof of Theorem 9. To reduce from \(\relax \mathbf{RCG'}\) to \(\relax \mathbf{RCG}\) we need to provide a method to consolidate the families of regular expressions into one regex per player. Here, we present a generic construction that forces the players to play in order, which can be applied to any \(\lang{RCG'}\) game — forcing each player to play their families of regexes in index order.
Given an arbitrary instance \(G \coloneqq \tup{\tup{R_1,\ldots ,R_m},\tup{C_1,\ldots ,C_n}}\) of \(\relax \mathbf{RCG'}\), we construct an equivalent instance of \(\relax \mathbf{RCG}\). Our construction requires the alphabet \(\Sigma \) to contain a third symbol “\(2\)” that is not part of any string matching any of the \(R_i\) or \(C_i\). We currently do not know how to remove this requirement. We can assume that the given \(\relax \mathbf{RCG'}\) grid is square, i.e., \(m=n\): Suppose this is not the case; for example, suppose \(m < n\). Then we can pad the grid with \(n-m\) bottom rows by
- concatenating each \(C_i\) with \(\0^{n-m}\) on the right, and
- defining \(R_i \coloneqq \1^*\) for \(m < i \le n\),
yielding an evidently equivalent \(n\times n\) game. We can do something similar if \(m > n\). The instance of \(\lang{\RCG }\) we construct from \(G\) will then be a \((2n+1)\times (2n+1)\) game \(H \coloneqq \tup{0^{2n+1},0^{2n+1},R,C}\). We may also assume that \(n\ge 2\).
The regular expressions \(R\) and \(C\) we construct for the respective players are given below, again with explanations afterwards:
Figure 4a illustrates how \(H\) ‘wraps’ around the game \(G\): players first fill in the spine, then regions I, II, and III before simulating the game \(G\) in the lower right square (light grey).
4.4 Normal Play
By a
Spine: - In round \(0\), both players play the spine, i.e., a string matching \(\2\1\0^*\).
Calibration: - In round \(i\), where \(1\le i\le n\), Rose and Colin each play a ‘calibration string,’ i.e. either the string matching \(\0^i\1^3\0^{n-i-2}\|\0^i\1\0^{n-i-1}\) (if \(i<n\)) or \(\0\0^{n-2}\1\1\|\0^{n-1}\1\) (if \(i=n\)).
Simulation: - Rose and Colin now simulate the given \(\lang{\RCG '}\) game: In round \((n+i)\), for \(1\le i\le n\), Rose plays a string matching \(\0^i\1\0^{n-i-1}\|R_i\) (if she can), and Colin plays a string matching \(\0^i\1\0^{n-i-1}\|C_i\) (if he can).
Figure 4b illustrates the state of the grid after round \(n\) of normal play (here, \(n=16\)). If either player deviates from normal play, we say that the first player to do so is
In either case, Rose would quickly lose. If Rose does play \(\2\1\0^*\) on her first turn, Colin must play a string prefixed with \(2\), his only option matching the regex \(\2\1\0^*\). □
Rose has a choice of regexes (
Rose cannot then play any string with prefix \(0^i2\), so she loses in round \((i+1)\). □
The preceding lemmas show that normal play is optimal for both players (even required for Colin) through round \(n\). Thus we can assume normal play through round \(n\), filling regions II and III of the grid with \(1\)’s along their diagonals and \(0\)’s elsewhere (as with the identity matrix).
In rounds \((n+1)\) through \(2n\), the players are essentially playing the game \(G\) in region IV, so the winner of \(H\) is the winner of \(G\). This completes the proof of Theorem 9.
5 Open Problems
The most immediate question arising from our work is whether \(\relax \mathbf{RCG}\) is \(\relax \mathbf{PSPACE}\)-hard restricted to a binary alphabet. Our proof shows only that it is \(\relax \mathbf{PSPACE}\)-hard for a ternary alphabet. Doing without the third symbol “\(2\)” in the alphabet currently seems like a daunting task, despite the fact that under normal play, that symbol appears only once in the upper left-hand corner.
Another question is whether we still get \(\relax \mathbf{PSPACE}\)-hardness if we restict the regexes \(R\) and \(C\) to be equal to each other. If one can show \(\relax \mathbf{PSPACE}\)-hardness for \(\relax \mathbf{RCG'}\) restricted so that \(R_i = C_i\) for all \(i\), then it may be easy to get \(R=C\) for the constructed instance of \(\relax \mathbf{RCG}\), since these two latter regexes are close to being equal anyway.
Acknowledgments
We would like to thank Thomas Thierauf for several interesting discussions on this topic and to Joshua Cooper for finding for us a particularly challenging and fun regex crossword puzzle. We are also grateful to Klaus-Jörn Lange for suggesting the connection between our work and the theory of picture languages.
1
2
3
4