Reformulation and compression (how changing the language changes the leverage)
Reformulation is translation that preserves meaning while unlocking better tools. Compression is the common thread, and neuro-symbolic gates turn proposals into evidence.
This tutorial is about a quiet superpower: saying the same thing in a different language, without changing what is true, and suddenly having stronger tools.
It is also about the other half of the story. Once a language is expressive enough to talk about real systems, there are hard limits on what can be proven or solved automatically. That is where “propose, then gate with evidence” becomes a practical architecture.
Mental pictures to keep
- A problem as a shape that can be redrawn in many coordinate systems
- A reformulation space: representations as nodes, translations as edges
- Abstraction as compression: forget detail, keep the structure needed for a property
- Symbol manipulation as search with structure, not guessing
- Logic as a counterexample factory for universal claims
- Neuro-symbolic programming as a formal gate: propose, check, refute, refine
Part I: inside my head (the problem did not change, but the tool did)
I am staring at a puzzle that feels small, but the search space is enormous.
There is a row of lights, some on and some off. A button press flips a fixed pattern. The goal is the all-off board.
For a few lights, I can get away with guessing. As the row grows, guessing turns into noise. Each press branches the future, and I cannot see a reliable path.
Then I redraw the situation.
Instead of “a row of lights”, I see a bitvector. Instead of “toggle”, I see XOR. Instead of “a sequence of button presses”, I see addition mod 2.
Now the puzzle is no longer a tree of guesses. It is a linear system over $\mathbb{F}_2$.
The board did not change. The question did not change. Whether a solution exists did not change. Only the representation changed. The toolchain changed from “try moves” to “Gaussian elimination”.
That is reformulation in one sentence:
Keep the meaning fixed, change the language, and steal the best tools that language has.
Part II: reformulation space (representations as a graph)
From far away, reformulation looks like a graph problem.
The running question stays fixed:
Can this light board be driven to the all-off state?
This section keeps redrawing that same question. The object does not change. The coordinates do.
- Each node is a representation of the same underlying situation (a puzzle board, a vector, a formula).
- Each edge is a translation (sometimes lossless, sometimes approximate).
- Each node comes with a different library of tools (algebra, graph algorithms, SAT solving, model checking).
Reformulation space
To make this precise, we reuse the distinctions from Tutorial 2:
- Isomorphism: a 1-to-1 translation that is reversible and preserves the operations and relations in view.
- Equivalence: a “same meaning” relation, often defined by equality after applying a semantics function.
- Sound abstraction: a deliberate merge of distinctions that preserves a chosen property (often one-way).
- Encoding/simulation: one formalism represents another’s behavior, but not as a simple 1-to-1 correspondence.
Many descriptions, one meaning
Reformulation is not only isomorphism. It is the broader habit of moving along meaning-preserving edges whenever possible, and along sound abstraction edges when the goal is a property rather than exact reconstruction.
Speculative reformulations still belong in the picture
Some proposals, like “it from bit” or “digital physics”, suggest that the most fundamental description of reality is informational or computational. These are best viewed as hypotheses about a future reformulation edge.
The standard is the same: make the translation explicit, and show that it preserves the predictions we already trust.
A speculative edge in reformulation space
Four examples of leverage-by-reformulation
This is the move, repeated with different clothing.
Running example for first read
If this section starts to feel like too many examples at once, keep only the toggle puzzle in view. That one example already contains the full pattern: original problem, meaning-preserving translation, stronger toolchain, mapped-back answer.
-
Toggle puzzle to linear algebra
Represent board states as vectors in ${0,1}^n$ and moves as additions mod 2. Then “is there a sequence of moves that reaches all-off?” becomes “does a linear system $A x = b$ over $\mathbb{F}_2$ have a solution?” The solver is elimination, not branching search. -
Circuit equivalence to SAT
Two circuits are equivalent exactly when there is no input that makes their outputs differ. Build a formula for $C_1(x) \ne C_2(x)$ and ask a SAT solver for a satisfying assignment. A model is a concrete counterexample input. UNSAT means no counterexample exists. -
Deadlock to cycle detection
Build a wait-for graph with an edge $A \to B$ meaning “A is waiting for B”. A deadlock corresponds to a directed cycle. The solver is graph theory. -
Program synthesis to logic learning by relational decomposition
Many synthesis systems treat each example as a whole pair: an input structure and an output structure. A different representation is to decompose the structures into relational facts, likein(index,value)andout(index,value). Then learning becomes: find a logic program that entails the output facts from the input facts. Hocquette and Cropper show that this reformulation can make hard synthesis problems much easier to solve using an off-the-shelf inductive logic programming system.
Relational decomposition: one example becomes many facts
These are not metaphors. They are meaning-preserving translations. The only real obligation is the one that matters: show that the translation preserves the question being asked.
For the running toggle puzzle, the question started as “which buttons should be pressed?” and became “does the linear system have a solution?” The mapped-back answer is still a button pattern. That full loop is the pattern the rest of the tutorial keeps unpacking.
Interlude: two working examples (Tao and Feynman)
Two of the cleanest real-world examples of abstraction and reformulation come from modern mathematics and modern physics.
Why this interlude is here
The toggle puzzle is the small running example. Tao and Feynman are the same move at expert scale: keep meaning fixed, redraw the object so a better tool can see it.
Terence Tao: split one hard object into two easier ones
A recurring method in additive combinatorics and analytic number theory is the “structure vs randomness” split:
- take a complicated object (often a function or a set),
- decompose it into a structured part and a pseudorandom part,
- use the right tool on each part,
- then combine the results.
The structured part is chosen so that it can be described compactly and manipulated algebraically. The pseudorandom part is chosen so that, for the patterns being counted, it behaves like noise (it has small correlation with the relevant test functions).
This is abstraction as compression with a target property in view. The split forgets detail differently depending on the question, but it keeps exactly what makes the proof go through.
Tao style decomposition: structure plus randomness
For an entry point written as a guide to this mindset, see Tao’s book Structure and Randomness.
Richard Feynman: replace a formula jungle with a diagrammatic calculus
In quantum field theory, the objects that appear in calculations can become enormous symbolic expressions.
Feynman’s diagram technique is a reformulation move:
- represent an interaction process as a graph with typed parts,
- translate the graph into a mathematical term by fixed rules,
- sum the contributions.
The diagram is not an illustration added after the fact. It is a compressed representation of a term in a perturbative expansion, designed so the bookkeeping is visible and repeatable.
Feynman diagrams as a compressed representation
Part III: abstraction as compression (and why compression feels like understanding)
We will keep the toggle-puzzle example in the background here. The board has not changed. Only the representation, and therefore the available operations, have changed.
Running example checkpoint
Keep one concrete object in view: a board state, a move matrix, and the equation A x = b over F_2.
In board language, the puzzle is "which buttons should be pressed?" In linear language, the puzzle is "which vector x satisfies the equation?"
Part III keeps translating that same object into smaller, sharper, or more structured forms.
The Abstraction Cheat Sheet has a strong rule of thumb:
Abstraction means: forget some implementation detail, while preserving the structure needed to reason about a property.
This is not vagueness. It is a controlled blur. It is a camera with a focus ring: some details are deliberately out of focus so the relevant shape is sharp.
One clean way to write the idea is as a map:
- concrete states: $C$
- abstract states: $A$
- abstraction: $\alpha : C \to A$
Typically, $\alpha$ is many-to-one. That is the point. Many concrete states collapse to one abstract state.
For the toggle puzzle, one concrete board might be stored as a row of little lights, while an abstract view stores only the bitvector or only the syndrome relevant to solvability in the chosen model. The point is not that one picture is prettier. The point is that one picture makes the next useful operation cheaper.
Compression is a closely related lens. A compression scheme has:
- an encoder $enc : X \to Z$
- a decoder $dec : Z \to X$
- and a promise that $dec(enc(x)) = x$ for the data of interest
If that promise holds on a set $X$, then the encoding is lossless on $X$. The original can be recovered exactly.
Abstraction is usually not like that. It does not promise invertibility. It is not lossless.
But abstraction and compression rhyme:
- Both create a smaller object that stands in for a larger one.
- Both succeed only when the smaller object preserves the structure needed for the task.
- Both fail when they merge distinctions that matter.
For the running puzzle, that means:
- the concrete view is a branching search over move sequences,
- the compressed view is a system of equations over $\mathbb{F}_2$,
- and the whole move is justified only because solving the equations still answers the original board question.
Shrink behavior, compress representation, abstract by merging
Why compression looks like intelligence (without turning it into a slogan)
When a mind learns a pattern, one useful way to describe what happened is:
- many experiences got mapped into a smaller internal representation,
- that representation kept the regularities that matter,
- and the result can now be used to predict, plan, or explain.
This is why “having a model” and “being able to compress” feel close. A child does not memorize every chair. They form a concept that throws away detail but preserves what matters for a goal like sitting.
But it is important not to oversell it:
- Compression is not the only ingredient of intelligence.
- There are many incompatible notions of “best compression”.
- Some tasks are hard even with the best representation.
So the safe claim is modest: good abstractions are compressions that preserve the right structure, and finding them is a major source of cognitive and mathematical leverage.
Decomposition (why good interfaces shrink both learning and search)
A lot of technical problem solving has the same outer shape:
- there is a space of candidates, call it $H$,
- there is a notion of “success” (fit the data, satisfy the spec, meet the invariant),
- and there is a process that eliminates candidates by finding refuters.
This is true in multiple domains:
- Learning: $H$ is a hypothesis class. Refuters are misclassified examples. The goal is low error.
- Program synthesis: $H$ is a class of programs. Refuters are failing tests or counterexamples from a verifier. The goal is to satisfy a specification.
- Verification and invariant discovery: $H$ is a space of invariants. Refuters are counterexample traces. The goal is an inductive invariant.
- Debugging: $H$ is a space of patches. Refuters are reproducers, failing tests, and traces. The goal is to remove the failure without breaking other behavior.
The running puzzle fits this template too:
- before reformulation, $H$ can be seen as candidate button-press sequences,
- after reformulation, $H$ can be seen as candidate solution vectors $x$ in $A x = b$,
- the refuter is anything that shows the proposed sequence or vector does not actually solve the board.
In all of these settings, “decomposition” is a representation choice that restricts $H$ to candidates built from smaller parts with explicit interfaces.
The same move appears in the running example. A large “try every press sequence” search is replaced by smaller local obligations: how each button affects the board, how those effects compose, and how elimination combines them. The puzzle did not become easier because the universe was kind. It became easier because the representation exposed reusable local structure.
Decomposition shrinks the hypothesis space
VC dimension and shattering (what the definition checks)
A binary hypothesis class is a set $H$ of functions $h : X \to {0,1}$. Think of each $h$ as one possible rule for labeling inputs as false or true.
Let $S = {x_1, \dots, x_m}$ be a finite set of inputs. The restriction of $H$ to $S$ is the set of labelings that $H$ can realize on those points:
\[H\!\upharpoonright_S := \{(h(x_1), \dots, h(x_m)) : h \in H\} \subseteq \{0,1\}^m\]Now two definitions:
- $H$ shatters $S$ if $H!\upharpoonright_S = {0,1}^m$, meaning that every one of the $2^m$ labelings occurs for some $h \in H$.
- The VC dimension $\mathrm{VC}(H)$ is the largest $m$ such that some set $S$ of size $m$ is shattered by $H$.
Shattering as an “all labelings” test
There are two complementary ways to talk about how “large” $H$ is.
- If $H$ is finite: size is literally $\lvert H \rvert$, and bounds often depend on $\log \lvert H \rvert$.
- If $H$ is infinite: size is measured by a capacity notion such as $\mathrm{VC}(H)$.
These are not competing religions. For a finite class, they are linked:
\[\mathrm{VC}(H) \le \log_2 \lvert H \rvert\]Reason: if $H$ shatters $m$ points, then it must realize all $2^m$ labelings on those points, so $\lvert H \rvert \ge 2^m$ and therefore $m \le \log_2 \lvert H \rvert$.
Two tiny examples anchor the intuition:
- Thresholds on the real line ($h_a(x) = 1[x \ge a]$) have $\mathrm{VC}=1$. One point can be labeled either way, but two ordered points cannot realize the labeling (1,0).
- Intervals on the real line ($h_{a,b}(x) = 1[a \le x \le b]$) have $\mathrm{VC}=2$. Two points can be labeled in all four ways, but three points cannot realize the labeling (1,0,1).
Fat-shattering (VC, but for real-valued outputs and margins)
Many hypothesis classes output real numbers, not bits. In that setting, the right analog of shattering uses a margin.
Let $F$ be a class of functions $f : X \to \mathbb{R}$ and fix a scale $\gamma > 0$. A finite set $S$ is $\gamma$-fat-shattered by $F$ if there exists a choice of thresholds $(r_x)_{x \in S}$ such that for every labeling $\sigma : S \to {-1,+1}$, there is some $f \in F$ with:
\[\forall x \in S.\; \sigma(x)\bigl(f(x) - r_x\bigr) \ge \gamma\]Intuition: each point $x$ comes with a “bar” $r_x$, and the class must be able to push the output at $x$ at least $\gamma$ above the bar (if $\sigma(x)=+1$) or at least $\gamma$ below it (if $\sigma(x)=-1$), simultaneously for every assignment of signs.
The fat-shattering dimension at scale $\gamma$, written $\mathrm{fat}_\gamma(F)$, is the largest size of a $\gamma$-fat-shattered set. As $\gamma$ shrinks, fat-shattering usually grows, since it becomes easier to separate points by a smaller margin.
Why this matters for learning (Blumer style, in one line)
In a realizable setting where some hypothesis in $H$ is perfectly consistent with the data, a simple Occam bound says that the number of examples needed to rule out most wrong hypotheses scales like:
\[m \gtrsim \frac{1}{\varepsilon}\bigl(\ln \lvert H \rvert + \ln(1/\delta)\bigr)\]The precise constants depend on assumptions. The shape is the point: smaller candidate sets need fewer refuters.
For infinite classes, VC bounds replace $\ln \lvert H \rvert$ with a term involving $\mathrm{VC}(H)$ (roughly, sample complexity grows with VC dimension, up to additional logarithmic factors).
What decomposition does to these measures
Decomposition helps only when it genuinely restricts the candidates.
One clean case is a product form. If a candidate is determined by independent choices from $H_1$ and $H_2$, then:
\[\lvert H \rvert = \lvert H_1 \rvert \cdot \lvert H_2 \rvert \quad\text{so}\quad \ln \lvert H \rvert = \ln \lvert H_1 \rvert + \ln \lvert H_2 \rvert\]This turns a monolithic search into a sum of smaller searches.
There is also a dual move, common in “mixtures” and routing: $H$ is a union of families $H = \bigcup_{i=1}^k H_i$. That can increase capacity because there are more options. Intuitively, selecting among $k$ modules costs about $\log k$ bits of freedom.
Assumption hygiene: decomposition is a claim about coupling
Decomposition does not automatically make a problem easy. It helps when most dependencies are local, and the interface between parts is narrow enough that refuters can be localized. If "everything depends on everything", the effective hypothesis space stays huge even if the code is split into files.
Why this matters for search tactics (epiplexity in plain language)
Statistical capacity is only half the story. Even if a good candidate exists in a small class, it can still be computationally hard to find.
A useful modern way to say the missing half is:
- Some structure exists “in principle”, but it is not accessible under a realistic compute budget.
- A good reformulation or decomposition can make the same structure cheap enough to exploit.
This is the intuition behind “information for bounded agents” proposals such as epiplexity. One way to summarize the idea is: separate the part of a dataset that can be turned into a shorter program by a bounded learner (structure) from the part that remains unpredictable under the same budget (time-bounded entropy). The moral is stable: compute budgets change what counts as usable structure.
Tree search versus propagation on a decomposed graph
This is a deep reason tactics work. A tactic is often a move that:
- reduces degrees of freedom (shrinks $H$),
- increases locality (makes refuters informative),
- and converts branching into propagation.
Canonicalizers and quotients (remove redundant degrees of freedom)
Decomposition is not the only way to shrink a search. Another common move is to notice that many candidates are “the same thing” in disguise.
This starts with an equivalence relation $\sim$ that captures “same meaning” for the purpose at hand. For example:
- In lambda calculus, many terms differ only by renaming bound variables (alpha-equivalence).
- In a state space, many states differ only by a symmetry (renaming IDs, permuting identical components).
Toy example: suppose two identical servers hold one job total. The states (server_a=1, server_b=0) and (server_a=0, server_b=1) are different raw states if the labels matter, but the same state for a load-balancing question where the servers are interchangeable. Searching both gives twice the paperwork and zero new information.
A canonicalizer is a function $\mathrm{can} : X \to X$ that picks a representative for each equivalence class. One clean specification is:
\[\forall x.\; \mathrm{can}(x) \sim x \quad\land\quad \forall x,y.\; \bigl(\mathrm{can}(x) = \mathrm{can}(y)\bigr) \leftrightarrow (x \sim y)\]The first clause says the representative preserves meaning. The second clause says representatives are unique per class. It implies an idempotence property, $\mathrm{can}(\mathrm{can}(x)) = \mathrm{can}(x)$.
Canonicalization as symmetry breaking
This move is best seen as a quotient. Instead of searching $X$ directly, search $X/{\sim}$, the set of equivalence classes. In practice, one searches only canonical representatives, which makes $X/{\sim}$ concrete.
That is the beginner test for quotienting: if two states differ only by a renaming that the property cannot observe, keep one and stop paying rent on the other.
Canonicalizers can reduce both:
- effective capacity, when $H$ contains many redundant descriptions of the same behavior (shrinking $\lvert H \rvert$ without removing semantic variety),
- effective epiplexity, when a compute-bounded search would otherwise revisit symmetric cases again and again.
Assumption hygiene: canonicalization is not always available or cheap
Some equivalence relations are expensive to decide, and some canonical forms are expensive to compute. In those settings, symmetry breaking is still valuable, but it may need weaker, cheaper canonicalizers or partial symmetry breaking.
A reformulation generator (a sketch of algorithm discovery)
This section connects VC-style “hypothesis classes” to automatic algorithm discovery. The link is simple: many discovery loops are search over a candidate class, guided by refuters.
If the candidates are programs, invariants, or proof scripts, the search state is often a partial object. If the candidates are reformulations, the search state is a representation plus a library of tools that representation unlocks.
A reformulation generator is a policy for proposing meaning-preserving (or soundly abstracting) translation steps in reformulation space. It does not need to be perfect. It needs to be productive under a compute budget, and it must be kept honest by gates.
One safe architecture has two loops:
- Proposal loop (heuristic): propose a reformulation or decomposition move that might make progress.
- Gate loop (fail-closed): check that the move preserves the intended meaning or property, then run a mature solver in the new representation and demand a certificate or a refuter.
Written as a minimal recipe:
- State: a representation $R$, a spec $\varphi$, and a set of accumulated refuters $E$ (counterexamples, failing traces, unsat cores, conflict clauses).
- Moves: a library of translation operators $\tau : R \to R’$ (rewrites, reductions, decompositions, symmetry breaking, relaxations).
- Gate: a checker that rejects moves that do not preserve the intended semantics (exact equivalence) or intended property (sound abstraction).
- Progress signal: solved-with-certificate, or refuted-with-witness.
The proposal loop can be hand-written (a portfolio of tactics) or learned (a model that predicts which $\tau$ is worth trying next). Either way, the safety claim comes only from the gate.
Reformulation as a conveyor, not a leap of faith
One complete toy pass looks like this:
- Start with the toggle puzzle in board language.
- Translate it to a linear system over $\mathbb{F}_2$.
- Run elimination.
- Get either a press vector (witness of solvability) or inconsistency (witness of impossibility in that model).
- Map the press vector back to concrete button presses on the original board.
This is why reformulation feels so powerful. The solver answer comes back carrying structure, not just a yes/no mood.
Assumption hygiene: “automatic solver” is always a scoped claim
A reformulation generator is not a magic general problem solver. For expressive enough languages, there are hard limits: some properties are undecidable, and many are computationally intractable in the worst case. The realistic goal is narrower: for a problem family, learn or curate a set of meaning-preserving moves that often land in a representation where mature tools work.
Part IV: symbol manipulation (why rewriting can beat guessing)
From the outside, symbolic reasoning can look like moving marks around.
From the inside, it is search with rails.
Keep the toggle puzzle in view here. Once the board became algebra, a solver no longer had to “feel around” in move space. It could rewrite, eliminate, and propagate.
Keep the toggle puzzle in view for one more section. Row operations in elimination are symbolic rewrites too. They look like bookkeeping on paper, but they preserve meaning while making the hidden structure easier to see.
Three examples:
-
Algebraic simplification
A rewrite like $(x + y)^2 = x^2 + 2xy + y^2$ is not a guess. It is a meaning-preserving transformation that changes what tools can see.In the running example, adding one row of the system to another over $\mathbb{F}_2$ is the same kind of move. The equation changes shape, the solvability question does not.
-
SAT and SMT solving
A SAT solver does not enumerate all $2^n$ assignments. It propagates constraints, learns from conflicts, and prunes vast regions of the search space because the syntax has structure. -
Unification and type inference
Type inference solves systems of equations between types by unification. The “symbols moving around” are the constraints being made consistent.
Symbol manipulation becomes powerful when:
- the transformations are guaranteed to preserve meaning, and
- the representation exposes the right local structure for pruning and composition.
This is reformulation again, but inside a language rather than between languages.
At the bottom, this is not mystical. It is what computation is.
A Turing machine is symbol manipulation: it reads a symbol, writes a symbol, moves left or right, and changes an internal state. That is controlled rewriting. Everything above it is a more ergonomic language for the same core act.
Part V: why logic is powerful, and why it cannot be the whole story
Logic is the mathematics of “must”.
It is also a reading discipline. Small symbols carry fixed meanings, so a short formula can say exactly what a long paragraph tries to gesture at.
This is also what made the running puzzle easier after reformulation. Once the board became equations, the question stopped being “what move feels promising?” and became “what must be true if this board is solvable?”
This is the one deliberate example switch in the tutorial. Up to here, the running object was the toggle puzzle. For the logic section, the turnstile is a better teaching object because implications, invariants, and time steps are easier to read in state-transition form than in matrix form.
Logic legend (how to read the symbols)
| Symbol | Read as | Meaning (informal) |
|---|---|---|
∀ | for all | universal claim |
∃ | there exists | existential claim |
. | such that | separator between a binder (∀x, ∃x) and its body |
∧ | and | both must hold |
∨ | or | at least one holds |
¬ | not | negation |
→ | implies | if left holds, right must hold |
↔ | if and only if | both directions of implication |
∈ | is in | membership, as in x ∈ S |
* | times | usually multiplication (not a standard logical connective; for logical “and”, use ∧) |
:= | defined as | a definition, not a claim to prove |
⊢ | derives | there is a valid proof from premises to conclusion |
∴ | therefore | the conclusion follows |
Many authors write ∀x. P(x), ∀x, P(x), or ∀x (P(x)). These are the same idea: “for all x, P(x)”.
Similarly, ∃x. P(x) means “there exists an x such that P(x)”.
In this tutorial, a symbol like P is usually a predicate (a property or relation), so P(x) is a statement that can be true or false.
It does not mean “a function of P”.
A helpful contrast is f(x), where f is a function symbol and f(x) names a value.
Predicates can be viewed as boolean-valued functions on the domain, but they are used as formulas (claims), not as values.
A common shorthand is ∀x ∈ S. P(x), read “for all x in S, P(x)”. Formally, it expands to
∀x. (x ∈ S → P(x)). Likewise, ∃x ∈ S. P(x) expands to ∃x. (x ∈ S ∧ P(x)).
Parentheses indicate what binds to what, the same way they do in algebra.
Operator precedence and scope (a PEMDAS-like convention)
Logic does not have a single universal precedence standard, but most texts follow a "tightest to loosest" convention close to:
- Parentheses:
(...) - Negation:
¬P - Conjunction:
P ∧ Q - Disjunction:
P ∨ Q - Implication:
P → Q - Biconditional:
P ↔ Q
Quantifiers (∀x, ∃x) are not "infix operators" like ∧. They introduce a bound variable, and their scope is the formula they govern.
This tutorial uses ∀x. ... and ∃x. ... (dot as a scope marker) to keep that scope explicit.
Examples under the convention above:
¬P ∧ Q parses as (¬P) ∧ Q.
P ∧ Q → R parses as (P ∧ Q) → R.
P → Q ∧ R parses as P → (Q ∧ R).
For chained operators, conventions vary, but ∧ and ∨ are usually treated as associative, and → is often treated as right-associative.
In ambiguous cases, parentheses are the only universally correct disambiguation.
A paragraph of formulas can out-precise a page of English
Take the turnstile rules from Tutorial 3 and say them in a tiny logic language.
Let:
- $\mathrm{state}_t \in {0,1}$ where 0 means Locked and 1 means Unlocked
- $\mathrm{coin}_t \in {0,1}$ and $\mathrm{push}_t \in {0,1}$
- $\mathrm{alarm}_t \in {0,1}$
Now the rules fit in a few lines:
\[\forall t.\, \mathrm{state}_t \in \{0,1\}\] \[\forall t.\, \mathrm{push}_t = 1 \to \mathrm{state}_{t+1} = 0\] \[\forall t.\, (\mathrm{state}_t = 0 \land \mathrm{coin}_t = 1 \land \mathrm{push}_t = 0) \to \mathrm{state}_{t+1} = 1\] \[\forall t.\, \mathrm{alarm}_t = 1 \leftrightarrow (\mathrm{state}_t = 0 \land \mathrm{push}_t = 1)\]This looks dense at first, but notice what it buys:
- no ambiguity about time (it says exactly which step looks at which state),
- no ambiguity about cases (each implication has an explicit condition),
- and a path to automation (a solver can search for a counterexample trace).
English can say the same thing, but it tends to hide these commitments in phrasing. Logic makes the commitments explicit.
Modus ponens (one step of deduction)
Reasoning is not a vibe. It is rule-governed movement from premises to conclusions.
The most famous rule is modus ponens:
\[\text{from } A \text{ and } (A \to B) \text{ infer } B\]In turnstile form, suppose:
\[\mathrm{state}_7 = 0 \land \mathrm{push}_7 = 1\]and we have the rule:
\[(\mathrm{state}_7 = 0 \land \mathrm{push}_7 = 1) \to \mathrm{alarm}_7 = 1\]Then we can conclude:
\[\therefore \mathrm{alarm}_7 = 1\]This is what “deduction” means in its cleanest form: the conclusion is forced by the premises under the rules of the logic.
Induction (prove it for all time steps)
Many program claims have the shape “for all steps, an invariant holds”. Mathematical induction matches that shape.
To prove $\forall t.\, I(t)$ by induction, one proves:
- Base case: $I(0)$.
- Step case: $\forall t.\, (I(t) \to I(t+1))$.
Then $I(t)$ holds for every natural number $t$.
This is the skeleton behind many safety proofs: start with an initial condition, show every transition preserves the invariant, and conclude the invariant holds forever.
Its most practical move is to turn a universal claim into a refutable target:
\[\forall x.\, P(x)\]becomes:
\[\text{find } x \text{ such that } \lnot P(x)\]In pure logic notation, this is the same shape:
\[\lnot(\forall x.\, P(x)) \leftrightarrow \exists x.\, \lnot P(x)\]If the solver finds such an $x$, that $x$ is a counterexample witness. If it proves none exists (in a decidable setting), the universal claim holds.
What “reasoning” is (and what language models simulate)
In this tutorial, “reasoning” means this:
- there is a language with semantics (the symbols mean something),
- there are rules for valid steps (proof rules, solver rules, algebraic rewrite rules),
- and a conclusion is trusted because it can be checked against those rules.
This gives three related words sharp meanings:
- Deduction: deriving a conclusion that is forced by premises. In symbols, $\Gamma \vdash \varphi$ means “from assumptions $\Gamma$, the conclusion $\varphi$ is derivable by valid rules”.
- Induction: a proof pattern for claims over time or natural numbers. It is still deduction, but it proves $\forall t.\, I(t)$ via a base case and a step case.
- Abduction: proposing a plausible hypothesis that would explain the observations. Abduction is useful, but it is not a proof.
The difference matters because a lot of human confusion comes from mixing them.
Language models can produce text that looks like reasoning, including proofs and derivations. Sometimes it is correct.
But the model is not forced to be correct by default. It is trained to produce plausible continuations of text. It often behaves like a strong abductive engine: it proposes hypotheses and fills in missing steps in a way that looks coherent.
That coherence can be real reasoning, or it can be pseudo reasoning: a sequence of steps that reads like deduction but does not actually follow the rules. The gap is not always obvious to a human reader, especially when the surface form is fluent.
This is why neuro-symbolic gates matter: treat the model as a proposer in a large space, and treat a checker as the mechanism that turns “looks right” into “is right, because it passed a refuter”.
Limits: undecidability and intractability
There are two clean, non-negotiable constraints on what logic can do for software:
-
Undecidability: for sufficiently expressive languages (general programs), many interesting questions have no algorithm that always terminates with the right yes/no answer. The halting problem is the classic example.
-
Intractability: even when a problem is decidable (SAT is the classic example), worst-case complexity can be enormous. Modern solvers win by exploiting structure, not by defeating the worst case. This is why representation choice matters so much. It is often the difference between “hopeless” and “instant”.
A deeper limit lives even in pure mathematics: in any sufficiently strong formal system, there are true statements that cannot be proven within that system (incompleteness). In software terms, this is a reminder that “formal” does not mean “omniscient”. It means “explicit about assumptions and rules”.
That is why formal methods is full of tradeoffs:
- restrict the language (finite-state models, bounded verification) to regain decidability,
- accept approximations (sound abstractions) to regain scalability,
- or accept incompleteness (find bugs reliably, but do not always prove absence of bugs).
Part VI: ultrafinitism (when “existence” means “feasible”)
Most working mathematics treats natural numbers as an unbounded infinity: for any $n$, there exists $n+1$.
Ultrafinitism is a family of views that pushes back on that comfort by treating feasibility as foundational. Strict finitism is one especially sharp version: it rejects not only actual infinity but also potential infinity in arithmetic.
The core intuition is feasibility:
- It is one thing to say “a number exists” in a formal system.
- It is another to say it can be constructed, represented, or used in any meaningful way.
One way to feel the motivation is physical: there are numbers that cannot be written down in the observable universe. Ultrafinitism treats that gap between “defined in a theory” and “usable in reality” as mathematically significant.
Historically, ultrafinitist ideas show up as a response to foundational debates about infinity, and as a response to the physical reality that reasoning is computation done by finite beings.
One way to keep the history grounded is to name a few signposts. This is not a complete timeline, and labels vary by author:
- 1958: Hao Wang surveys foundational work and explicitly centers concepts like proof and feasible number, proposing “anthropologism” as one feasibility-oriented domain.
- 1959 to 1961: Alexander Esenin-Volpin presents an “ultra-intuitionistic” program for foundations (often discussed as an ultrafinitist influence).
- 1982: Crispin Wright argues for strict finitism as a philosophical position in the philosophy of mathematics.
- Today: strict finitism continues as a live topic. One formulation posits a largest natural number, with variants that treat the successor of that number as undefined or as looping back to itself.
This matters for formal methods because it puts a spotlight on a practical truth:
- Proofs, models, and counterexamples are not only about truth. They are also about resources: time, memory, and proof length.
Even if one rejects ultrafinitism as a foundation, the engineering lesson remains: feasibility is part of reliability.
Part VII: the Lighthill report (a warning about demos without guarantees)
In July 1972, Sir James Lighthill completed a survey report on AI for the UK Science Research Council (often discussed by its 1973 publication).
One of the report’s themes was that many AI approaches ran into combinatorial explosion when scaled, and that demonstrations in small or toy settings did not necessarily carry to the general case. It is the same shape as the lights puzzle: a branching tree that looks manageable until it suddenly is not.
In June 1973, a BBC TV debate about the report took place at the Royal Institution. The report and the surrounding debate are often cited as part of the story of reduced UK funding for AI research in that era.
The reason it belongs in this tutorial is not nostalgia. It is a pattern:
- A system can look good in a demo.
- A claim about “always” needs a different kind of support.
Formal methods is what it looks like to take the second kind of claim seriously: state the property, then search for refuters, not applause.
Part VIII: neuro-symbolic programming (propose, then gate with evidence)
Here is a definition that is practical enough to build systems around:
Neuro-symbolic programming is a workflow where models propose ideas, hypotheses, candidates, or code, but final outputs are formally gated by evidence-based checks.
The model is a generator. The symbolic layer is a judge.
Neuro-symbolic gate (propose, check, refute)
This is a close cousin of CEGIS:
Counterexample-guided synthesis as a template
Why it is powerful
This architecture combines two strengths:
- Models are good at proposing candidates in huge, fuzzy spaces (programs, proofs, rewrites, plans).
- Formal methods are good at saying “no” precisely, with a witness.
Put as a slogan: learn broadly, then gate formally where correctness matters.
When a system is built so that “no” comes with a counterexample trace, iteration becomes cumulative.
Here is the concrete shape of the workflow:
- a model proposes a candidate (a patch, a rule, a table update, a plan),
- the gate checks it against explicit constraints,
- if it fails, the output is not “seems wrong”. It is a smallest witness, like an input that breaks a claim or a trace that violates an invariant,
- the next proposal is conditioned on that witness.
The loop turns vague intent into explicit constraints. Each counterexample is information that can be carried forward.
A concrete example
Imagine an agent proposes an implementation of the turnstile from Tutorial 3. The spec says: if the previous state is Locked and the input is Push, then the alarm must be 1. The gate runs a checker.
If the implementation is wrong, the gate does not respond with a vibe. It returns a witness trace, like a short sequence of inputs (Coin, Push, Push) that leads to a violated invariant. That trace is new information the next proposal can directly address.
The running puzzle has the same shape. A model could propose a candidate move vector x. The gate checks whether A x = b. If not, the rejection is not "bad try". It is the exact equation that failed, or a mapped-back board state showing the proposal does not solve the puzzle.
What it does not solve
Formal gates do not remove the need for good specs.
If the specification is wrong, incomplete, or expressed in the wrong abstraction, the gate can approve the wrong thing. A gate is only as good as the property it checks.
So the deepest skill here is the same one that started the tutorial:
- reformulate until the meaning is clear, and
- choose the abstraction that matches the property being claimed.
References (entry points, not prerequisites)
- Abstraction Cheat Sheet: https://thedarklightx.github.io/Beyond-Code-Abstraction-Cheatsheet/
- Solar-Lezama, Introduction to Program Synthesis (Lecture 1): https://people.csail.mit.edu/asolar/SynthesisCourse/Lecture1.htm
- Blumer, Ehrenfeucht, Haussler, and Warmuth, Occam’s Razor (1987): https://doi.org/10.1016/0020-0190(87)90114-1
- Blumer, Ehrenfeucht, Haussler, and Warmuth, Learnability and the Vapnik-Chervonenkis dimension (1989): https://doi.org/10.1145/76359.76371
- Shalev-Shwartz and Ben-David, Understanding Machine Learning: From Theory to Algorithms (2014): https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
- Finzi et al., From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence (2026): https://arxiv.org/abs/2601.03220
- Tao, Structure and Randomness: An Introduction to Probabilistic Combinatorics (2008): https://bookstore.ams.org/gsm-76
- Feynman, Space-Time Approach to Quantum Electrodynamics (1949): https://journals.aps.org/pr/abstract/10.1103/PhysRev.76.769
- Wang, Eighty years of foundational studies (1958): https://philpapers.org/rec/WANEYO
- Wright, Strict Finitism (1982): https://philpapers.org/rec/WRISF
- de Oliveira, Is strict finitism arbitrary? (2024): https://academic.oup.com/pq/advance-article/doi/10.1093/pq/pqae093/7730546
- Seisenberger and Sterkenburg, A strict finitistic logic (2023): https://link.springer.com/article/10.1007/s11229-022-03962-4
- Lighthill, Artificial Intelligence: A General Survey (1973): http://www.chilton-computing.org.uk/inf/literature/reports/lighthill_report/p001.htm
- McCarthy, Review of “Artificial Intelligence: A General Survey” (1973): http://www-formal.stanford.edu/jmc/reviews/lighthill/lighthill.html
- University of Edinburgh AIAI, Lighthill Report and debate (1972 to 1973): https://aiai.ed.ac.uk/project/lighthill/
- Agar, What is the AI Winter? (2020): https://www.cambridge.org/core/journals/bjhs-themes/article/what-is-the-ai-winter/F0D1E0F6D434D4A73D79DDDF5646ED46
- Hocquette and Cropper, Relational Decomposition for Program Synthesis (2024): https://arxiv.org/abs/2408.12212