Department of Mathematics

Indian Institute of Science

Bangalore

Tic-tac-toe

Othello

Deep Blue

Alpha Go

- Carrying out specific computations and control functions.
- Following precise instructions to carry out arbitrary computations, i.e., to be computationally universal.
- Learning from experience, by imitation and by experimentation to get better results.
- Computers are used today in mathematics as assistants, not collaborators.

- A game has:
- A
*state*at any time. - Rules for making moves.
- An outcome: win/loss, margin etc.

- A
- A strategy can be based on:
- A
*policy*function, giving (weighted) moves to consider. - A (relative)
*value*function, telling us how good a state is.

- A
- We consider certain sequences of moves, based on policy, and optimize value at the end of the sequence.
- In tic-tac-toe we can use simple policies and values.
- In Othello, we can refine the obvious value - number of black/white coins, by considering edges and corners.

- In Chess, a basic value function is obtained by counting pieces and pawns with weights.
- For openings, we also have strict policy functions - we only consider a small subset of possible move sequences.
- Deep Blue (and chess theory) greatly extend these.
- The value and policy functions of Kasparov were far better, but compensated for by Deep Blue being able to consider far more move sequences.
- “ Play the opening like a book, the middle game like a magician, and the endgame like a machine. ” – Spielmann
- A lesson from machines.

- In Go, the number of legal moves is much larger, causing the number of sequences for a weak policy to grow very fast.
- More importantly, it is very hard to describe a good value function.
- This makes it far harder for computers.
- Yet, in March 2016, a Go playing system AlphaGo defeated 18-time world champion Lee Sedol.

- The policy and value functions of AlphaGo are deep neural networks that were trained.
- The policy network was trained by learning to predict the next move from games of expert players.
- The value network was trained by AlphaGo playing against versions of itself and getting feedback from the outcome.
- AlphaGo considered fewer sequences of moves than Deep Blue.
- AlphaGo came up with unexpected moves.

- The state can be represented by a vector, on which we wish to find good functions.
- We first define a new vector in terms of the input vector as a composition of functions from vectors to vectors. Our desired functions are in terms of this new vector.
- We can view co-ordinates of vectors as forming layers, with each layer defined as a function of the previous layer.
- Each co-ordinate of a layer is a linear combination of (some of the) coordinates of the previous layer composed by a cut-off function.
- This can be trained by gradient flow, using the chain rule to backward propagate gradients.

- We are trying to solve an Engineering problem of finding a near-optimal strategy, and the related scientific problem of understanding the strategy used by the brain (say for vision).
- A good scientific theory gives a concise description of (properties of) observations, which are good approximations.
- However, any reasonably concise description may give a very poor approximation, and near-optimal solutions may be far from unique.
- Yet, AlphaGo is not based on a single strategy network, but on two black-box networks and some concrete descriptions.

Goals

Computer-Assisted proof components

Automated Deduction

Limits of Set theory and First-order Logic

**Goal:**Use computers to greatly increase our ability to discover and prove mathematical results across areas of mathematics.- Why?
- How?
*Learn*to*search*for,*construct*and*recognize*useful mathematical objects.

- There are infinitely many prime numbers.
- There are arbitrarily long arithmetic progressions all of whose elements are prime numbers.
- There are infinitely many natural numbers $n$ such $n$ and $n+2$ are both primes.

- Computers have been used in various ways to provide a
*component*of a proof:- Enumeration,
- Symbolic algebra,
- Exact real number arithmetic,
- Linear programming,
- SAT solvers.

- Some examples:
- Four colour theorem,
- Kepler conjecture,
- Boolean Pythagorean triples problem.

- A formal deduction system should giver rules that describe
- What are well-formed expressions, i.e., represent mathematical objects.
- How to deduce statements from other statements (more generally to make judgements).
- Axioms, which are well-formed propositions that we take to be proved.

- In the usual foundations of mathematics, the rules for forming expressions and making deductions are those of First-Order Logic, and the axioms are those of Set Theory.

- A first-order language, which describes a domain of discourse (e.g. $\mathbb{N}$) has vocabulary consisting of
- Variables - can be taken to be a fixed countable set.
- Constants (e.g. $0$, $1$).
- Functions (e.g. $+$).
- Relations (e.g. $<$, $=$).
- Special symbols $\Rightarrow$, $\iff$, $\wedge$, $\vee$, $\forall$, $\exists$, ...

- We form two kinds of expressions from these,
*terms*(e.g. $1+1$, $n + 3$) and*formulas*(e.g. $n + 1 < 2$, $4 < 3$). - Terms and formulas may depend on variables.
- Terms represent objects in the
*domain of discourse*. - Formulas are either
*true*or*false*.

- We can deduce formulas from other formulas using the rules of deduction.
- The main deduction rule is
*Modus Ponens*: given $P$ and $P\Rightarrow Q$ we deduce $Q$. - A
*theory*is a language together with a collection of formulas, called*axioms*in the language. - A formula is
*deducible*in a theory if it can be obtained from the axioms by the rules of deduction.

- A universal deducer is a computable function (algorithm) which, given a proposition (formula), returns a proof if it is deducible and otherwise returns failure.
- By results of Turing, there is no such algorithm.
- We can enumerate proofs and check if they prove either the given proposition or its negation. This does not always work as there are statements that are true but not provable.
- Practically, we can conclude that there is no best deducer, as any given proof can be found by some deducer but no deducer can find all proofs.

- Robbins conjecture was a conjectural characterization of Boolean algebras in terms of associativity and commutativity of $\vee$ and the Robbins equation $\neg(\neg(a\vee b)\vee \neg(a \vee \neg b)) = a$.
- This was conjectured in the 1930s, and finally proved in 1996 using the automated theorem prover EQP, which is a Resolution Theorem Prover with Paramodulation.
- So far, this seems to be the only major success of deductive theorem provers.
- First-order logic theorem provers are, however, used in interactive proof systems (hammer tactics).

- A proof in real-life mathematics consists of:
- definitions, axioms, assumptions, notation;
- assertions;
- hints about which assertions, definitions etc. are used in the proof of a given assertion.
- focussing attention on relevant objects and results.
- introducing variables,hypotheses etc. into a scope.

- The reader is expected to deduce all assertions based on the hints (or at least believe that he/she can do so).
- In particular, there is no objective sense in which a proof is complete or correct.

“ Since the first half of the 20th century mathematics has been presented as a science based on ZFC and ZFC was introduced as a particular theory in Predicate Logic.

“ Therefore someone who wanted to get to the bottom of things in mathematics had a simple road to follow - learn what Predicate Logic is, then learn a particular theory called ZFC, then learn how to translate propositions about a few basic mathematical concepts into formulas of ZFC, and then learn to believe, through examples, that the rest of mathematics can be reduced to these few basic concepts.”

Vladimir Voevodsky

- In the usual foundations of mathematics, $sin(3)$ and $3(sin)$ are syntactically equally valid, i.e., the usual language of mathematics is an informal language.
- Statements and proofs formalized in first-order logic are verbose and opaque.
- Propositions and Proofs are not first class, i.e., cannot be arguments or values of functions, so not composable. We instead have patterns of proof (such as induction).
- HoTT foundations are fully formal, yet much closer to working mathematics and with a syntax that actually describes valid objects. Further, introducing variables, hypothesis and assumptions as well as inductive proofs, are natural.

“ The roadblock that prevented generations of interested mathematicians and computer scientists
from solving the problem of computer verification of mathematical reasoning was the
unpreparedness of foundations of mathematics for the requirements of this task.”

“ Formulating mathematical reasoning in a language precise enough for a computer to follow
meant using a foundational system of mathematics not as a standard of consistency applied only
to establish a few fundamental theorems, but as a tool that can be employed in everyday
mathematical work. ”

Vladimir Voevodsky

Type theoretic Foundations

Terms, Types, Rules

Inductive types

Dependent Types

Propositions as types

- A language has words and phrases belonging to various
*syntactic categories*. - The grammar specifies rules for forming phrases from words and phrases, based on their syntactic categories.
- First order logic is closely modelled on this, with terms analogues of noun phrases, functions and relations analogues of verbs and formulas analogues of sentential phrases.
- However, meaningless but valid sentences are avoided by having no verb phrases other than the small number of verbs in the vocabulary.

- In a higher order logic we have rules for forming new syntactic categories.
- In a simple higher order language, these are formed from other syntactic categories (generics).
- In a
*dependently typed*language, syntactic categories are regarded as words/phrases. - Thus, we can form syntactic categories in terms of words/phrases.
- We can also form phrases using syntactic categories.

- A Type theory is a type system rich enough to replace Set theory as foundations for mathematics.
- Mathematical objects, called
*terms*, have*types*. - A term $a$ having a type $A$, denoted $a : A$, is analogous to an element $a$ belonging to a set $A$, i.e., $a \in A$.
- However the rules for forming terms and types, and for determining
whether a term has a type, are purely
*syntactic.* - Nevertheless, the rules for forming types are rich enough that types can play the role of sets - for instance, prime numbers form a type.
- Even more remarkably, propositions and proofs can be expressed in terms of types and terms.

- Mathematical objects are called
*terms.* - Every term has a
*type*, generally unique. - Types are also terms, whose types are
*universes*. - We have
*rules*to introduce terms (including types), individually or in groups, into the context. - Rules also let us make two kinds of
**judgements**:- that a term $a$ is of type $A$.
- that two terms are equal
*by definition*.

- All the rules are syntactic.
- Note that terms can be equal without being so by definition.
- There is a relation (type family)
*propositional equality*extending definitional equality.

- Given types $A$ and $B$, we can introduce the function type $A \to B$, whose members are functions.
- Given $f: A \to B$ and $a : A$, we get a term $f(a) : B$.
- We can construct a function $f: A \to B$, $$f(a) := b,$$ by giving an expression $b$ of type $B$ in terms of a variable $a : A$ and other terms in the context.
- We can also define functions
*recursively*on*inductive types*.

- We generalize functions $f : A \to B$ to
*dependent functions*, so that $f(a)$ has a type $B(a)$, depending in general on $a : A$. - More precisely,
- A
*type family*$B: A \to \mathfrak{U}$ is a function with codomain a universe, so all its values are types. - Given a type family $B: A \to \mathfrak{U}$, we can construct a corresponding type $\prod_{a : A} B(a)$ of dependent functions.
- We can apply $f : \prod_{a : A} B(a)$ to $a : A$, to obtain $f(a) : B(a)$.

- A
- Constructions of dependent functions are analogous to those of functions.

- An inductive type $T$ is defined by specifying terms (usually functions) that construct members of $T$.

```
data ℕ : Type where
zero : ℕ
succ : ℕ → ℕ
```

- We can define functions recursively on inductive types, by specifying in all cases.
- Formally, we can introduce recursion functions and apply them to the definition data.

```
_+_ : ℕ → ℕ → ℕ
zero + y = y
(succ x) + y = succ (x + y)
```

- The types $\mathbb{0}$ has no terms .
- The type $\mathbb{1}$ has a single term $*$.
- The product type $A \times B$ has terms (corresponding to) pairs $(a, b)$ with $a: A$ and $b : B$.
- The sum (disjoint union) type $A \oplus B$ has terms (corresponding to) terms of $A$ and terms of $B$.
- For a type family $B: A \to \mathcal{U}$, the
*dependent pair*type $\sum_{a: A} B(a)$ has terms (corresponding to) pairs $(a, b)$ with $a: A$ and $b : B(a)$.

- A type $A$ is
*inhabited*if there is a term $a$ with $a : A$. - By
*propostion*we mean a logical statement that must be true or false. - We represent propositions by types.
- If a type $A$ is viewed as a proposition, a term $a : A$ is a
*proof*of (or witness to) $A$. - In particular, a proposition is
**true**if and only if the corresponding type is**inhabited**. - Note that we must be able to form types representing mathematical propositions by the type formation rules.

Let $A$ and $B$ be types, regarded as representing propositions.

- The proposition $A \Rightarrow B$ is represented by $A \to B$.
- The propostion $A\wedge B$ is represented by $A \times B$.
- The proposition $A \vee B$ is represented by $A \oplus B$.
- The proposition $\neg A$ is represented by $A \to \mathbb{0}$.

- A proposition depending on a variable $x : A$ is represented by a type family $P : A \to \mathfrak{U}$.
- The proposition $\forall x\in A,\ P(x)$ is $\prod_{x: A} P(x)$.
- The proposition $\exists x\in A,\ P(x)$ is $\sum_{x : A} P(x)$.

- For a fixed type $A$, propositional equality is given by the identity type family freely generated by reflexivity.
- This is an inductive type family.
- However, for fixed $a: A$, $a = a$ is
**not**an inductive type, i.e., it is not suffiicient to define functions on $refl(a)$.

```
data _==_ {A : Type} : A → A → Type where
refl : (a : A) → a == a
```

```
symmetry : {A : Type} → {x y : A} → (x == y) → (y == x)
symmetry (refl a) = refl a
trans : {A : Type} → {x y z : A} → (x == y) → (y == z) → (x == z)
trans (refl a) (refl .a) = refl a
```

Equality, Paths, Homotopy

Levels from dimension

Sets, Propositions

- We
*interpret*- Types as
*spaces*. - Terms of a type as points of the space.
- Functions $A \to B$ as continuous functions $A \to B$.
- For a type $A$ and terms $x, y: A$, the identity type $x = y$ as
*paths*in $A$ from $x$ to $y$.

- Types as
- We do not actually construct spaces, i.e., sets with topology, starting with a type.
- Instead we make topological (specifically homotopy theoretic) constructions and prove topological results in type theory.
- A practical consequence for type theories is that we get a canonical, provably consistent, type theory.

- As above, for a type $A$ and $x, y : A$, a term $p : (x = y)$ is interpreted as a path from $x$ to $y$.
- Two such paths are equal if there is a path of paths, called a homotopy, between them.
- We have similar notions of equality for functions.
- Types are equal if the corresponding spaces are homotopy equivalent,
as a consequence of the
*Univalence axiom.*

- We define levels of types, based on a characterization of dimension in homotopy theory.
- By definition, there is a unique type at level $-2$ (the lowest), which has a single term.
- Inductively, we define the level of a type $A$ to be at most $(n + 1)$ if for $a, b : X$, the type $a = b$ has level at most $n$.
- Further, we can truncate a type canonically to an $n$-type.

- A set is a space with all of its components contractible.
- A type $A$ is a set if for $x, y: A$ and $p, q: x = y$, we have $p = q$.
- A mere proposition is a type which is either empty or all of its elements are equal, i.e., a type at level $-1$.
- The concept of mere propostions, as well as propositional truncation, allow consistent mixing of classical logic with the type theoretic form.
- For instance, in homotopy type theory the law of excluded middle is usually assumed for mere propositions, but not for all types.

HoTT implemented in Scala

Reinforcement learning

Representation learning

Natural language processing

- A theorem with a simple statement but difficult proof is useful.
- A theorem used to prove many other theorems is useful.
- Particularly for applications, we may have an externally determined
notion of
*a priori*usefulness. - With homotopy type theory and backward propagation, all these can be naturally captured in simple learning dynamics.

- Given an initial distribution $P_0$ on terms, rules for forming terms gives a new distribution which can be recursively defined by a relation of the form $$P = \alpha P_0 + \beta \mu_*(P) + \gamma \theta_*(P \times P) + \dots$$
- Given a probability distribution on terms we get one on types by mapping a term to its type (proof distribution).
- As (inhabited) types themselves are terms, we get a restricted distribution on them (theorem distribution).
- We have a flow of the (entropy of) the proof distribution towards the theorem distribution.
- We can backward propagate to get a flow on the distribution on terms.
- The distribution on theorems can have other components.

Using proximity and order, words, mathematical objects etc can be represented by vectors which capture many of their relations.

- We aim to extract mathematical objects from the literature, with reasonably high accuracy.
- These can be used for learning, both giving terms from which to generate others, and a distribution on types.
- This is similar to a translation problem, except the target language has strong restrictions on sentences being meaningful and true.
- To extract from natural language, we first use a parser.

- A knot $K$ is a smoothly embbed circle in $S^3$.
- $K$ is
*unknotted*if it bounds a smoothly embedded disc in $S^3$. - $K$ is
*slice*if it bounds a smoothly embedded disc in $B^4$. - $K$ is
*ribbon*if it bounds a slice disc so that $x^2+y^2+z^2$ has (no degenerate critical points and) no local maxima in it. - The slice-ribbon conjecture says that evey slice knot is ribbon.

- We seek useful results by interaction, working out and the literature.
- Working out is a combination of experimentation, computation, deduction and search, with learning.
- We seek:
- Consequences of being
*ribbon*- to contradict. - Constructions of
*slice*knots. - Modifications of, and relations between, slice discs for a fixed knot.
- Invariants, complexities etc associated to slice discs.

- Consequences of being
- One may similarly analyse, for example, the mean curvature flow.