-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpreface.qmd
115 lines (62 loc) · 10.5 KB
/
preface.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
### Motivation
**Why Compressive Learning?**
**Observation**:
We are taking extremely large pre-trained networks (LLMs) and then quantizing them to 1-bit precision. Effectively turning them to Boolean Networks and yet notice that performance does not degrade as much. See [1-bit LLMs](https://arxiv.org/abs/2402.17764) paper for recent results. It begets new questions like _Can we not train them in the quantized space itself_? Answering it requires revising and revisiting many things we learnt over last few decades. We can put the topics into few broad categories.
### Analysis
- The Lottery Ticket Hypothesis
Lottery Ticket Hypothesis[(LTH)](https://arxiv.org/abs/1803.03635) conjectured that some sub networks are as good as the full network and remaining ones can be completely pruned-off without incurring lot of degradation in performance.
In the [1-bit LLMs](https://arxiv.org/abs/2402.17764) paper, authors showed that, all weights can just come from $\{ -1,0,1\}$. It is the extreme form of quantization and pruning combined. We see such extreme form of quantization of many modern LLMs. This is done for computational efficiency (also energy efficiency) at inference time.
LTH authors observed that initialization mattered though, in the sense that, same winning/retained sub network retrained from a different initialization did not achieve the same performance the sub network achieved with the the original initialization. I suspect that it is an artefact of the training (learning) algorithm (and not due to the initialization itself).
What role are the prune-able sub networks playing? Is that about redundancy (serving some other important function such as a error control coding) or they are just plain useless. It is not clear unless we investigate in that direction. The LTH paper did not study the brittleness after pruning (AFIK).
- Are the pruned networks acting like "error control codes" in the functional space? Not having them is ok from a prediction (accuracy) point of view , but not ok from robustness (sensitivity) point of view.
- How we do study generalization ability of such binary networks? From a _build_ point of view, what is needed to generalize? I hypothesize that _smoothness_ of the functions is central to generalization. So, how do we define and later enforce such smoothness on Boolean Networks.
- A related question - what is the equivalent notion of robustness for Boolean maps?
- How do we analyze (and develop the theory to analyze) such Boolean functions which map n-dimensional inputs bit streams to m-dimensional output bit streams? Will the theories of error control codes based on groups and fields help?
### Build & Develop
- Do we need to go through the route of _Train large models with full precision and quantize_ or can we _Learn in the 1-bit space_ itself. Drawing inspiration from compressive sensing, can we do **Compressive Learning**?.
- How do we train such 1-bit models to begin with? If `autograd` and `backprop` made possible the training (with non-trivial enablers like data, compute and massive engineering feats at data center scale) - are there efficient learning algos to train 1-bit LLMs?
Boolean Logic Deep Learning [(BOLD)](https://arxiv.org/abs/2405.16339) introduced some Variational Theory to define gradients on Boolean variables, and a corresponding chain rule to go with.
The implication is that - we can compute the gradients on mixed data types on any computational DAG (all neural networks are instances of such graphs) via `backprop` algo. But will it work at scale?
### Kernel Machines
Can Kernels bridge the digital (Boolean Networks) and the analog (current full precision LLMs) divide?
Before Deep Learning took-off circa 2012, most successful models were Kernel Machines. It was a well researched area, and probably well understood as well, relative to Deep Learning. Some evidence that they can.
- Trees are Kernels Machines: Some might argue - what about Trees and Ensembles? They are not Kernel Machines. Under the hood, Random Forests are also Kernel Machines. In an obscure paper [Some infinity theory for predictor ensembles](https://www.stat.berkeley.edu/~breiman/some_theory2000.pdf), Leo Brieman shows that Random Forests implicitly learn a Laplacian Kernel, asymptotically.
- Transformers are Kernel Machines: In another insightful work, Prof. [Rich B](https://richb.rice.edu/) who did pioneering work in signal processing in general, and [compressive sensing](https://dsp.rice.edu/cs/) in particular (the 1-pixel camera project, for example), re-derived Transformers, like Prof. [Yi Ma](https://people.eecs.berkeley.edu/~yima/)'s group, but using a different route - by looking at Transformers as Kernel Machines. In [A Primal-Dual Framework for Transformers and Neural Networks](https://arxiv.org/abs/2406.13781), an explicit construction of a polynomial kernel was given, whose dual form gives raise to the self-attention block of Transformers. _Perhaps_ this can explain why Transformers were not very competitive to Trees (due to the different inductive biases encoded by their respective Kernels. Not all Kernels are equal).
- SGD is a Kernel Machine Learner: In this [paper](https://arxiv.org/abs/2012.00152v1), Prof. Pedro Domingos shows that every model learned by Gradient Descent is approximately a Kernel Machine. The Kernel happens to be Neural Tangent Kernel (NTK), which seems to be main analytical device to study the behavior of Deep Neural Networks (See [this](https://arxiv.org/abs/1806.07572) for example).
- Random Features are Large Scale Kernel Machines: In a [time-tested](https://eecs.berkeley.edu/news/ben-recht-wins-nips-test-time-award/) paper, Ali and Ben, show that linear models on random features are Kernel Machines, and they are extremely competitive and simple (to implement and model), at the same time.
- Separability in High Dimensional Feature space: The conventional wisdom in ML is, transform low dim input to very high (non linear) features space, where modeling becomes easier. Kernels provide that kind of Statistical Machinery to both compute and theorize the development of new models (architectures) and learning algorithms.
But what about Boolean functions, i.e, functions that map n-dimensional Boolean inputs to m-dimensional Boolean outputs? In this beautiful paper [Learning of Boolean Functions Using Support Vector Machines](https://dl.acm.org/doi/abs/10.5555/647719.735941), publised in 2001, Ken Sadohara develops DNF Kernels to be applied along side SVMs. The connection between DNF (Disjunctive Normal Form) and Boolean circuits are rather straightforward to see. Any truth table (coding a Boolean function) can be represented either in Sum-of-Products (SOP) or Product-of-Sum (POS) form, and DNF corresponds to the SOP form. So, if we can construct a model and learn algorithms to go with, such model becomes a universal learner (of any Boolean function). This is the constructionist approach to model(architecture) design. Using this constructionist approach, we can argue that Boolean SOP circuits are MLPs in the Boolean space, see [here](https://mlsquare.github.io/intro2dl/lectures/L01.html#universal-approximation) for discussion.
So the thesis is - All modern, empirically proven architectures are a _Composition of Learnable Kernel Machines_ and perhaps we can extend this to Deep Binary Neural Networks as well.
### Axiomatic Notions of Intelligence
- Principle of Parsimony and Self-Consistency: In a series of papers, based on decades of work, Prof. [Yi Ma](https://people.eecs.berkeley.edu/~yima/) and colleagues argued that for intelligence to emerge, certain aspects must be considered such as representations must be parsimonious (not exactly sparse) and they must be self-consistent (to recover the original signal).
- **Compression**: A key point Prof. Yi Ma emphasizes is that **Compression** is at the heart of intelligence. Using rate-reduction theory (widely used in communication theory), his group was able to re-derive Transformers, ConvNets, the empirically proven architectures from first principles.
- Ability to learn based on few examples: Intelligent systems should be able to learn based on the feedback loop (error control) with few examples.
- Recollection is not a sign of intelligence: By that account, taking a cue from Prof. [Subbarao Khambampati](https://rakaposhi.eas.asu.edu/), LLMs are approximate retrieval machines, which do not have any reasoning ability, unless specifically equipped to do so, as the recent works seem to suggest.
### Axiomatic Notions of Perception and Processing
- Perceive in low dimensions but with high precision (fidelity).
- Project them on to very high dimensional, parsimonious, self-consistent representations but in low precision (1 bit for eg)
- Process them in the bit space (energy efficient)
What it means is, the interface between the external world and the processor (reasoner or model) is like an A/D Convertor (Analog to Digital converter) and the processor (model) only performs bitwise operations and we can convert the Binary signals back to Analog (D/A converter) for external communication.
Putting it all together,
### Compressive Learning
is to study and develop
- Computational Training Stack
- define `gradients` on Boolean variables, and a chain rule to go with
- develop `backprop` to scale to large computational DAGs
- Perhaps, use Genetic Algorithms to augment the training
- Kernels
- Learnable DNF Kernels to learn Boolean features or other Universal Learners of Boolean N/Ws in SOP or POS form
- Compose the Kernels depth-wise to retain the expressivity of modern Deep Learning models.
- Issues
- Can they be trained? Or do they just rote learn?
- Can they generalize?
- Can they be distilled?
- Can they be analyzed?
If successful, an optimistic outlook for these Deep Binary Neural Networks is, they are:
- Interpretable: since we can recover the DNF forms
- Modulo-LLM-ready: since symbols can be mined explicitly, can augment symbolic reasoning, making the LLMs [modulo-LLMs](https://arxiv.org/abs/2402.01817)
- Energy-efficient: require bit-wise operations, not needing giant matrix multiplications (so no GPUs)
- ASIC-friendly: we may be able compile PyTorch models directly into HDL languages and burn the models on silicon. Like 3D printing, forge your model on silicon.
- SLA-friendly: with high token throughput and low latency
- Edge-friendly: with low memory footprint, can be deployed on edge devices
- End-to-End trainable: no need for train-large-then-quantize paradigm - train in the compressed domain itself.