Playground project to learn about LLMs.
- Python Jupyter Notebook
- Pytorch
- Matplotlib
- Bigram LLM
- The main idea is to get a count of character pairs which occur in the text.
- Arrange the dataset of the text so that the character pairs per letter are row-wise and column-wise (the second char in the pair in the col is the first char in the row)
- Get a probability of a letter following another letter based on the character pairs in a row (Char Pair for a letter / Total Count of Char pair occurences for that letter)
- repeat the loop since the column selected lines up with the starting char of the next pair by row index (repeat loop on that row)
- Bigram LLM built with a Neural Network
- Neural Networks
- Following Andrej Karpathy's building micrograd
- Derivatives
- Notebook
- Back Propagation using the Chain Rule
- Made up of inputs, weights and bias that are inputs to layers of Neurons
- Loss is calculated after data passes through the layers
- Mean squared error, Max-margin, Cross Entropy Loss, Negative Log Likelihood
- For regression, use Mean squared error, for Classification use Negative Log Likelihood
- Back propagation pass is done to determine weight/bias adjustments needed to get closer to target output
- Gradient Descent: Loop back to running predictions with the upated weights and repeat Loss back propagation and parameter adjustments to continually lower the Loss
-
$x_n$ : Inputs to the neuron -
$w_n$ : Weights (on the synapses) -
Processing in the Neuron: The set of weights multiplied by their corresponding inputs with a bias
-
what flows to the neuron are the multiple sets of inputs multiplied by the weights:
$w_1 \times x_1, w_2 \times x_2, \ldots, w_n \times x_n$ -
Added to this is some bias
$b$ which can be used to adjust the sensitivity or "trigger happiness" of the neuron regardless of the input.$$\sum_n w_n x_n + b$$ -
The product of the inputs, weights with the bias is piped to an Activation Function
-
The Activation Function is usually a squashing function of some kind (Sigmoid, Relu or Tanh)
-
The squashing function squashes so that the output plateaus and caps smoothly at 1 or -1 (as the inputs are increased or decreased from zero):
-
-
The output of the neuron is the Activation function applied to the dot product of the weights/inputs+bias:
$$f\left(\sum_n x_n w_n + b\right)$$
-
-
A set of Neurons evaluated independently
- A network with multiple Layers of Neurons
- The Layers feed into each other sequentially (in order)
- In practice, for very large datasets, batching is done which takes a smaller subset of the data and uses that for the forward and backward pass
- See Andrew Karpathy's Micrograd demo for example code
def loss(batch_size=None):
if batch_size is None:
Xb, yb = X, y
else:
ri = np.random.permutation(X.shape[0])[:batch_size]
Xb, yb = X[ri], y[ri]