Skip to content

Commit

Permalink
report draft 2
Browse files Browse the repository at this point in the history
  • Loading branch information
outlawhayden committed Apr 30, 2024
1 parent e29201f commit 32f76eb
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 9 deletions.
Binary file modified report/report.pdf
Binary file not shown.
22 changes: 13 additions & 9 deletions report/report.tex
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
\begin{document}
\maketitle
\begin{abstract}
We present an expanded implementation of control vector engineering for the \emph{Mistral-7B-Instruct-v0.1}, in which we pre-compute vector representations of 50 different model behaviors, which we can then enforce in future queries through the addition of the vector to the activation function between each layer. We also add a novel visual representation framework for these control vectors, in terms of their individual weights as well as their magnitude of effect at each layer, and a front-end that allows for implementation of linear combinations of control vectors to induce a mixture of behaviors from the model at a variety of magnitudes.
We present an expanded implementation of control vector engineering for the \emph{Mistral-7B-Instruct-v0.1}, in which we pre-compute vector representations of 50 different model behaviors, which we can then enforce in future queries through the addition of the vector to the activation function between each layer. We also add a novel visual visualization framework for these control vectors, in terms of their individual weights as well as their magnitude of effect at each layer, and a front-end that allows for implementation of linear combinations of control vectors to induce a mixture of behaviors from the model at a variety of magnitudes.
\end{abstract}

%
Expand Down Expand Up @@ -100,7 +100,7 @@ \section{Background}

\item \emph{A Tutorial on Network Embeddings}~\cite{chen2018network}

Chen et al offer an overview of different ways to extract representation embeddings from models beyond PCA. This is different than the high-level concepts we are trying to learn with control vectors, and rather focuses on the applications of low-dimensional latent node representation. Some of the methods covered include isomaps and local linear embeddings. While somewhat out of the scope of our project, the unsupervised methods are potential strategies we will explore to improve our model.
Chen et al offer an overview of different ways to extract conceptual embeddings from models beyond PCA. This is different than the high-level concepts we are trying to learn with control vectors, and rather focuses on the applications of low-dimensional latent node representation. Some of the methods covered include isomaps and local linear embeddings. While somewhat out of the scope of our project, they provide a useful framework for advanced concepts and increased our understanding of concept embeddings within similar systems.

\item \emph{Representation Engineering Mistral-7B an Acid Trip}~\cite{vogel2024repeng}

Expand All @@ -112,7 +112,7 @@ \section{Background}
\section{Approach}
While these projects include proof of concept, given the high computational cost associated with generating these control vectors, there does not yet exist a user-facing accessible and interpretable mechanism for individuals to implement these control vector schemes. Our approach will focus on using control vectors for further representation engineering and explainability, and as a lens with which to examine the biases and behaviors within foundation models.

As such, our project has essentially two components - the pre-calculation of conceptual representations within the model, and the frontend component that will pass queries through an evaluation pipeline.
As such, our project has essentially two components - the pre-calculation of conceptual embeddings within the model, and the frontend component that will pass queries through an evaluation pipeline.

\subsection{Model Architecture}
Since we require manual adjustment of the activation functions between layers, we need an open-source model that is completely accessible at the weight level, while still being relatively lightweight enough to query within a reasonable time for a web application. Much of the relevant literature for this reason utilizes the \emph{Mistral-7B-Instruct-v0.1} \cite{jiang2023mistral} model, which is what we will use as well; while it is a "mixture-of-experts" model and a bit outside of the scope of this course, it includes specific \emph{[INST]} tokens for instructional control which increase the effectiveness of the extraction and control process. We can access the model from \emph{HuggingFace}\footnote{https://huggingface.co/mistralai/Mistral-7B-v0.1} and load it directly into PyTorch with simple license acceptance.
Expand All @@ -136,6 +136,8 @@ \subsection{Evaluation with Enforced Behavior}

In each forward pass, for each layer, the activation function is modified given a normalization flag, a defined operation to combine the control vector into the activation function, and a scalar value. We first check each entry of the hidden state, and create a binary vector that encodes if an entry is nonzero. We then multiply the control vector by this mask - as if an entry is zero in the original layer, it should remain zero after modification. We then potentially normalize the control vector if desired, and multiply it by our chosen scalar multiple. This scalar parameter affects the magnitude of the behavior enforcement. We then combine the original activation function output with the control vector using the chosen operation, and potentially normalize this output as well if desired.

We implemented a minimum amount of code-level changes to the original \emph{Repeng} library itself - only adding a few utilities for caching and loading control vector objects as methods within the preconstructed classes. Most of our programming was implementing iterative training processes to calculate a large number of control vectors with limited computational resources, as well as automating our novel visual renderings.


\subsection{Web App}
Our web app in Flask provides users with the ability to select a pre-cached control vector, magnitude of effect, and then to pass a query through the model with their desired control behavior enforced. The app loads the model once locally, and then loads the desired control vector into memory. The user's query is then formatted, passed through the model, and the output is sent back to the frontend to be read. The frontend will also pass the query through the model twice with no control vector loaded (equivalent to a selected magnitude of 0), for comparison and examination of effect. For more information, refer to Appendix \ref{appendix:A}. We note that this requires similarly capable local computation resources - the host has to be able to at least evaluate a local forward pass of the model, which often requires an external GPU processor. Alternatives include web deployment with the inclusion of cloud resources to load and evaluate the model. For implementation, deployment, and source code, see \href{https://github.com/tulane-cmps6730/project-control}{project-control} on Github.
Expand All @@ -156,7 +158,7 @@ \subsection{Dataset}


\subsection{Visualizations}
For visible inspection, we compute two graphical representations for each control vector. Recall that each "Vector" is a list of vectors, with one for each layer. We compute the average value across all individual vectors in the array to get the average magnitude of effect on each layer, as well as the average vector across all layers. For example, with the desired behavior of "Elated" versus the undesired behavior of "Dejected", Figure \ref{fig:elated_effect} depicts the average control vector across all layers of the model, and Figure \ref{fig:elated_mag} depicts the average weight magnitude for each layer. We can see that the control vector is mostly subtractive, with most values between 0 and -0.025, but with the most noticeable effects in layers 10-15.
For visible inspection, we compute two graphical visualizations for each control vector. Recall that each "Vector" is a list of vectors, with one for each layer. We compute the average value across all individual vectors in the array to get the average magnitude of effect on each layer, as well as the average vector across all layers. For example, with the desired behavior of "Elated" versus the undesired behavior of "Dejected", Figure \ref{fig:elated_effect} depicts the average control vector across all layers of the model, and Figure \ref{fig:elated_mag} depicts the average weight magnitude for each layer. We can see that the control vector is mostly subtractive, with most values between 0 and -0.025, but with the most noticeable effects in layers 10-15.

While it is possible to depict the entire list of vectors as a two-dimensional array, or potentially as a heatmap, we found that most of the values are so small, and the array so large, that any visual insight is generally lost.

Expand All @@ -180,18 +182,20 @@ \subsection{Magnitude of Effect}

\section{Discussion}
\subsection{Interpretability and Explainability}
The control vector framework presents the capacity to extract and examine potential representations of concepts from within the model, but also the ability to explore the effect that representation has on final output. The visual representations are intuitive, and allow for a more holistic overview of the concept's effects on individual weights, in a way that the effects of prompt engineering would not be possible to capture.
The control vector framework presents the capacity to extract and examine potential representations of concepts from within the model, but also the ability to explore the effect that representation has on final output. The visual renderings are intuitive, and allow for a more holistic overview of the concept's effects on individual weights, in a way that the effects of prompt engineering would not be possible to capture.

\subsection{Computation Requirements}
As opposed to fine-tuning, implementing and calculating control vectors is a cheaper and more accessible process. First, the calculation of control vectors is only feature extraction on the results of multiple forward passes - there are no gradients calculated, and no weights adjusted within the model, which greatly reduces the computational cost associated with their generation. Enforcement is also relatively cheap - since a linear function is used to combine the control vector with the hidden weights in each activation function, the cost associated with enforcement is at worst linear with respect to the number of layers in the model. Given that the bulk of the cost is in the vectors' generation, which only needs to be done once, and then can be utilized from a cache with no new learning required, they scale in a useful way for deployment in user-facing utilities such as chatbots.
As opposed to fine-tuning, implementing and calculating control vectors is a cheaper and more accessible process. First, the calculation of control vectors is only feature extraction on the results of multiple forward passes - there are no gradients calculated, and no weights adjusted within the model, which greatly reduces the computational cost associated with their generation. Enforcement is also relatively cheap - since a linear function is used to combine the control vector with the hidden weights in each activation function, the cost associated with enforcement is at worst linear with respect to the number of layers in the model. Given that the bulk of the cost is in the vectors' generation, which only needs to be done once to then be utilized from a cache with no new learning required, they scale in a useful way for deployment in user-facing utilities such as chatbots.

It is worth noting that the many forward passes required to collect the outputs which are refined into control vectors could be done in parallel. Our implementation did not use a multi-threaded approach, but the lack of gradient calculations within the model would make this extremely easy to distribute across multiple processors without the associated methods required for gradient calculations (gradient averaging, data/model sharding, etc.,).

\subsection{Safety}
Control vectors are also better at controlling model output than prompt engineering in a user-facing manner. Especially given the recent cottage industry that has grown around jailbreaking models towards undesireable behavior using specific adversarial prompts or other injection methods, there is a need for simultaneously transparent and robust control methods. For a given behavior enforced with a control vector ('helpful', 'cautious', 'civil', 'balanced', etc.,) removing the behavior externally using prompt engineering is extremely difficult, as illustrated by Vogel \cite{vogel2024repeng} \footnote{See also: \href{https://vgel.me/posts/representation-engineering/}{Representation Engineering Mistral-7B an Acid Trip}}. However, the other side of this feature is that control vectors provide a very efficient way to jailbreak models that were tuned or aligned to certain behaviors, if the individual layers are accessible. Therefore, for safety purposes, these tools are best suited for scenarios in which developers have direct access to model weights, but users do not - hampering open source accessibility and transparency.

Zou et al., \cite{zou2023representation} also excellently outline the utilization of representation engineering as a tool for transparency with regards to the model itself. While their visualization methods are more advanced than ours, we were aiming for a similar simple visual representation of hidden representations within otherwise intractable layers. Often referred to as 'top-down' research, control vectors also can serve as a useful tool for examining the biases and behaviors of the model itself within its internal representation space. However, we were unable to find useful quantitative insight into the similarities between control vectors and how it translates itno knowledge
Zou et al., \cite{zou2023representation} also excellently outline the utilization of representation engineering as a tool for transparency with regards to the model itself. While their visualization methods are more advanced than ours, we were aiming for a similar simple visual representation of hidden conceptss within otherwise intractable layers. Often referred to as 'top-down' research, control vectors also can serve as a useful tool for examining the biases and behaviors of the model itself within its latent space. This could potentially include observing the weights of each hidden layer during a forward pass, and measuring their magnitude with respect to a given control vector to check for internal enforcement of a desired concept within each hidden state, but this has yet to be implemented anywhere that we know of. However, in our own work, we were unable to find useful quantitative insight into the similarities between control vectors and how it translates into knowledge about the model's latent space. The high dimensionality and relatively low magnitude of each vector within the list makes measures of vector similarity (cosine similarity, word2vec or other latent space similarity tools, etc.,) difficult to interpret - especially given that the process of calculating control vectors is a noisy process not conducted via gradient calculation.

\subsection{Shortcomings}
While this process is generally model-agnostic, in the sense that it is theoretically feasible for any language model with distinct layers and activation functions, as with any representation space the final control vectors are model-specific. For each model architecture, the control vectors have to be calculated from scratch, and do not translate between different model versions or sizes.
While this process is generally model-agnostic, in the sense that it is theoretically feasible for any language model with distinct layers and activation functions, as with any latent space the final control vectors are model-specific. For each model architecture, the control vectors have to be calculated from scratch, and do not translate between different model versions or sizes.

As opposed to prompt engineering, the number of different concepts that can be enforced via control vectors is relatively small. Since the overall magnitude of the changes implemented has to remain below a certain threshold for the responses to remain cogent, implementing multiple different conceptual enforcements required that the magnitudes for each concept are accordingly scaled down, which reduces their overall effect. The best effects are seen when just one control vector is used - as a workaround, one control vector can be calculated to encompass the desired joint concept, although that might make the effect more unstable.

Expand All @@ -203,7 +207,7 @@ \subsection{Future Work}
As additionally pointed out by \cite{vogel2024repeng}, a one-size-fits-all mechanism for generating the contrastive prompts used for computing control vectors could be improved. For example, calculating a control vector for 'honesty' should include more prompts asking the model to lie, or return incorrect answers, etc., - for this project, something scaleable and automated with respect to our synthetic data was what was warranted to calculate the variety of control vectors that we did. For a given specific use case, this part of the process would be more effective at creating effective representations if treated with greater care.

\section{Conclusion}
In conclusion, this is a useful interactive implementation of a novel representation engineering concept, and a demonstration of the effects of non-prompt based model steering. The visualizations of calculated control vectors provide an intuitive and accessible representation of concepts in a relatively straightforward way, functioning as a preliminary window into the model's conceptual space. The representation of underlying model concepts in a comprehensible format will ideally pave the way for more transparent AI systems by illustrating what specific behaviors are being systematically induced. In the constant effort by researchers to create "ethical" AI, control vectors are potentially useful in reducing model susceptibility to malicious input modifications and increasing the output predictability offers significant potential.
In conclusion, this is a useful interactive implementation of a novel representation engineering concept, and a demonstration of the effects of non-prompt based model steering. The visualizations of calculated control vectors provide an intuitive and accessible manifestation of concepts in a relatively straightforward way, functioning as a preliminary window into the model's conceptual space. The distillation of underlying model concepts in a comprehensible format will ideally pave the way for more transparent AI systems by illustrating what specific behaviors are being systematically induced. In the constant effort by researchers to create "ethical" AI, control vectors are potentially useful in reducing model susceptibility to malicious input modifications and increasing the output predictability offers significant potential.



Expand Down

0 comments on commit 32f76eb

Please sign in to comment.