-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit Tool Turns #626
base: master
Are you sure you want to change the base?
Explicit Tool Turns #626
Conversation
The general approach looks good to me. I one major suggestion and one question. The suggestion first: I don't think it's worth propertizing the prompt-prefix and response-prefix. The current approach is to strip them via text search (see I understand this is less than ideal when you change the prefixes. For example, if you open an old chat file from before you changed the prefixes, the prefixes in the file will no longer be stripped. However, I think the new approach is worse on balance because it doubles or triples the length of the serialized bounds list when you save the chat. We want to store as little metadata as we can get away with. Note also that the local variables block which is used for serialization in markdown and text chat files has a maximum allowed length of 2000 characters or something. (Enforced by Emacs.) Each extra character of metadata we store bumps us closer to this limit, and all hell breaks loose if we cross it. Continuing to strip prefixes/suffixes using string matching will also simplify the code a fair bit -- no need for a Finally, I haven't noticed any auto-mimicry problems with retaining the prefixes in the messages array, possibly because if the prefixes fail to be stripped, they always end up as part of the user prompts, not previous LLM responses. So failing to get it right is not a critical problem. To address the case of rehydrating older chat files correctly, if we absolutely must strip the prefixes correctly we can store the prefix strings in the file as Org properties/local variables. I'm not sure this is required, and in any case it doesn't have to be part of this PR. Will address the question in my next message. |
IIUC, you haven't yet updated |
gptel.el
Outdated
(add-text-properties | ||
beg end | ||
(pcase value | ||
('response '(gptel response front-sticky (gptel))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about using (beg end)
instead of (beg end response)
.
The tool-use API looks like this right now: the For tool calls, it is called with a list of the form ((tool args closure) ...) where For tool results, it is called with a list of the form ((name args result) ...) where
For your use case, this API needs to be redesigned because you need the tool call id along with the result. We should keep the above goals in mind when making the API change. One option I considered before is an explicit mention of the purpose of the call, i.e. send to the (:tool-call ((tool args closure) ...)) for a tool-call confirmation/run, and (:tool-result ((tool args result) ...)) for a tool result. I judged this as an unnecessarily complex structure before (a plist of list of lists). Same for (tool-call . ((tool args closure)...)) ;and
(tool-result . ((tool args result) ...)) These are too verbose because most consumers of See the documentation for other types of
We want something at this level of simplicity for tool calls and tool results. |
We also need to add new tests, but maybe that can wait until the design is final? |
Technically it's broken for big chats already. When the day comes that someone needs this, we can use file storage. Add an option for the property to be a string. The string names a file at (in the no-littering case)
It's not something I would stress over. I agree stripping can be very valuable in a few cases. An empty line can be ignored as a user message instead of treated blindly like a turn in the chat. In that case, I no longer need to mark some lines as ignore. The ignore property can be re-hydrated from
Aight, I think I will have to asynchronously answer this part. |
Losing self-contained files is not acceptable. At that point we can switch to using a TOML frontmatter block or something in Markdown files instead, the equivalent of Org properties. But anyway, this is not currently a problem.
Not sure what you mean by "stress over it". The current approach of stripping the prefixes via text search is simple and sometimes wrong, but that's me not stressing over it. I'd like to avoid explicitly marking prefixes as ignored text for the mentioned reasons:
Adding the ignore property to the tool call block decorations is fine 👍 |
Empty/whitespace lines should already be stripped, IIRC. |
I meant to confirm that not stressing over it was already correct. My reasoning was that any use case for keeping prefixes in sync is extremely niche and can be done by simple text replace if the user really needs. |
If I'm reading this right, you're debating if you should serialize the bounds of |
Avoid serialization, yes. However, instead of marking them I think I can filter them on the frontends easy enough. That seems more proper since they're an artifact of each frontend anyway. In both markdown and org, stripping these block header + footers is just a linear scan going forward if we escape the block contents. It's more robust in the case that someone munges a block. |
Sure. As I see it the less we need to serialize the more robust the persistence feature remains.
Hmm, this requires going all-in into using a buffer copy for parsing. We're currently doing this for Org because of We've paid the price for Org already, but I'm hoping to avoid doing this everywhere else.
Well one of your original arguments was that using intervals (text-properties/overlays) is both more efficient (
If someone messes up a block, aren't text properties more robust? If you're using On the other hand, if you're text-matching
I propose a hybrid solution:
|
If we're talking about 100MB contents, I could see this being heavy, but especially where we don't even activate the mode, buffer copying is idiomatic rather than anti-pattern. It's super cheap and in many cases much more straight-forward. We have like 5GB/s available on this kind of workload. Try this function out on a typical buffer: (defun clone-and-report ()
(interactive)
(let ((source (current-buffer))
(beg (point-min))
(end (point-min))
(conses cons-cells-consed)
(start (float-time))
(created))
(dotimes (_ 100)
(let ((buffer (generate-new-buffer (buffer-name) t)))
(set-buffer buffer)
(insert-buffer-substring source beg end)
(push buffer created)))
(let ((new-conses cons-cells-consed)
(now (float-time)))
(message "Old: %s New: %s Delta: %s"
conses new-conses (- new-conses conses))
(message "Start: %s End: %s Elapsed: %s"
start now (- now start)))
(mapcar #'kill-buffer created)
(let ((new-conses cons-cells-consed)
(now (float-time)))
(message "After killing - Old: %s New: %s Delta: %s"
conses new-conses (- new-conses conses))
(message "After killing - Start: %s End: %s Elapsed: %s"
start now (- now start))))) I ran this on one of my org buffers and it was 4ms and 1858 conses, many of which are used to run the command and print the results. It's way, way faster than human speed and doesn't gum up the GC. There's reasons to do other things like only copying a region from the buffer, but in general buffer-to-buffer copy is one of the fastest, cheapest, lowest reside things in Emacs. |
I take your point. 👍 Independent of this, did you think about the other arguments (robustness to text munging, interval trees vs string/regexp search) for painting the tool call block decorators with the ignore property and avoiding text parsing? |
Unless we save them, until restoring. Hmm... Btw, linear scan are O(n). Grows with buffer size. Parsing without the need to backtrack is O(n) in time. Realistically it's only heavy in org because of the overhead of push-down and deciding and scooping up every single kind of element into the result.
Without the id of the tool call, the API is incomplete. Possibly the llm package is incomplete and needs to make a breaking change here. Without ids, I'd say it's broken to begin with. I'll look around, but I definitely like the idea of any state machine callback receiving an explicit indication of state, either through a method name or a leading symbol.
Whatever the case, since the feature is young, I'd be in favor of an aggressive breaking change to get to a good place early in its life, one that is easy to understand. Let's not make it backwards compatible but convoluted. |
2e9ab99
to
ccee2e8
Compare
After looking through a bit more, I don't think any It wouldn't be the worst idea to add a TYPE argument to the callback but I just don't think it solves enough here. The user's callback is about as simple as it can get unless a separate I need to throw some time at this tomorrow. Too sleepy to write anything smart. |
I don't follow. I meant that painting the decorators with
Interval tree lookups are
I disagree, the tool call id is an internal communication detail. There's no reason the callback/API consumer needs to be aware of some random UUID. Your case is special since you want to encode the messages array as the buffer in a lossless way. (See my note at the end.)
The concern isn't backwards compatibility, as there is no "backwards" yet. We are at the start line. The concern is the complexity of and confusion in the For tool calls confirmation/run: ((#[gptel-tool name function args description category ...] args closure) ...) where For tool results: ((:name ... :args ... :result ... :id ... :arguments ... :function ...) ...) Except for (The structure of the plist above isn't even uniform across backends, by the way. For OpenAI the above has both Compare with ((#[gptel-tool name function args description category ...] args closure) ...) for a tool call confirmation/run and ((name args result) ...) for a tool result, keeping in mind that the latter is going to be the much more common in uses of These two are much more distinct.
But this is what's happening right now. In any case, the user can switch backends at any time so whatever tool id you store has to work for all backends. So I have a hypothesis: modulo some template or prefix like |
agree.
I snipped a great deal of distracting reading because I need more answers from reading / writing code. Here is another great deal of less distracting reading. 🥲 While not for now, I think backends will need to be able to leak data into the front-end, but in a completely pass-through manner, using a I think it would be better to always pass tool structs to the callback rather than "tool-name". This can later support a The calls & results would be: In this case, while the id is hopefully not backend specific, I would like to use it and just hope for the best. If it doesn't matter, why replace it with another unique value unless we know such a value will fail less and not more? How this can actually help backends (and they might not tell us) is that the computation up to the tool call might be cached, leading to faster first tokens. All of the big providers are strongly incentivized to silently cache and use several methods of content cleaning and hashing to avoid re-computing the model state up to the next token. I does seem better to provide the callback with
Which part, the round tripping of the backend detail or retaining the tool result in context?
If during restore I'm parsing decorators to re-ignore them, the fat fingered decorator will not be propertized. Only if we save the properties of ignored content does this advantage hold during persistence. There's other problems with fat fingering, such as leaving behind a result that can no longer be read. Maybe we should mark every tool call as read-only and be done with these user what-if headaches. It's almost never meaningful for them to update the results by hand unless they just delete the entire tool call. I'm going with this. |
I'm not following, but I'll wait for the PR to develop further instead of taking up more of your time on future plans right now.
You still need a whole system written around the tool call ids to accommodate the various APIs. If you've stored
I understand the idea in general, but doubt very much tool call ids have anything to do with it. It also depends on where the cache boundary is placed. Tool calls appear in the last two messages in the messages array, unlikely to make much difference.
I like this and agree, except for two things:
Changing the callback calling convention is a no-no. This would make it a backwards compatibility issue. But it's also unnecessary, since However, looking at integrating the Deepseek model properly is convincing me that the callback is going to have to fulfill an increasing number of duties. In this case, it has to decide how to handle the "reasoning" text block supplied by Deepseek, for which gptel needs to signal that the provided text is a reasoning block. So a dispatch system where tool-call: (tool-call . ((tool args closure) ...)) ; or just (call . ((...)))? tool-result: (tool-result . ((tool args result) ...)) ; or just (result . ((...)))? reasoning block: (reasoning . "The user is asking for...") This is an alternative to running
The latter, you're trying to map the messages array to the buffer in a one-to-one fashion. I want this for gptel's chat feature, but most other uses of tools don't have this goal. |
Yes, you lose robustness the moment text parsing enters the picture. But it's not all or nothing, as many chats (I'm guessing a majority) are single-use and not persisted to disk, and it works fine for these. Alternatively, you can save the properties of the ignored content too, if we can find some way to shorten the length of the serialized string: perhaps just On that note, tool calls and result properties can also be shortened when serialized, to (I must reiterate here that the prompt/response prefix don't deserve this much care as they're part of the user prompt and not a cause of auto-mimicry. Even applying
I'm imagining the situation where a result is split into two chunks by an errant yank. Or worse, when a tool call (name, args) is split. There needs to be some way to sanitize a parsed tool block. I see a TODO item in your PR about coupling tool calls and results so they appear next to each other in the messages array. In the worst case it should all be parsed and included as regular response text, I guess.
You can't mark a region of the chat buffer But how do you ensure that the entire tool call can be deleted but a part of it can't? I can think of a couple ways to do this; they're super clunky. |
They will. The front-ends are just rudimentary right now. The killer pattern will be having a usually hidden but editable session buffer to store and remix context. When re-writing regions, you want gptel to have seen how other source code works and to remember it for the duration of potentially 2-3 re-write attempts with different prompts. This session buffer can be used across multiple buffers, such as with occur workflows. It can be cloned to start multiple requests in parallel, forked, supports undo barriers and so on. Chat combines the idea of a working buffer and a target buffer. It's kind of like macros with LLMs providing the gummy heuristics and pseudo-programming interface and tools providing the RAG data necessary for contextualizing work with accurate facts.
I'm hesitant because it's a shared mutable object that lives for the entire request lifecycle. Doesn't feel right to throw such a clear function signature relationship in there. I like your later suggestion of
Not the best, but simple enough. Recommend putting a TODO(2.0) on this?
Ah, hell. True. I was off on a tangent imagining there was a tool-call object.
Recommend:
The opposite is true. As I've demonstrated, tool calls are going to wind up heavily interwoven in recursive lookup use cases. We're pretty good at sending an append-only log in many cases. It's good to be consistent.
We can even combine tool calls and results. It might make other problems simpler. I'll take a look. A 2D structure is most efficient here. For enough flexibility in the values:
Fairly simple logic to use this. As for tool bounds that are haphazardly split by the user, I recommend interning them into re-education camps until they stop expecting to pour skittles into the gas tank without something breaking. I'll see what's in the manual about that. Claude Sonnet 3.0 wrote that 🙊 |
Interesting idea -- but what you're describing will essentially reuse the mapping that you're creating in this PR. No one else is going to write this feature from scratch. A big chunk of tool use is going to be for side-effects, and another chunk is going to be one-off queries or queries where retaining the history is not important.
(plist-put info :type 'tool-result)
(funcall (plist-get info :callback) result)
(plist-put info :type nil)
That said, including the type in the
Yeah. Scales well enough to new tasks too. The way I think about it is that every case looks like this: (response . "response text")
(reasoning . "reasoning text")
(tool-call . ((tool args closure) ...))
(tool-result . ((tool args result) ...)) only the first case is 90%+ of uses so we make an exception and directly send
I hope the missed dot is a typo: we want I maintain that including the id here is premature, as almost no one will need it except for us in this PR. (If an API consumer wants a persistent history of tool-calls, they'll just be reusing our UI.) How about picking it up from (plist-get info :tool-use) contains all the ids, as well as other details you might need.
Not quite -- the cache miss will only be for a fraction of the buffer each time, you can run through the experiment. But we might as well store the original id, yeah.
No strong opinions here, except that a slight extension of the current system is about the same, since we don't need to indicate (beg end) (beg end)
(beg end tc id) (beg end tr id)
(beg end i) (beg end i)
To Emacs users a buffer is a sandbox, not a gas tank. So it's perfectly reasonable from their perspective to pour skittles into it. That it represents a structured array of LLM conversation turns is unfortunately our problem to handle. |
(defun gptel--modification-allow-ask (_beg _end)
"Ask before allowing user to edit tool region."
(if (y-or-n-p "Warning! Editing GPTel tool could break next request!
Do you know what you're doing? Be honest, bro.")
(save-excursion
(when (text-property-search-backward
'modification-hooks '(gptel--modification-allow-p) t)
(when-let ((found (text-property-search-forward
'modification-hooks
'(gptel--modification-allow-p) t)))
(let ((inhibit-modification-hooks t))
(remove-text-properties (prop-match-beginning found)
(prop-match-end found)
'(modification-hooks
insert-in-front-hooks
insert-behind-hooks)))
(message "Region modification protection removed."))))
(user-error "Edit aborted. You did the right thing.")))
(defun gptel--modification-undo-ask (beg end)
"Ask user to undo edits to tool region."
(unless (y-or-n-p "Warning! Inserting into GPTel tool could break next request!
Do you know what you're doing? Be honest, bro.")
(let ((inhibit-modification-hooks t))
(delete-region beg end)
(message "Insertion deleted"))))
(defun gptel--modification-protect (beg end)
(add-text-properties
beg end
'(modification-hooks
(gptel--modification-allow-ask)
insert-in-front-hooks
(gptel--modification-undo-ask)
insert-behind-hooks
(gptel--modification-undo-ask)))) This is what I spiked out to prevent fat fingering while allowing editing if the user insists. It is more complex than setting `'read-only'. Instead just adding `'read-only' to a region is pretty nearly ideal. It allows killing. It's just a bit tricky to edit unless the user is industrious enough to make some Elisp. |
b408dd6
to
b6c80d9
Compare
this isn't the right way. it makes the new frontend test pass and demonstrates the issue
Made the new frontend test green. Wouldn't be surprised if that breaks some of the other tests that rely on prompt & response prefixes getting into the context to match the eld data. Priority was giving us a solid target to aim at. I'm about to crash. More notes in my final commits. I can do the rebase tomorrow but it's very much 10am work, not 11pm work. I stress that the tradeoff is either more features and tests that can smoke out bugs faster than a git history that's not actually being read (maybe a good job for more tools!) |
There are multiple issues about this. I'm going to provide an option to strip property blocks from headings before sending for those who want it, but I'm not convinced yet that it should be the default, as I would like to abide by What You See is What is Sent. There is no evidence so far that the presence of the properties blocks confuses LLMs in any noticeable way. |
Some evidence shows that context length alone tends to degrade performance, independent of content. I'd argue that anything that appears like syntax creates much more specific patterns of activation that are more likely to confuse downstream attention. We can only suss this out with probabilistic techniques like cross validation. It's more like network engineering. Any thoughts on rebase versus more tests and propagating the changes to other backends? I'd be happy to support an experimental branch and just collect the necessary backend changes while pulling in changes from master. IMO an experimental branch would get more popular with programmers and both accelerate work and reduce bugs in stable releases. |
1. gptel--trim-prefixes returns nil whenever a string is empty after trimming (for blank strings or prompt-only strings, this will occur) 2. all models check for a non-nil after-trimming string before appending messages to their parts 3. all models trim the whole buffer using simple string-trim when response tracking is off 4. zero-length multi-part vectors are also skipped
This was an opportunistic update, but it seems this model is obviously out of use and will be abandoned soon
I updated all backends to support vanishing messages (by declining to push them to the message parts) Suppose this could result in empty messages and backend rejections when the user has absolutely nothing in context. Can handle that if any backends care. This prepares for a by-inspection tool-use support update for the other backends. |
There might be a mismatch in the JSON structure of the initial tool call and subsequent tool turns as encoded in tool blocks.
I did a rough pass on Gemini's tool support. Left a note regarding the raw JSON of the initial tool result and the need to splice back into Elisp if we want send nested JSON results in both cases. Seems like Gemini at least is expecting structure rather than a string of a structure. I'm going to check on how this works on the OpenAI backend. Going to pause work on this branch and just get some testing in by actually doing things. |
I create feature branches for specific features. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've done a partial review covering most things except the tool handling itself. The changes mostly make sense. Will look at the tool call handling when I have more time.
@@ -277,35 +277,30 @@ TOOL-USE is a list of plists containing tool names, arguments and call results." | |||
(goto-char (previous-single-property-change | |||
(point) 'gptel nil (point-min))) | |||
(not (= (point) prev-pt))) | |||
;; HACK Until we can find a more robust solution for editing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please retain this HACK text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's no longer relevant? The change means empty messages are no longer sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty messages were not sent before either, when I installed the comment.
Not sending whitespace messages is the hack. Whitespace messages should not be created by editing response regions in the first place, but they are because of some limitations with text properties. This hack is how we work around it, so the message is relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string trimming is a bug. By fixing string trimming to return nil for a string that collapse to an empty string and then not sending it, the behavior is correct. It's not a hack. It's just what should happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unrelated to string trimming, and has to do with the creation of whitespace regions, not their mitigation. Please add the text back as I need to fix the issue in the future. If you are interested in understanding the problem you can read the issues linked to in the comment.
"Remove prompt/response prefixes from string S. | ||
Return nil if string collapses to empty string." | ||
(let ((trimmed (string-trim | ||
(string-trim s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the double string-trim
? The inner string-trim
can cause this operation to fail when the prefix-strings have spaces, such as when the response string is set to Response:
(with the trailing space).
e1dee73
to
bd87d7e
Compare
Revert to regular string trim.
bd87d7e
to
831124e
Compare
Signed-off-by: Psionik K <[email protected]>
rebase laster.
trim this so that blank strings will return nil
Signed-off-by: Psionik K <[email protected]>
I'll rework the string trim tomorrow. The general vision is to trim to nil instead of empty (blank will be empty already). The nil is very natural to check with A possible silly corner case being introduced is if all messages are zero length and the backend is one that fails when there are no messages. This can only happen if there's a buffer full of prefixes and no actual contents, so it's obvious to the user. We can't say anything more clear than the backend error message in that case, so recommend just letting the API error message through. Big stuff:
I debated this more. I think it's the right goal. Until there's a branch on gptel, nobody can rely on the branch in this PR to base off of. Trying to mostly just rebase and do QA from here. I think the behavior is about right. I probably broke bounds persistence. Will fix that in the morning. |
Signed-off-by: Psionik K <[email protected]>
The intent here was to make the initial tool result and the results sent from buffer contents identical. OpenAI doc says a tool result can be an array, but I got an API error saying it expected an object, so I guess it must be an array of objects. Earlier I had removed a string-coercion because it can lose type information: a string that could be a symbol name and a symbol are indistinguishable after coercion. At length, I don't think this matters. Sending escaped strings as literals to the model is probably the bigger crime. Anthropic API might reveal something about the right choice. Can't hardly read the Gemini docs. R1 docs say String only.
535c4e7
to
c39bf27
Compare
Can I continue looking at the code or should I wait for this? |
I just did a bunch of linting, I think you'll need to rebase on master before making more changes on your branch to avoid headaches. Additionally, the |
Draft
And does not pass tests yet. Should be close to passing. Just sleepy and wanted to give PST a chance to take a look at approaches before I finish up.
One of my commits almost vanished during merge and became whitespace only.
overlay, re-hydrate overlays when opening file