Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to specify a schema for structured output for models that support it #776

Closed
simonw opened this issue Feb 26, 2025 · 18 comments
Closed
Labels
design enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Feb 26, 2025

Eventually this leads to tool support, but for the moment I'm going to start with the ability to provide a schema and have supporting models use that to return JSON.

Related:

Relevant PR:

@simonw simonw added design enhancement New feature or request labels Feb 26, 2025
@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I just fully upgraded to Pydantic v2 in preparation for this work:

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Eventually I'd like this feature to express itself in a bunch of different ways, including an llm extract ... command for performing structured data extraction into JSON/CSV/SQLite.

To start with though I think the simplest version is the ability to pass a --schema to the llm prompt command which is then used (for supporting model) to enforce a JSON output shape.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I built a prototype of this which seems to work quite well - I got it working against OpenAI and Anthropic and Gemini. Pushing that to a branch now.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Here's what it took to get that working in llm-anthropic:

diff --git a/llm_anthropic.py b/llm_anthropic.py
index 2bf7650..ff2a96c 100644
--- a/llm_anthropic.py
+++ b/llm_anthropic.py
@@ -201,6 +201,7 @@ class _Shared:
     can_stream = True
 
     supports_thinking = False
+    supports_schema = True
     default_max_tokens = 4096
 
     class Options(ClaudeOptions): ...
@@ -348,6 +349,13 @@ class _Shared:
             if "thinking" in kwargs:
                 kwargs["extra_body"] = {"thinking": kwargs.pop("thinking")}
 
+        if prompt.schema:
+            kwargs["tools"] = [{
+                "name": "output_structured_data",
+                "input_schema": prompt.schema,
+            }]
+            kwargs["tool_choice"] = {"type": "tool", "name": "output_structured_data"}
+
         return kwargs
 
     def set_usage(self, response):
@@ -374,8 +382,13 @@ class ClaudeMessages(_Shared, llm.KeyModel):
             with messages_client.stream(**kwargs) as stream:
                 if prefill_text:
                     yield prefill_text
-                for text in stream.text_stream:
-                    yield text
+                for chunk in stream:
+                    if hasattr(chunk, 'delta'):
+                        delta = chunk.delta
+                        if hasattr(delta, 'text'):
+                            yield delta.text
+                        elif hasattr(delta, 'partial_json'):
+                            yield delta.partial_json
                 # This records usage and other data:
                 response.response_json = stream.get_final_message().model_dump()
         else:

llm-gemini was slightly harder because it turns out Gemine doesn't accept the full JSON schema spec, just a subset of it. This worked:

diff --git a/llm_gemini.py b/llm_gemini.py
index e3572f6..800e6e2 100644
--- a/llm_gemini.py
+++ b/llm_gemini.py
@@ -1,3 +1,4 @@
+import copy
 import httpx
 import ijson
 import llm
@@ -79,10 +80,27 @@ def resolve_type(attachment):
     return mime_type
 
 
+
+def cleanup_schema(schema):
+    "Gemini supports only a subset of JSON schema"
+    keys_to_remove = ("$schema", "additionalProperties")
+    # Recursively remove them
+    if isinstance(schema, dict):
+        for key in keys_to_remove:
+            schema.pop(key, None)
+        for value in schema.values():
+            cleanup_schema(value)
+    elif isinstance(schema, list):
+        for value in schema:
+            cleanup_schema(value)
+    return schema
+
+
 class _SharedGemini:
     needs_key = "gemini"
     key_env_var = "LLM_GEMINI_KEY"
     can_stream = True
+    supports_schema = True
 
     attachment_types = (
         # Text
@@ -226,6 +244,12 @@ class _SharedGemini:
         if prompt.system:
             body["systemInstruction"] = {"parts": [{"text": prompt.system}]}
 
+        if prompt.schema:
+            body["generationConfig"] = {
+                "response_mime_type": "application/json",
+                "response_schema": cleanup_schema(copy.deepcopy(prompt.schema)),
+            }
+
         config_map = {
             "temperature": "temperature",
             "max_output_tokens": "maxOutputTokens",

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I tested these using dogs.schema.json:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "dogs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "minLength": 1
          },
          "bio": {
            "type": "string",
            "minLength": 1
          }
        },
        "required": ["name", "bio"],
        "additionalProperties": false
      }
    }
  },
  "required": ["dogs"],
  "additionalProperties": false
}

Here's what I got:

llm --schema "$(cat dogs.schema.json)" 'invent three dogs' -m gpt-4o-mini | jq
{
  "dogs": [
    {
      "name": "Biscuit",
      "bio": "A cheerful golden retriever with a love for frisbees and belly rubs. Biscuit is known for his friendly nature and his uncanny ability to sense when someone is feeling down. He often brings over his favorite toy to cheer them up."
    },
    {
      "name": "Pixel",
      "bio": "A small, tech-savvy pug with a knack for getting into trouble. Pixel loves to explore and has a particular fondness for shiny objects, often stealing keys and chargers around the house. Despite his mischief, he is adored for his adorable snorts and playful antics."
    },
    {
      "name": "Stormy",
      "bio": "A majestic Siberian husky with striking blue eyes and a spirited personality. Stormy is an adventurous soul who loves to howl along with the wind. He enjoys long runs in the snow and has a special bond with his human, always by their side during hikes and outdoor adventures."
    }
  ]
}
llm --schema "$(cat dogs.schema.json)" 'invent three dogs' -m claude-3.7-sonnet | jq
{
  "dogs": [
    {
      "name": "Buddy",
      "bio": "Buddy is an energetic 3-year-old Golden Retriever who loves swimming and playing fetch. He's known in the neighborhood for his friendly demeanor and his ability to catch frisbees mid-air. Buddy volunteers as a therapy dog at the local children's hospital on weekends."
    },
    {
      "name": "Luna",
      "bio": "Luna is a clever 5-year-old Border Collie with striking blue eyes. She excels at agility competitions and can solve puzzle toys in record time. When not herding sheep at her family's farm, Luna enjoys cuddling on the couch and watching nature documentaries."
    },
    {
      "name": "Max",
      "bio": "Max is a charming 7-year-old Dachshund with a playful personality. Despite his short legs, he's surprisingly fast and loves chasing squirrels in the park. Max is also a talented digger and has an impressive collection of buried toys in the backyard. His favorite food is peanut butter."
    }
  ]
}
llm --schema "$(cat dogs.schema.json)" 'invent three dogs' -m gemini-2.0-flash | jq
{
  "dogs": [
    {
      "bio": "A playful and energetic Jack Russell Terrier mix with a love for chasing squirrels and learning new tricks. He's always up for an adventure and brings joy to everyone he meets.",
      "name": "Sparky"
    },
    {
      "bio": "A gentle and intelligent Golden Retriever with a calm demeanor and a talent for retrieving. She loves to swim and is always eager to please her human companions.",
      "name": "Honey"
    },
    {
      "bio": "A quirky and independent French Bulldog with a goofy personality and a knack for making people laugh. He enjoys naps, short walks, and cuddling on the couch.",
      "name": "Pickles"
    }
  ]
}

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

In the longer run I'd like to not have to remember and then author JSON schema syntax to use this feature - but LLMs are great at that already so it's not a huge pain yet.

I think I'll focus on nicer ways to do that when I design and implement the llm extract command. That isn't needed to ship this though.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Interestingly it's possible to make any model support this feature using prompt engineering. I think I'll leave that for people to implement using templates combined with extract: true though.

Quick demo of that:

llm -s 'Extract data matching this schema and return it as JSON in a fenced code block. {
  "type": "object",
  "properties": {
    "dogs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string",
            "minLength": 1
          },
          "bio": {
            "type": "string",
            "minLength": 1
          }
        },
        "required": ["name", "bio"],
        "additionalProperties": false
      }
    }
  },
  "required": ["dogs"],
  "additionalProperties": false
}' --extract --save dogs

Then:

llm -t dogs 'invent three dogs' -m gpt-4o

Output:

{
  "dogs": [
    {
      "name": "Baxter",
      "bio": "Baxter is a playful Golden Retriever who loves swimming and fetching tennis balls. He has a knack for making everyone smile with his goofy antics and warm, wagging tail."
    },
    {
      "name": "Luna",
      "bio": "Luna is a gentle Shetland Sheepdog with a love for herding and agility courses. Her intelligence and alert nature make her an excellent companion and protector."
    },
    {
      "name": "Shadow",
      "bio": "Shadow is a friendly Labrador Retriever known for his loyalty and love of adventure. Whether it's hiking in the mountains or playing at the park, he's always ready for the next journey."
    }
  ]
}

And (local model):

llm -t dogs 'invent three dogs' -m mlx-community/Llama-3.2-3B-Instruct-4bit

Output (it got created and added more properties):

[
  {
    "name": "Max",
    "bio": "Max is a playful and energetic golden retriever who loves playing fetch and going on long walks. He's always up for an adventure and is a loyal companion.",
    "age": 3,
    "breed": "Golden Retriever",
    "weight": 70,
    "favoriteToy": "Tennis Ball",
    "favoriteActivity": "Playing Fetch"
  },
  {
    "name": "Luna",
    "bio": "Luna is a calm and gentle beagle who loves snuggling up on the couch and taking naps. She's a bit of a homebody, but loves going on short walks and exploring the neighborhood.",
    "age": 5,
    "breed": "Beagle",
    "weight": 40,
    "favoriteToy": "Squeaky Toy",
    "favoriteActivity": "Snuggling"
  },
  {
    "name": "Rocky",
    "bio": "Rocky is a tough and energetic bulldog who loves playing rough-and-tumble games like tug-of-war and wrestling. He's a bit of a roughneck, but has a soft spot for belly rubs and treats.",
    "age": 2,
    "breed": "Bulldog",
    "weight": 60,
    "favoriteToy": "Tug-of-War Rope",
    "favoriteActivity": "Playing Tug-of-War"
  }
]

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Other concerns to solve:

  • Logging the schema that was used in a database (ideally without duplicating it too many times)
  • Docs on how to use this
  • Automated tests
  • Docs on how to write plugins that support schemas

It would be nice to have at least one local plugin that supports this too. llm-gguf is most likely as it has some level of grammar support in the underlying libraries - as far as I can tell mlx-lm doesn't have any grammar / JSON schema support yet.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I'd like to have a llm schemas command family similar to llm templates which lets you save schemas with aliases which you can later pass t llm prompt --schema X - but I won't build that just yet.

I'd also like it if --schema could take a '{block of JSON}' or a file path or an alias or even perhaps a URL, like -a does.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I'm going to ship this as an alpha so I can ship some plugins that use it as alphas too.

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Idea: build out a small collection of example schemas which live in this repository and which plugins are expected to be able to handle.

These could even be bundled with the LLM package itself (not hidden in tests/ or docs/) such that plugins depending on this package can use them in their own automated tests.

See:

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

Idea: llm logs --schema X search option which figures out the ID of the schema you pass in and then returns all matching responses.

I could add options to llm logs for outputting just the JSON response data - then you could run the same schema a bunch of times to collect data and dump that back out from the logs as CSV or JSON later on.

Could even have a --key option or similar so that if you logged a bunch of responses that were an object with a single key containing an array of objects those objects can be dumped out flattened together somehow.

See:

@simonw
Copy link
Owner Author

simonw commented Feb 26, 2025

I should make it so templates can have schemas saved to them, which get persisted in the YAML.

That way you could do this:

curl $URL | llm -t extract-headlines

Combined with without logging idea, it might be worth having a --silent option for llm prompt which causes the response to be logged but not output.

See:

simonw added a commit that referenced this issue Feb 26, 2025
simonw added a commit that referenced this issue Feb 27, 2025
…#777)

Refs #776

* Implemented new llm prompt --schema and model.prompt(schema=)
* Log schema to responses.schema_id and schemas table
* Include schema in llm logs Markdown output
* Test for schema=pydantic_model
* Initial --schema CLI documentation
* Python docs for schema=
* Advanced plugin docs on schemas
@simonw
Copy link
Owner Author

simonw commented Feb 27, 2025

OK, this is landed on main - I'm going to ship an alpha.

@simonw simonw closed this as completed Feb 27, 2025
simonw added a commit to simonw/llm-anthropic that referenced this issue Feb 27, 2025
simonw added a commit to simonw/llm-anthropic that referenced this issue Feb 27, 2025
@simonw
Copy link
Owner Author

simonw commented Feb 27, 2025

Now available in llm-anthropic alpha too:

llm install llm-anthropic==0.15a0

Then:

llm --schema '{
  "type": "object",
  "properties": {
    "dogs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "bio": {
            "type": "string"
          }
        }
      }
    }
  }
}' -m claude-3.7-sonnet 'invent three surprising dogs' | jq
{
    "dogs": [
        {
            "name": "Professor Whiskers",
            "bio": "Despite being a dog, Professor Whiskers has a peculiar fascination with cats. He meows convincingly, grooms himself like a feline, and prefers to nap in cardboard boxes. His doctoral thesis on string theory was surprisingly well-received by the scientific community."
        },
        {
            "name": "Quantum Barkley",
            "bio": "Quantum Barkley appears to exist in multiple places simultaneously. His owners have documented him sleeping in his bed while simultaneously being spotted stealing treats from the kitchen. Physicists are currently studying him as the first macroscopic example of quantum superposition."
        },
        {
            "name": "Sir Woofs-A-Lot",
            "bio": "Sir Woofs-A-Lot is the only known dog who doesn't bark - he instead speaks fluent French with a distinct Parisian accent. He works part-time as a wine critic and has an uncanny ability to predict stock market trends. His Instagram account dedicated to his beret collection has over 2 million followers."
        }
    ]
}

simonw added a commit to simonw/llm-gemini that referenced this issue Feb 27, 2025
simonw added a commit to simonw/llm-gemini that referenced this issue Feb 27, 2025
@simonw
Copy link
Owner Author

simonw commented Feb 27, 2025

... and in llm-gemini:

llm install llm-gemini==0.13a0

Then:

llm --schema '{
  "type": "object",
  "properties": {
    "dogs": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "bio": {
            "type": "string"
          }
        }
      }
    }
  }
}' -m gemini-2.0-flash 'invent three spanish dogs' | jq
{
  "dogs": [
    {
      "name": "Rayo Español"
    },
    {
      "name": "Trueno Ibérico"
    },
    {
      "name": "Sol de Castilla"
    }
  ]
}

@simonw
Copy link
Owner Author

simonw commented Feb 27, 2025

I'm going to start a milestone for the rest of this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant