Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

BLOOM Refactor #141

Closed
wants to merge 25 commits into from
Closed

BLOOM Refactor #141

wants to merge 25 commits into from

Conversation

danforbes
Copy link
Contributor

@danforbes danforbes commented Apr 16, 2023

Code refactor and BLOOM model implementation. This is an attempt to take the best parts of #85 from @hhamud and #125 from @iacore.

@philpax
Copy link
Collaborator

philpax commented Apr 16, 2023

Nice work! I really like the traits, I think this is definitely the direction to go (although I suspect there'll be some change here as we accommodate more eccentric models - but that's OK!)

Here's the feedback I had while looking through:

  • Rename llama-rs to llm-base, so it's clear what the relationship between bloom, llama and llm-base is. If possible, try to avoid users of bloom and llama having to bring in llm-base themselves (i.e. re-export any relevant types).
  • Expose a llm crate that includes both bloom and llama. I'd suggest feature-gating these, but specifying them as default features.
  • Consolidate the two CLIs into llm-cli, as a client crate to llm.
  • It doesn't seem like you've used most of the code from Standalone loader #125 ?
    That PR introduces a more generic way to load models of any GGML format, and handles the other formats. It's not done yet, but it results in less code duplication and should be more versatile.
  • The Model::Model associated type is redundant, no? That would just be Self.
  • Model::Layer / the Layer types shouldn't be public, they're an implementation detail of the model
  • Document the changes made to GGML, to make it clearer that we've patched it with some new code (in CREDITS.txt)
  • The fields of InferenceSession, Vocabulary and other similar types were crate-public to ensure that users couldn't accidentally meddle/depend on the internals. Is making them public necessary?
  • Prefer using &dyn Model instead of &impl Model; monomorphisation is unnecessary here and comes at a cost if you use multiple models.

Can you also credit @hhamud and @iacore in your PR description?

@danforbes
Copy link
Contributor Author

Thank you very much for the encouragement and detailed feedback! TBH I'm not a very experienced Rust programmer, nor do I know a ton about LLMs, so I'm very happy for the suggestions. I wasn't able to make a ton of sense out of #125, so I may need some more details there, but let me see what I can do.

@philpax
Copy link
Collaborator

philpax commented Apr 16, 2023

It's possible that we can get this going as-is, and then use BLOOM as the secondary test case for #125. Let's see how this looks after we've gotten through the other feedback 🚀

@danforbes
Copy link
Contributor Author

Okay, I think I have made as many changes as I feel comfortable with...thanks again for the great suggestions! Other contributors can feel free to add their own commits to this branch (just let me know if you need perms). Some thoughts:

  • I've tested this a little locally and inference for both BLOOM and LLaMA models actually seems messed up 😅 There is text produced, but it is garbled.
  • I renamed llama-rs to llm-base, but I'm not sure if I re-exported the right types.
  • I didn't feel comfortable making the changes to the loader, which I think precludes me from moving forward with your suggestions around the llm and llm-cli crates.
  • I'm not sure what sort of changes needs to be made to CREDITS.txt
  • When I tried using dyn instead of impl for Model it wanted me to couple with the associated type(s), so I wasn't sure if I knew how to implement that properly.

Thanks again and I'm happy to take another stab at more improvements if you have more suggestions 💯

@hhamud
Copy link
Contributor

hhamud commented Apr 16, 2023

I've tested this a little locally and inference for both BLOOM and LLaMA models actually seems messed up 😅 There is text produced, but it is garbled.

You can try to print out the logit vectors of both the models and compare them with the original implementations in C with llama.cpp and bloom.cpp, or even just test the outputs of the models using the same prompts and If they are approximate enough it should be fine.

Though I wonder if it is possible to implement a CI that downloads weights and measures the differences in vector output from the same prompt, (cosine similarity?) and passes if it is within a specified range.

@philpax philpax mentioned this pull request Apr 16, 2023
@danforbes
Copy link
Contributor Author

@hhamud FWIW, I also received garbled output with a LLaMA 13B model on the current main branch, so I don't necessarily think it's a regression...I'm not sure if it's an existing bug or a user-error.

@philpax
Copy link
Collaborator

philpax commented Apr 18, 2023

Sorry about the delay on addressing your latest updates, I'll have a look after work soon!

@philpax philpax mentioned this pull request Apr 19, 2023
@philpax
Copy link
Collaborator

philpax commented Apr 19, 2023

Hey! If all goes well, #125 will be landing soon. I'll take care of handling the merge into this PR in the next day or two and addressing the feedback if that's OK with you?

@philpax philpax mentioned this pull request Apr 20, 2023
@hhamud hhamud mentioned this pull request Apr 20, 2023
@danforbes
Copy link
Contributor Author

All good 💯

@KerfuffleV2
Copy link
Contributor

@danforbes Hey, just wondering. Did you already try to get a pull request with your alibi operation accepted upstream to the ggml or llama.cpp repo?

There's been some related discussion over here: 6b2765d#commitcomment-110389270

philpax added 2 commits April 26, 2023 00:27
This disables BLOOM support for now.
@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 26, 2023

I have also started

Better watch out, you're on the path to having an attention span as short as mi—

TBH the code was written by @hhamud. Maybe they would care to merge upstream?

It would be a lot nicer to get it merged than to have to maintain a custom GGML version indefinitely. Also, the llama.cpp/ggml repo gets a lot of pull requests so it's reasonably likely it could take a couple weeks to get something merged even if it was started today.

If no one else wants to do it, I can (once it's definitely working) but it would be better if the people who were actually involved were the ones who submitted the pull. (Not just because I'm lazy, but probably at least 50% that reason.) edit: Currently a lot on my plate, I probably won't have the time to take something like this on.

@philpax
Copy link
Collaborator

philpax commented Apr 26, 2023

@philpax I have also started a branch for GPT2/Cerebras loading and inference. It crashes loading the vocabulary ATM. However, it produces garbled text.

Nice work! I would suggest not racing too far ahead with the model support, I'd like to make sure all of the ones we have now work well and share as much as they can. (Especially around the actual inference logic - I wonder how much can be shared?)

That being said, if it gets done and it works, it would be nice to support. Keep us posted 🚀

It would be a lot nicer to get it merged than to have to maintain a custom GGML version indefinitely. Also, the llama.cpp/ggml repo gets a lot of pull requests so it's reasonably likely it could take a couple weeks to get something merged even if it was started today.

If no one else wants to do it, I can (once it's definitely working) but it would be better if the people who were actually involved were the ones who submitted the pull. (Not just because I'm lazy, but probably at least 50% that reason.)

Yeah, I agree. The functions came from bloomz.cpp, which is not being developed at present, so I don't think there's any immediate pressure from anyone to get it into GGML.

If you want to tackle it, that'd be appreciated, but I think we're all just as lazy as you 😅

llm-base/src/loader.rs Outdated Show resolved Hide resolved
@danforbes
Copy link
Contributor Author

Keep us posted 🚀

I think I've gotten it working...the problem was a silly user error when calculating the position tensor during evaluation. Let me know if you'd like me to pull the gpt crate into this PR.

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

I'd suggest opening a separate PR, just so that we can focus on getting this one in first.

@danforbes
Copy link
Contributor Author

@KerfuffleV2
Copy link
Contributor

Does this currently run? I think I figured out a way to do with without a patched GGML (assuming the alibi op is the only thing that was patched).

What you could do is generate a F32 tensor with the (j + 1) * m_k calculation at https://github.com/rustformers/llama-rs/blob/c608b4b81b53fe331989c896af7b60df23bdc28e/ggml-sys/ggml/ggml.c#L8651 and just add it to the other one. This wouldn't even require the map operations. Or am I missing something?

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

For BLOOM evaluation, there is some difference here: https://github.com/NouamaneTazi/bloomz.cpp/blob/main/main.cpp#L667-L676 ➡️ https://github.com/danforbes/llama-rs/blob/dfo/model/bloom/bloom/src/lib.rs#L303-L312

Hm, I know that we've messed around with that a bit in the LLaMA model, but I don't think it should cause the behaviour we're seeing here?

The best thing to do would be to run inference for a single token with both bloom and bloomz.cpp and then compare the logits, but that'll require some more concerned debugging.

Does this currently run?

It runs, but does not produce any output.

Or am I missing something?

Wouldn't know - didn't try! That sounds reasonable. The alibi op was indeed the only thing that was patched - if it can be replicated without a new operation, that'd be awesome.

You could try reimplementing it and seeing if the logits match what's produced now? If they do, we can switch to that / remove the patched GGML and then get back to figuring out what the issue is.


I bet the issue's something incredibly simple; I just haven't put the debugging time in.

@KerfuffleV2
Copy link
Contributor

It runs, but does not produce any output.

I can't really test a different version of the alibi op if it doesn't produce output. :)

Making something that runs but produces no output is probably pretty easy even without correctly reproducing that operation's functionality...

So I'll wait until it's actually functioning before I try to do that.

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

Yes, that's why I suggested comparing the logits 😉

As long as the logits are the same, the overall computation should be identical, and we can continue as-is to identify the issue. If it isn't, then we know that it isn't a drop-in replacement.

That being said, no stress if you want to wait until it's properly done.

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

Just leaving this as a note for either me or you to deal with: rename Model -> KnownModel, ErasedModel -> Model

danforbes added a commit to danforbes/ggml that referenced this pull request Apr 27, 2023
@danforbes
Copy link
Contributor Author

Hm, I know that we've messed around with that a bit in the LLaMA model, but I don't think it should cause the behaviour we're seeing here?

This is the only difference I was able to find between the implementation in this repository and what is currently in the main branch for bloomz.cpp. I have added some commented out code that indicates what happens when I try to bring this codebase inline with the upstream...I've tried a few different permutations to get things to work, but I wasn't having too much luck.

https://github.com/danforbes/llama-rs/blob/e35f93b642b6e1fc9975c22de938b08451d77d97/bloom/src/lib.rs#L302-L320

@danforbes
Copy link
Contributor Author

@philpax would you be opposed to me doing some refactoring with respect to the ggml crates? I'd like to combine them into one crate and split up the current ggml/src/lib.rs file into context.rs, tensor.rs, etc.

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

Keep ggml-sys separate, but the rest is all yours - I was planning on merging the loader into the ggml crate myself, so if you want to get a start on that, go for it!

@philpax
Copy link
Collaborator

philpax commented Apr 27, 2023

Hm, I know that we've messed around with that a bit in the LLaMA model, but I don't think it should cause the behaviour we're seeing here?

This is the only difference I was able to find between the implementation in this repository and what is currently in the main branch for bloomz.cpp. I have added some commented out code that indicates what happens when I try to bring this codebase inline with the upstream...I've tried a few different permutations to get things to work, but I wasn't having too much luck.

https://github.com/danforbes/llama-rs/blob/e35f93b642b6e1fc9975c22de938b08451d77d97/bloom/src/lib.rs#L302-L320

Were you able to get bloomz.cpp working with the same model?

@danforbes
Copy link
Contributor Author

Were you able to get bloomz.cpp working with the same model?

Yes and I tried to compare the logits but I got overwhelmed 😅 Are you opposed to maybe jumping on a Jitsi or some other kind of web conference sometime to work on this together? I feel like we could get a lot accomplished that way...

@danforbes
Copy link
Contributor Author

@philpax I'd like to make a suggestion to get this across the line. What if we pull out the non-functional BLOOM crate and replace it with the GPT2 implementation I have sitting in another branch? There's been a lot of good development on this branch and I think it's high-time we merge it in. I'd also suggest pulling in the GGML refactor since @setzer22 seems on board with the approach I've taken. What do you think?

@danforbes danforbes mentioned this pull request Apr 30, 2023
@danforbes
Copy link
Contributor Author

Superseded by #162

@danforbes danforbes closed this Apr 30, 2023
@danforbes danforbes deleted the dfo/model/bloom branch April 30, 2023 21:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants