Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning Axial RoPE with frequency scaling? #32

Open
tasansal opened this issue Aug 26, 2024 · 0 comments
Open

Fine-tuning Axial RoPE with frequency scaling? #32

tasansal opened this issue Aug 26, 2024 · 0 comments

Comments

@tasansal
Copy link

tasansal commented Aug 26, 2024

Hi @lucidrains

We have trained a 3D ViT masked autoencoder using axial RoPE for an image size of 512x512x512 (3D scientific images, sampled from much larger volumes). Now I want to try fine-tuning the pre-trained model for larger (i.e. 1024x1024x1024) context size. However, it doesn't seem obvious to how. I am especially unclear on how to calculate the scale for axial RoPE correctly. Important note: we are not resizing the images; we tile the larger image with these "mini-cubes". So, going up in size means we have more context.

I would love to hear your feedback on how to do this properly. Below is my thought process (and please correct me where I am wrong!).

Normally, with 1D RoPE, we have the theta_rescale_factor, which changes freqs in RoPE directly. However, when freqs_for is set to pixel, the theta parameter isn't used to build freqs, which is probably fine since we don't have a single sequence and reuse [-1, 1] range for axial.

Anyhow, assuming above is fine, we apply axial RoPE with apply_rotary_emb instead of rotate_queries_and_keys. It seems like rotate_queries_and_keys does use get_scale to calculate the scale and apply it to q/k separately. But if caching is disabled, is the scale hard coded to be 1?

Q1: Would it make sense to implement the same logic in rotate_queries_and_keys to do it with axial variant?

Q2: Maybe an ignorant question, but why scale q with scale and k with scale**-1?

Q3: Is it OK to apply the scale directly using apply_rotary_emb and then fine-tune the model?

Q4: Is the scaling linear to the size of the dimension change? i.e., if I double the resolution, should the scale be 2.0 in that direction? Or do we need to account for diagonal distances etc in N-D case?

Q5: Is there any writeup (paper, pre-print etc) about axial-RoPE?

I may be completely off and need to understand the logic better. If that's the case, I would appreciate any help!

@tasansal tasansal changed the title What is the right way to fine tune Axiale RoPE with frequency scaling? What is the right way to fine tune Axial RoPE with frequency scaling? Aug 26, 2024
@tasansal tasansal changed the title What is the right way to fine tune Axial RoPE with frequency scaling? Fine-tuning Axial RoPE with frequency scaling? Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@tasansal and others