You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have trained a 3D ViT masked autoencoder using axial RoPE for an image size of 512x512x512 (3D scientific images, sampled from much larger volumes). Now I want to try fine-tuning the pre-trained model for larger (i.e. 1024x1024x1024) context size. However, it doesn't seem obvious to how. I am especially unclear on how to calculate the scale for axial RoPE correctly. Important note: we are not resizing the images; we tile the larger image with these "mini-cubes". So, going up in size means we have more context.
I would love to hear your feedback on how to do this properly. Below is my thought process (and please correct me where I am wrong!).
Normally, with 1D RoPE, we have the theta_rescale_factor, which changes freqs in RoPE directly. However, when freqs_for is set to pixel, the theta parameter isn't used to build freqs, which is probably fine since we don't have a single sequence and reuse [-1, 1] range for axial.
Anyhow, assuming above is fine, we apply axial RoPE with apply_rotary_emb instead of rotate_queries_and_keys. It seems like rotate_queries_and_keys does use get_scale to calculate the scale and apply it to q/k separately. But if caching is disabled, is the scale hard coded to be 1?
Q1: Would it make sense to implement the same logic in rotate_queries_and_keys to do it with axial variant?
Q2: Maybe an ignorant question, but why scale q with scale and k with scale**-1?
Q3: Is it OK to apply the scale directly using apply_rotary_emb and then fine-tune the model?
Q4: Is the scaling linear to the size of the dimension change? i.e., if I double the resolution, should the scale be 2.0 in that direction? Or do we need to account for diagonal distances etc in N-D case?
Q5: Is there any writeup (paper, pre-print etc) about axial-RoPE?
I may be completely off and need to understand the logic better. If that's the case, I would appreciate any help!
The text was updated successfully, but these errors were encountered:
tasansal
changed the title
What is the right way to fine tune Axiale RoPE with frequency scaling?
What is the right way to fine tune Axial RoPE with frequency scaling?
Aug 26, 2024
tasansal
changed the title
What is the right way to fine tune Axial RoPE with frequency scaling?
Fine-tuning Axial RoPE with frequency scaling?
Aug 26, 2024
Hi @lucidrains
We have trained a 3D ViT masked autoencoder using axial RoPE for an image size of 512x512x512 (3D scientific images, sampled from much larger volumes). Now I want to try fine-tuning the pre-trained model for larger (i.e. 1024x1024x1024) context size. However, it doesn't seem obvious to how. I am especially unclear on how to calculate the
scale
for axial RoPE correctly. Important note: we are not resizing the images; we tile the larger image with these "mini-cubes". So, going up in size means we have more context.I would love to hear your feedback on how to do this properly. Below is my thought process (and please correct me where I am wrong!).
Normally, with 1D RoPE, we have the
theta_rescale_factor
, which changesfreqs
in RoPE directly. However, whenfreqs_for
is set topixel,
thetheta
parameter isn't used to buildfreqs
, which is probably fine since we don't have a single sequence and reuse [-1, 1] range for axial.Anyhow, assuming above is fine, we apply axial RoPE with
apply_rotary_emb
instead ofrotate_queries_and_keys
. It seems likerotate_queries_and_keys
does useget_scale
to calculate the scale and apply it to q/k separately. But if caching is disabled, is the scale hard coded to be 1?Q1: Would it make sense to implement the same logic in
rotate_queries_and_keys
to do it with axial variant?Q2: Maybe an ignorant question, but why scale q with
scale
and k withscale**-1
?Q3: Is it OK to apply the scale directly using
apply_rotary_emb
and then fine-tune the model?Q4: Is the scaling linear to the size of the dimension change? i.e., if I double the resolution, should the scale be 2.0 in that direction? Or do we need to account for diagonal distances etc in N-D case?
Q5: Is there any writeup (paper, pre-print etc) about axial-RoPE?
I may be completely off and need to understand the logic better. If that's the case, I would appreciate any help!
The text was updated successfully, but these errors were encountered: