Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Cranelift sizeless vector types #19

Merged

Conversation

sparker-arm
Copy link
Contributor

Copyright (c) 2021, Arm Limited.

Copyright (c) 2021, Arm Limited.
- The new types don't report themselves as vectors, so ty.is\_vector() = false, but are explicitly reported via ty.is\_sizeless\_vector().
- The TypeSet and ValueTypeSet structs gain a bool to represent whether the type is sizeless.
- TypeSetBuilder also gains a bool to control the building of those types.
- At the encoding level, a bit is used to represent whether the type is sizeless, this bit has been taken from the max number of vector lanes supported, so they'd be reduced to 128 from 256.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take it from the special types range instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest version.

The proposal is to introduce new sizeless vector types into Cranelift, that:
- Express a vector, with a lane type and size, but a target-defined number of lanes.
- Are denoted by prefixing the lane type with 'sv' (sizeless vector): svi16, svi32, svf32, etc...
- Have a minimum width of 128-bits, meaning we can simply map to existing simd-128 implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Riscv allows smaller vectors, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apparently so, but the flexible vector spec does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the flexible vectors specification might introduce the requirement that the underlying vector length is a power of 2, even though the Arm SVE and the RISC-V V extension, I suppose, do not have similar restrictions. Whether our IR should cover more than the common use cases (such as the Wasm flexible vectors, power-of-2 vector sizes, etc.) is something to discuss.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that supporting more general expressivity at the CLIF level is a good thing to establish early on if we plan to do it at all, or else we'll bake in a lot of assumptions to code that operates on these types, over time. So non-power-of-2 sizes is something to include (and test for) early. It does seem valuable to me to support this if underlying hardware commonly does.

On the other hand, we don't want to come up with a design that requires handling difficult edge-cases; emulating smaller vector support than the fundamental unit seems like a good example of that. (Edge-cases if one were to try to emulate with bigger ops involve e.g. ensuring loads/stores don't run beyond the end of the heap, and having smaller alignments than usual.) So scaling down to 64 bits seems like an unreasonable goal just because one possible future platform supports it. So, I'd tentatively suggest a 128-bit minimum size.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will operations like iadd work just like with regular vectors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think most polymorphic instructions can just be extended for sizeless types too. I've currently implemented: iadd, isub, imul, fadd, fsub and splat.

@bjorn3
Copy link
Contributor

bjorn3 commented Dec 17, 2021

What new instructions will be introduced for creating, loading and storing such sizeless vectors?

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Dec 17, 2021

What new instructions will be introduced for creating, loading and storing such sizeless vectors?

I think we may need to add stack related instructions, since we have explicit ones, though I'm still not completely sure why. But I expect that the main changes will be encapsulated in new heap_addr and stack_addr operations. We could probably modify the existing ones, but new ones would help isolate the ambiguous semantics of the sizeless nature.

@cfallin
Copy link
Member

cfallin commented Dec 17, 2021

One thing to note related to the stack: both stack address computation (stack_addr) and regalloc spilling depend on the frame layout, which we compute by knowing the size of all types. How does the prototype currently work wrt stack layout -- does it assume some maximal size? Or are the types fixed to some (implementation-defined) size at some point between the CLIF generation and lowering?

@abrown
Copy link
Contributor

abrown commented Dec 17, 2021

cc: @penzn

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Dec 20, 2021

@cfallin I've only just started looking into stack handling, so I don't have any concrete answers for you. For the backend part, my hand wavey plan was to try and do what is done in LLVM for SVE, where fixed-size and sizeless slots are assigned in different areas, using different pointers and the runtime value of 'VL' can be used to scale offsets into sizeless areas. I've only just looked at the machinst ABI layer though, and I will continue with that this week.

At an IR level, the slots would be as sizeless (minimum 128-bits, not sure about maximum) as any other sizeless value. But, as I don't know the IR well, this is the main contention that I am concerned about :) From my basic understanding, it looks like it would be easiest to create sizeless-specific instructions to handle anything stack related, so that we can preserve the current semantics for fixed-size objects, and avoid having to worry about immediate offsets into a slot of unknown size. But I'm also assuming that the different slots could be freely mingled because they're just SSA values...? Please feel free to point me at any areas of the code which you think could be problematic.

If the ambiguous size of objects is going to be a serious problem for us, then there is the option of forcing a lowering to a fixed size, though this could be sub-optimal for architectures like SVE when we're not functioning as a JIT.

@cfallin
Copy link
Member

cfallin commented Dec 20, 2021

@sparker-arm Thanks for the clarifications! I think that we can definitely find a way to accommodate variable-sized types; we just need to consider the problem explicitly and make sure to fix the places it breaks any assumptions :-)

A quick-and-dirty way of getting a feel for that would be to make ty.bits() panic when called on a variable-sized type, then run whatever vector tests updated to use the sizeless vectors. I expect that the ABI code is going to be large part of that. Note that it's not just explicit user loads/stores that we have to worry about: the regalloc, for example, assumes it can spill any register (including the new sizeless-vector ones) and needs to know how large of a spillslot to allocate.

Just for clarity -- I'm not actually sure I could say for certain despite being deep into this conversation! -- when exactly is the vector size made concrete? Do we ultimately compile the code while knowing the size (e.g., we decide we're compiling for microarchitecture X, and so we choose for vectors to be 512-bits wide)? Or are we actually generating code that will only know at runtime? I had been assuming the former, but your mention of computed offsets, etc., makes me think possibly it's the latter.

If we ultimately know the size statically, then it's just a phase-ordering/staging problem: we might need to rework how some of the ABI code operates, but it's not fundamentally incompatible with our model. If the latter, then it sounds like it's actually alloca() of sorts. We know what we need to do to make this work but it's a bit tricky, especially if we're going to have lots of spillslots.

@bjorn3
Copy link
Contributor

bjorn3 commented Dec 20, 2021

It determined at runtime. For arm I believe it is fixed for each cpu, while for riscv the code can ask for any length (up to a certain maximum) and then the minimum of what the user requested and the cpu supports is chosen as far as I understand. The whole point is to make executables agnostic to the vector size supported by the cpu and thus not require recompilation when a new cpu is released with bigger vectors.

@penzn
Copy link

penzn commented Dec 21, 2021

Just for clarity -- I'm not actually sure I could say for certain despite being deep into this conversation! -- when exactly is the vector size made concrete? Do we ultimately compile the code while knowing the size (e.g., we decide we're compiling for microarchitecture X, and so we choose for vectors to be 512-bits wide)? Or are we actually generating code that will only know at runtime? I had been assuming the former, but your mention of computed offsets, etc., makes me think possibly it's the latter.

It should be the former, in a sense that it would be known what SIMD length a given CPU supports and then the code can be generated with the right instructions, and ideally without dynamic handling when the generated code executes. "Determined at runtime" is from the point of view of the developer, not wasm runtime.

@bjorn3
Copy link
Contributor

bjorn3 commented Dec 21, 2021

That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on.

@sparker-arm
Copy link
Contributor Author

worry about: the regalloc, for example, assumes it can spill any register (including the new sizeless-vector ones) and needs to know how large of a spillslot to allocate.

Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.

when exactly is the vector size made concrete?

The whole point is to make executables agnostic to the vector size supported by the cpu and thus not require recompilation when a new cpu is released with bigger vectors.

it would be known what SIMD length a given CPU supports and then the code can be generated with the right instructions, and ideally without dynamic handling when the generated code executes

@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime?

Either way, both these options are available, and will be both be implemented. For architectures that support runtime vector lengths, we can generate agnostic code (SVE uses the 'VL' runtime variable for most of this) while Neon, AVX-2 and AVX-512 will have to choose a size during codegen. I'm currently implementing SVE and Neon, in tandem, to support for these types, and my idea is, still, that we'd have a legalisation pass that can convert sizeless to fixed sizes which would enable all the SIMD compatible backends to add (basic) support with little or no effort.

@sparker-arm
Copy link
Contributor Author

A quick-and-dirty way of getting a feel for that would be to make ty.bits() panic when called on a variable-sized type

And thanks for this @cfallin

@cfallin
Copy link
Member

cfallin commented Dec 21, 2021

Do we actually need to know the real size though? Or does it just need to understand there's some scaling factor for any register/spill slots that hold a sizeless type..? And, as we've discussed before, the new register allocator will have to accept and understand differences between the fixed and sizeless registers and how they potentially alias.

Right, it can be made to work; it's just a bit of an API change that we'll need to design and coordinate. Right now both regalloc.rs and regalloc2 are built around the notion that they manage spill space (in units of slots) and can ask how many slots a given value will need. We'll need to have some sort of notion of a separate user-managed spill space for type/regclass X.

Re: stack layout more generally: none of this is impossible, but it's going to need some careful design rework in the ABI code. Specifically, right now the ABI implementation is mostly shared between aarch64, x64 and s390x, and on all architectures, FP (RBP) is kept at the top of the frame, just below return address/stack args, and used to access stack args; and SP (RSP) is kept at the bottom of the frame, as required (redzone notwithstanding). All stackslots and spillslots are accessed via offsets from SP.

If part of the frame is variable, we'll need to either start accessing the fixed part via negative offsets from FP, or move FP below the fixed part and above the variable part, or keep another base register. The third is not great (extra register pressure). The first was how my original aarch64 ABI impl worked, but we switched from negative-from-FP to positive-from-SP since the encoding is more efficient. So we're left with the second, which is I think how other compilers also handle variable frame sizes on aarch64 (?). The issue with that is that we share the generic ABI code with x64, and on x64, we have to keep RBP at the top of the frame in the Windows Fastcall ABI (EDIT: and so we always do, for uniformity). Also relevant: there's support for omitting frame pointer setup when unneeded (in leaf functions at least); that interacts with this decision too.

So we can definitely work this out, but we'll need to probably add "modes" to the ABI: where is FP, from which base register are (i) args, (ii) fixed slots, (iii) variable slots accessed, how unwind info is emitted correctly in all cases, etc. Lots of interacting moving parts.

All of the above is needed for e.g. alloca() support as well, and is not unique to runtime-variable-sized types, but this may be the thing that forces the need first; so, thanks for pioneering the trail :-)

@cfallin
Copy link
Member

cfallin commented Dec 21, 2021

Ah, another option I missed: variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots. I think that satisfies all the constraints we have today (but not alloca(); the thing that makes VST-slots "weaker" than alloca in requirements is that we can know the size at prologue time rather than throughout the function body).

@penzn
Copy link

penzn commented Dec 21, 2021

@penzn I thought it was still a possibility that the flexible spec would support setting the length at runtime?

@sparker-arm, there is an idea to support setting 'current' length at runtime, RISC-V style, but the maximum available was always meant to be determined before, in the spirit of straight-forward compilation support. Doing something like that isn't impossible, but would most likely require dynamic dispatch, and was considered out of scope.

That would only work for jit compilation and not aot compilation. With aot compilation it may not be known what cpu it runs on.

@bjorn3, valid point - how does AOT compilation currently handle target features?

@bjorn3
Copy link
Contributor

bjorn3 commented Dec 21, 2021

There is a list of target features. When doing AOT compilation you can choose a list of allowed target features. The produced executable will then run on all cpus supporting all these target features. This list is generally chosen very conservatively to maximize the amount of cpus it works on. For example only sse and sse2 are enabled by default by rustc on x86_64. This allows at most 128 bit vectors despite the fact that modern cpus support 512 bit vectors through avx512. You can enable avx (for 256 bit vectors) or avx512 (for 512 bit vectors) but then it won't run on older cpus. Arm's SVE however allows you to compile once with the SVE target feature enabled and then it will use eg 512 bit vectors on cpus that support them while retaining compatibility with cpus that only support 128 bit vectors.

@penzn
Copy link

penzn commented Dec 22, 2021

It sounds like the same would work here, when the user requests AVX instructions, they are going to get 256-bit vectors, SSE - 128-bit and so on.

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Dec 22, 2021

variable part in the middle, fixed part at the bottom of the stack frame; then FP at the top and used to access args, as today, and SP at the bottom with fixed (independent of variable sizes) offsets for normal stackslots/spillslots.

@cfallin I think you're generally describing how AArch64 is currently handling this, with third register used in the presence of dynamic objects : https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp

@penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various type.length operations, or are these available to just be more user friendly?

@penzn
Copy link

penzn commented Dec 23, 2021

@penzn Does the flexible spec state that the size of all the flexible types are the same? I had assumed not because of the various because of the various type.length operations, or are these available to just be more user friendly?

Yes, those are there to express things in number of elements instead of bytes, which would be more natural for loops and arrays (also cuts one instruction out, but that isn't a big win really). Spec does not say whether or not different types can have different byte length, but practically hardware length is the same for architectures that use same SIMD registers for different types.

@sparker-arm
Copy link
Contributor Author

Thanks @penzn, though I feel having the size defined in the spec would be very useful in reducing the amount of ambiguity for any target independent parts of the compiler; the case of stack objects is a good example where, at the IR level, we can have homogeneously unsized slot type which can be (re)used by any flexible type. When it comes closer to codegen, it also makes the layout of the frame easier/smaller/efficient as, if they potentially had different sizes, we'd likely have to pad objects to a fixed alignment or have expensive ways of calculating the address of each object.

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Dec 23, 2021

But now I remember that, for SVE, the sizeless registers aren't always equal - the predicate registers are x8 smaller than the data regs...

EDIT: This actually shouldn't matter at the moment, while comparisons produce vector masks in the vector regs, but is something we should consider if predicates types are introduced.


We can add sizeless vector types by modifying existing structs to hold an extra bit of information to represent the sizeless nature:

- The new types don't report themselves as vectors, so ty.is\_vector() = false, but are explicitly reported via ty.is\_sizeless\_vector().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid ambiguity, we should rename is_vector to is_sized_vector if we move forward with this.

@sparker-arm
Copy link
Contributor Author

@cfallin I've spent the last few days playing with the ABI layer, and I think I've stumbled upon (again) the limitation with the RegClass types, specifically while looking at callee saves... according to the AAPCS, in many cases SVE regs are treated the same as Neon but there are cases where the whole 'Z' register needs to be saved, and I'm not sure how to handle this while we just have shared V128 registers for Neon and SVE.

So two questions really: can you think of a way around this, and/or do we need to follow the AAPCS? RegClass is used quite a bit in the ABI layer so I'm assuming there's going to be more problems like this.

@cfallin
Copy link
Member

cfallin commented Jan 6, 2022

Hmm, interesting; can you say more about what factors influence prologue/epilogue register save details? Is it just "if you clobber the upper bits, save them" or something like that? I think it might be reasonable to, as we scan the function body building up the clobber set during ABI processing, look at some information provided by an instruction ("requires special save/restore") and use that info as well.

The basic abstraction at the regalloc layer is that we have disjoint classes of units to allocate; overlapping classes are a major refactor that would require quite a significant investment of time (a month or so in regalloc code, with significant correctness risk); so I think that we likely need to find a way to make this work without two different register classes that mean "Z/vec reg as traditional vec reg" and "Z/vec reg as Z reg". But it's fundamentally a single register, so this doesn't seem wrong to me. In a sense, the way that we use it is type information layered on top of the allocation itself.

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Jan 7, 2022

It is when the routine receives and/or returns Z or predicate registers that the callee needs to save the whole Z register (z8-z23). From what you said though, I guess we could just grab the type information from the instruction so that we don't have to rely on RegClass? I definitely have no interest in trying to bend/break regalloc to the will of SVE, but if we can pass the high-level type information down then I think this will be fine. I still haven't looked at spill/restores though :)

edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful.

edit: And the only place that I've found which doesn't provide a virtual reg for spill/reload is in regalloc's get_stackmap_artefacts_at function...

@sparker-arm
Copy link
Contributor Author

@cfallin Do you think adding something like 'add_sizeless_def' to RegUsageCollector would be a suitable way to communicate between regalloc and cranelfit? I think this would allow us to convert 'clobbered_registers' to hold a RealReg and bool to represent whether it is sizeless.

@cfallin
Copy link
Member

cfallin commented Jan 7, 2022

edit: But now from quickly glancing, I see that type information is passed around in the spill/restore APIs so I'm hopeful.

Not anymore actually; see the recent fuzzbug fix (and followup issue #3645) where we determined that this type information can be inaccurate in the presence of moves and move elision.

There's a bit of subtlety here: the register allocator should not know about IR types in a non-opaque way; that's an abstraction leak and a correctness risk (what if it tries to do something with that info?). The allocator is a thing that hands out registers; registers are just black boxes that hold bits; modern aarch64 machines have Z registers; Cranelift can decide to put a Z-register-sized value, or a v128-sized value, or an f64-sized value, or anything else into that register. That's a clear delineation of responsibilities and if we blur that line, I fear we could have bigger correctness problems later (similar to the one CVE and another almost-CVE this area has given us when we do blur the line).

The suggested fix in the issue above involves having the regalloc record type info alongside registers, but only to verify that moves are valid; we don't want it to actually peek into that type info.

But, the ABI code is free to reason about how the function body uses registers. So what I imagine could work is that when we generate prologue/epilogue code, we can scan over parameter/return types ("do we take or return a Z reg" condition above) and determine what we actually clobber and what we need to save. If something also depends on whether instructions actually e.g. touch high bits or whatnot, we can scan instructions for that too. But that all stays within Cranelift; it doesn't involve regalloc changes.

It's possible I'm missing some requirement here though -- is there some reason that we can't determine the info we need from scanning the code on the Cranelift side?

@cfallin
Copy link
Member

cfallin commented Feb 22, 2022

Creating a type in IR sounds rather useful, I had no idea that it was possible! Could you point me at an example if we're already doing this for something else?

I don't think we do anything of the sort currently -- it's a new idea :-) But, yes, it certainly seems possible to me. This will change some of the internal signatures, e.g. Type no longer always knows its size -- it could return an Option<usize> from bits() or it could return a TypeSize with Static(usize) and Dynamic(ir::Value) arms, I suppose, and to get the latter it would need the &ir::Function as an argument. But all of this seems reasonable to me at least.

So, at least for SVE, vector_length_reg will return the bit/byte width of either the Z-regs, or the predicates - both of which are scalable but are sized differently. So, I think we need to parameterize our special operation with a vector type, and move away from the notion of a VL register:

function %f() {
  gv0 = dyn_scale.i32x4  ; How many i32x4 vectors can fit in a register?
  dt0 = i32x4 * gv0
  dynslot0 = slot dt0
  
block0:
  v0 = load.dt0 dynslot0
  v1 = iadd.dt0 v0, v0
  ; ...
}

I believe this still provides all the characteristics we're looking for, and should make lowering more efficient with better type info. Does that look okay?

I think so, yes. Just to make sure I understand, the dyn_scale global value will be determined based on a known target microarchitecture/ISA level? Or read/computed from special register(s) in the function prologue?

My one concern is passing a type, instead of a size, to the slot as this is very different to the current way of doing things, but I will trust that it's feasible.

Yes, I think this actually makes more sense overall: associating the slot with a type feels more appropriate than requiring the IR producer to match the size given to the slot entity with the known size of the loads and stores used to access it.

@sparker-arm
Copy link
Contributor Author

I don't think we do anything of the sort currently -- it's a new idea :-) But, yes, it certainly seems possible to me.

Okay :) sounds like I'll be kept busy for a while hacking on the type system!

Just to make sure I understand, the dyn_scale global value will be determined based on a known target microarchitecture/ISA level? Or read/computed from special register(s) in the function prologue?

Yes, in most cases it will just be a constant set in the target backend. If/when we add fully dynamic support for SVE, it will be an instruction or two, in the prologue, that reads the runtime VL.

@sparker-arm
Copy link
Contributor Author

Hi @cfallin, now that I've had some time to revisit the meta level of cranelift, I'm still struggling to see how these dynamically created types would work. It just seems to be too against how types and instructions are currently implemented.

During the parsing we can check that dyn_scale and our dynamic type have been declared with the same base vector type, that's okay...At the IR level, the mechanical bits seem fine too: Types are just u8 values, we reserve some of those for dynamic types, as I have done for the original sizeless types, and these should be named such as I8X16XN. Implementing methods to handle the dynamic addition is not an issue. But this doesn't seem to tie dyn_scale value to the type in the way you would like, these are concrete types and not connected to a global value.

I actually got so lost in the meta layers that I actually can't remember which part made my mind finally fall over :) I think my problem comes when implementing the polymorphic instruction generation, when we want to iterate through concrete types. From the description above, they seem like concrete types (just a u8 with a bit for dynamic) but from the user perspective they are not. In the meta level, I've tried to generate a set of types that have a reference to an IR entitty, which then seems superficial since that entity isn't actually part of the type. Does any of this successfully convey my pain?

@cfallin
Copy link
Member

cfallin commented Mar 1, 2022

@sparker-arm , I'm happy to help work out implementation/prototype details -- do you have a WIP branch you can point me to that demonstrates what you're thinking / what issues you're running into?

Hi @cfallin, now that I've had some time to revisit the meta level of cranelift, I'm still struggling to see how these dynamically created types would work. It just seems to be too against how types and instructions are currently implemented.

During the parsing we can check that dyn_scale and our dynamic type have been declared with the same base vector type, that's okay...At the IR level, the mechanical bits seem fine too: Types are just u8 values, we reserve some of those for dynamic types, as I have done for the original sizeless types, and these should be named such as I8X16XN. Implementing methods to handle the dynamic addition is not an issue. But this doesn't seem to tie dyn_scale value to the type in the way you would like, these are concrete types and not connected to a global value.

I don't think that the IR literally needs to have a Type that refers to/holds an IR entity, or somesuch; I'm imagining something like:

  • Type has a new "dynamically-sized type" mode, which reserves some of the index space for dt0..dtN
  • In ir::Function, we have a new dyn_types vector that contains structs that define each dynamic type, something like:
struct Function {
    // ...
    dyn_types: Vec<DynType>,
}

/// A dynamically-sized type. Type `dtN` refers to the definition in `dyn_types[N]`.
struct DynType {
    /// The dynamically-sized type is defined in terms of a base type and a global value that indicates
    /// how many of those base-type elements this type contains.
    base_type: Type,
    length: GlobalValue,
}

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Possibly I'm missing some other difficulty here though? Are there places where it would be awkward to look up the DynType (I guess we need to think about what additional plumbing this requires)?

I actually got so lost in the meta layers that I actually can't remember which part made my mind finally fall over :) I think my problem comes when implementing the polymorphic instruction generation, when we want to iterate through concrete types. From the description above, they seem like concrete types (just a u8 with a bit for dynamic) but from the user perspective they are not. In the meta level, I've tried to generate a set of types that have a reference to an IR entitty, which then seems superficial since that entity isn't actually part of the type. Does any of this successfully convey my pain?

When generating code, I imagine there would be a case that matches on dynamic types, and then switches on the base type; so just as we today have cases for i32x4 and i64x2, we would have cases for i32xN and i64xN. There's just one indirection to go look up the base type. Does that make sense?

@sparker-arm
Copy link
Contributor Author

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Okay, thanks, this sounds much more feasible and sorry I kinda missed this suggestion in your previous comment.

One thing regarding the global value names, is it a fundamental property of clif that we need a prefix followed by a number? I was wondering whether we could have Nxi32x4 = i32x4 * gv0 instead, which could make the IR more readable.

I'm happy to help work out implementation/prototype details -- do you have a WIP branch you can point me to that demonstrates what you're thinking / what issues you're running into?

I greatly appreciate the offer, but I think your above suggestion is enough to get me moving forward again. I will be on holiday for a couple weeks though, starting Friday, so please don't interpret my silence as being stuck again.

@cfallin
Copy link
Member

cfallin commented Mar 2, 2022

It isn't really an issue that the DynType isn't literally part of the Type; the Type contains an index, and we can look up the DynType given the Function context.

Okay, thanks, this sounds much more feasible and sorry I kinda missed this suggestion in your previous comment.

One thing regarding the global value names, is it a fundamental property of clif that we need a prefix followed by a number? I was wondering whether we could have Nxi32x4 = i32x4 * gv0 instead, which could make the IR more readable.

The CLIF parser seems to be built assuming the entityN syntax and it does have a nice logical consistency to it, though I see the appeal of arbitrary type names as well... for now I think it's probably simplest to keep the same scheme for this new type of entity, but we can definitely discuss/refine syntax more when there's a prototype to play with, I think.

@sparker-arm
Copy link
Contributor Author

@cfallin @abrown Would you be able to take a look at this again, please? I'm happy that everything ended up working as we discussed, and I'm kinda keen to move into a code RFC to better illustrate all the moving pieces.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally happy with this now; thanks for the patience in iterating on the high-level approach here!

I too am eager to see the code; some of the details here I think will be most clear once we have a concrete thing in front of us, and I think everything is solidified to the point that we won't have too much design churn that wastes effort. The biggest question to me is how the pipeline lowers this to the existing stackslot abstractions; as we discussed earlier if all the sizes are initially compile-time constants then in theory the ABI implementation could remain unchanged, if we legalize beforehand. But for full generality maybe the ABI needs to be aware of the separate category of slots. In any case, we can iterate that as we need -- let's hopefully get this RFC accepted soon, if others agree!

@@ -96,17 +99,19 @@ With the notion of a sizeless stack slot possible, but not a sizeless spill slot
# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations.
The main change here is the introduction of dynamically created types, using an existing vector type as a base and a scaling factor represented by a global value. Using a global value fits with clif IR in that we have a value which is not expected (allowed?) to change during the execution of the function. The alternative is to add types which have an implicit scaling factor which could make verification more complicated, or impossible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, "not allowed" rather than just "not expected" -- the semantics are as-if loaded once in the prologue.

The main design decision is to have a vector type that isn't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such. Although setting the minimum size of the opaque type to 128 bits still allows us to infer the minimum number of lanes for a given lane type. The flexible vector specification also provides specific operations for lane accesses and shuffles so these aren't operations that we have to handle with our existing operations.
The main change here is the introduction of dynamically created types, using an existing vector type as a base and a scaling factor represented by a global value. Using a global value fits with clif IR in that we have a value which is not expected (allowed?) to change during the execution of the function. The alternative is to add types which have an implicit scaling factor which could make verification more complicated, or impossible.

The new vector types also aren't comparable to the existing vector types, which means that no existing paths can accidently try to treat them as such.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/accidently/accidentally/

@sparker-arm sparker-arm force-pushed the cranelift-sizeless-vector branch from ca2e00b to b7cb4c6 Compare April 21, 2022 08:31
@sparker-arm
Copy link
Contributor Author

Thanks @cfallin !

The biggest question to me is how the pipeline lowers this to the existing stackslot abstractions; as we discussed earlier if all the sizes are initially compile-time constants then in theory the ABI implementation could remain unchanged, if we legalize beforehand. But for full generality maybe the ABI needs to be aware of the separate category of slots.

Indeed, the current ABI changes that I've made aren't really needed for this initial fixed size implementation, but I've tried to implement it with true dynamic sizes, and general SVE, in mind.

Speaking of which, my current plan for the code RFC is to avoid including any SVE codegen, except some changes to the ABI API, purely to avoid distractions from the IR and ABI changes (it already feels big enough to me). Do you think this is wise, or would you also prefer to see some SVE codegen too?

@cfallin
Copy link
Member

cfallin commented Apr 21, 2022

Speaking of which, my current plan for the code RFC is to avoid including any SVE codegen, except some changes to the ABI API, purely to avoid distractions from the IR and ABI changes (it already feels big enough to me). Do you think this is wise, or would you also prefer to see some SVE codegen too?

I think your incremental approach sounds good: best to get support for the dynamically-sized types in, handling/storing/spilling them, then we can add ISA support as a followup. Actually maybe there are (at least) three pieces: the dynamically-sized types and ABI support; then a "polyfill" without SVE, just NEON (and others could build the same with Intel SIMD if interested); then actually using the new hardware instructions. But, we don't need to fix the details here, they can change as needed I think.

Copy link
Contributor

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I agree with the @cfallin's sentiment above that this can be incrementally improved but that this is a good place to start.

@sparker-arm sparker-arm requested a review from fitzgen April 25, 2022 07:52
@sparker-arm
Copy link
Contributor Author

Hi, it's been a week since I updated this so I'd now like to move this RFC into the final comment period. Thanks!

@cfallin
Copy link
Member

cfallin commented Apr 27, 2022

@sparker-arm could you post a final-comment-period approval checklist? Then we can start to collect approvals and hopefully get this in soon!

(+1 from me as well, i.e. feel free to mark me as checked already)

@sparker-arm
Copy link
Contributor Author

sparker-arm commented Apr 28, 2022

Stakeholders sign-off

Arm

Fastly

Intel

Unaffliated

IBM

Copy link

@penzn penzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC looks good from spec point of veiw.

I don't consider myself a stakeholder w.r.t cranelift, my opinion should carry much lower weight than others :) I would be interested in following this development one way or the other though.

@sparker-arm
Copy link
Contributor Author

I believe enough time has now passed? Can this now be merged?

@cfallin
Copy link
Member

cfallin commented May 10, 2022

Indeed, the FCP has now elapsed with no objections, so this RFC should now be merged! (I'm happy to click the button if you don't have permissions to do so, but we should also fix that if so)

@sparker-arm sparker-arm merged commit 8d1fb11 into bytecodealliance:main May 10, 2022
@sparker-arm
Copy link
Contributor Author

Great, I didn't think I had permissions! Thanks for all the help with this @cfallin

@cfallin
Copy link
Member

cfallin commented May 10, 2022

It was just added apparently (thanks to Till); you're part of the Cranelift core org group now so it shouldn't be an issue in the future :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants