Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule][ARM] Enable ARM CPU intrinsic for MetaSchedule #14209

Merged
merged 1 commit into from
Mar 31, 2023

Conversation

dsbarinov1
Copy link
Contributor

Motivation:
The purpose of this PR is to add support for intrinsics to optimize matrix multiplication operations (e.g. matmul, convolution) during tuning with MetaScheduler.

Information about PR:
The present PR integrates the existing neon and dotprod (namely, sdot and udot) ARM CPU intrinsics into MetaScheduler, introduces a new "hybrid" dotprod intrinsic ("hdot") working with uint8, uint8 -> int32 data types, and changes the intrinsic selection and application processes for the ARM CPU case, since we operate with multiple intrinsics, rather than with a specific one.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Mar 6, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@dsbarinov1 dsbarinov1 marked this pull request as draft March 6, 2023 11:08
Copy link
Contributor

@vvchernov vvchernov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave comments here to think about how to make the code cleaner, but they are not required to fix

@@ -180,7 +180,7 @@ class ScheduleRule : public runtime::ObjectRef {
* \return The schedule rule created
*/
TVM_DLL static ScheduleRule MultiLevelTilingWithIntrin(
String intrin_name, String structure, Optional<Array<String>> tile_binds,
Array<String> intrin_name, String structure, Optional<Array<String>> tile_binds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why it is done (String -> Array), but it should be rethink one more time due to API changing affects other places not only your own task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, new API changes were reverted to the original ones, while keeping the same new functionality.

@@ -85,21 +101,23 @@ class MultiLevelTilingWithIntrinNode : public MultiLevelTilingNode {

public:
/*! \brief The name of a tensor intrinsic. */
String intrin_name;
Array<String> intrin_name;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the field type is still be changed, I recommend to rename it to intrin_names for the sake of clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outdated.

@@ -110,6 +155,16 @@ void SpaceGeneratorNode::InitializeWithTuneContext(const TuneContext& context) {
default_sch_rules = ScheduleRule::DefaultMicro();
default_postprocs = Postproc::DefaultMicro();
default_mutator_probs = Mutator::DefaultMicro();
} else if (kind == "neon") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like different levels of target types are checked here. Possibly it should be "arm" type with splitting to "neon"/"dotprod" in separated method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// return HasFlag_(attr.value(), flag);
// }

static inline bool HasFlag_(Optional<Array<String>> attr, std::string flag) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have the same code in src/target/parsers/aprofile.cc.
Instead of code duplication, can we move it into common place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication fixed, we are now using different method to pull specific keys from the target.

ScheduleRule::AddRFactor(
/*max_jobs_per_core=*/8,
/*max_innermost_factor=*/Integer(32)),
ScheduleRule::MultiLevelTilingWithIntrin(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, new API in MultiLevelTilingWithIntrin is not required anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is not in use rn.

@dsbarinov1 dsbarinov1 force-pushed the dbarinov/metaschedule_arm_cpu_intrin branch 2 times, most recently from c3cc7bc to 5baa4fd Compare March 20, 2023 19:39
@dsbarinov1 dsbarinov1 marked this pull request as ready for review March 21, 2023 11:26
@dsbarinov1
Copy link
Contributor Author

@ibsidorenko could you review my changes, please ? :)

vec_c = C.vload([0], dtype="int32x4")

C[T.ramp(T.int32(0), 1, 4)] = T.call_llvm_pure_intrin(
T.llvm_lookup_intrinsic_id("llvm.aarch64.neon.udot.v4u32.v16u8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use the same intrinsic id "llvm.aarch64.neon.udot.v4u32.v16u8" in both cases. Is it Ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When experimenting with tflite_mobilenet_v3_quant model, we encountered convolution with multiplication of uint8 uint8 tensors into int32 accumulator, which did not allow to apply existing sdot/udot intrinsics, so we had to create new hdot intrinsic, which already works with such dtypes layout. From my knowledge, there is no such neon instruction to work with u8u8i32 layout of dtypes, for that reason we could try to call the closest instruction, which we did and the intrinsic succesfully applied to that type of convolution, bringing us a performance benefit.

@dsbarinov1
Copy link
Contributor Author

@echuraev, @masahi, could you review my changes, please :)

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

vec_a,
vec_b,
dtype="int32x4",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to clean up a lot of code duplication between different dtypes here. See tensor_intrin/cuda.py for examples.

};
}

Array<ScheduleRule> ScheduleRule::DefaultARMDotprod() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the dup with DefaultARMDotprod, to make the difference obvious.

@dsbarinov1 dsbarinov1 force-pushed the dbarinov/metaschedule_arm_cpu_intrin branch 6 times, most recently from 7439490 to d08f54c Compare March 28, 2023 18:55
@dsbarinov1 dsbarinov1 force-pushed the dbarinov/metaschedule_arm_cpu_intrin branch from d08f54c to 615d61f Compare March 28, 2023 23:09
}

template <typename... Args>
static void AgregateImpl(Array<T>& dest) {} // NOLINT(*)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edit (NOLINT) was made on the consideration that quote "Google C++ Style Guide seems to have allowed using non-const references as parameters".
Reference: https://github.innominds.com/cpplint/cpplint/issues/148

@masahi
Copy link
Member

masahi commented Mar 29, 2023

@tvm-bot rerun

1 similar comment
@masahi
Copy link
Member

masahi commented Mar 29, 2023

@tvm-bot rerun

@dsbarinov1
Copy link
Contributor Author

@masahi, could you review my changes, please? :)

};
}

Array<ScheduleRule> GetDotprodSpecificRules() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to "ARM" dot product only so the naming is not the best. I'll merge it for now, but please fix this when you get a chance later.

@masahi masahi merged commit b724c87 into apache:main Mar 31, 2023
@masahi
Copy link
Member

masahi commented Mar 31, 2023

sorry I forgot to take another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants