Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Fast tanh #3255

Merged
merged 1 commit into from
Jun 5, 2019
Merged

[TOPI] Fast tanh #3255

merged 1 commit into from
Jun 5, 2019

Conversation

hlu1
Copy link
Contributor

@hlu1 hlu1 commented May 29, 2019

Borrowing the fast_tanh_float implementation from Eigen (https://github.com/eigenteam/eigen-git-mirror/blob/80f488a7bc9b7c64c9d0c0e8fb301fd905ad1b95/Eigen/src/Core/MathFunctionsImpl.h#L26) can bring about 28x speedup to tanh.

benchmark:

target = "llvm -mcpu=core-avx2"
num_iter = 1000
num_cycles = 5
dtype = "float32"

def bench_tanh(func, m, n):
    a = relay.var("a", shape=(m, n))
    out = func(a)
    f = relay.ir_pass.infer_type(relay.Function([a], out))
    opt_level = 3

    with relay.build_config(opt_level=opt_level):
        graph, lib, params = relay.build(f, target, params={})
    print(graph)

    remote = tvm.rpc.LocalSession()
    tmp = tvm.contrib.util.tempdir()
    lib_fname = tmp.relpath("net.tar")
    with tvm.target.create(target):
        lib.export_library(lib_fname)

    remote.upload(lib_fname)
    lib = remote.load_module("net.tar")
    ctx = remote.cpu(0)

    module = graph_runtime.create(graph, lib, ctx)

    logging.debug(graph)

    input = {'a': np.random.uniform(low=-10, high=10, size=(m, n)).astype(np.float32)}
    module.set_input(**input)

    ftimer = module.module.time_evaluator("run", ctx, num_iter)
    for _ in range(num_cycles):
        prof_res = ftimer()
        print("TVM time: ", prof_res.mean * 1e6, " us")
        time.sleep(1)

bench_tanh(relay.tanh, 1024, 128)

Results:

before:
TVM time:  1512.9183090000001  us
TVM time:  1406.613658  us
TVM time:  1444.041799  us
TVM time:  1445.61708  us
TVM time:  1407.4704649999999  us

after:
TVM time:  49.699045999999996  us
TVM time:  57.133776999999995  us
TVM time:  57.434446  us
TVM time:  59.131979  us
TVM time:  57.127435999999996  us

speedup = 28x

The speedup is about the same on intel skylakes.

@hlu1 hlu1 changed the title [TOPI] fast tanh [TOPI] Fast tanh May 29, 2019
@hlu1
Copy link
Contributor Author

hlu1 commented May 29, 2019

@ajtulloch, could you review pls?

@jroesch
Copy link
Member

jroesch commented May 30, 2019

LGTM, would be good to get a review from @ajtulloch and then we can merge. Does this have any approximation or numerical stability issues?

@pavpanchekha
Copy link

pavpanchekha commented May 30, 2019

@jroesch Asked me to take a look accuracy-wise—it's not a review, just a quick take.

The Eigen tanh is implemented using rational approximation on [-9, 9] and is set to ±1 outside that range. (See comment in the source though note that the implementation is actually done by clamping.) In GLIBC, which I assume is the currently-used implementation, atanh is computed via log1p (See comment in the source in this mirror).

Let's start with single precision. Generally speaking, I expect the Eigen implementation to be faster (evaluating two polynomials, plus one division, is going to be much faster than a logarithm!) and I assume the polynomials are well-chosen so that the accuracy is going to be acceptable (the comment says that it's within a few ULPs... that they don't say how many doesn't inspire confidence, but they're using a 13/6 approximation which seems good enough). Plus, I assume you're using this implementation for an activation function, which which exact accuracy is likely unimportant. And the rational approximation is going to be monotonic, which is nice.

Now let's do double precision. Here, the Eigen implementation will only be as accurate as a single-precision computation, because it's missing terms in the polynomial. And, while clamping at ±9 is appropriate in single precision, in double precision you have to clamp at 19 (and so need a rational approximation accurate that far out). I don't know what exactly your users think about accuracy, but I suspect they wouldn't be happy with double-precision being no more accurate than single precision.

The safe but practical thing, I think, is using the Eigen atanh for single precision but not double precision. If you wanted to, you could derive an analogous polynomial and get a double-precision version that way, or you could keep using the GLIBC implementation in that case and hope the higher memory bandwidth of double precision masks and additional CPU time computing atanh.

@hlu1 hlu1 force-pushed the fast_tanh branch 2 times, most recently from 4bd414a to 83eb574 Compare May 30, 2019 23:12
@hlu1
Copy link
Contributor Author

hlu1 commented May 30, 2019

@pavpanchekha, thanks for the comment. It makes a lot of sense.
I added the logic to only invoke the Eigen fast_tanh_float for fp32 and fallback to default GLIBC tanh implementation for all other datatypes. Double precision test for tanh is also added.

@hlu1 hlu1 force-pushed the fast_tanh branch 2 times, most recently from ce650de to fe05f22 Compare May 31, 2019 21:31
Copy link
Contributor

@ajtulloch ajtulloch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, excellent idea - just a suggestion for a ulp-bound test.

topi/include/topi/elemwise.h Outdated Show resolved Hide resolved
topi/tests/python/test_topi_math.py Outdated Show resolved Hide resolved
@hlu1 hlu1 force-pushed the fast_tanh branch 2 times, most recently from 5869797 to de7a162 Compare June 1, 2019 00:05
@ajtulloch
Copy link
Contributor

Looks like tests fail because of CUDA expf being > 1 ULP (IIRC it's something like 5 ULP max), but maybe we should just enable ULP checking for the tanh impl?

high,
shape=(20, 3),
dtype=tvm.float32,
maxulp=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just make this an optional setting that only tanh uses for now?

@hlu1
Copy link
Contributor Author

hlu1 commented Jun 1, 2019

I did a bit more testing and noticed that the error between numpy.tanh and topi.tanh actually can be pretty big, even for the original implementation. The maxulp can be as big as 194. I think using absolute error and relative error is probably fine.

import numpy as np
import tvm
import topi
import topi.testing
from topi import util

m = tvm.var('m')
l = tvm.var('l')
A = tvm.placeholder((m, l), name='A')

shape = (20, 3)
B = topi.tanh(A)

for _ in range(10):
    a_np = np.random.uniform(low=-1, high=1, size=shape).astype(A.dtype)
    b_np = np.tanh(a_np)
    device = "llvm"
    ctx = tvm.context(device, 0)

    with tvm.target.create(device):
        s = topi.generic.schedule_injective(B)
    foo = tvm.build(s, [A, B], device, name="tanh")
    a = tvm.nd.array(a_np, ctx)
    b = tvm.nd.array(np.zeros_like(b_np), ctx)
    foo(a, b)
    try:
        np.testing.assert_array_almost_equal_nulp(b.asnumpy(), b_np)
    except AssertionError as error:
        print(error)

Original:

    X and Y are not equal to 1 ULP (max is 20)
    X and Y are not equal to 1 ULP (max is 2)
    X and Y are not equal to 1 ULP (max is 194)
    X and Y are not equal to 1 ULP (max is 11)
    X and Y are not equal to 1 ULP (max is 3)
    X and Y are not equal to 1 ULP (max is 10)
    X and Y are not equal to 1 ULP (max is 8)
    X and Y are not equal to 1 ULP (max is 5)
    X and Y are not equal to 1 ULP (max is 32)
    X and Y are not equal to 1 ULP (max is 40)

Eigen:

X and Y are not equal to 1 ULP (max is 13)
X and Y are not equal to 1 ULP (max is 3)
X and Y are not equal to 1 ULP (max is 2)
X and Y are not equal to 1 ULP (max is 5)
X and Y are not equal to 1 ULP (max is 27)
X and Y are not equal to 1 ULP (max is 2)
X and Y are not equal to 1 ULP (max is 26)
X and Y are not equal to 1 ULP (max is 14)
X and Y are not equal to 1 ULP (max is 74)
X and Y are not equal to 1 ULP (max is 4)

@ajtulloch
Copy link
Contributor

Sounds good, looks great then. Thanks for digging into it.

@hlu1
Copy link
Contributor Author

hlu1 commented Jun 3, 2019

@tqchen, @jroesch, it's ready to be merged.

@tqchen tqchen merged commit 165aa0d into apache:master Jun 5, 2019
@tqchen
Copy link
Member

tqchen commented Jun 5, 2019

Thanks, @pavpanchekha @jroesch @hlu1 @ajtulloch @antinucleon , this PR is now merged

@hlu1
Copy link
Contributor Author

hlu1 commented Jun 5, 2019

Thanks @tqchen

wweic pushed a commit to wweic/tvm that referenced this pull request Jun 26, 2019
wweic pushed a commit to neo-ai/tvm that referenced this pull request Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants