Jitted function runs slower than non-jitted function for EM updates. #23822

tillahoffmann · 2024-09-21T16:29:09Z

tillahoffmann
Sep 21, 2024

Edit: This may be OS specific; see additional comment below.

I'm observing that a jitted function takes around 60% longer to complete than a non-jitted function. There are some discussions on GitHub and StackOverflow related to the same observation (e.g., very small functions where the overhead is larger than the benefit from jit, placing a large number of Python objects on the device, or compile overhead due to Python loops), but I don't think I'm in any of these settings.

For context, I'm using jax to estimate the factors of a tensor decomposition model using classic EM-style updates (rather than gradient-based optimization). The motivation to jit is that I would like to use jax.lax.scan to run the updates without going back to Python after each iteration. The model is

$$ \hat y_{ijk} = \mu + a_i + b_j + c_k + A_{ij} + B_{jk} + C_{ki} $$

and I try to minimize the L2 loss between $\hat y_{ijk}$ and the data $y_{ijk}$.

The main update function is as follows (a full example is here). For the larger analysis, I'm using shrinkage priors on all components of the model and use variational Bayes updates with a mean-field approximation for the posterior. But the function below exhibits the same behavior and is much more readable.

def update(factors, i, j, k, y):
    """
    Update the factors given the COO representation of observations.

    Args:
        factors: Mapping from factor names to jax arrays.
        i: Indices of obs. along first tensor dimension with shape `(n_obs,)`.
        j: Indices of obs. along second tensor dimension with shape `(n_obs,)`.
        k: Indices of obs. along third tensor dimension with shape `(n_obs,)`.
        y: Observations at indices `(i, j, k)` with shape `(n_obs,)`.

    Returns:
        Updated factors and predictions as a tuple.
    """
    # Create a shallow copy to ensure the function is pure.
    factors = factors.copy()
    
    # Indices to expand each factor to the same size as the observations `y`.
    indices = {
        "a": (i,),
        "b": (j,),
        "c": (k,),
        "A": (i, j),
        "B": (j, k),
        "C": (k, i),
    }
    # Expand the factors so we can easily construct an estimate of `y`.
    summands = {key: factors[key][idx] for key, idx in indices.items()}
    
    # First update the grand mean `mu` separately because it doesn't have indices.
    y_hat = sum(summands.values())
    mu = (y - y_hat).mean()
    factors["mu"] = mu
    summands["mu"] = mu
    
    # Iterate over all factors and update them.
    for key, idx in indices.items():
        # Pop the factor we're currently updating.
        summands.pop(key)
        # Evaluate the number of observations per element of the factor. We add on 
        # 0.001 to avoid division by zero. This is just the prior precision in a 
        # Bayesian context.
        precision = 0.001 + jnp.zeros_like(factors[key]).at[idx].add(1)
        # Evaluate the totals for each element and divide by the precision to get a
        # point estimate.
        y_hat = sum(summands.values())
        residuals = y - y_hat
        prod = jnp.zeros_like(precision).at[idx].add(residuals)
        factor = prod / precision
        # Update the factor and add to the summands.
        factors[key] = factor
        summands[key] = factor[idx]

    y_hat = sum(summands.values())
    return factors, y_hat

Timings are as follows (all run on the CPU of a 2020 MacBook Pro with M1 chip). I'm using tensor dimensions (200, 300, 150) and 100,000 observations of y.

>>> %timeit jax.block_until_ready(update(factors, i, j, k, y))
159 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> # Jit and run once to remove compilation overhead in timing.
>>> jitted = jax.jit(update)
>>> jax.block_until_ready(jitted(factors, i, j, k, y))
>>> %timeit jax.block_until_ready(jitted(factors, i, j, k, y))
266 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I expected the jitted version to be faster, e.g., because the at[...].add statements would be compiled to in-place updates instead of creating copies in the non-jitted code. Any insights would be much appreciated! 🙏

My environment is as follows.

>>> jax.print_environment_info()
jax:    0.4.33
jaxlib: 0.4.33
numpy:  2.1.1
python: 3.11.5 (main, Dec  8 2023, 17:04:09) [Clang 15.0.0 (clang-1500.0.40.1)]
jax.devices (1 total, 1 local): [CpuDevice(id=0)]
process_count: 1
platform: uname_result(system='Darwin', node='Tills-MacBook-Pro.local', release='24.0.0', version='Darwin Kernel Version 24.0.0: Mon Aug 12 20:49:48 PDT 2024; root:xnu-11215.1.10~2/RELEASE_ARM64_T8103', machine='arm64')

Answered by dfm

Sep 23, 2024

Thanks for trying that. For now perhaps you can set the following environment variable as a workaround for v0.4.33:

XLA_FLAGS=--xla_cpu_use_thunk_runtime=false

I expect that we'll be able to come up with a longer term solution for use cases like this, but I don't know the answer yet!

View full answer

tillahoffmann · 2024-09-21T16:36:32Z

tillahoffmann
Sep 21, 2024
Author

Update: This may be OS specific (macOS Sequoia on my machine), because the timings are very different when running on a Colab CPU: The jitted function is about 4.5x faster.

>>> %timeit jax.block_until_ready(update(factors, i, j, k, y))
84.6 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> # Jit and run once to remove compilation overhead in timing.
>>> jitted = jax.jit(update)
>>> jax.block_until_ready(jitted(factors, i, j, k, y))
>>> %timeit jax.block_until_ready(jitted(factors, i, j, k, y))
18.6 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

10 replies

dfm Sep 23, 2024
Collaborator

I wonder if the difference is less to do with the OS, and more to do with the JAX version because colab normally has an older version? Can you run your local benchmark using jax<=0.4.31? In v0.4.32 the CPU dispatch behavior changed, and I've seen some cases where this can have a negative effect on scan performance in particular. I'd be interested to know if that is the source here too!

tillahoffmann Sep 23, 2024
Author

Yup, that's it. Thank you for the fast reply. I'm observing a 60x speed difference for the jitted function between 0.4.31 and 0.4.33. Do you think this change is likely permanent or a regression that might be resolved in a future version?

>>> %timeit jax.block_until_ready(update(factors, i, j, k, y))
17.2 ms ± 91.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> # Jit and run once to remove compilation overhead in timing.
>>> jitted = jax.jit(update)
>>> jax.block_until_ready(jitted(factors, i, j, k, y))
>>> %timeit jax.block_until_ready(jitted(factors, i, j, k, y))
4.09 ms ± 8.21 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dfm Sep 23, 2024
Collaborator

Thanks for trying that. For now perhaps you can set the following environment variable as a workaround for v0.4.33:

XLA_FLAGS=--xla_cpu_use_thunk_runtime=false

I expect that we'll be able to come up with a longer term solution for use cases like this, but I don't know the answer yet!

Answer selected by tillahoffmann

tillahoffmann Sep 23, 2024
Author

Great, I'll use that workaround for now. Thanks for the fast response! Is there an option to convert the discussion to a bug report?

ezhulenev Sep 24, 2024

I was able to reproduce it locally, hopefully find a fix ASAP. FWIW it doesn't look like a know regression in while loop, but something new.

ezhulenev Sep 24, 2024

Turned out it is a regression because of an interpreted while loop. You have >10 while loops with some of them (all?) having 100k iterations, where each iteration updates a single scalar. Is it possible to rewrite it to avoid loops?

We plan to bring back compile while loops, but it will take some time. We will keep --xla_cpu_use_thunk_runtime=false until we'll fix this issue.

tillahoffmann Sep 24, 2024
Author

Thank you for digging into this, @ezhulenev. I couldn't quite figure out which parts of the code in the update function are compiled to while loops. Is it the .at[idx].add statements? Would you be able to point me in the right direction?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jitted function runs slower than non-jitted function for EM updates. #23822

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Jitted function runs slower than non-jitted function for EM updates. #23822

tillahoffmann Sep 21, 2024

Replies: 1 comment · 10 replies

tillahoffmann Sep 21, 2024 Author

dfm Sep 23, 2024 Collaborator

tillahoffmann Sep 23, 2024 Author

dfm Sep 23, 2024 Collaborator

tillahoffmann Sep 23, 2024 Author

ezhulenev Sep 24, 2024

ezhulenev Sep 24, 2024

tillahoffmann Sep 24, 2024 Author

tillahoffmann
Sep 21, 2024

Replies: 1 comment 10 replies

tillahoffmann
Sep 21, 2024
Author

dfm Sep 23, 2024
Collaborator

tillahoffmann Sep 23, 2024
Author

dfm Sep 23, 2024
Collaborator

tillahoffmann Sep 23, 2024
Author

tillahoffmann Sep 24, 2024
Author