Inference breaks in TensorFlow 2.10 #1721

talmo · 2024-03-23T00:37:47Z

Bug description

As part of an experimental move to the latest version of TensorFlow available on Windows (v2.10), we are now facing an issue during inference.

The logs below reveal that we're getting a weird error when using find_global_peaks_rough, specifically on this line:

sleap/sleap/nn/peak_finding.py

Line 224 in eb14764

channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels

The exception (below) is hinting at the modulo operation being the problem. There's obviously mathematical workarounds that avoid the modulo operation, but it'd be good to dig into it.

In the meantime, I've deleted the conda package in our Anaconda channel since it was getting downloaded even for the stable release (v1.3.3) since we version fenced SLEAP permissively to allow newer versions of TensorFlow. This issue may have affected a small number of users who installed SLEAP since I pushed that conda package though (~100-200).

If we need to rebuild that conda package for testing, we can just rerun the jobs in this workflow to rebuild and reupload TensorFlow v2.10 to our conda channel. We should probably change the tag from main to dev to prevent users from downloading the new release until it's fixed though.

The short term fix if others run into this is to just pip install tensorflow==2.7 and everything should work.

Actual behaviour

Inference breaks during evaluation or inference with a centered instance model, specifically during the call to find_global_peaks_rough:

    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]

Your personal set up

OS: Window 10
Version(s): v1.3.3 or eb14764

SLEAP installation method (listed here):

Environment packages

# paste output of `pip freeze` or `conda list` here

Logs

Epoch 5/5
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png
2024-03-22 13:26:24.427056: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 160 } dim { size: 160 } dim { size: 1 } } }
200/200 - 9s - loss: 4.0422e-04 - head: 7.4481e-04 - torso: 1.1064e-04 - tail_base: 3.5720e-04 - val_loss: 3.9666e-04 - val_head: 6.5480e-04 - val_torso: 1.2842e-04 - val_tail_base: 4.0678e-04 - lr: 1.0000e-04 - 9s/epoch - 47ms/step
INFO:sleap.nn.training:Finished training loop. [0.9 min]
INFO:sleap.nn.training:Deleting visualization directory: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz
INFO:sleap.nn.training:Saving evaluation metrics to model folder...
Predicting... ----------------------------------------   0% ETA: -:--:-- ?Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png
2024-03-22 13:26:28.048155: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -23 } dim { size: -24 } dim { size: 1 } } }
2024-03-22 13:26:28.058356: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -38 } dim { size: -39 } dim { size: -40 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -11 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48113909760 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: -42 } dim { size: -43 } dim { size: 1 } } }
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2024-03-22 13:26:28.906549: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: JIT compilation failed.
Predicting... ----------------------------------------   0% ETA: -:--:-- ?
Traceback (most recent call last):
  File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
    sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
  File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
    trainer.train()
  File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
    self.evaluate()
  File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
    split_name="train",
  File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
    labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
  File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
    self._make_labeled_frames_from_generator(generator, data)
  File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
    for ex in generator:
  File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
    ex = process_batch(ex)
  File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
    preds = self.inference_model.predict_on_batch(ex, numpy=True)
  File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
    outs = super().predict_on_batch(data, **kwargs)
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
    outputs = self.predict_function(iterator)
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\eager\execute.py", line 55, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'mod' defined at (most recent call last):
    File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
      sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
    File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
      trainer.train()
    File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
      self.evaluate()
    File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
      split_name="train",
    File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
      labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
      self._make_labeled_frames_from_generator(generator, data)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
      for ex in generator:
    File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
      ex = process_batch(ex)
    File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
      preds = self.inference_model.predict_on_batch(ex, numpy=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
      outs = super().predict_on_batch(data, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
      outputs = self.predict_function(iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function
      return step_function(self, iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step
      outputs = model.predict_step(data)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step
      return self(x, training=False)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
Detected at node 'mod' defined at (most recent call last):
    File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
      sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
    File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
      trainer.train()
    File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
      self.evaluate()
    File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
      split_name="train",
    File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
      labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
      self._make_labeled_frames_from_generator(generator, data)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
      for ex in generator:
    File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
      ex = process_batch(ex)
    File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
      preds = self.inference_model.predict_on_batch(ex, numpy=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
      outs = super().predict_on_batch(data, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
      outputs = self.predict_function(iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function
      return step_function(self, iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step
      outputs = model.predict_step(data)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step
      return self(x, training=False)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_predict_function_37158]
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
Run Path: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1
Saving config: C:\Users\Talmo/.sleap/1.3.3/preferences.yaml

Screenshots

How to reproduce

Run inference with a top-down model (specifically the centered instance portion).

The text was updated successfully, but these errors were encountered:

talmo · 2024-04-08T05:35:44Z

Solution: Stay on TF 2.7 😢

roomrys · 2024-12-19T19:06:25Z

Yes, we were getting JIT errors, but we found that we could upgrade tensorflow - the newer versions just need some help finding the CUDA directory (located at our conda prefix).

talmo added the bug Something isn't working label Mar 23, 2024

eberrigan mentioned this issue Mar 29, 2024

Update to new TensorFlow conda package #1726

Merged

11 tasks

talmo closed this as completed Apr 8, 2024

roomrys mentioned this issue Aug 31, 2024

Use tf.math.mod instead of % #1931

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference breaks in TensorFlow 2.10 #1721

Inference breaks in TensorFlow 2.10 #1721

talmo commented Mar 23, 2024

talmo commented Apr 8, 2024

roomrys commented Dec 19, 2024

Inference breaks in TensorFlow 2.10 #1721

Inference breaks in TensorFlow 2.10 #1721

Comments

talmo commented Mar 23, 2024

Bug description

Actual behaviour

Your personal set up

Screenshots

How to reproduce

talmo commented Apr 8, 2024

roomrys commented Dec 19, 2024