Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference breaks in TensorFlow 2.10 #1721

Closed
1 of 4 tasks
talmo opened this issue Mar 23, 2024 · 2 comments
Closed
1 of 4 tasks

Inference breaks in TensorFlow 2.10 #1721

talmo opened this issue Mar 23, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@talmo
Copy link
Collaborator

talmo commented Mar 23, 2024

Bug description

As part of an experimental move to the latest version of TensorFlow available on Windows (v2.10), we are now facing an issue during inference.

The logs below reveal that we're getting a weird error when using find_global_peaks_rough, specifically on this line:

channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels

The exception (below) is hinting at the modulo operation being the problem. There's obviously mathematical workarounds that avoid the modulo operation, but it'd be good to dig into it.

In the meantime, I've deleted the conda package in our Anaconda channel since it was getting downloaded even for the stable release (v1.3.3) since we version fenced SLEAP permissively to allow newer versions of TensorFlow. This issue may have affected a small number of users who installed SLEAP since I pushed that conda package though (~100-200).

If we need to rebuild that conda package for testing, we can just rerun the jobs in this workflow to rebuild and reupload TensorFlow v2.10 to our conda channel. We should probably change the tag from main to dev to prevent users from downloading the new release until it's fixed though.

The short term fix if others run into this is to just pip install tensorflow==2.7 and everything should work.

Actual behaviour

Inference breaks during evaluation or inference with a centered instance model, specifically during the call to find_global_peaks_rough:

    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]

Your personal set up

  • OS: Window 10
  • Version(s): v1.3.3 or eb14764
Environment packages
# paste output of `pip freeze` or `conda list` here
Logs
Epoch 5/5
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png
2024-03-22 13:26:24.427056: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -2 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -2 } dim { size: 160 } dim { size: 160 } dim { size: 1 } } }
200/200 - 9s - loss: 4.0422e-04 - head: 7.4481e-04 - torso: 1.1064e-04 - tail_base: 3.5720e-04 - val_loss: 3.9666e-04 - val_head: 6.5480e-04 - val_torso: 1.2842e-04 - val_tail_base: 4.0678e-04 - lr: 1.0000e-04 - 9s/epoch - 47ms/step
INFO:sleap.nn.training:Finished training loop. [0.9 min]
INFO:sleap.nn.training:Deleting visualization directory: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz
INFO:sleap.nn.training:Saving evaluation metrics to model folder...
Predicting... ----------------------------------------   0% ETA: -:--:-- ?Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132344.centroid.n=1\viz\validation.*.png
Polling: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1\viz\validation.*.png
2024-03-22 13:26:28.048155: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_UINT8 } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_UINT8 shape { dim { size: 1 } dim { size: 768 } dim { size: 1024 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -3 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "AuthenticAMD" model: "241" frequency: 3493 num_cores: 64 environment { key: "cpu_instruction_set" value: "SSE, SSE2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 524288 l3_cache_size: 134217728 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -3 } dim { size: -23 } dim { size: -24 } dim { size: 1 } } }
2024-03-22 13:26:28.058356: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -38 } dim { size: -39 } dim { size: -40 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -11 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48113909760 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -11 } dim { size: -42 } dim { size: -43 } dim { size: 1 } } }
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2024-03-22 13:26:28.906549: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: JIT compilation failed.
Predicting... ----------------------------------------   0% ETA: -:--:-- ?
Traceback (most recent call last):
  File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
    sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
  File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
    trainer.train()
  File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
    self.evaluate()
  File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
    split_name="train",
  File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
    labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
  File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
    self._make_labeled_frames_from_generator(generator, data)
  File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
    for ex in generator:
  File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
    ex = process_batch(ex)
  File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
    preds = self.inference_model.predict_on_batch(ex, numpy=True)
  File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
    outs = super().predict_on_batch(data, **kwargs)
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
    outputs = self.predict_function(iterator)
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\util\traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\tensorflow\python\eager\execute.py", line 55, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'mod' defined at (most recent call last):
    File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
      sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
    File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
      trainer.train()
    File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
      self.evaluate()
    File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
      split_name="train",
    File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
      labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
      self._make_labeled_frames_from_generator(generator, data)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
      for ex in generator:
    File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
      ex = process_batch(ex)
    File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
      preds = self.inference_model.predict_on_batch(ex, numpy=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
      outs = super().predict_on_batch(data, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
      outputs = self.predict_function(iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function
      return step_function(self, iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step
      outputs = model.predict_step(data)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step
      return self(x, training=False)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
Detected at node 'mod' defined at (most recent call last):
    File "C:\Miniconda3\envs\sleap_develop\Scripts\sleap-train-script.py", line 33, in <module>
      sys.exit(load_entry_point('sleap', 'console_scripts', 'sleap-train')())
    File "d:\sleap_develop\sleap\nn\training.py", line 2014, in main
      trainer.train()
    File "d:\sleap_develop\sleap\nn\training.py", line 953, in train
      self.evaluate()
    File "d:\sleap_develop\sleap\nn\training.py", line 966, in evaluate
      split_name="train",
    File "d:\sleap_develop\sleap\nn\evals.py", line 744, in evaluate_model
      labels_pr: Labels = predictor.predict(labels_gt, make_labels=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 526, in predict
      self._make_labeled_frames_from_generator(generator, data)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2642, in _make_labeled_frames_from_generator
      for ex in generator:
    File "d:\sleap_develop\sleap\nn\inference.py", line 436, in _predict_generator
      ex = process_batch(ex)
    File "d:\sleap_develop\sleap\nn\inference.py", line 399, in process_batch
      preds = self.inference_model.predict_on_batch(ex, numpy=True)
    File "d:\sleap_develop\sleap\nn\inference.py", line 1069, in predict_on_batch
      outs = super().predict_on_batch(data, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2474, in predict_on_batch
      outputs = self.predict_function(iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2041, in predict_function
      return step_function(self, iterator)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2027, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 2015, in run_step
      outputs = model.predict_step(data)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 1983, in predict_step
      return self(x, training=False)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2265, in call
      if isinstance(self.instance_peaks, FindInstancePeaksGroundTruth):
    File "d:\sleap_develop\sleap\nn\inference.py", line 2274, in call
      peaks_output = self.instance_peaks(crop_output)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Miniconda3\envs\sleap_develop\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "d:\sleap_develop\sleap\nn\inference.py", line 2110, in call
      if self.offsets_ind is None:
    File "d:\sleap_develop\sleap\nn\inference.py", line 2112, in call
      peak_points, peak_vals = sleap.nn.peak_finding.find_global_peaks(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 366, in find_global_peaks
      rough_peaks, peak_vals = find_global_peaks_rough(
    File "d:\sleap_develop\sleap\nn\peak_finding.py", line 224, in find_global_peaks_rough
      channel_subs = tf.range(total_peaks, dtype=tf.int64) % channels
Node: 'mod'
2 root error(s) found.
  (0) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
         [[top_down_inference_model/find_instance_peaks_1/RaggedFromValueRowIds_1/RowPartitionFromValueRowIds/bincount/Minimum/_436]]
  (1) UNKNOWN:  JIT compilation failed.
         [[{{node mod}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_predict_function_37158]
INFO:sleap.nn.callbacks:Closing the reporter controller/context.
INFO:sleap.nn.callbacks:Closing the training controller socket/context.
Run Path: C:/Users/Talmo/Desktop/fun with nick\models\240322_132513.centered_instance.n=1
Saving config: C:\Users\Talmo/.sleap/1.3.3/preferences.yaml

Screenshots

How to reproduce

Run inference with a top-down model (specifically the centered instance portion).

@talmo talmo added the bug Something isn't working label Mar 23, 2024
@talmo
Copy link
Collaborator Author

talmo commented Apr 8, 2024

Solution: Stay on TF 2.7 😢

@talmo talmo closed this as completed Apr 8, 2024
@roomrys
Copy link
Collaborator

roomrys commented Dec 19, 2024

Yes, we were getting JIT errors, but we found that we could upgrade tensorflow - the newer versions just need some help finding the CUDA directory (located at our conda prefix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants