Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed tests with 'make test' #537

Closed
tdd11235813 opened this issue Jun 19, 2018 · 7 comments
Closed

Failed tests with 'make test' #537

tdd11235813 opened this issue Jun 19, 2018 · 7 comments

Comments

@tdd11235813
Copy link
Contributor

tdd11235813 commented Jun 19, 2018

Can you run the runtime tests successfully?

I used this module and cmake settings:

module load cmake cuda/8.0.61 gcc/4.9.3
cmake -DBOOST_ROOT="$HOME/software/boost_1.65.1/taurus/" -DBOOST_LIBRARYDIR="$HOME/software/boost_1.65.1/taurus/lib" -DBoost_USE_STATIC_LIBS=ON -DBoost_USE_MULTITHREADED=ON -DBoost_USE_STATIC_RUNTIME=OFF -DALPAKA_ACC_CPU_B_SEQ_T_FIBERS_ENABLE=OFF .. 
make -j 8
make test

Failed Tests:

2:axpy (fixed)

  • due to binary comparison of floats (should be fabs(x-y)>eps instead of x!=y)

3:cudaOnly (fixed)

  • empty test tree gives error.
    It seems to be necessary, to build a second time with ALPAKA_ACC_GPU_CUDA_ONLY_MODE set in cmake with other backends manually disabled, and there make cudaOnly will provide the test.
    Have not investigated this, but isn't this possible to have this mode integrated in one build step?

12:event (fixed)

12/21 Testing: event
 12/21 Test: event
 Command: "/home/matwerne/cuda-workspace/alpaka/build/test/unit/event/event"
 Directory: /home/matwerne/cuda-workspace/alpaka/build/test/unit/event
 "event" start time: Jun 19 15:27 CEST
 Output:
 ----------------------------------------------------------
 Running 20 test cases...
 unknown location(0): fatal error: in "event/eventTestShouldBeFalseWhileInQueueAndTrueAfterBeingProcessed<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/
 alpaka/test/event/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error co
 de': 'unrecognized error code'!
 /home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(70): last checkpoint: "eventTestShouldBeFalseWhileInQueueAndTrueAfterBeingProcessed" entry.
 unknown location(0): fatal error: in "event/eventReEnqueueShouldBePossibleIfNobodyWaitsFor<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/ev ent/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecogn ized error code'!
 /home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(108): last checkpoint: "eventReEnqueueShouldBePossibleIfNobodyWaitsFor" entry.
 unknown location(0): fatal error: in "event/eventReEnqueueShouldBePossibleIfSomeoneWaitsFor<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/e vent/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecog nized error code'!
 /home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(162): last checkpoint: "eventReEnqueueShouldBePossibleIfSomeoneWaitsFor" entry.
 unknown location(0): fatal error: in "event/waitForEventThatAlreadyFinishedShouldBeSkipped<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/ev ent/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecogn ized error code'!
 /home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(232): last checkpoint: "waitForEventThatAlreadyFinishedShouldBeSkipped" entry.
 
 *** 4 failures are detected in the test module "event"

15:memBuf (fixed)

15/21 Testing: memBuf
 15/21 Test: memBuf
 Command: "/home/matwerne/cuda-workspace/alpaka/build/test/unit/mem/buf/memBuf"
 Directory: /home/matwerne/cuda-workspace/alpaka/build/test/unit/mem/buf
 "memBuf" start time: Jun 19 15:27 CEST
 Output:
 ----------------------------------------------------------

Running 225 test cases...
 unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_long>>": signal: integer divide by zero; address of failing instruction: 0x004950f7
 /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
 unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_long>>": signal: integer divide by zero; address of failing instruction: 0x004d5410
 /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
 unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_int>>": signal: integer divide by zero; address of failing instruction: 0x0051504e
 /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
 unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_int>>": signal: integer divide by zero; address of failing instruction: 0x00554f26
 /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
 unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_short>>": signal: integer divide by zero; address of failing instruction: 0x00586f76
 /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint

 *** 5 failures are detected in the test module "memBuf"

18:queue (fixed)

18/21 Testing: queue
 18/21 Test: queue
 Command: "/home/matwerne/cuda-workspace/alpaka/build/test/unit/queue/queue"
 Directory: /home/matwerne/cuda-workspace/alpaka/build/test/unit/queue
 "queue" start time: Jun 19 15:27 CEST
 Output:
 ----------------------------------------------------------
 Running 16 test cases...
 /home/matwerne/cuda-workspace/alpaka/test/unit/queue/src//QueueTest.cpp(130): fatal error: in "queue/queueWaitShouldWork<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": critical check true == CallbackFinished has failed [true != t rue]
 
 *** 1 failure is detected in the test module "queue"

Other tests were successful.

@BenjaminW3
Copy link
Member

BenjaminW3 commented Jun 19, 2018

make test is not yet supported. I am currently enabling this, see #534 for example.

2:axpy: At least on CPU accelerators this test works. Was the failure in CUDA? This might be related to --use_fast_math where the result differs from the IEEE reference result computed on the CPU. We really might need to do the epsilon comparison. Any volunteers to fix this?

3:cudaOnly: is known to me and I will fix this. The CI test scripts manually exclude this test when the define is set. This has to be replaced with a logic within CMake.

@BenjaminW3
Copy link
Member

BenjaminW3 commented Jun 19, 2018

15:memBuf I have not yet a clue what happened here. This test worked once when @psychocoderHPC added this feature. We do not execute tests when CUDA is enabled due to missing hardware. Can anyone investigate this?

@BenjaminW3
Copy link
Member

18:queue I can imagine what happened here and will provide a fix that has to be tested on real hardware. I would expect the other queue test to be flaky as well.

@BenjaminW3
Copy link
Member

12:event: This test once worked. At least I remember that @psychocoderHPC once executed the test. cuStreamWaitValue32 is part of the driver API and not the runtime API. May this be a reason for the error? May we need additional settings/options?

@BenjaminW3
Copy link
Member

BenjaminW3 commented Jun 19, 2018

Update:
Documentation

4.15. Stream memory operations
This section describes the stream memory operations of the low-level CUDA driver application programming interface.

The whole set of operations is disabled by default. Users are required to explicitly enable them, e.g. on Linux by passing the kernel module parameter shown below: modprobe nvidia NVreg_EnableStreamMemOPs=1 There is currently no way to enable these operations on other operating systems.

Users can programmatically query whether the device supports these operations with cuDeviceGetAttribute() and CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_MEM_OPS.

Support for the CU_STREAM_WAIT_VALUE_NOR flag can be queried with CU_DEVICE_ATTRIBUTE_CAN_USE_STREAM_WAIT_VALUE_NOR.

Support for the cuStreamWriteValue64() and cuStreamWaitValue64() functions, as well as for the CU_STREAM_MEM_OP_WAIT_VALUE_64 and CU_STREAM_MEM_OP_WRITE_VALUE_64 flags, can be queried with CU_DEVICE_ATTRIBUTE_CAN_USE_64_BIT_STREAM_MEM_OPS.

Support for both CU_STREAM_WAIT_VALUE_FLUSH and CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES requires dedicated platform hardware features and can be queried with cuDeviceGetAttribute() and CU_DEVICE_ATTRIBUTE_CAN_FLUSH_REMOTE_WRITES.

Note that all memory pointers passed as parameters to these operations are device pointers. Where necessary a device pointer should be obtained, for example with cuMemHostGetDevicePointer().

None of the operations accepts pointers to managed memory buffers (cuMemAllocManaged).

Especially the "The whole set of operations is disabled by default." is the answer why the 12:event test fails on your system. We may have to skip those tests on systems where it is not enabled.

@tdd11235813
Copy link
Contributor Author

ah very interesting, thanks a lot for your answers, I'll take care of the float comparison thing, and will test the queue test with your PR #538. The stream memory operations thing seems to be related to the cuda bug in #504.

@tdd11235813
Copy link
Contributor Author

for the logs, same errors reproduced with gcc5.5 + cuda9.1 (before: gcc4.9.3 + cuda8).

Currently Loaded Modules:
  1) modenv/classic (S)   2) cmake/3.10.1   3) cuda/9.1.85   4) gcc/5.5.0   5) openmpi/3.0.0-gnu5.5   6) boost/1.65.1-gnu5.5
12/21 Testing: event
12/21 Test: event
Command: "/home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/event/event"
Directory: /home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/event
"event" start time: Jun 21 16:30 CEST
Output:
----------------------------------------------------------
Running 20 test cases...
unknown location(0): fatal error: in "event/eventTestShouldBeFalseWhileInQueueAndTrueAfterBeingProcessed<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/event/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecognized error code'!
/home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(70): last checkpoint: "eventTestShouldBeFalseWhileInQueueAndTrueAfterBeingProcessed" entry.
unknown location(0): fatal error: in "event/eventReEnqueueShouldBePossibleIfNobodyWaitsFor<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/event/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecognized error code'!
/home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(108): last checkpoint: "eventReEnqueueShouldBePossibleIfNobodyWaitsFor" entry.
unknown location(0): fatal error: in "event/eventReEnqueueShouldBePossibleIfSomeoneWaitsFor<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/event/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecognized error code'!
/home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(162): last checkpoint: "eventReEnqueueShouldBePossibleIfSomeoneWaitsFor" entry.
unknown location(0): fatal error: in "event/waitForEventThatAlreadyFinishedShouldBeSkipped<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": std::runtime_error: /home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/event/EventHostManualTrigger.hpp(588) cuStreamWaitValue32( static_cast<CUstream>(queue.m_spQueueImpl->m_CudaQueue), reinterpret_cast<CUdeviceptr>(event.m_spEventImpl->m_devMem), 0x01010101u, CU_STREAM_WAIT_VALUE_GEQ) : 'unrecognized error code': 'unrecognized error code'!
/home/matwerne/cuda-workspace/alpaka/test/unit/event/src//EventTest.cpp(232): last checkpoint: "waitForEventThatAlreadyFinishedShouldBeSkipped" entry.

*** 4 failures are detected in the test module "event"
15/21 Testing: memBuf
15/21 Test: memBuf
Command: "/home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/mem/buf/memBuf"
Directory: /home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/mem/buf
"memBuf" start time: Jun 21 16:30 CEST
Output:
----------------------------------------------------------
Running 225 test cases...
unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_long>>": signal: integer divide by zero; address of failing instruction: 0x004b0c3b
/home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_long>>": signal: integer divide by zero; address of failing instruction: 0x0051ff0a
/home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_int>>": signal: integer divide by zero; address of failing instruction: 0x0058eaa4
/home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_int>>": signal: integer divide by zero; address of failing instruction: 0x005fd656
/home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint
unknown location(0): fatal error: in "memBuf/memBufZeroSizeTest<alpaka__acc__AccGpuCudaRt<std__integral_constant<unsigned_long,_3ul>,_unsigned_short>>": signal: integer divide by zero; address of failing instruction: 0x0065e2e4
/home/matwerne/cuda-workspace/alpaka/test/common/include/alpaka/test/mem/view/ViewTest.hpp(158): last checkpoint

*** 5 failures are detected in the test module "memBuf"
18/21 Testing: queue
18/21 Test: queue
Command: "/home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/queue/queue"
Directory: /home/matwerne/cuda-workspace/alpaka/build_cuda/test/unit/queue
"queue" start time: Jun 21 16:31 CEST
Output:
----------------------------------------------------------
Running 16 test cases...
/home/matwerne/cuda-workspace/alpaka/test/unit/queue/src//QueueTest.cpp(130): fatal error: in "queue/queueWaitShouldWork<std__tuple<alpaka__dev__DevCudaRt,_alpaka__queue__QueueCudaRtAsync>>": critical check true == CallbackFinished has failed [true != true]

*** 1 failure is detected in the test module "queue"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants