Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenPBS job queue hang #1930

Closed
cblackworth-anl opened this issue Apr 14, 2021 · 18 comments
Closed

OpenPBS job queue hang #1930

cblackworth-anl opened this issue Apr 14, 2021 · 18 comments

Comments

@cblackworth-anl
Copy link

When running jobs in OpenPBS 20.0.1 we have found what looks to be partly related to #1301 & #1473

Currently our PBS implementation uses job history output, when running jobs in PBS with reframe the job will error out due to this:

[CMD] 'qstat -f 1015.sch1 1016.sch1 1017.sch1 1018.sch1'
[  FAILED  ] Ran 0/2 test case(s) from 3 check(s) (0 failure(s))
[==========] Finished on Wed Apr 14 16:32:55 2021
/usr/bin/reframe: run session stopped: job scheduler error: qstat failed with exit code 35 (standard error follows):
qstat: 1015.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1016.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1017.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1018.sch1 Job has finished, use -x or -H to obtain historical job information

if we disable the job history support (not ideal), the job just hangs infinitely regardless of if the PBS jobs fail or run successfully

Reframe version:
reframe --version 3.5.0

SLE 15 SP2
lmod
OpenPBS 20.0.1

@vkarak
Copy link
Contributor

vkarak commented Apr 21, 2021

Hi @cblackworth I suspect where the problem comes from. ReFrame uses qstat -f job1 job2 ... to query the status of pending jobs. Do you see messages such as Job state not found (job info follows): in the log file? (or simply rerun ReFrame with -vvv) Could you copy those messages here?

@cblackworth-anl
Copy link
Author

here's the output with logging set to debug (running with -vvv didn't output any more details)

[  FAILED  ] Ran 0/2 test case(s) from 3 check(s) (0 failure(s))
[==========] Finished on Wed Apr 21 20:04:02 2021
/usr/bin/reframe: run session stopped: job scheduler error: qstat failed with exit code 35 (standard error follows):
qstat: 1037.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1038.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1039.sch1 Job has finished, use -x or -H to obtain historical job information
qstat: 1040.sch1 Job has finished, use -x or -H to obtain historical job information

Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/reframe/frontend/cli.py", line 968, in main
    runner.runall(testcases, restored_cases)
  File "/usr/lib64/python3.6/site-packages/reframe/frontend/executors/__init__.py", line 410, in runall
    self._runall(testcases)
  File "/usr/lib64/python3.6/site-packages/reframe/frontend/executors/__init__.py", line 473, in _runall
    self._policy.runcase(t)
  File "/usr/lib64/python3.6/site-packages/reframe/frontend/executors/policies.py", line 365, in runcase
    self._poll_tasks()
  File "/usr/lib64/python3.6/site-packages/reframe/frontend/executors/policies.py", line 409, in _poll_tasks
    part.scheduler.poll(*part_jobs)
  File "/usr/lib64/python3.6/site-packages/reframe/core/schedulers/pbs.py", line 204, in poll
    f'qstat failed with exit code {completed.returncode} '
reframe.core.exceptions.JobSchedulerError: qstat failed with exit code 35 (standard error follows):
qstat: 1037.sch1 Job has finished, use -x or -H to obtain historical job information

If we turn off job history which is needed for other uses at the site I get this error when the job hangs and never closes (reframe will just run indefinitely until it times out) basically hanging at "Entering stage: run_complete"

Reached concurrency limit for partition 'testbed:login': 8 job(s)
Polling 8 task(s) in 'testbed:login'
[S] local: pid 66859 already dead or assigned elsewhere
[S] local: pid 67056 already dead or assigned elsewhere
[S] local: pid 67170 already dead or assigned elsewhere
[S] local: pid 67773 already dead or assigned elsewhere
[S] local: pid 68068 already dead or assigned elsewhere
[S] local: pid 68180 already dead or assigned elsewhere
[S] local: pid 68765 already dead or assigned elsewhere
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloMultiLangTest_c', 'testbed:login', 'builtin')
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloMultiLangTest_c', 'testbed:login', 'cray')
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloMultiLangTest_cpp', 'testbed:login', 'gnu')
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloTest', 'testbed:login', 'builtin')
Polling 4 task(s) in 'testbed:pbs'
[CMD] 'qstat -f 1047.sch1 1048.sch1 1049.sch1 1050.sch1'
[S] pbs: Return code is 153: jobids not known by scheduler, assuming all jobs completed
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Scheduling test case ('HelloTest', 'testbed:login', 'cray') for running
Entering stage: compile
Copying /home/user/reframe_testbed/hello/src to stage directory
Symlinking files: []
[CMD] '/usr/share/lmod/lmod/libexec/lmod python show cray-pals'
[CMD] '/usr/share/lmod/lmod/libexec/lmod python load cray-pals'
[CMD] '/home/user/reframe_testbed/stage/testbed/login/cray/HelloTest/rfm_HelloTest_build.sh'

==> timings: setup: 0.006s compile: 1.287s run: 5.723s sanity: 0.000s performance: 0.000s total: 7.024s
Entering stage: cleanup
Copying test files to output directory
Removing stage directory
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete
Entering stage: run_complete

@vkarak
Copy link
Contributor

vkarak commented Apr 22, 2021

I think it is a bug of our PBS scheduler backend in the way we set the job final state. I'll propose a fix in a PR.

@cblackworth-anl
Copy link
Author

I pulled down your branch and put it on our test system and using the same config, same system that worked for 3.5.0 is throwing this error on tests for scheduler items:

/usr/bin/reframe: no modules system is set: module 'cray-pals' will not be loaded: check the 'modules_system' configuration parameter for your system

@vkarak
Copy link
Contributor

vkarak commented Apr 22, 2021

Hi @cblackworth, this should be just a warning... It simply says that you haven't set a modules system, so any attempt to load cray-pals will be ignored. You should set the modules_system configuration parameter to lmod or to whichever modules systems you are using. But for the scope of this issue, it should not play any role.

@vkarak
Copy link
Contributor

vkarak commented Apr 22, 2021

And you get this warning, because my branch that you pulled is beyond 3.5.3 :-)

@cblackworth-anl
Copy link
Author

This actually is set (the error wasn't there before the update). It causes the otherwise able to run jobs to fail to run as it can no longer compile my tests now.

to be clear, the only change to anything has been to update the version of reframe, everything else is unchanged from before when this test worked.

Reverting the version of reframe allows this test to work but fails to run on PBS.

So I am unable to verify if this patch works.

[  FAILED  ] Ran 3/3 test case(s) from 3 check(s) (1 failure(s), 0 skipped)
[==========] Finished on Thu Apr 22 17:14:36 2021

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for HelloMultiLangTest_cpp
  * Test Description: HelloMultiLangTest_cpp
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /shared_home/user/reframe/stage/generic/default/builtin/HelloMultiLangTest_cpp
  * Node list:
  * Job type: local (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: compile
  * Rerun with '-n HelloMultiLangTest_cpp -p builtin --system generic:default -r'
  * Reason: build system error: I do not know how to compile a C++ program

@vkarak
Copy link
Contributor

vkarak commented Apr 22, 2021

This is interesting... It seems that ReFrame picks up the generic system, which is totally weird if you run exactly the same command. Could you rerun with -vvv and copy here the first lines including the [ReFrame Setup] section?

@cblackworth-anl
Copy link
Author

Config file:

site_configuration = {
    'systems': [
        {
            'name': 'testbed',
            'descr': 'testbed Testbed',
            'hostnames': ['testbed','sch1','sc01'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'login',
                    'descr': 'Login nodes',
                    'scheduler': 'local',
                    'launcher': 'local',
                    'environs': ['builtin', 'gnu', 'cray'],
                },
                {
                    'name': 'mc',
                    'descr': 'Multicore nodes',
                    'scheduler': 'pbs',
                    'launcher': 'mpiexec',
                    'environs': ['gnu', 'cray'],
                    'max_jobs': 100,
                    'resources': [
                        {
                            'name': 'memory',
                            'options': ['--mem={size}']
                        }
                    ]
                }
            ]
        },
    ],
    'environments': [
        {
            'name': 'gnu',
            'modules': [{'name': 'PrgEnv-gnu', 'collection': True}],
            'cc': 'cc',
            'cxx': 'CC',
            'ftn': 'ftn',
            'target_systems': ['testbed']
        },
        {
            'name': 'cray',
            'modules': [{'name':  'PrgEnv-cray', 'collection': True}],
            'cc': 'cc',
            'cxx': 'CC',
            'ftn': 'ftn',
            'target_systems': ['testbed']
        },
        {
            'name': 'clang',
            'cc': 'clang',
            'cxx': 'clang++',
            'ftn': ''
        },
        {
            'name': 'builtin',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
        },
        {
            'name': 'builtin',
            'cc': 'cc',
            'cxx': 'CC',
            'ftn': 'ftn',
            'target_systems': ['testbed']
        }
    ],
    'logging': [
        {
            'level': 'debug',
            'handlers': [
                {
                    'type': 'stream',
                    'name': 'stdout',
                    'level': 'info',
                    'format': '%(message)s'
                },
                {
                    'type': 'file',
                    'level': 'debug',
                    'format': '[%(asctime)s] %(levelname)s: %(check_info)s: %(message)s',   # noqa: E501
                    'append': False
                }
            ],
            'handlers_perflog': [
                {
                    'type': 'filelog',
                    'prefix': '%(check_system)s/%(check_partition)s',
                    'level': 'info',
                    'format': (
                        '%(check_job_completion_time)s|reframe %(version)s|'
                        '%(check_info)s|jobid=%(check_jobid)s|'
                        '%(check_perf_var)s=%(check_perf_value)s|'
                        'ref=%(check_perf_ref)s '
                        '(l=%(check_perf_lower_thres)s, '
                        'u=%(check_perf_upper_thres)s)|'
                        '%(check_perf_unit)s'
                    ),
                    'append': True
                }
            ]
        }
    ],
}

Command output:


reframe -C ./settings.py -r -R -c ./
[ReFrame Setup]
  version:           3.6.0-dev.3
  command:           '/usr/bin/reframe -C ./settings.py -r -R -c ./'
  launched by:       cblackworth@sc01
  working directory: '/shared_home/cblackworth/reframe_testbed'
  settings file:     './settings.py'
  check search path: (R) '/shared_home/cblackworth/reframe_testbed'
  stage directory:   '/shared_home/cblackworth/reframe_testbed/stage'
  output directory:  '/shared_home/cblackworth/reframe_testbed/output'

/shared_home/cblackworth/reframe_testbed/hello/hello2.py:10: WARNING: the @parameterized_test decorator is deprecated; please use the parameter() built-in instead
@rfm.parameterized_test(['c'], ['cpp'])

[==========] Running 3 check(s)
[==========] Started on Thu Apr 22 17:37:27 2021

[----------] started processing HelloMultiLangTest_c (HelloMultiLangTest_c)
[ RUN      ] HelloMultiLangTest_c on generic:default using builtin
/usr/bin/reframe: no modules system is set: module 'cray-pals' will not be loaded: check the 'modules_system' configuration parameter for your system
[----------] finished processing HelloMultiLangTest_c (HelloMultiLangTest_c)

[----------] started processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
[ RUN      ] HelloMultiLangTest_cpp on generic:default using builtin
[     FAIL ] (1/3) HelloMultiLangTest_cpp on generic:default using builtin [compile: 0.010s run: n/a total: 0.023s]
==> test failed during 'compile': test staged in '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_cpp'
[----------] finished processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)

[----------] started processing HelloTest (HelloTest)
[ RUN      ] HelloTest on generic:default using builtin
[----------] finished processing HelloTest (HelloTest)

[----------] waiting for spawned checks to finish
[       OK ] (2/3) HelloMultiLangTest_c on generic:default using builtin [compile: 0.088s run: 0.176s total: 0.283s]
[       OK ] (3/3) HelloTest on generic:default using builtin [compile: 0.079s run: 0.174s total: 0.271s]
[----------] all spawned checks have finished

[  FAILED  ] Ran 3/3 test case(s) from 3 check(s) (1 failure(s), 0 skipped)
[==========] Finished on Thu Apr 22 17:37:27 2021

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for HelloMultiLangTest_cpp
  * Test Description: HelloMultiLangTest_cpp
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_cpp
  * Node list:
  * Job type: local (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: compile
  * Rerun with '-n HelloMultiLangTest_cpp -p builtin --system generic:default -r'
  * Reason: build system error: I do not know how to compile a C++ program
------------------------------------------------------------------------------
Log file(s) saved in: '/tmp/rfm-gb82bgcv.log'

@cblackworth-anl
Copy link
Author

Full output from -vvv run.

 reframe -C ./settings.py -r -R -c ./ -vvv
Loading user configuration
Loading configuration file: './settings.py'
Detecting system
Looking for a matching configuration entry for system 'sc01'
Configuration found: picking system 'generic'
Selecting subconfig for 'generic'
Initializing runtime
Selecting subconfig for 'generic:default'
Initializing system partition 'default'
Selecting subconfig for 'generic'
Initializing system 'generic'
Initializing modules system 'nomod'
[ReFrame Environment]
  RFM_CHECK_SEARCH_PATH=<not set>
  RFM_CHECK_SEARCH_RECURSIVE=<not set>
  RFM_CLEAN_STAGEDIR=<not set>
  RFM_COLORIZE=<not set>
  RFM_CONFIG_FILE=<not set>
  RFM_GRAYLOG_ADDRESS=<not set>
  RFM_IGNORE_CHECK_CONFLICTS=<not set>
  RFM_IGNORE_REQNODENOTAVAIL=<not set>
  RFM_INSTALL_PREFIX=/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg
  RFM_KEEP_STAGE_FILES=<not set>
  RFM_MODULE_MAPPINGS=<not set>
  RFM_MODULE_MAP_FILE=<not set>
  RFM_NON_DEFAULT_CRAYPE=<not set>
  RFM_OUTPUT_DIR=<not set>
  RFM_PERFLOG_DIR=<not set>
  RFM_PREFIX=<not set>
  RFM_PURGE_ENVIRONMENT=<not set>
  RFM_REPORT_FILE=<not set>
  RFM_SAVE_LOG_FILES=<not set>
  RFM_STAGE_DIR=<not set>
  RFM_SYSLOG_ADDRESS=<not set>
  RFM_SYSTEM=<not set>
  RFM_TIMESTAMP_DIRS=<not set>
  RFM_UNLOAD_MODULES=<not set>
  RFM_USER_MODULES=<not set>
  RFM_USE_LOGIN_SHELL=<not set>
  RFM_VERBOSE=<not set>
[ReFrame Setup]
  version:           3.6.0-dev.3
  command:           '/usr/bin/reframe -C ./settings.py -r -R -c ./ -vvv'
  launched by:       cblackworth@sc01
  working directory: '/shared_home/cblackworth/reframe_testbed'
  settings file:     './settings.py'
  check search path: (R) '/shared_home/cblackworth/reframe_testbed'
  stage directory:   '/shared_home/cblackworth/reframe_testbed/stage'
  output directory:  '/shared_home/cblackworth/reframe_testbed/output'

Looking for tests in '/shared_home/cblackworth/reframe_testbed'
Validating '/shared_home/cblackworth/reframe_testbed/hello/hello2.py': OK
/shared_home/cblackworth/reframe_testbed/hello/hello2.py:10: WARNING: the @parameterized_test decorator is deprecated; please use the parameter() built-in instead
@rfm.parameterized_test(['c'], ['cpp'])
@rfm.parameterized_test(['c'], ['cpp'])                                                                                                                                                                     [99/1929]

  > Loaded 2 test(s)
Validating '/shared_home/cblackworth/reframe_testbed/hello/hello1.py': OK
  > Loaded 1 test(s)
Validating '/shared_home/cblackworth/reframe_testbed/settings.py': not a test file
Loaded 3 test(s)
Generated 3 test case(s)
Filtering test cases(s) by name: 3 remaining
Filtering test cases(s) by tags: 3 remaining
Filtering test cases(s) by other attributes: 3 remaining
Building and validating the full test DAG
Full test DAG:
  ('HelloMultiLangTest_c', 'generic:default', 'builtin') -> []
  ('HelloMultiLangTest_cpp', 'generic:default', 'builtin') -> []
  ('HelloTest', 'generic:default', 'builtin') -> []
Final number of test cases: 3
Loading environment for current system
(Un)using module paths from command line
Loading user modules from command line
[==========] Running 3 check(s)
[==========] Started on Thu Apr 22 18:42:45 2021

[----------] started processing HelloMultiLangTest_c (HelloMultiLangTest_c)
Selecting subconfig for 'generic:default'
[ RUN      ] HelloMultiLangTest_c on generic:default using builtin
Entering stage: setup
Setting up test paths
Created stage directory '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_c' [clean_stagedir: True]
Created output directory '/shared_home/cblackworth/reframe_testbed/output/generic/default/builtin/HelloMultiLangTest_c'
Setting up job 'rfm_HelloMultiLangTest_c_job' (scheduler: 'local', launcher: 'local')
Setting up job 'rfm_HelloMultiLangTest_c_build' (scheduler: 'local', launcher: 'local')
Scheduling test case ('HelloMultiLangTest_c', 'generic:default', 'builtin') for running
Entering stage: compile
Copying /shared_home/cblackworth/reframe_testbed/hello/src to stage directory
Symlinking files: []
/usr/bin/reframe: no modules system is set: module 'cray-pals' will not be loaded: check the 'modules_system' configuration parameter for your system
[CMD] '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_c/rfm_HelloMultiLangTest_c_build.sh'
Entering stage: compile_wait
[S] local: pid 22765 already dead or assigned elsewhere
Entering stage: run
Generating the run script
[CMD] '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_c/rfm_HelloMultiLangTest_c_job.sh'
Spawned run job (id=22776)
[----------] finished processing HelloMultiLangTest_c (HelloMultiLangTest_c)

[----------] started processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)
[ RUN      ] HelloMultiLangTest_cpp on generic:default using builtin
Entering stage: setup
Setting up test paths
Created stage directory '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_cpp' [clean_stagedir: True]
Created output directory '/shared_home/cblackworth/reframe_testbed/output/generic/default/builtin/HelloMultiLangTest_cpp'
Setting up job 'rfm_HelloMultiLangTest_cpp_job' (scheduler: 'local', launcher: 'local')
Setting up job 'rfm_HelloMultiLangTest_cpp_build' (scheduler: 'local', launcher: 'local')
Scheduling test case ('HelloMultiLangTest_cpp', 'generic:default', 'builtin') for running
Entering stage: compile
Copying /shared_home/cblackworth/reframe_testbed/hello/src to stage directory
Symlinking files: []
caught reframe.core.exceptions.BuildSystemError: I do not know how to compile a C++ program
Entering stage: performance
Symlinking files: []                                                                                                                                                                                        [42/1929]
caught reframe.core.exceptions.BuildSystemError: I do not know how to compile a C++ program
Removing task from the running list: ('HelloMultiLangTest_cpp', 'generic:default', 'builtin')
Task was not running
[     FAIL ] (1/3) HelloMultiLangTest_cpp on generic:default using builtin [compile: 0.011s run: n/a total: 0.026s]
==> test failed during 'compile': test staged in '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_cpp'
==> timings: setup: 0.011s compile: 0.011s run: n/a sanity: n/a performance: n/a total: 0.026s
[----------] finished processing HelloMultiLangTest_cpp (HelloMultiLangTest_cpp)

[----------] started processing HelloTest (HelloTest)
[ RUN      ] HelloTest on generic:default using builtin
Entering stage: setup
Setting up test paths
Created stage directory '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloTest' [clean_stagedir: True]
Created output directory '/shared_home/cblackworth/reframe_testbed/output/generic/default/builtin/HelloTest'
Setting up job 'rfm_HelloTest_job' (scheduler: 'local', launcher: 'local')
Setting up job 'rfm_HelloTest_build' (scheduler: 'local', launcher: 'local')
Scheduling test case ('HelloTest', 'generic:default', 'builtin') for running
Entering stage: compile
Copying /shared_home/cblackworth/reframe_testbed/hello/src to stage directory
Symlinking files: []
[CMD] '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloTest/rfm_HelloTest_build.sh'
Entering stage: compile_wait
[S] local: pid 22782 already dead or assigned elsewhere
Entering stage: run
Generating the run script
[CMD] '/shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloTest/rfm_HelloTest_job.sh'
Spawned run job (id=22793)
[----------] finished processing HelloTest (HelloTest)

[----------] waiting for spawned checks to finish
Running tasks: 2
Polling 2 task(s) in 'generic:default'
[S] local: pid 22776 already dead or assigned elsewhere
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloMultiLangTest_c', 'generic:default', 'builtin')
Finalizing 1 task(s)
Finalizing task ('HelloMultiLangTest_c', 'generic:default', 'builtin')
Finalizing task ('HelloMultiLangTest_c', 'generic:default', 'builtin')
Entering stage: sanity
Entering stage: performance
[       OK ] (2/3) HelloMultiLangTest_c on generic:default using builtin [compile: 0.092s run: 0.195s total: 0.308s]
==> timings: setup: 0.012s compile: 0.092s run: 0.195s sanity: 0.001s performance: 0.000s total: 0.308s
Entering stage: cleanup
Copying test files to output directory
Removing stage directory
Poll rate control: sleeping for 0.1s (current poll rate: 699050.6666666666 polls/s)
Running tasks: 1
Polling 1 task(s) in 'generic:default'
[S] local: pid 22793 already dead or assigned elsewhere
Entering stage: run_complete
Entering stage: run_wait
Removing task from the running list: ('HelloTest', 'generic:default', 'builtin')
Finalizing 1 task(s)
Finalizing task ('HelloTest', 'generic:default', 'builtin')
Finalizing task ('HelloTest', 'generic:default', 'builtin')
Entering stage: sanity
Entering stage: performance
[       OK ] (3/3) HelloTest on generic:default using builtin [compile: 0.081s run: 0.184s total: 0.286s]
==> timings: setup: 0.012s compile: 0.081s run: 0.184s sanity: 0.001s performance: 0.000s total: 0.286s
Entering stage: cleanup
Copying test files to output directory
Removing stage directory
[----------] all spawned checks have finished

[  FAILED  ] Ran 3/3 test case(s) from 3 check(s) (1 failure(s), 0 skipped)
[==========] Finished on Thu Apr 22 18:42:45 2021

==============================================================================
SUMMARY OF FAILURES
------------------------------------------------------------------------------
FAILURE INFO for HelloMultiLangTest_cpp
  * Test Description: HelloMultiLangTest_cpp
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /shared_home/cblackworth/reframe_testbed/stage/generic/default/builtin/HelloMultiLangTest_cpp
  * Node list:
  * Job type: local (id=None)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: compile
  * Rerun with '-n HelloMultiLangTest_cpp -p builtin --system generic:default -r'
  * Reason: build system error: I do not know how to compile a C++ program
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg/reframe/frontend/executors/__init__.py", line 269, in _safe_call
    return fn(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg/reframe/core/hooks.py", line 36, in _fn
    func(obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg/reframe/core/pipeline.py", line 97, in _wrapped
    return fn(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg/reframe/core/pipeline.py", line 1242, in compile
    *self.build_system.emit_build_commands(self._current_environ),
  File "/usr/lib/python3.6/site-packages/ReFrame_HPC-3.6.0.dev3-py3.6.egg/reframe/core/buildsystems.py", line 436, in emit_build_commands
    raise BuildSystemError('I do not know how to compile a '
reframe.core.exceptions.BuildSystemError: I do not know how to compile a C++ program

------------------------------------------------------------------------------
Log file(s) saved in: '/tmp/rfm-zcq63gqr.log'

@vkarak
Copy link
Contributor

vkarak commented Apr 26, 2021

Hi @cblackworth, regarding the error of not recognising your system after you pulled in my branch, it looks like a regression in the framework (see #1948), but meanwhile you can try to explicitly pass --system=testbed in order to test this PR.

@cblackworth-anl
Copy link
Author

That doesn't seem to work?

cblackworth@sc01:/shared_home/cblackworth/reframe_testbed> reframe -C ./settings.py -c ./hello/ -L -vvv --show-config=systems/0/name
Loading user configuration
Loading configuration file: './settings.py'
Detecting system
Looking for a matching configuration entry for system 'sc01'
Configuration found: picking system 'generic'
Selecting subconfig for 'generic'
Initializing runtime
Selecting subconfig for 'generic:default'
Initializing system partition 'default'
Selecting subconfig for 'generic'
Initializing system 'generic'
Initializing modules system 'nomod'
"generic"
cblackworth@sc01:/shared_home/cblackworth/reframe_testbed> reframe -C ./settings.py -r -C ./settings.py --system=testbed
/usr/bin/reframe: failed to load configuration: could not find a configuration entry for the requested system: 'testbed'
/usr/bin/reframe: Log file(s) saved in: '/tmp/rfm-rhrq61ed.log'
cblackworth@sc01:/shared_home/cblackworth/reframe_test>

@cblackworth-anl
Copy link
Author

That last command should be "reframe -C ./settings.py -c ./hello/ --system=testbed -r"

@vkarak
Copy link
Contributor

vkarak commented Apr 26, 2021

Hmm, your case doesn't seem to be the same as in #1948. May I ask how do you install ReFrame? One strange thing that I see and can't understand is that your settings file does not define the generic system and I'm wondering how ReFrame picks it up? It shouldn't be the case...

@cblackworth-anl
Copy link
Author

It looks like there was a lingering module in python somewhere. I re ran this using just the bootstrap script and it worked.

For the github releases, is there a "correct" install that doesn't use bootstrap.sh? (this system isn't able to access the internet)

@cblackworth-anl
Copy link
Author

I am now running in to this odd behavior and I'm wondering if its related? This error only shows up on this one job and only when used with PBS, the local only job of the same test runs fine:

(both the gnu and cray tests fail, hence the changing test name, same error though)

caught builtins.FileNotFoundError: [Errno 2] No such file or directory: '/home/cblackworth/stage/testbed/pbs/cray/HelloTest/rfm_HelloTest_job.err'

cblackworth@sch1:~> ls -al /home/cblackworth/stage/testbed/pbs/cray/HelloTest/rfm_HelloTest_job.err
-rw------- 1 cblackworth users 0 Apr 28 14:12 /home/cblackworth/stage/testbed/pbs/cray/HelloTest/rfm_HelloTest_job.err

Further details:

[     FAIL ] ( 4/15) HelloTest on testbed:pbs using gnu [compile: 0.752s run: 0.818s total: 1.605s]
==> test failed during 'sanity': test staged in '/home/cblackworth/stage/testbed/pbs/gnu/HelloTest'


------------------------------------------------------------------------------
FAILURE INFO for HelloTest
  * Test Description: HelloTest
  * System partition: testbed:pbs
  * Environment: cray
  * Stage directory: /home/cblackworth/stage/testbed/pbs/cray/HelloTest
  * Node list: x1000
  * Job type: batch job (id=1116.sch1.hsn0.cm.testbed.imi.alcf.anl.gov)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: cleanup
  * Rerun with '-n HelloTest -p cray --system testbed:pbs -r'
  * Reason: file not found error: [Errno 2] No such file or directory: '/home/cblackworth/stage/testbed/pbs/cray/HelloTest/rfm_HelloTest_job.err'
Traceback (most recent call last):
  File "/reframe/reframe/frontend/executors/__init__.py", line 269, in _safe_call
    return fn(*args, **kwargs)
  File "/reframe/reframe/core/hooks.py", line 36, in _fn
    func(obj, *args, **kwargs)
  File "/reframe/reframe/core/pipeline.py", line 97, in _wrapped
    return fn(*args, **kwargs)
  File "/reframe/reframe/core/pipeline.py", line 1665, in cleanup
    self._copy_to_outputdir()
  File "/reframe/reframe/core/pipeline.py", line 1618, in _copy_to_outputdir
    self._copy_job_files(self._job, self.outputdir)
  File "/reframe/reframe/core/pipeline.py", line 1612, in _copy_job_files
    shutil.copy(stderr, dst)
  File "/usr/lib64/python3.6/shutil.py", line 245, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib64/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/cblackworth/stage/testbed/pbs/cray/HelloTest/rfm_HelloTest_job.err'

@vkarak
Copy link
Contributor

vkarak commented Apr 28, 2021

Hi @cblackworth is this failure persistent? It really looks like #1394. There, we were on a Lustre filesystem and I'm not sure if there was somehow glitch, but we haven't seen that error since a while ago.

@vkarak vkarak removed this from the ReFrame sprint 21.05.1 milestone May 17, 2021
@vkarak vkarak added this to the ReFrame Sprint 21.07.1 milestone Jul 14, 2021
@vkarak
Copy link
Contributor

vkarak commented Jul 23, 2021

Hi @cblackworth, I'll close this issue now. Feel free to reopen it, if it persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants