Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String Interpolation and Immutable Errors #25

Merged
merged 1 commit into from
Oct 21, 2015
Merged

String Interpolation and Immutable Errors #25

merged 1 commit into from
Oct 21, 2015

Conversation

gcamilo
Copy link
Contributor

@gcamilo gcamilo commented Oct 9, 2015

Fixed string interpolation errors and new types in Julia 0.4

@andreasnoack
Copy link
Member

cc: @amitmurthy

@@ -68,7 +70,7 @@ function launch(manager::Union{PBSManager, SGEManager}, params::Dict, instances_
end

function manage(manager::Union{PBSManager, SGEManager}, id::Integer, config::WorkerConfig, op::Symbol)
if op == :finalize
if op == :interrupt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Shouldn't :interrupt send a message that simulates a SIGINT? And :finalize should kill the workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it wasn't killing the workers, closing just dumped a bunch of random stuff to no end on STDOUT that I had to Ctrl+Z and kill, Ctrl+C did nothing. In the current implementation op SSHManagers there is no finalize op, just an interrupt so I changed it to this and it kills the worker properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default implementation of kill(manager, pid, config) executes a exit() on the remote worker. So rmprocs(workers()) should result in the killing of workers. Can you try it out?

If not, you can implement kill(manager::Union{PBSManager, SGEManager}, id::Integer, config::WorkerConfig) and manually manage the killing of workers as you are doing here.

:finalize is called when the manager objects is gced - the timing of which is not predictable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rmprocs(workers()) with the current implementation of qsub (with :finalize to kill workers) returns :ok but when you Ctrl+D to exit julia also prints to STDOUT unending stream of commands and must be killed by hand.

I tried implementing

function  kill(manager::Union{PBSManager, SGEManager}, id::Integer, config::WorkerConfig)
    kill(config.userdata[:process])
end

And the same happens with :finalinze.

If you look in managers.jl, there the killing of workers is never handled in a :finalize op, always in an :interrupt. Finalizers just show up in multi.jl:

 # install a finalizer to perform cleanup if necessary
finalizer(w, (w)->if myid() == 1 manage(w.manager, w.id, w.config, :finalize) end)

So we could follow this model and keep my implementation of the manage function as is, which is the only thing currently working for me, and do any cleanup in a finalize. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:interrupt is for Ctrl-C, i.e., SIGINT. In managers.jl it sends a kill -2 which is the command line way of sending a SIGINT to a process.

Don't do anything with :finalize

In your implementation of kill do the following:

remotecall(exit, id)
close(config.io)
kill(config.userdata[:process])
if isfile(config.userdata[:iofile]
    rm(config.userdata[:iofile])
end

This ought to work.

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 20, 2015

So my current implementation of kill is

function manage(manager::Union{PBSManager, SGEManager}, id::Integer, config::WorkerConfig, op::Symbol)
    if op == :interrupt
    kill(manager, id, config)
    end
end

function kill(manager::Union{PBSManager, SGEManager}, id::Integer, config::WorkerConfig)

    remotecall(id,exit)
    close(get(config.io))

    kill(get(config.userdata)[:process],2)  


    if isfile(get(config.userdata)[:iofile])
        rm(get(config.userdata)[:iofile])
    end

end

The gets are necessary because WorkerConfig is all Nullable types. Ctrl+D exits julia quietly, but the processes are not killed and the files not deleted. After looking around this seems to be the culprit which pops up after killing the tail command:

ERROR (unhandled task failure): EOFError: read end of file
 in read at stream.jl:911
 in message_handler_loop at multi.jl:847
 in process_tcp_streams at multi.jl:836
 in anonymous at task.jl:63

I tried wrapping in a try/catch but that did not help.

@amitmurthy
Copy link
Contributor

You don't need

  if op == :interrupt
    kill(manager, id, config)
    end

This would kill the workers when interrupt(workers()) is called.

I guess the problem with Ctrl-D is that the spawned tail commands are not being killed - maybe due to this - JuliaLang/julia#9095 . But this should have been an issue previously too?

What about on an exit() from Julia? Is the cleanup happening properly?

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 20, 2015

So I should leave the manage method empty?

This was happening before, just realized they are the same problem. exit() does not clean up properly and neither does Ctrl+D. The finalizer in your issue, finalizer(proc, (x)->kill(x)), either in launch or in kill didn't help.

@amitmurthy
Copy link
Contributor

Yes, leave the manage method empty.

Changing kill(get(config.userdata)[:process],2) to kill(get(config.userdata)[:process],15) in your kill implementation ought to terminate the tail commands.

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 20, 2015

It did not, should I also have the finalizer in there?

@amitmurthy
Copy link
Contributor

No. Does not kill 15 kill the spawned tail commands?

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 21, 2015

No, the tail commands are not killed either after exit() or Ctrl+D.

@amitmurthy
Copy link
Contributor

I think I found the issue. We need to export kill, init_worker and connect in src/ClusterManagers.jl

Without it your kill implementation is not being called.

Can you change

import Base: launch, manage
export launch, manage

to

import Base: launch, manage, kill, init_worker, connect
export launch, manage, kill, init_worker, connect 

and try again.

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 21, 2015

Nailed it, the tail -f processes are killed, and the files are deleted, I'll fix the pull request.

Everything seems to be working, at least for SGE.

@amitmurthy
Copy link
Contributor

Thanks for your patience. Can you also squash all the commits before we merge.

@gcamilo
Copy link
Contributor Author

gcamilo commented Oct 21, 2015

Done.

amitmurthy added a commit that referenced this pull request Oct 21, 2015
String Interpolation and Immutable Errors
@amitmurthy amitmurthy merged commit e0e2c02 into JuliaParallel:master Oct 21, 2015
@gcamilo gcamilo deleted the pull-request/399bcb2d branch October 21, 2015 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants