Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anonymous classes in inline code are loaded too late for serialization #104

Closed
facundominguez opened this issue Apr 11, 2017 · 6 comments
Closed

Comments

@facundominguez
Copy link
Member

facundominguez commented Apr 11, 2017

[java| $rdd.map(new Function<Object,Object>() { Object call(Object x){ return x;} }) |]

doesn't work on multi-node setups.

The problem is that an executor receives the serialized object new Function<Object,Object>() { Object call(Object x){ return x;} }, and in order to deserialize it, it needs to load the anonymous class to which it belongs.

The executor then notices that no jar and no class in the classpath contains the class definition and therefore it fails. Where is the class then? Currently inline-java embeds the bytecode in the Haskell executable. The embedded bytecode is sent to the JVM at runtime by Language.Java.Inline.loadJavaWrappers. But this function is never called on executors.

The ideal fix would be for spark to provide some startup hooks, so Language.Java.Inline.loadJavaWrappers can be called when the executor starts. But this feature is not implemented.

Calling loadJavaWrappers when sparkle loads is no good, because upon receiving the serialized object, the executor has no clue that it needs to load sparkle in order to have the class defined.

The only workaround I've found so far, is to dump the .class files that inline-java produces into a folder and add them to the sparkle application jar. tweag/inline-java#62

Any preferences on how to better deal with this?

@mboes mboes changed the title inline-java is difficult to use with sparkle. Anonymous classes in inline code are loaded too late for serialization Apr 11, 2017
@mboes
Copy link
Member

mboes commented Apr 11, 2017

Note that this problem isn't specific to inline-java: it's a problem for any use of JNI's defineClass, which Spark currently provides no way of performing preemptively at initialization time. It sounds to me like this is an upstream issue. Which we could work around in various ways in inline-java for that particular special case.

@facundominguez
Copy link
Member Author

Which we could work around in various ways in inline-java for that particular special case.

Indeed, any preference?

@mboes
Copy link
Member

mboes commented Apr 12, 2017

Since as noted above this is an upstream issue with Spark itself, I have a preference for keeping any workaround in sparkle. We could remove the workaround once the ticket you mention above is resolved. Inline code is just stubs and these stubs are best kept in the executable itself. No one other than the executable should see these stubs, nor should they be able to call them. And that way we don't need to parameterize the java QQ with a gazillion (aka 1-3) options whose combinations are hard to test exhaustively.

The JIRA ticket you mention includes comments from several folks who successfully hooked into JavaSerializer. We could call loadJavaWrappers once (or all the time), from the serializer. Did the "epic struggle" you mention in inline-java#62 include that already?

@facundominguez
Copy link
Member Author

facundominguez commented Apr 12, 2017

I didn't try it. Apparently we need to

  1. Extend org.apache.spark.Serializer. This is a wrapper that will load sparkle in a static block and will forward calls to the appropriate serializer.
  2. Set our instance with sparkConf.set("spark.serializer","our.serializer.class.name")

I don't see how this can be parameterized by the serializer that spark currently uses. We might have to define a different wrapper for each serializer we ever want to use.

mboes added a commit that referenced this issue Apr 16, 2017
Just like in straight Java, it's perfectly legal in an inline-java
quasiquote to create an object of anonymous class. The problem is,
such an object can't be deserialized from any process that hasn't yet
loaded the wappers for all quasiquotes, since it is the wrappers that
"define" the anonymous class.

Spark executors can be given a task by the Spark driver that includes
such anonymous objects. Without the InlineJavaRegistrator provided
here, it is not possible to guarantee that the inline-java wrappers
have been loaded *prior* to the task being deserialized.

The solution here consists in choosing the Kryo serializer. It's
much faster than the default `JavaSerializer` that Spark uses anyways.
`KryoSerializer` provides a crucial facility that `JavaSerializer`
does not: class registration. Spark furthermore defines "registrator"
classes that when invoked perform class registration, or indeed any
arbitrary action. We provide an `InlineJavaRegistrator` to inline-java
users, which abuses class registration to first load all wrappers.
This happens on all executors prior to any work being performed.

Fixes #104.
mboes added a commit that referenced this issue Apr 16, 2017
Just like in straight Java, it's perfectly legal in an inline-java
quasiquote to create an object of anonymous class. The problem is,
such an object can't be deserialized from any process that hasn't yet
loaded the wappers for all quasiquotes, since it is the wrappers that
"define" the anonymous class.

Spark executors can be given a task by the Spark driver that includes
such anonymous objects. Without the InlineJavaRegistrator provided
here, it is not possible to guarantee that the inline-java wrappers
have been loaded *prior* to the task being deserialized.

The solution here consists in choosing the Kryo serializer. It's
much faster than the default `JavaSerializer` that Spark uses anyways.
`KryoSerializer` provides a crucial facility that `JavaSerializer`
does not: class registration. Spark furthermore defines "registrator"
classes that when invoked perform class registration, or indeed any
arbitrary action. We provide an `InlineJavaRegistrator` to inline-java
users, which abuses class registration to first load all wrappers.
This happens on all executors prior to any work being performed.

Fixes #104.
mboes added a commit that referenced this issue Apr 16, 2017
Just like in straight Java, it's perfectly legal in an inline-java
quasiquote to create an object of anonymous class. The problem is,
such an object can't be deserialized from any process that hasn't yet
loaded the wappers for all quasiquotes, since it is the wrappers that
"define" the anonymous class.

Spark executors can be given a task by the Spark driver that includes
such anonymous objects. Without the InlineJavaRegistrator provided
here, it is not possible to guarantee that the inline-java wrappers
have been loaded *prior* to the task being deserialized.

The solution here consists in choosing the Kryo serializer. It's
much faster than the default `JavaSerializer` that Spark uses anyways.
`KryoSerializer` provides a crucial facility that `JavaSerializer`
does not: class registration. Spark furthermore defines "registrator"
classes that when invoked perform class registration, or indeed any
arbitrary action. We provide an `InlineJavaRegistrator` to inline-java
users, which abuses class registration to first load all wrappers.
This happens on all executors prior to any work being performed.

Fixes #104.
@facundominguez
Copy link
Member Author

facundominguez commented Apr 26, 2017

This is not quite as usable as sparkle users would need. When the classes that loadJavaWrappers loads depend on classes in sparkle.jar, the class loader can't find them when loadJavaWrappers is invoked in InlineJavaRegistrator.java.

@mboes
Copy link
Member

mboes commented Apr 26, 2017

Could we please not unearth old issues from the dead and instead create a new one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants