-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-984] Add Java NDArray and introduce Java Operator Builder class #12772
Conversation
0d83d6b
to
6e061d4
Compare
Awesome contribution! Thanks a lot :) Do you think it's possible to introduce test coverage early on so we can track improvements early on? |
Hi @marcoabreu, there is unit test covered on Java side to make sure the handmade operators and generated operators are tested. Unfortunately, we cannot have full unit test coverage for generated operators since there are 200+ and already tested on Python end. I would be grateful if you could request more tests if you find some part of the code is not covered. |
Sure, I fully understand and agree. I was thinking about targeted coverage measurement for the Java API. The Backend calls would be mocked and we would validate that the passthrough is properly working. The actual logic is still tested under scala. The idea is to detect if some changes on the scala side broke the compatibility or behaviour of the Java API. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes but other that that LGTM
@@ -24,4 +24,6 @@ object DType extends Enumeration { | |||
val UInt8 = org.apache.mxnet.DType.UInt8 | |||
val Int32 = org.apache.mxnet.DType.Int32 | |||
val Unknown = org.apache.mxnet.DType.Unknown | |||
|
|||
private[mxnet] def numOfBytes(dType: DType): Int = org.apache.mxnet.DType.numOfBytes(dType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an unused private method. I think we can get rid of it.
scala-package/scalastyle-config.xml
Outdated
@@ -64,7 +64,7 @@ You can also disable only one rule, by specifying its rule id, as specified in: | |||
|
|||
<check level="error" class="org.scalastyle.file.FileLineLengthChecker" enabled="true"> | |||
<parameters> | |||
<parameter name="maxLineLength"><![CDATA[100]]></parameter> | |||
<parameter name="maxLineLength"><![CDATA[132]]></parameter> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undo this. I know you picked it up from my change but Naveen wanted us to have this discussion on the dev list before changing it.
I like the idea of getting a test coverage measurement working early on the java api. I think that this is something we can add to our plan. Our current testing strategy is that the Java unit tests will be aimed at testing the compatibility of the java/scala interaction and all behavior testing will occur in the Scala side. With that said, the current Scala testing is lacking in some areas and needs be be improved. |
82df9fd
to
554b5e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should aim for having unit tests for all of the methods in the non-generated NDArray code. Given that it's just a wrapper, in early stages, and there are other things waiting on this I won't block the PR for it but we should do a follow up with better test coverage. Other than that LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't look into the details, but a high-level comment:
I feel it is better to follow Spark's approach, name it as JavaNDArray or something, and provide toNDArray()
, so that we don't need to duplicate every class in Java - each time an API needs a NDArray, Java users can use JavaNDArray to operate and convert to Scala's NDArray.
And I think Scala's Context, DType, Shape, IO can be re-used, we do not benefit much from inventing new class. So far we duplicate too much.
-
Context
only thing I see aredevtype2str
anddevstr2type
now become Java Map, but who actually will use them? Even though it was used in some unusual case, users can easily invoke.asJava
themselves. -
DType
Nothing changes -
Shape
Java users can already construct Shape easily: Shape.create(1,2,3,4), if you worry abouttoVector
returning Scala Vector, we can add one more functiontoList
. -
IO
Nothing new.
@@ -606,7 +606,7 @@ scalaclean: | |||
|
|||
scalapkg: | |||
(cd $(ROOTDIR)/scala-package; \ | |||
mvn package -P$(SCALA_PKG_PROFILE),$(SCALA_VERSION_PROFILE) -Dcxx="$(CXX)" \ | |||
mvn package -P$(SCALA_PKG_PROFILE),$(SCALA_VERSION_PROFILE),integrationtest -Dcxx="$(CXX)" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why running test here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No tests are actually triggered in here. This is just to avoid JUnit test running when make scala pkg. The integration test flag will disable the unit test for JUnit
A few comments, what this is doing is essentially spark's approach (see their JavaDoubleRDD). They have a JavaDoubleRDD class that they use to just call methods of their Scala RDD and internally handle the conversions. You can also see this in their java examples, throughout all of the examples that I've looked at the Java examples never use any part of the core library outside of the api.java package. If we don't provide a Java friendly wrapper for all of the classes we want to expose to Java developers it'll lead to a rather confusing experience. They'll be expected to learn what is and is not meant for them to use from Scala's package while also having to know when to use the javaapi package instead. Take the IO wrapper for example, all the Java wrapper is exposing is a Java friendly version of DataDesc. If we don't wrap that and instead expect them to use the Scala one then they'll also get exposed to all the other classes in IO that aren't intended for them to use (aren't Java friendly). Any methods that we add there to make it Java friendly will pollute the Scala API with stuff that scala developers don't need. |
Thanks for providing your points of view @andrewfayres IMO it is not always "bad" to expose Scala classes. Scala classes are not necessary a "new language" that requires users to learn. e.g., scala.Tuple, many Java 3rd parties (Guava, Apache Common, javatuples) provide exactly the same thing. Spark only re-implement a very small set of classes, mostly RDDs, and not every RDD have a Java version. At least these four classes are already "java-friendly". Duplicating will only increase maintenance efforts. Honestly I don't see why It makes sense to me to have Java version of NDArray, (and maybe also Module, etc), because they have too many functions that are not java-friendly. But not every class is in such situation. |
Also I can easily find a java example using Scala object, in the link you mentioned: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java#L34-L37 And in their document, they at least recommend java users to use And here https://spark.apache.org/examples.html , if you click tab "Java", there's a bunch of usage of Spark-Scala objects: LogisticRegression, Vector, etc. And let's check their java test case: https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java I don't think whether they come from |
I'm not taking a position that Scala objects should never be exposed to Java users. I have no objections to doing so when it's appropriate. My objection is about exposing half a package to them because it is already Java friendly and requiring them to know which parts they are expected to cherry pick from. I think everything you linked to is from outside the core module which I was talking about. The reason I was making that distinction is that it seems to be where they have separate java and scala APIs. At a cursory glance, it looks like the entire mllib module is Java friendly and doesn't have a separate API (like it had been originally designed to be that way). If we'd planned on Java support from the very beginning this is probably how we'd of done things. I have no objections to the idea that we could have them use scala.Tuple2. It's useful, straight forward, is a concept everyone knows, and has no analog in Java SE. I agree that DataDesc is Java-friendly (or at least good enough) but there are plenty of other classes (even in the same IO file as DataDesc) that aren't. I think it's an unreasonable ask to expect the Java Developer to know that they're expected to only select certain things from that package but other things are off limits. I'm not particularly concerned about the maintenance costs because they're just wrappers. The only time I foresee us doing maintenance is if we make a breaking change to the corresponding method of the Scala API which should be very rare (and when it happens there's already so much to update that an extra wrapper should make little difference). |
Then I do not agree the opinion of, even though DataDesc is java-friendly, we still duplicate it because there're other things duplicated. User types I think you agree that, if one class is already designed to be java-friendly, then we can just expose it - this is how you took spark-mllib as an example. But why tangle about the namespace. The main objective of providing "Java API", is not to have a java namespace - as a user, why do I care about which namespace I'm using? As a remedy of not making APIs java-friendly at the very beginning, Spark invents JavaRDD, we are just facing the same problem here. Also be careful saying maintenance is easy. Count how many lines you wrote for only 4 tiny, simple classes. And when API changes, increases, removes, we'll suffer from maintaining two things - And remember IO/Optimizer/etc are very likely to be extended. |
87d95a2
to
cd93cc3
Compare
In offline discussions we've agreed that we'll collectively craft an email with both positions, send it out for feedback, and go with the consensus that emerges. In the meantime, development of the Java api will proceed on a new branch. |
Here is the alternative one, closing this now! #12816 |
Description
This is a PR based on @andrewfayres PR: #12757. Please merge the previous one first and then this one.
In this PR, Java NDArray were introduced as well as a brand new Builder Operators. Deprecated operators were removed from Java API Macros. Please see the unit test for more information.
@andrewfayres @nswamy @yzhliu @gigasquid
Special thanks for @zachgk @piyushghai for guidance and support on Java new API!
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.