Updated flink 1.8 -> 1.17. Added smoke tests for Flink #9046

dotbg · 2023-04-18T14:07:12Z

Updated the Apache Flink - related code to the most recent version. It can be considered a first step towards #8650 as most recent Flink versions support scala 2.13. We can, certainly, follow Flink approach and deprecate scala-related flink logic, but I am not a person to make this decision.

trivialfis · 2023-04-18T16:21:24Z

Thank you for the work on flink! It will take a few days for a review since we haven't had any contributor to the flink interface for a long while.

dotbg · 2023-04-18T19:50:53Z

@trivialfis no worries. I've mostly converted existing code and partially translated some of scala code to java as flink moved significant part of their code base that way. I am happy to hear any suggestions/comments.

trivialfis · 2023-04-19T19:38:27Z

@CodingCat Hi, do you have some bandwidth for looking into this?

dotbg · 2023-04-20T10:34:26Z

rebased to include latest changes.

dotbg · 2023-04-20T16:37:15Z

I've noticed that flink does not publish flink-scala for 2.13. Perhaps, it is wise to keep just java API in place.

trivialfis · 2023-04-20T17:00:31Z

@dotbg Hi, are you familiar with flink (use flink in practice)? Or are you modifying the interface only for upgrading the Scala version? At the moment, we don't have any active contributors who work on flink. (nor any feature request for working with it)

If you are interested in the flink interface in general, we can spend some time bringing it up-to-date and I will do a crash course myself for catching up. Otherwise, my PR review will be based on the "minimum amount of changes that can get things going".

dotbg · 2023-04-20T20:48:41Z

@trivialfis my former team was using flink extensively, but they relied on jpmml. I have noticed that the flink xgboost interface was out of date and wanted to fix that.

My current job relies on spark and we want to upgrade our scala version as well. Hence I am in the "two birds with one stone" situation.

I am fine with the "minimum amount of changes that can get things going" and, perhaps, deprecating the flink integration. I do not mind supporting the interface for a while, but I can not devote myself for the flink support unless there is a community interest.

trivialfis · 2023-04-20T21:42:01Z

Actually, I'm quite excited that you work on the flink interface! We recently learned about the flink project and its strength in streaming data, then an expert come and bring it up to date! What are the odds?

but I can not devote myself for the flink support unless there is a community interest.

No worries, we can see how it goes. I will try to catch up on this area as well.

dotbg · 2023-04-20T22:03:08Z

@trivialfis Great to hear. Well, a while ago I have done some work on flink ml, but I can not call myself an expert. Xbgoost can consider collaborating with the flink-ml project though. But this is, perhaps, a future topic :)

trivialfis

Thank you for working on the flink interface. This is an initial review, I'm still going through the tutorials/learning at the moment.

I think we can start the flink interface from scratch if necessary considering its long inactive status. As a result, moving code from scala to java makes sense if it aligns with flink. These might be topics for the remote future, but having that in mind can reduce some constraints for us.

trivialfis · 2023-04-21T12:41:48Z

jvm-packages/xgboost4j-flink/src/main/scala/ml/dmlc/xgboost4j/scala/flink/XGBoostModel.scala


-class XGBoostModel (booster: Booster) extends Serializable {
+class XGBoostModel(wrapped: JXGBoostModel) extends Serializable {


I think the name booster is fine and seems clearer than the new name wrapped?

@trivialfis I am wrapping the Java XGBoostModel implementation here, hence, the naming. I can rename this to javaXGBoostModel or something alike. Do you have a better name in mind? Sadly, booster is not what is used.

In principle, I delegate all operations made by scala's XGBoostModel to the 'wrapped' java implementation hence the naming choice.

I have removed the class and rewrote the scala example so that the java implementation is used.

trivialfis · 2023-04-21T12:45:58Z

jvm-packages/xgboost4j-flink/pom.xml

            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
-            <artifactId>flink-ml_${scala.binary.version}</artifactId>
-            <version>${flink.version}</version>
+            <artifactId>flink-ml-uber</artifactId>


Out of curiosity, does uber have a different meaning in flink-ml than the company name?

flink-ml-uber is the artifact, which includes all flink-ml related classes. I cantry to cherry-pick particular dependencies, if necessary

I am using another dependency now.

trivialfis · 2023-04-21T12:48:40Z

jvm-packages/xgboost4j-flink/src/main/java/ml/dmlc/xgboost4j/java/flink/SVMUtil.java

@@ -0,0 +1,113 @@
+/*


For running tests and defining example, we can use https://github.com/dmlc/xgboost/blob/master/demo/data/veterans_lung_cancer.csv instead of libsvm files. We can copy it into resources during build.

I think that's probably better than us maintaining yet another svm parser.

Oh, I did not know that. thanks a lot! I will rewrite the tests then and this class will become obsolete.

Removed class, rewritten the tests.

I've added the train/test CSV files as flink does not have the train/test split functionality and implementing it should not be done in this repository.

trivialfis · 2023-04-21T12:50:49Z

jvm-packages/xgboost4j-flink/src/main/java/ml/dmlc/xgboost4j/java/flink/XGBoost.java

+   */
+  public static XGBoostModel train(DataSet<VectorWithNorm> dtrain,
+                                   Map<String, Object> params,
+                                   int round) throws Exception {


Let's use the name numBoostRound instead of round. It's fine to have a breaking change for this. We can safely ignore the existing implementation and start from scratch considering that there's no user for a long period of time.

thanks a lot!

wbo4958 · 2023-04-25T00:39:17Z

jvm-packages/xgboost4j-flink/src/main/java/ml/dmlc/xgboost4j/java/flink/XGBoost.java

+              null,
+              null,
+              numEarlyStoppingRounds);
+        Communicator.shutdown();


can we wrap the shutdown in the try/catch/finally?

@wbo4958 done, I made it look similar to spark implementation. I am keen to hear other suggestions.

wbo4958 · 2023-04-25T00:52:16Z

I am not familiar with Flink, but the code looks neat, and the training pipeline is quite similar to spark. LGTM.

dotbg · 2023-04-25T09:35:06Z

rebased on top of the latest master branch

trivialfis · 2023-04-25T10:10:48Z

demo/data/veterans_lung_cancer-test.csv

@@ -0,0 +1,42 @@
+Survival_label_lower_bound,Survival_label_upper_bound,Age_in_years,Karnofsky_score,Months_from_Diagnosis,Celltype=adeno,Celltype=large,Celltype=smallcell,Celltype=squamous,Prior_therapy=no,Prior_therapy=yes,Treatment=standard,Treatment=test


Not entirely sure if we need to make a copy of the data in the git repository in order to have train-test split. Do we have a slice operator with flink data structure?

no, we do not :( But I will work it around.

Maybe we just use the training dataset as the validation/test dataset for now, potentially with some added noise. We are kind of conservative about adding data files in git repository.

No worries, I worked it around. It should be much better now.

trivialfis

The PR looks good to me.

I need to make a follow-up PR that documents that the API for flink is not yet stabilized.

trivialfis · 2023-04-25T19:34:34Z

Please help resolve the conflict. Thank you again for the nice work on bringing up the flink interface!

dotbg · 2023-04-26T09:22:44Z

The PR looks good to me.

I need to make a follow-up PR that documents that the API for flink is not yet stabilized.

Thanks. a lot.

dotbg · 2023-04-26T10:18:43Z

@trivialfis I see that the XGBoost-R-Tests / Test R on OS windows-latest, R release is failing. I am not familiar with R and I do not use windows at the moment. Is this a known issue?

trivialfis · 2023-04-26T10:23:14Z

Please ignore it for the moment, we are aware of the issue #9091 .

dotbg force-pushed the flink-version-upgrade branch from e67a60b to 1fc59f0 Compare April 20, 2023 10:33

dotbg mentioned this pull request Apr 20, 2023

xgboost4j support of scala 2.13 or scala 3? #8650

Open

trivialfis reviewed Apr 21, 2023

View reviewed changes

dotbg force-pushed the flink-version-upgrade branch from 1fc59f0 to f508642 Compare April 24, 2023 22:31

dotbg requested a review from trivialfis April 24, 2023 22:34

wbo4958 reviewed Apr 25, 2023

View reviewed changes

dotbg force-pushed the flink-version-upgrade branch 2 times, most recently from 0bd2e70 to c632e8d Compare April 25, 2023 09:34

trivialfis reviewed Apr 25, 2023

View reviewed changes

dotbg force-pushed the flink-version-upgrade branch from c632e8d to 11277e8 Compare April 25, 2023 14:55

trivialfis approved these changes Apr 25, 2023

View reviewed changes

dotbg force-pushed the flink-version-upgrade branch from 11277e8 to f0894ce Compare April 26, 2023 09:18

Updated flink 1.8 -> 1.17. Added smoke tests for Flink

7035462

dotbg force-pushed the flink-version-upgrade branch from f0894ce to 7035462 Compare April 26, 2023 09:19

trivialfis merged commit 0e7377b into dmlc:master Apr 26, 2023

dotbg deleted the flink-version-upgrade branch April 26, 2023 11:19


		class XGBoostModel (booster: Booster) extends Serializable {
		class XGBoostModel(wrapped: JXGBoostModel) extends Serializable {

		@@ -0,0 +1,42 @@
		Survival_label_lower_bound,Survival_label_upper_bound,Age_in_years,Karnofsky_score,Months_from_Diagnosis,Celltype=adeno,Celltype=large,Celltype=smallcell,Celltype=squamous,Prior_therapy=no,Prior_therapy=yes,Treatment=standard,Treatment=test

Updated flink 1.8 -> 1.17. Added smoke tests for Flink #9046

Updated flink 1.8 -> 1.17. Added smoke tests for Flink #9046

Conversation

dotbg commented Apr 18, 2023

trivialfis commented Apr 18, 2023 • edited Loading

dotbg commented Apr 18, 2023

trivialfis commented Apr 19, 2023

dotbg commented Apr 20, 2023

dotbg commented Apr 20, 2023

trivialfis commented Apr 20, 2023 • edited Loading

dotbg commented Apr 20, 2023

trivialfis commented Apr 20, 2023

dotbg commented Apr 20, 2023

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotbg Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotbg Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotbg Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

wbo4958 commented Apr 25, 2023

dotbg commented Apr 25, 2023

trivialfis Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

dotbg Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

trivialfis Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

dotbg Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented Apr 25, 2023

dotbg commented Apr 26, 2023

dotbg commented Apr 26, 2023

trivialfis commented Apr 26, 2023

trivialfis commented Apr 18, 2023 •

edited

Loading

trivialfis commented Apr 20, 2023 •

edited

Loading

dotbg Apr 21, 2023 •

edited

Loading

dotbg Apr 25, 2023 •

edited

Loading

dotbg Apr 25, 2023 •

edited

Loading

trivialfis Apr 25, 2023 •

edited

Loading

dotbg Apr 25, 2023 •

edited

Loading

trivialfis Apr 25, 2023 •

edited

Loading

dotbg Apr 25, 2023 •

edited

Loading