Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delimiter option for reading CSV files for Feathr #307

Merged
merged 22 commits into from
Aug 16, 2022

Conversation

ahlag
Copy link
Contributor

@ahlag ahlag commented May 30, 2022

Signed-off-by: Chang Yong Lik [email protected]

Description

  • Added delimiter options

Resolves #241

How was this patch tested?

Tested locally with sbt

sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestCsvDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestBatchDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.hdfs.TestFileFormat'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestDataLoaderFactory'

Progress Tracker

  • Use the Job configurations like here to implement, i.e. add a section called spark.feathr.inputFormat.csvOptions.sep which allows end users to pass the delimiters as options
  • in the scala code, if you search for ss.read.format("csv").option("header", "true"), there will be a bunch of places that you need to modify. Eventually they will be using something like the csv reader here (https://spark.apache.org/docs/3.2.0/sql-data-sources-csv.html).
  • You can get the config in different places thru something like this: sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",")
  • Test case
  • Also please help the job configuration docs (https://linkedin.github.io/feathr/how-to-guides/feathr-job-configuration.html) to make sure the options are clear to end users.

Does this PR introduce any user-facing changes?

Allows users to specify delimiters

Signed-off-by: Chang Yong Lik <[email protected]>
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from 2b1c0a6 to ddb5643 Compare May 30, 2022 15:24
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from c764d1e to 36fe0b3 Compare June 11, 2022 15:56
@xiaoyongzhu xiaoyongzhu added the safe to test Tag to execute build pipeline for a PR from forked repo label Jun 15, 2022
@ahlag ahlag force-pushed the feature/delimiter branch from ca7e91f to 3f17359 Compare June 21, 2022 11:58
Signed-off-by: Chang Yong Lik <[email protected]>
(cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)
@ahlag ahlag force-pushed the feature/delimiter branch from 3f17359 to b5c4e12 Compare June 23, 2022 06:55
@ahlag ahlag force-pushed the feature/delimiter branch from 86637e4 to 6229f33 Compare June 24, 2022 03:48
@ahlag ahlag changed the title [WIP] Add delimiter option for reading CSV files for Feathr Add delimiter option for reading CSV files for Feathr Jun 24, 2022
@xiaoyongzhu
Copy link
Member

@ahlag thanks for the contribution! This PR looks good to me, but I'm not sure why the test fails. Spent a bit time to investigate and feel it might be caused by the newly added tests?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 9, 2022

@xiaoyongzhu
Did it fail in GitHub Actions or by command line e.g. sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'? I tried rerunning the following commands locally but the unit cases were mot failling.

sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestCsvDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.util.TestSourceUtils'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestBatchDataLoader'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.hdfs.TestFileFormat'
sbt 'testOnly com.linkedin.feathr.offline.source.dataloader.TestDataLoaderFactory'

@xiaoyongzhu
Copy link
Member

Talked with @ahlag offline and asked him to run sbt test. Feel the issue is mostly because those failed tests were relying on loadDataFrame to read the csv files

@ahlag ahlag force-pushed the feature/delimiter branch from 129fafa to d2c93cc Compare July 10, 2022 10:45
Signed-off-by: changyonglik <[email protected]>
@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

@xiaoyongzhu
I think I found the problem. The delimiter was not pass successfully when I tried passing the options with sqlContext.
Is there a way to set the config with SparkSession in unit tests?

TestFileFormat.scala

    val sqlContext = ss.sqlContext
    sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")

FileFormat.scala

val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",")

@xiaoyongzhu
Copy link
Member

@xiaoyongzhu
I think I found the problem. The delimiter was not pass successfully when I tried passing the options with sqlContext.
Is there a way to set the config with SparkSession in unit tests?

TestFileFormat.scala

    val sqlContext = ss.sqlContext
    sqlContext.setConf("spark.feathr.inputFormat.csvOptions.sep", "\t")

FileFormat.scala

val csvDelimiterOption = ss.sparkContext.getConf.get("spark.feathr.inputFormat.csvOptions.sep", ",")

Hmm it's a bit weird. Is it possible to force set the delimiters?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

Look like there can only be one SparkContext

[info] TestFileFormat:
[info] - testLoadDataFrame
[info] - testLoadDataFrameWithCsvDelimiterOption *** FAILED ***
[info]   org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:

@ahlag
Copy link
Contributor Author

ahlag commented Jul 10, 2022

I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client.

@xiaoyongzhu
Copy link
Member

I think I will try a new approach. Since an existing SparkContext cannot be edited nor can another one be created, I will test it with e2e by passing the config from the client.

I did some research and found this answer:
https://stackoverflow.com/a/44613011

sqlContext.setConf("spark.sql.shuffle.partitions", "10") will set the property parameter for whole application before logicalPlan is generated.

sqlContext.sql("set spark.sql.shuffle.partitions=15") will also set the property but only for particular query and is generated at the time of logicalPlan creation.

Choosing between them depends on what your requirement is.

Maybe you can try sqlContext.sql?

@ahlag
Copy link
Contributor Author

ahlag commented Jul 11, 2022

Ok, I'll give this a shot

@ahlag
Copy link
Contributor Author

ahlag commented Jul 27, 2022

@xiaoyongzhu
Ok! I have updated the release version.

xiaoyongzhu
xiaoyongzhu previously approved these changes Jul 27, 2022
Copy link
Collaborator

@hangfei hangfei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update wiki that we support tsv.

Signed-off-by: changyonglik <[email protected]>
ahlag added 2 commits August 2, 2022 23:17
@ahlag ahlag force-pushed the feature/delimiter branch from 880ea90 to 3ec3fdf Compare August 2, 2022 15:06
Signed-off-by: changyonglik <[email protected]>
@ahlag ahlag force-pushed the feature/delimiter branch 2 times, most recently from 3842194 to 7711253 Compare August 4, 2022 13:00
Signed-off-by: changyonglik <[email protected]>
@ahlag ahlag force-pushed the feature/delimiter branch from 062d916 to 2d41211 Compare August 4, 2022 13:13
@ahlag ahlag force-pushed the feature/delimiter branch from f2ecf06 to 0b01cc6 Compare August 4, 2022 13:24
Signed-off-by: changyonglik <[email protected]>
@ahlag ahlag force-pushed the feature/delimiter branch from 78b3cbd to b4e55fb Compare August 4, 2022 13:36
@ahlag
Copy link
Contributor Author

ahlag commented Aug 4, 2022

@xiaoyongzhu @hangfei
I have finished the changes. Could you review?

xiaoyongzhu
xiaoyongzhu previously approved these changes Aug 11, 2022
@blrchen
Copy link
Collaborator

blrchen commented Aug 16, 2022

@ahlag would you mind merge latest main and resolve the conflicts so that we can get this merged, thanks for your time!

@ahlag
Copy link
Contributor Author

ahlag commented Aug 16, 2022

@xiaoyongzhu @blrchen
Done! Could you merge it today? Cause I am afraid it might have new conflicts.

@xiaoyongzhu xiaoyongzhu merged commit 6a0aba3 into feathr-ai:main Aug 16, 2022
@ahlag ahlag deleted the feature/delimiter branch August 17, 2022 08:05
ahlag added a commit to ahlag/feathr that referenced this pull request Aug 26, 2022
* Added documentation

Signed-off-by: Chang Yong Lik <[email protected]>

* Added delimiter to CSVLoader

Signed-off-by: Chang Yong Lik <[email protected]>
(cherry picked from commit bc71fad93c08f6d06e40f7e289456c6a1b4d45e0)

* Added delimiter to BatchDataLoader, FileFormat and SourceUtils

Signed-off-by: Chang Yong Lik <[email protected]>

* Added test case for BatchDataLoader

Signed-off-by: Chang Yong Lik <[email protected]>

* Added test case for FileFormat

Signed-off-by: Chang Yong Lik <[email protected]>

* Added test case for BatchDataLoader

Signed-off-by: Chang Yong Lik <[email protected]>

* Added test case and fixed indent

Signed-off-by: Chang Yong Lik <[email protected]>

* Passing failure

Signed-off-by: changyonglik <[email protected]>

* Removed unused imports from BatchDataLoader

Signed-off-by: changyonglik <[email protected]>

* Fixed test failures

Signed-off-by: changyonglik <[email protected]>

* Added release version

Signed-off-by: changyonglik <[email protected]>

* Removed tailing space

Signed-off-by: changyonglik <[email protected]>

* Removed wildcard imports

Signed-off-by: changyonglik <[email protected]>

* Paraphrased comments and docstring

Signed-off-by: changyonglik <[email protected]>

* Added DelimiterUtils

Signed-off-by: changyonglik <[email protected]>

* Refactored utils

Signed-off-by: changyonglik <[email protected]>

* Updated wiki to support both tsv and csv

Signed-off-by: changyonglik <[email protected]>

* Fixed spelling error

Signed-off-by: changyonglik <[email protected]>

* trigger GitHub actions

Signed-off-by: Chang Yong Lik <[email protected]>
Signed-off-by: changyonglik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test Tag to execute build pipeline for a PR from forked repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add delimiter option for reading CSV files for Feathr
4 participants