Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delimiter option for reading CSV files for Feathr #241

Closed
xiaoyongzhu opened this issue May 10, 2022 · 8 comments · Fixed by #307 or #620
Closed

Add delimiter option for reading CSV files for Feathr #241

xiaoyongzhu opened this issue May 10, 2022 · 8 comments · Fixed by #307 or #620
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xiaoyongzhu
Copy link
Member

xiaoyongzhu commented May 10, 2022

Currently when reading for CSV files, the options are very limited. Would be good to add a few options to read csv files, for example delimiter options.

Currently CSV doesn't have good delimiter option support and Feathr should support that use case to allow a bit customization (currently it's only supporting default value which is ,.

@xiaoyongzhu xiaoyongzhu added good first issue Good for newcomers help wanted Extra attention is needed labels May 10, 2022
@ahlag
Copy link
Contributor

ahlag commented May 12, 2022

@xiaoyongzhu
Can I have a try? Could you tell me which part needs some upgrade?

@xiaoyongzhu
Copy link
Member Author

@xiaoyongzhu
Can I have a try? Could you tell me which part needs some upgrade?

Definitely! I'll provide more details on how I envision the implementation.

@xiaoyongzhu
Copy link
Member Author

What I'm imagining is:

  1. Use the Job configurations like here to implement, i.e. add a section called spark.feathr.inputFormat.csvOptions.sep which allows end users to pass the delimiters as options
  2. in the scala code, if you search for ss.read.format("csv").option("header", "true"), there will be a bunch of places that you need to modify. Eventually they will be using something like the csv reader here (https://spark.apache.org/docs/3.2.0/sql-data-sources-csv.html).
  3. You can get the config in different places thru something like this: sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",")
  4. You are likely to provide a test case
  5. Also please help the job configuration docs (https://linkedin.github.io/feathr/how-to-guides/feathr-job-configuration.html) to make sure the options are clear to end users.

Let me know if you have any questions, and feel free to reach out to me if you have any questions (via Slack etc.)

@xiaoyongzhu
Copy link
Member Author

cc @jaymo001 @hangfei

@ahlag
Copy link
Contributor

ahlag commented May 13, 2022

@xiaoyongzhu
Thanks for sharing!
Do you have a deadline for this? I can try implementing from this weekend.

@xiaoyongzhu
Copy link
Member Author

@xiaoyongzhu
Thanks for sharing!
Do you have a deadline for this? I can try implementing from this weekend.

Sounds good! Let me know if you have any questions!

@ahlag
Copy link
Contributor

ahlag commented May 13, 2022

@xiaoyongzhu
Just want to clarify my understanding. We want the end users to pass spark.feathr.inputFormat.csvOptions.sep to specify the delimiter as below (snippet I found from the test):
https://github.com/linkedin/feathr/blob/f2cea49b36d28dc60934304b584c09c09076e88d/feathr_project/test/test_input_output_sources.py#L89-L93

  1. Can you tell me how spark.feathr.inputFormat.csvOptions.sep is parsed as an argument in feathr? where (3) sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",") is set.

  2. Can you tell me how scala is unit tested in feathr? ( My guess build/sbt test?)

  3. How is the website updated? Is it done from this repo?

@xiaoyongzhu
Copy link
Member Author

@xiaoyongzhu Just want to clarify my understanding. We want the end users to pass spark.feathr.inputFormat.csvOptions.sep to specify the delimiter as below (snippet I found from the test):

https://github.com/linkedin/feathr/blob/f2cea49b36d28dc60934304b584c09c09076e88d/feathr_project/test/test_input_output_sources.py#L89-L93

  1. Can you tell me how spark.feathr.inputFormat.csvOptions.sep is parsed as an argument in feathr? where (3) sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",") is set.
  2. Can you tell me how scala is unit tested in feathr? ( My guess build/sbt test?)
  3. How is the website updated? Is it done from this repo?

spark.feathr.inputFormat.csvOptions.sep it's passed as spark options (if you read how execution_configuratons is passed.
2. yes, using sbt test
3. yes the docs are directly updated, so you only need to update the markdown files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
2 participants