Add delimiter option for reading CSV files for Feathr #241

xiaoyongzhu · 2022-05-10T18:17:08Z

Currently when reading for CSV files, the options are very limited. Would be good to add a few options to read csv files, for example delimiter options.

Currently CSV doesn't have good delimiter option support and Feathr should support that use case to allow a bit customization (currently it's only supporting default value which is ,.

The text was updated successfully, but these errors were encountered:

ahlag · 2022-05-12T14:15:40Z

@xiaoyongzhu
Can I have a try? Could you tell me which part needs some upgrade?

xiaoyongzhu · 2022-05-12T21:57:18Z

@xiaoyongzhu
Can I have a try? Could you tell me which part needs some upgrade?

Definitely! I'll provide more details on how I envision the implementation.

xiaoyongzhu · 2022-05-12T22:22:51Z

What I'm imagining is:

Use the Job configurations like here to implement, i.e. add a section called spark.feathr.inputFormat.csvOptions.sep which allows end users to pass the delimiters as options
in the scala code, if you search for ss.read.format("csv").option("header", "true"), there will be a bunch of places that you need to modify. Eventually they will be using something like the csv reader here (https://spark.apache.org/docs/3.2.0/sql-data-sources-csv.html).
You can get the config in different places thru something like this: sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",")
You are likely to provide a test case
Also please help the job configuration docs (https://linkedin.github.io/feathr/how-to-guides/feathr-job-configuration.html) to make sure the options are clear to end users.

Let me know if you have any questions, and feel free to reach out to me if you have any questions (via Slack etc.)

xiaoyongzhu · 2022-05-12T22:23:37Z

cc @jaymo001 @hangfei

ahlag · 2022-05-13T03:18:30Z

@xiaoyongzhu
Thanks for sharing!
Do you have a deadline for this? I can try implementing from this weekend.

xiaoyongzhu · 2022-05-13T14:32:44Z

@xiaoyongzhu
Thanks for sharing!
Do you have a deadline for this? I can try implementing from this weekend.

Sounds good! Let me know if you have any questions!

ahlag · 2022-05-13T16:03:02Z

@xiaoyongzhu
Just want to clarify my understanding. We want the end users to pass spark.feathr.inputFormat.csvOptions.sep to specify the delimiter as below (snippet I found from the test):
https://github.com/linkedin/feathr/blob/f2cea49b36d28dc60934304b584c09c09076e88d/feathr_project/test/test_input_output_sources.py#L89-L93

Can you tell me how spark.feathr.inputFormat.csvOptions.sep is parsed as an argument in feathr? where (3) sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",") is set.
Can you tell me how scala is unit tested in feathr? ( My guess build/sbt test?)
How is the website updated? Is it done from this repo?

xiaoyongzhu · 2022-05-14T00:08:46Z

@xiaoyongzhu Just want to clarify my understanding. We want the end users to pass spark.feathr.inputFormat.csvOptions.sep to specify the delimiter as below (snippet I found from the test):

https://github.com/linkedin/feathr/blob/f2cea49b36d28dc60934304b584c09c09076e88d/feathr_project/test/test_input_output_sources.py#L89-L93

Can you tell me how spark.feathr.inputFormat.csvOptions.sep is parsed as an argument in feathr? where (3) sqlContext.getConf("spark.feathr.inputFormat.csvOptions.sep", ",") is set.

Can you tell me how scala is unit tested in feathr? ( My guess build/sbt test?)

How is the website updated? Is it done from this repo?

spark.feathr.inputFormat.csvOptions.sep it's passed as spark options (if you read how execution_configuratons is passed.
2. yes, using sbt test
3. yes the docs are directly updated, so you only need to update the markdown files

xiaoyongzhu added good first issue Good for newcomers help wanted Extra attention is needed labels May 10, 2022

xiaoyongzhu assigned ahlag May 12, 2022

ahlag mentioned this issue May 30, 2022

Add delimiter option for reading CSV files for Feathr #307

Merged

5 tasks

ahlag mentioned this issue Aug 2, 2022

[WIP] UI: Add detail page for data source #544

Closed

xiaoyongzhu closed this as completed in #307 Aug 16, 2022

ahlag mentioned this issue Aug 26, 2022

UI: Add data source detail page #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add delimiter option for reading CSV files for Feathr #241

Add delimiter option for reading CSV files for Feathr #241

xiaoyongzhu commented May 10, 2022 •

edited

Loading

ahlag commented May 12, 2022

xiaoyongzhu commented May 12, 2022

xiaoyongzhu commented May 12, 2022

xiaoyongzhu commented May 12, 2022

ahlag commented May 13, 2022

xiaoyongzhu commented May 13, 2022

ahlag commented May 13, 2022 •

edited by xiaoyongzhu

Loading

xiaoyongzhu commented May 14, 2022

Add delimiter option for reading CSV files for Feathr #241

Add delimiter option for reading CSV files for Feathr #241

Comments

xiaoyongzhu commented May 10, 2022 • edited Loading

ahlag commented May 12, 2022

xiaoyongzhu commented May 12, 2022

xiaoyongzhu commented May 12, 2022

xiaoyongzhu commented May 12, 2022

ahlag commented May 13, 2022

xiaoyongzhu commented May 13, 2022

ahlag commented May 13, 2022 • edited by xiaoyongzhu Loading

xiaoyongzhu commented May 14, 2022

xiaoyongzhu commented May 10, 2022 •

edited

Loading

ahlag commented May 13, 2022 •

edited by xiaoyongzhu

Loading