Skip to content

Commit

Permalink
ORC-1023: Support writing bloom filters in ConvertTool (#933)
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR adds an option to the java tool ConvertTool to specify which columns it should generate bloom filters.

### Why are the changes needed?

While debugging an issue, I need to generate an ORC file with bloom filters using the Java APIs. The ConvertTool is easy to use but it doesn't generate bloom filters. It'd be helpful to add an option for it.

### How was this patch tested?

Didn't find any existing tests on ConvertTool. So I manually tested it and verified the bloom filters are generated.

(cherry picked from commit 7c45137)
Signed-off-by: Dongjoon Hyun <[email protected]>
  • Loading branch information
stiga-huang authored and dongjoon-hyun committed Dec 16, 2021
1 parent a0bb4ad commit 9bd4824
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 5 deletions.
15 changes: 13 additions & 2 deletions java/tools/src/java/org/apache/orc/tools/convert/ConvertTool.java
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ public class ConvertTool {
private final int csvHeaderLines;
private final String csvNullString;
private final String timestampFormat;
private final String bloomFilterColumns;
private final Writer writer;
private final VectorizedRowBatch batch;

Expand Down Expand Up @@ -194,11 +195,17 @@ public ConvertTool(Configuration conf,
this.csvHeaderLines = getIntOption(opts, 'H', 0);
this.csvNullString = opts.getOptionValue('n', "");
this.timestampFormat = opts.getOptionValue("t", DEFAULT_TIMESTAMP_FORMAT);
this.bloomFilterColumns = opts.getOptionValue('b', null);
String outFilename = opts.hasOption('o')
? opts.getOptionValue('o') : "output.orc";
boolean overwrite = opts.hasOption('O');
writer = OrcFile.createWriter(new Path(outFilename),
OrcFile.writerOptions(conf).setSchema(schema).overwrite(overwrite));
OrcFile.WriterOptions writerOpts = OrcFile.writerOptions(conf)
.setSchema(schema)
.overwrite(overwrite);
if (this.bloomFilterColumns != null) {
writerOpts.bloomFilterColumns(this.bloomFilterColumns);
}
writer = OrcFile.createWriter(new Path(outFilename), writerOpts);
batch = schema.createRowBatch();
}

Expand Down Expand Up @@ -238,6 +245,10 @@ private static CommandLine parseOptions(String[] args) throws ParseException {
options.addOption(
Option.builder("s").longOpt("schema").hasArg()
.desc("The schema to write in to the file").build());
options.addOption(
Option.builder("b").longOpt("bloomFilterColumns").hasArg()
.desc("Comma separated values of column names for which bloom filter is " +
"to be created").build());
options.addOption(
Option.builder("o").longOpt("output").desc("Output filename")
.hasArg().build());
Expand Down
10 changes: 7 additions & 3 deletions site/_docs/java-tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ supports both the local file system and HDFS.

The subcommands for the tools are:

* convert (since ORC 1.4) - convert JSON files to ORC
* convert (since ORC 1.4) - convert JSON/CSV files to ORC
* count (since ORC 1.6) - recursively find *.orc and print the number of rows
* data - print the data of an ORC file
* json-schema (since ORC 1.4) - determine the schema of JSON documents
Expand All @@ -28,9 +28,13 @@ The command line looks like:

## Java Convert

The convert command reads several JSON files and converts them into a
The convert command reads several JSON/CSV files and converts them into a
single ORC file.

`-b,--bloomFilterColumns <columns>`
: Comma separated values of column names for which bloom filter is to be created.
By default, no bloom filters will be created.

`-e,--escape <escape>`
: Sets CSV escape character

Expand Down Expand Up @@ -311,4 +315,4 @@ cost of printing the data out.

## Java Version

The version command prints the version of this ORC tool.
The version command prints the version of this ORC tool.

0 comments on commit 9bd4824

Please sign in to comment.