diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..27237318 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,165 @@ +# Contributing + +This project is work of [many contributors](https://github.com/rcongiu/Hive-JSON-Serde/graphs/contributors). + +You're encouraged to submit [pull requests](https://github.com/rcongiu/Hive-JSON-Serde/pulls), [propose features and discuss issues](https://github.com/rcongiu/Hive-JSON-Serde/issues). + +In the examples below, substitute your Github username for `contributor` in URLs. + +## Fork the Project + +Fork the [project on Github](https://github.com/rcongiu/Hive-JSON-Serde) and check out your copy. + +``` +git clone https://github.com/contributor/Hive-JSON-Serde.git +cd Hive-JSON-Serde +git remote add upstream https://github.com/rcongiu/Hive-JSON-Serde.git +``` + +## Build + +Ensure that you can build the project and run tests. + +``` +git checkout develop +mvn test +``` + +### Architecture + +JSON encoding and decoding is using a somewhat modified version of [Douglas Crockfords JSON library](https://github.com/douglascrockford/JSON-java), which is included in the distribution. + +The SerDe builds a series of wrappers around `JSONObject`. Since serialization and deserialization are executed for every (and possibly billions) record we want to minimize object creation, so instead of serializing/deserializing to an `ArrayList`, the `JSONObject` is kept and a cached +`ObjectInspector` is built around it. When deserializing, Hive gets a `JSONObject`, and a `JSONStructObjectInspector` to read from. Hive has `Structs`, `Maps`, `Arrays` and primitives while `JSON` has `Objects`, `Arrays` and primitives. Hive `Maps` and `Structs` are both implemented as `Object`, which are less restrictive than hive maps. A JSON `Object` could be a mix of keys and values of different types, while Hive expects you to declare the +type of map (eg. `map`). The user is responsible for having the JSON data structure match hive table declaration. + +See [www.congiu.com](http://www.congiu.com/?s=serde) for details. + +### Compiling for Specific Targets + +Use maven to compile the SerDe. This project uses maven profiles to support multiple version of Hive/CDH. + +#### CDH4 + +``` +mvn -Pcdh4 clean package +``` + +#### CDH5 + +``` +mvn -Pcdh5 clean package +``` + +#### HDP 2.3 + +``` +mvn -Phdp23 clean package +``` + +### Generate a JAR + +All output is generated into `json-serde/target/json-serde-VERSION-jar-with-dependencies.jar`. + +``` +$ mvn package +``` + +#### Specific Versions of Hive + +If you want to compile the SerDe against a different version of the cloudera libs, use `-D`. + +``` +$ mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package +``` + +For Hive 0.14.0 and Cloudera 1.0.0. + +``` +mvn -Pcdh5 -Dcdh5.hive.version=1.0.0 clean package +``` + +## Write Tests + +Try to write a test that reproduces the problem you're trying to fix or describes a feature that you want to build. + +We definitely appreciate pull requests that highlight or reproduce a problem, even without a fix. + +## Write Code + +Implement your feature or bug fix. + +## Write Documentation + +Document any external behavior in the [README](README.md). + +## Update Changelog + +Add a line to [CHANGELOG](CHANGELOG.md) under *Next* release. +Make it look like every other line, including your name and link to your Github account. + +## Commit Changes + +Make sure git knows your name and email address: + +``` +git config --global user.name "Your Name" +git config --global user.email "contributor@example.com" +``` + +Writing good commit logs is important. A commit log should describe what changed and why. + +``` +git add ... +git commit +``` + +## Push + +``` +git push origin my-feature-branch +``` + +## Make a Pull Request + +Go to https://github.com/contributor/Hive-JSON-Serde and select your feature branch. +Click the 'Pull Request' button and fill out the form. Pull requests are usually reviewed within a few days. + +## Rebase + +If you've been working on a change for a while, rebase with upstream/master. + +``` +git fetch upstream +git rebase upstream/master +git push origin my-feature-branch -f +``` + +## Update CHANGELOG Again + +Update the [CHANGELOG](CHANGELOG.md) with the pull request number. A typical entry looks as follows. + +``` +* [#123](https://github.com/rcongiu/Hive-JSON-Serde/pull/123): Reticulated splines - [@contributor](https://github.com/contributor). +``` + +Amend your previous commit and force push the changes. + +``` +git commit --amend +git push origin my-feature-branch -f +``` + +## Check on Your Pull Request + +Go back to your pull request after a few minutes and see whether it passed muster with Travis-CI. Everything should look green, otherwise fix issues and amend your commit as described above. + +## Be Patient + +It's likely that your change will not be merged and that the nitpicky maintainers will ask you to do more, or fix seemingly benign problems. Hang on there! + +## Thank You + +Please do know that we really appreciate and value your time and work. We love you, really. + + diff --git a/README.md b/README.md index 1747ac1c..2ffb56db 100644 --- a/README.md +++ b/README.md @@ -3,128 +3,107 @@ JsonSerde - a read/write SerDe for JSON Data [![Build Status](https://travis-ci.org/rcongiu/Hive-JSON-Serde.svg?branch=master)](https://travis-ci.org/rcongiu/Hive-JSON-Serde) -Serialization/Deserialization module for Apache Hadoop Hive -JSON conversion UDF +This library enables Apache Hive to read and write in JSON format. It includes support for serialization and deserialization (SerDe) as well as JSON conversion UDF. -This module allows hive to read and write in JSON format (see http://json.org for more info). +### Features -Features: * Read data stored in JSON format -* Convert data to JSON format when INSERT INTO table -* arrays and maps are supported -* nested data structures are also supported. -* modular to support multiple versions of CDH - -IMPORTANT!!! READ THIS BELOW!! -Json records must be _one per line_, that is, the serde -WILL NOT WORK with multiline Json. Why ? Because the way hadoop -works with files, they have to be _splittable_, for instance, -hadoop will split text files at end of line..but in order to split -a text file with json at a certain point, we would have to parse -everything up to that point. See below -``` -// this will work -{ "key" : 10 } +* Convert data to JSON format during `INSERT INTO ` +* Support for JSON arrays and maps +* Support for nested data structures +* Support for Cloudera's Distribution Including Apache Hadoop (CDH) +* Support for multiple versions of Hadoop -// this will not work -{ - "key" : 10 -} -``` +### Installation +Download the latest binaries (`json-serde-X.Y.Z-jar-with-dependencies.jar` and `json-udf-X.Y.Z-jar-with-dependencies.jar`) from [congiu.net/hive-json-serde](http://www.congiu.net/hive-json-serde). Choose the correct verson for CDH 4, CDH 5 or Hadoop 2.3. Place the JARs into `hive/lib` or use `ADD JAR` in Hive. -BINARIES ----------- -github used to allow uploading of binaries, but not anymore. -Many people have been asking me for binaries in private by email -so I decided to upload binaries here: +### JSON Data Files -http://www.congiu.net/hive-json-serde/ +Upload JSON files to HDFS with `hadoop fs -put` or `LOAD DATA LOCAL`. JSON records in data files must appear _one per line_, without a trailing CR/LF after the last record. This is because Hadoop partitions files as text using CR/LF as a separator to distribute work. -so you don't need to compile your own. There are versions for -CDH4, CDH5 and HDP 2.3. +The following example will work. +```json +{ "key" : 10 } +{ "key" : 20 } +``` -COMPILE ---------- - -Use maven to compile the serde. -The project uses maven profiles to support multiple -version of hive/CDH. -To build for CDH4: +The following example will not work. -``` -mvn -Pcdh4 clean package +```json +{ + "key" : 10 +} +{ + "key" : 20 +} ``` -To build for CDH5: -``` -mvn -Pcdh5 clean package -``` +### Loading a JSON File and Querying Data -To build for HDP 2.3: -``` -mvn -Phdp23 clean package -``` +Uses [json-serde/src/test/scripts/test-without-cr-lf.json](json-serde/src/test/scripts/test-without-cr-lf.json). -the serde will be in -``` -json-serde/target/json-serde-VERSION-jar-with-dependencies.jar ``` +~$ cat test.json +{"text":"foo","number":123} +{"text":"bar","number":345} -```bash -$ mvn package +~$ perl -pe 'chomp if eof' test.json > test-without-cr-lf.json -# If you want to compile the serde against a different -# version of the cloudera libs, use -D: -$ mvn -Dcdh.version=0.9.0-cdh3u4c-SNAPSHOT package -``` +~$ cat test-without-cr-lf.json +{"text":"foo","number":123} +{"text":"bar","number":345}~$ +$ hadoop fs -put -f test-without-cr-lf.json /user/data/test.json +$ hive -Hive 0.14.0 and 1.0.0 ------------ +hive> CREATE DATABASE test; -Compile with -``` -mvn -Pcdh5 -Dcdh5.hive.version=1.0.0 clean package +hive> CREATE EXTERNAL TABLE test ( text string ) + ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' + LOCATION '/user/data'; + +hive> SELECT * FROM test; +OK + +foo 123 +bar 345 ``` +### Querying Complex Fields -EXAMPLES ------------- +Uses [json-serde/src/test/scripts/data.txt](json-serde/src/test/scripts/data.txt). -Example scripts with simple sample data are in src/test/scripts. Here some excerpts: +``` +hive> CREATE DATABASE test; -### Query with complex fields like arrays +hive> CREATE TABLE test ( + one boolean, + three array, + two double, + four string ) + ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' + STORED AS TEXTFILE; -```sql -CREATE TABLE json_test1 ( - one boolean, - three array, - two double, - four string ) -ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' -STORED AS TEXTFILE; +hive> LOAD DATA LOCAL INPATH 'data.txt' OVERWRITE INTO TABLE test; -LOAD DATA LOCAL INPATH 'data.txt' OVERWRITE INTO TABLE json_test1 ; -hive> select three[1] from json_test1; +hive> select three[1] from test; gold yellow ``` -If you have complex json it can become tedious to create the table -by hand. I recommend [hive-json-schema](https://github.com/quux00/hive-json-schema) to build your schema from the data. +If you have complex json it can be tedious to create tables manually. Try [hive-json-schema](https://github.com/quux00/hive-json-schema) to build your schema from data. +See [json-serde/src/test/scripts](json-serde/src/test/scripts) for more examples. -### Nested structures - -You can also define nested structures: +### Defining Nested Structures ```sql -add jar ../../../target/json-serde-1.0-SNAPSHOT-jar-with-dependencies.jar; +ADD JAR json-serde-1.3.7-SNAPSHOT-jar-with-dependencies.jar; CREATE TABLE json_nested_test ( country string, @@ -133,127 +112,125 @@ CREATE TABLE json_nested_test ( ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; --- data : {"country":"Switzerland","languages":["German","French", --- "Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}} +-- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}} -LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_nested_test ; +LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_nested_test; -select * from json_nested_test; -- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]} -select languages[0] from json_nested_test; -- result: German -select religions['catholic'][0] from json_nested_test; -- result: 10 -``` +select * from json_nested_test; -### SUPPORT FOR ARRAYS -You could have JSON arrays, in that case the SerDe would still work, -and it will expect data in the JSON arrays ordered just like the hive -columns, like you'd see in the regular text/csv serdes. -For instance, if you do -```sql -CREATE TABLE people ( name string, age int) +-- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]} + +select languages[0] from json_nested_test; +-- result: German + +select religions['catholic'][0] from json_nested_test; +-- result: 10 ``` -your data should look like -```javascript + +### Using Arrays + +Data in JSON arrays should be ordered identically to Hive columns, similarly to text/csv. + +For example, array data as follows. + +```js ["John", 26 ] ["Mary", 23 ] ``` -Arrays can still be nested, so you could have + +Can be imported into the following table. + +```sql +CREATE TABLE people (name string, age int) +``` + +Arrays can also be nested. + ```sql CREATE TABLE complex_array ( - name string, address struct) ... + name string, address struct +) + -- data: ["John", { street:"10 green street", city:"Paris" } .. ] ``` +### Importing Malformed Data -### MALFORMED DATA +The SerDe will raise exceptions with malformed data. For example, the following malformed JSON will raise `org.apache.hadoop.hive.serde2.SerDeException`. -The default behavior on malformed data is throwing an exception. -For example, for malformed json like +```json {"country":"Italy","languages" "Italian","religions":{"catholic":"90"}} +``` -you get: -Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] +``` +Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: +Row is not a valid JSON Object - JSONException: Expected a ':' after a key at 32 [character 33 line 1] +``` + +This may not be desirable if you have a few bad lines you wish to ignore. Set `ignore.malformed.json` in that case. -this may not be desirable if you have a few bad lines you wish to ignore. If so you can do: ```sql ALTER TABLE json_table SET SERDEPROPERTIES ( "ignore.malformed.json" = "true"); ``` -it will not make the query fail, and the above record will be returned as -NULL null null +While this option will not make the query fail, a NULL record will be inserted instead. -#### Promoting a scalar to an array - -It is a common issue to have a field that sometimes is a scalar and sometimes is an array, for instance: ``` -{ field: "hello", .. } -{ field: [ "hello", "world" ], ... +NULL NULL NULL +``` + +### Promoting a Scalar to an Array + +It is a common issue to have a field that sometimes is a scalar and sometimes an array. + +```json +{ "field" : "hello", .. } +{ "field" : [ "hello", "world" ], ... ``` -In this case , if you declare your table as `array,` if the SerDe finds a scalar, it will return a one-element -array, effectively promoting the scalar to an array. The scalar has to be of the correct type. +Declare your table as `array`, the SerDe will return a one-element array of the right type, promoting the scalar. +### Support for UNIONTYPE -### UNIONTYPE support (PLEASE READ IF YOU USE IT) +A `Uniontype` is a field that can contain different types. Hive usually stores a 'tag' that is basically the index of the datatype. For example, if you create a `uniontype`, a tag would be 0 for int, 1 for string, 2 for float as per the [UnionType documentation](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes). -A Uniontype is a field that can contain different types, like in C. -Hive usually stores a 'tag' that is basically the index of the datatype, -for instance, if you create a uniontype , tag would be -0 for int, 1 for string, 2 for float (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypes). +JSON data does not store anything describing the type, so the SerDe will try and infer it. The order matters. For example, if you define +a field `f` as `UNIONTYPE` you will get different results. -Now, JSON data does not store anything like that, so the serde will try and -look what it can do.. that is, check, in order, if the data is compatible -with any of the given types. So, THE ORDER MATTERS. Let's say you define -a field f as UNIONTYPE and your js has -```{json} -{ "f": "123" } // parsed as int, since int precedes string in definitions, - // and "123" can be parsed to a number -{ "f": "asv" } // parsed as string +The following data will be parsed as `int`, since it precedes the `String` type in the defintion and `123` is successfully parsed as a number. + +```json +{ "f": "123" } ``` -That is, a number in a string. This will return a tag of 0 and an int rather -than a string. -It's worth noticing that complex Union types may not be that efficient, since -the SerDe may try to parse the same data in several ways; however, several -people asked me to implement this feature to cope with bad JSON, so..I did. +The following data will parsed as a `String`. +```json +{ "f": "asv" } +``` +It's worth noting that complex `Union` types may not be very efficient, since the SerDe may try to parse the same data in multiple ways. -### MAPPING HIVE KEYWORDS +### Mapping Hive Keywords -Sometimes it may happen that JSON data has attributes named like reserved words in hive. -For instance, you may have a JSON attribute named 'timestamp', which is a reserved word -in hive, and hive will fail when issuing a CREATE TABLE. -This SerDe can map hive columns over attributes named differently, using SerDe properties. +Sometimes JSON data has attributes named like reserved words in hive. For instance, you may have a JSON attribute named 'timestamp', and hive will fail when issuing a `CREATE TABLE`. This SerDe can map hive columns over attributes with different names using properties. -For instance: +In the following example `mapping.ts` translates the `ts` field into it the JSON attribute called `timestamp`. ```sql CREATE TABLE mytable ( - myfield string, - ts string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' + myfield string, ts string +) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( "mapping.ts" = "timestamp" ) STORED AS TEXTFILE; ``` -Notice the "mapping.ts", that means: take the column 'ts' and read into it the -JSON attribute named "timestamp" +### Mapping Names with Periods -#### Mapping names with dots +Hive doesn't support column names containing periods. In theory they should work when quoted in backtics, but doesn't, as noted in [SO#35344480](http://stackoverflow.com/questions/35344480/hive-select-column-with-non-alphanumeric-characters/35349822). To work around this issue set the property `dots.in.keys` to `true` in the SerDe Properties and access these fields by substituting the period with an underscore. -as noted in issue #131, Hive doesn't like column names containing dots/periods. -In theory they should work when quoted in backtics, but as noted in this [stack overflow discussion] -( http://stackoverflow.com/questions/35344480/hive-select-column-with-non-alphanumeric-characters/35349822) -it doesn't work in practice for some limitation of the hive parser. - -So, you can then set the property `dots.in.keys` to `true` in the Serde Properties and access -those fields by substituting the dot with an underscore. - -For example, if your JSON looks like -``` -{ "my.field" : "value" , "other" : { "with.dots" : "blah } } -``` -you can create the table like +For example, create the following table. ```sql CREATE TABLE mytable ( @@ -263,8 +240,13 @@ CREATE TABLE mytable ( WITH SERDEPROPERTIES ("dots.in.keys" = "true" ) ``` -Note how the table was created using underscores instead of dots. -Now you can query the fields without hive getting confused. +Load the following JSON. + +``` +{ "my.field" : "value" , "other" : { "with.dots" : "blah } } +``` + +Query data substituting periods with underscores. ```sql SELECT my_field, other.with_dots from mytable @@ -272,72 +254,37 @@ SELECT my_field, other.with_dots from mytable value, blah ``` -### ARCHITECTURE - -For the JSON encoding/decoding, I am using a modified version of Douglas Crockfords JSON library: -https://github.com/douglascrockford/JSON-java -which is included in the distribution. I had to make some minor changes to it, for this reason -I included it in my distribution and moved it to another package (since it's included in hive!) - -The SerDe builds a series of wrappers around JSONObject. Since serialization and deserialization -are executed for every (and possibly billions) record we want to minimize object creation, so -instead of serializing/deserializing to an ArrayList, I kept the JSONObject and built a cached -objectinspector around it. So when deserializing, hive gets a JSONObject, and a JSONStructObjectInspector -to read from it. Hive has Structs, Maps, Arrays and primitives while JSON has Objects, Arrays and primitives. -Hive Maps and Structs are both implemented as object, which are less restrictive than hive maps: -a JSON Object could be a mix of keys and values of different types, while hive expects you to declare the -type of map (example: map). The user is responsible for having the JSON data structure -match hive table declaration. - -More detailed explanation on my blog: -http://www.congiu.com/articles/json_serde - - -### UDF +### User Defined Functions (UDF) -As a bonus, I added a UDF that can turn anything into a JSON string. -So, if you want to convert anything (arrays, structs..) into -a string containing their JSON representation, you can do that. +#### tjson -Example: +The `tjson` UDF can turn array, structs or strings into JSON. ``` -add jar json-udf-1.3.8-jar-with-dependencies.jar; +ADD JAR json-udf-X.Y.Z-jar-with-dependencies.jar; create temporary function tjson as 'org.openx.data.udf.JsonUDF'; hive> select tjson(named_struct("name",name)) from mytest1; OK {"name":"roberto"} - - ``` +### Timestamps -### Notes - -#### Timestamp support -note that timestamp support will use the systems default timezone -to convert timestamps. +Note that the system default timezone is used to convert timestamps. +### Contributing -### CONTRIBUTING - -I am using gitflow for the release cycle. +See [CONTRIBUTING](CONTRIBUTING.md) for how to build the project. ### History -This library is written by [Roberto Congiu](http://www.congiu.com) +This library is written by [Roberto Congiu](http://www.congiu.com) <rcongiu@yahoo.com> during his time at [OpenX Technologies, Inc.](https://www.openx.com). See [CHANGELOG](CHANGELOG.md) for details. -### THANKS +### Thanks Thanks to Douglas Crockford for the liberal license for his JSON library, and thanks to my employer OpenX and my boss Michael Lum for letting me open source the code. - - - - - - diff --git a/json-serde/src/test/scripts/test-without-cr-lf.json b/json-serde/src/test/scripts/test-without-cr-lf.json new file mode 100644 index 00000000..4fa75a0a --- /dev/null +++ b/json-serde/src/test/scripts/test-without-cr-lf.json @@ -0,0 +1,2 @@ +{"text":"foo\nbar","number":123} +{"text":"bar\nfoo","number":345} \ No newline at end of file diff --git a/json-serde/src/test/scripts/test.json b/json-serde/src/test/scripts/test.json new file mode 100644 index 00000000..6d37c1b5 --- /dev/null +++ b/json-serde/src/test/scripts/test.json @@ -0,0 +1,2 @@ +{"text":"foo\nbar","number":123} +{"text":"bar\nfoo","number":345}