[SPARK 5280] RDF Loader added + documentation #4650

lukovnikov · 2015-02-17T14:48:11Z

Have been testing it with DBpedia dumps, works well so far.
Any help with custom partitioning and optimization is welcome.

TODO: test + comment

…oaderhash

SparkQA · 2015-02-17T19:49:00Z

Test build #27643 has finished for PR 4650 at commit 80d9b72.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

lukovnikov · 2015-02-18T11:39:51Z

style errors fixed

SparkQA · 2015-02-18T13:03:04Z

Test build #27680 has finished for PR 4650 at commit 4014c7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2015-02-18T15:10:17Z

Please add tests for your RDF loader, and see my codes as an example:
maropu@cc5ac0b

BTW, I think that it'd would be better to divide an interface and the implementations
for GraphLoader because we'll possibly add the some types of GraphLoader
for different formats in a future.

e.g.,)

an interface
o.a.spark.graphx.GraphLoader:
abstract class GraphLoader {
def edgeListFile(...)
}
the implementations
o.a.spark.graphx.impl.loader.LineLoader
class LineLoader extends GraphLoader {
def edgeListFile() = {the current implementation of GraphLoader#edgeListFile}
}

o.a.spark.graphx.impl.loader.RDFLoader
class RDFLoader extends GraphLoader {
def edgeListFile() = {your codes}
}

Thought?

lukovnikov · 2015-02-19T18:05:22Z

Will add tests soon.

I was also thinking about making one interface for different loaders (with a load() method instead of edgeListFile()) and maybe a facade combining all loaders.

lukovnikov · 2015-02-21T19:08:58Z

Added test + a test file (small excerpt from DBpedia 3.9) + small fix in RDFLoader

SparkQA · 2015-02-21T19:13:10Z

Test build #27822 has finished for PR 4650 at commit 1bec795.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-21T19:13:55Z

Test build #27817 has finished for PR 4650 at commit 04df47a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-21T19:32:41Z

Test build #27823 has finished for PR 4650 at commit b658c55.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-21T19:48:07Z

Test build #27825 has finished for PR 4650 at commit 3db73ab.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-21T21:12:23Z

Test build #27826 has finished for PR 4650 at commit 4daa6e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lukovnikov · 2015-02-23T12:30:06Z

@maropu tests are added and build tests passed. Is it ready for merging now?

blankdots · 2015-06-26T13:59:58Z

I would really like to have this - is it going to be merged ?

emir-munoz · 2015-07-13T14:29:37Z

+1, Is this going to be merged?

maropu · 2015-07-14T01:20:40Z

@emir-munoz @blankdots This PR is totally stale, so it'd better to refactor this if you're interested in.
Also, ISTM this kind of loader extensions should be registered in spark packages.

emir-munoz · 2015-07-14T09:29:42Z

thanks @maropu I will take a closer look into it

rvesse · 2015-07-14T11:37:21Z

A Spark plugin seems like a much better approach, I've done some experimentation on a plugin for this which seems like a much cleaner and lightweight approach though I have no idea if it will ever be open sourced (it is work for my employer) or move beyond the experimentation stage.

I would strongly suggest looking at leveraging existing libraries like Apache Jena to do a lot of the work for you and avoid having to implement your own RDF parsers. Personally I am using the Apache Jena Elephas modules for this since the Hadoop Input formats can simply be used by Spark. (Disclaimer - I am a committer on Apache Jena and the main contributor to the Elephas modules)

blankdots · 2015-07-14T12:55:27Z

@maropu @emir-munoz also considering taking a look at it.
@rvesse good suggestion.

AmplabJenkins · 2015-08-21T20:22:21Z

Can one of the admins verify this patch?

set0gut1 · 2015-11-06T19:28:38Z

def gethash(in:String):Long = {
  var h = 1125899906842597L
  for (x <- in) {
    h = 31 * h + x;
  }
  return h
}

This hash function used to calculate vertexId seems to be weak.
I tried this with subject URIs of DBpedia's labels_en.nt (sample).
There are 11,519,154 unique URIs, and the hash values of 20,741 URIs collided (0.18%).

For example, the hash values of these URI are same (-3127496886112549146).

http://dbpedia.org/resource/Dms
http://dbpedia.org/resource/EOT
http://dbpedia.org/resource/F15
http://dbpedia.org/resource/EP5

MLnick · 2015-11-12T08:52:05Z

@lukovnikov if there is still interest in this, the best approach would be to first release something in spark-packages.org as a set of utilities to create Graphs. Using existing 3rd party Hadoop formats makes the most sense as per @rvesse.

Could you close this PR please?

andrewor14 · 2015-12-17T01:03:21Z

+1 to making this a Spark package. I would recommend that we close this PR since it's gone stale.

lukovnikov and others added 30 commits February 3, 2015 20:41

fast forward from upstream

10436d2

dictionary builder done

595aed0

[SPARK 5280]

c239902

done dictionary version

f14e483

[SPARK 5280] rdfloader using hashes as VertexIds

43cc53a

cleaned up + fixed style

2e1220d

TODO: test + comment

made custom 64bit hash

54e2c6e

proper

b454560

fast forward from upstream

45a9f57

dictionary builder done

6ee9a2b

[SPARK 5280]

45c2216

done dictionary version

fa5c0da

[SPARK 5280] rdfloader using hashes as VertexIds

c036f98

cleaned up + fixed style

5755379

TODO: test + comment

made custom 64bit hash

e00123e

proper

6af9a7a

Merge branch 'master' of github.com:lukovnikov/spark into rdfloaderhash

1ee34c9

Merge branch 'rdfloaderhash' of github.com:lukovnikov/spark into rdfl…

9000a47

…oaderhash

RDF Loader with hash, tested on small RDF dumps (more tests in progress)

70eb725

added documentation for RDFLoader

4398d93

small update to RDFLoader description

273a1b3

sdf

202ccf8

fast forward from upstream

2d990ce

Merge branch 'master' of github.com:lukovnikov/spark

4a9b622

Merge branch 'rdfloaderhash'

062996c

[SPARK 5280]

121bf14

Merge branch 'rdfloaderhash' of github.com:lukovnikov/spark into rdfl…

67ada51

…oaderhash

Merge branch 'rdfloaderhash'

e5fcf75

Merge remote-tracking branch 'upstream/master'

c5960af

undone unnecessary changes

91361f3

style errors fixed

4014c7f

lukovnikov added 3 commits February 21, 2015 18:52

Merge remote-tracking branch 'upstream/master'

04df47a

added test

0421112

test style better

1bec795

revert pom

b658c55

source file closed

3db73ab

changed resource to .data

4daa6e9

asfgit closed this in ce5fd40 Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK 5280] RDF Loader added + documentation #4650

[SPARK 5280] RDF Loader added + documentation #4650

lukovnikov commented Feb 17, 2015

SparkQA commented Feb 17, 2015

lukovnikov commented Feb 18, 2015

SparkQA commented Feb 18, 2015

maropu commented Feb 18, 2015

lukovnikov commented Feb 19, 2015

lukovnikov commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

lukovnikov commented Feb 23, 2015

blankdots commented Jun 26, 2015

emir-munoz commented Jul 13, 2015

maropu commented Jul 14, 2015

emir-munoz commented Jul 14, 2015

rvesse commented Jul 14, 2015

blankdots commented Jul 14, 2015

AmplabJenkins commented Aug 21, 2015

set0gut1 commented Nov 6, 2015

MLnick commented Nov 12, 2015

andrewor14 commented Dec 17, 2015

[SPARK 5280] RDF Loader added + documentation #4650

[SPARK 5280] RDF Loader added + documentation #4650

Conversation

lukovnikov commented Feb 17, 2015

SparkQA commented Feb 17, 2015

lukovnikov commented Feb 18, 2015

SparkQA commented Feb 18, 2015

maropu commented Feb 18, 2015

lukovnikov commented Feb 19, 2015

lukovnikov commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

SparkQA commented Feb 21, 2015

lukovnikov commented Feb 23, 2015

blankdots commented Jun 26, 2015

emir-munoz commented Jul 13, 2015

maropu commented Jul 14, 2015

emir-munoz commented Jul 14, 2015

rvesse commented Jul 14, 2015

blankdots commented Jul 14, 2015

AmplabJenkins commented Aug 21, 2015

set0gut1 commented Nov 6, 2015

MLnick commented Nov 12, 2015

andrewor14 commented Dec 17, 2015