Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK 5280] RDF Loader added + documentation #4650

Closed
wants to merge 40 commits into from
Closed

[SPARK 5280] RDF Loader added + documentation #4650

wants to merge 40 commits into from

Conversation

lukovnikov
Copy link

Have been testing it with DBpedia dumps, works well so far.
Any help with custom partitioning and optimization is welcome.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27643 has finished for PR 4650 at commit 80d9b72.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lukovnikov
Copy link
Author

style errors fixed

@SparkQA
Copy link

SparkQA commented Feb 18, 2015

Test build #27680 has finished for PR 4650 at commit 4014c7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 18, 2015

Please add tests for your RDF loader, and see my codes as an example:
maropu@cc5ac0b

BTW, I think that it'd would be better to divide an interface and the implementations
for GraphLoader because we'll possibly add the some types of GraphLoader
for different formats in a future.

e.g.,)

  • an interface
    o.a.spark.graphx.GraphLoader:
    abstract class GraphLoader {
    def edgeListFile(...)
    }
  • the implementations
    o.a.spark.graphx.impl.loader.LineLoader
    class LineLoader extends GraphLoader {
    def edgeListFile() = {the current implementation of GraphLoader#edgeListFile}
    }

o.a.spark.graphx.impl.loader.RDFLoader
class RDFLoader extends GraphLoader {
def edgeListFile() = {your codes}
}

Thought?

@lukovnikov
Copy link
Author

Will add tests soon.

I was also thinking about making one interface for different loaders (with a load() method instead of edgeListFile()) and maybe a facade combining all loaders.

@lukovnikov
Copy link
Author

Added test + a test file (small excerpt from DBpedia 3.9) + small fix in RDFLoader

@SparkQA
Copy link

SparkQA commented Feb 21, 2015

Test build #27822 has finished for PR 4650 at commit 1bec795.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2015

Test build #27817 has finished for PR 4650 at commit 04df47a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2015

Test build #27823 has finished for PR 4650 at commit b658c55.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2015

Test build #27825 has finished for PR 4650 at commit 3db73ab.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 21, 2015

Test build #27826 has finished for PR 4650 at commit 4daa6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lukovnikov
Copy link
Author

@maropu tests are added and build tests passed. Is it ready for merging now?

@blankdots
Copy link

I would really like to have this - is it going to be merged ?

@emir-munoz
Copy link

+1, Is this going to be merged?

@maropu
Copy link
Member

maropu commented Jul 14, 2015

@emir-munoz @blankdots This PR is totally stale, so it'd better to refactor this if you're interested in.
Also, ISTM this kind of loader extensions should be registered in spark packages.

@emir-munoz
Copy link

thanks @maropu I will take a closer look into it

@rvesse
Copy link
Member

rvesse commented Jul 14, 2015

A Spark plugin seems like a much better approach, I've done some experimentation on a plugin for this which seems like a much cleaner and lightweight approach though I have no idea if it will ever be open sourced (it is work for my employer) or move beyond the experimentation stage.

I would strongly suggest looking at leveraging existing libraries like Apache Jena to do a lot of the work for you and avoid having to implement your own RDF parsers. Personally I am using the Apache Jena Elephas modules for this since the Hadoop Input formats can simply be used by Spark. (Disclaimer - I am a committer on Apache Jena and the main contributor to the Elephas modules)

@blankdots
Copy link

@maropu @emir-munoz also considering taking a look at it.
@rvesse good suggestion.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@set0gut1
Copy link

set0gut1 commented Nov 6, 2015

def gethash(in:String):Long = {
  var h = 1125899906842597L
  for (x <- in) {
    h = 31 * h + x;
  }
  return h
}

This hash function used to calculate vertexId seems to be weak.
I tried this with subject URIs of DBpedia's labels_en.nt (sample).
There are 11,519,154 unique URIs, and the hash values of 20,741 URIs collided (0.18%).

For example, the hash values of these URI are same (-3127496886112549146).

  • http://dbpedia.org/resource/Dms
  • http://dbpedia.org/resource/EOT
  • http://dbpedia.org/resource/F15
  • http://dbpedia.org/resource/EP5

@MLnick
Copy link
Contributor

MLnick commented Nov 12, 2015

@lukovnikov if there is still interest in this, the best approach would be to first release something in spark-packages.org as a set of utilities to create Graphs. Using existing 3rd party Hadoop formats makes the most sense as per @rvesse.

Could you close this PR please?

@andrewor14
Copy link
Contributor

+1 to making this a Spark package. I would recommend that we close this PR since it's gone stale.

@asfgit asfgit closed this in ce5fd40 Dec 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.