Skip to content

Latest commit

 

History

History
82 lines (56 loc) · 3.84 KB

README.md

File metadata and controls

82 lines (56 loc) · 3.84 KB

Knowledge Search

a graph-based knowledge search engine powered by Wikipedia

knowledgesearch.us

Connecting Articles in a Graph

The first link in the main body text identifies a hierarchical relationship between articles: banana links to fruit, piano to musical instruments, and so on.

The search engine constructs a directed graph connecting the 11 million English articles (6 million redirects) using the first link. For those curious about the network's topology, peek at the research inspiring this project.

Example: Piano

Parent Comparable Children
musical instruments Music box, Violin family, Glass harmonica Piano Music, Piano music, Grand Piano, Lily Maisky, William Merrigan Daly

Graph Implementation

  1. Download entire XML dump available here: https://dumps.wikimedia.org/enwiki/
  2. Extract the first link in the main body text (get_first_link.py)
  3. Store graph in neo4j
    • index articles by title and add page views as a property of each article
      • uses bulk import, which also includes page views as an attribute for each node
      • query can filter resutls by page views to return the most relevant articles

note matching titles between the available hourly page view data and displayed title is imperfect (see match_views.py)

Fuzzy Title Matching

In addition to the graph, the first 2000 characters of the main body text are indexed for fuzzy title searching.

  • powered by Elasticsearch
  • indexing is distributed using Scala
    • note PySpark Databricks XML package and EsSpark (connector from Spark to Elasticsearch) are not compatible
    • build jar using Maven and run Scala (see pom.xml and index_wiki.scala)
  • Elasticsearch query weighs the title 2x more heavily realtive to introductory body text

Example: "paper" --> "Pulp (paper)"

lower-case "paper" is matched to the correct Wikipedia article title

Setup

To install dependencies:

pip install -r requirements.txt

For distributed computations, the program also requires Spark, Java > 7, and Scala.

Program expects configurations in configs.py which sets environment variables for database and spark nodes:

import os

os.environ["master_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["elasticsearch_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["neo4j_pass"] = "xxxx"
os.environ["neo4j_ip"] = "xx.xxx.xx"

Application

Flow based on search term:

  • match search term to the closest title (using Elasticsearch query above)
  • fetch a subset of the network (parent, comparable, and child articles)
  • filter the subgraph by page views to return only the most relevant subset

The front-end serves this result in a directed D3 graph.