-
Notifications
You must be signed in to change notification settings - Fork 12
An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.
License
larroy/mycelium
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The Mycelium Information retrieval system ========================================= Check the latest up-to-date user documentation at http://pedro.larroy.com/mycelium/sphinx/ For the impatient ----------------- $ ./bootstrap.sh $ ./build.py $ scons $ build/release/crawler $ echo 'http://google.com' | nc localhost 1024 Status of the modules / Features ================================ - Crawler: Feature complete. - Tokenizer/Stemmer: Work in progress. - Inverted index: TODO - Search frontend: TODO How to use it ------------- - Initialize the git submodules: $ git submodule init $ git submodule update - Build the 3rd-party libraries: $ ./build.py - Compile the sources with SCons: $ scons - Alternatively you might build with system curl: $ scons --system_curl - But as said previously, synchronous DNS resolving will harm the performance and block. So it's not recommended unless curl has been compiled with c-ares, as it will be done by build.py Running ------- - The environment variables that affect some configuration parameters are: * Specific for the crawler: MYCELIUM_CRAWLER_PORT: port to listen for urls MYCELIUM_CRAWLER_PARALLEL: number of parallel crawlers to run * General for all the tools that interact with the DB: MYCELIUM_DB_HOST: mongodb host for storing the documents, default is "localhost" MYCELIUM_DB_NS: database.collection, defaults to "mycelium.crawl" Dependencies ============ The software is build on debian / ubuntu systems, although it should be fairly easy to port to other platforms. Some (might be incomplete) list of libraries that are required: - z - boost_filesystem - boost_system - boost_regex - log4cxx - pthread - curl - event - ssl - libidn11-dev Other software: - scons - flex - autoconf (for building curl and c-ares) Troubleshoting ============== AttributeError: 'SConsEnvironment' object has no attribute 'CXXFile': File "/home/piotr/devel/mycelium/SConstruct", line 210: variant_dir='build/{0}'.format(build), duplicate=0) File "/usr/lib/scons/SCons/Script/SConscript.py", line 614: return method(*args, **kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 551: return _SConscript(self.fs, *files, **subst_kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 260: exec _file_ in call_stack[-1].globals File "/home/piotr/devel/mycelium/src/SConscript", line 11: env.CXXFile(target='Robots_flex.cc',source='robots.ll') To solve this problem make sure flex is installed.
About
An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published