Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable Length Gap Pattern Matching #324

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
c7e7577
Refactoring WT code.
simongog Nov 4, 2014
6868241
Added method ranges_in_sorted_sequence.
simongog Nov 7, 2014
c11176f
make_pair -> {}
simongog Nov 7, 2014
8ec8eae
Iterator difference now declared const.
simongog Nov 13, 2014
4a44bea
ported gapped-matching benchmarks
olydis Jan 28, 2016
63a2f2d
submodule
olydis Jan 28, 2016
f77281b
updated gitignore
olydis Jan 28, 2016
1ca41da
scrambled together compatible files
olydis Jan 28, 2016
2a4cd27
ported pattern extractor
olydis Jan 28, 2016
9d946e8
convenience...
olydis Jan 28, 2016
7b5df6b
added multi-pattern support to Makefile
olydis Jan 28, 2016
4500e7c
more automation and parallel index creation
olydis Jan 28, 2016
6d5a107
redid fix of loading dynamically sized int_vectors
olydis Jan 28, 2016
c90f113
fixed regexp decomposition
olydis Jan 28, 2016
e67d1c5
new test cases and memory dump
olydis Jan 28, 2016
31854b4
added boost to the equation
olydis Jan 29, 2016
9cbb70c
added std to lots of places (g++4.9 didn't complain), set up test cas…
olydis Jan 29, 2016
5ac2e2d
cosmetic changes
olydis Feb 1, 2016
237e0af
experimental potential speedup
olydis Feb 2, 2016
f29f43d
smarter result concat
olydis Feb 2, 2016
6a75d3b
Merge branch 'sea' of https://github.com/olydis/vlg_matching into sea
olydis Feb 2, 2016
f0e1773
lazy result generation
olydis Feb 3, 2016
5e8fd23
lazified QGRAM index
olydis Feb 4, 2016
44dcfc1
full config
olydis Feb 5, 2016
2020f64
timeout
olydis Feb 12, 2016
a6b0c5f
Fix q-gram index bug.
mpetri Mar 21, 2016
cb409e3
more space efficient text loading for regexp_boost
olydis Mar 22, 2016
4f7c736
Merge pull request #1 from mpetri/master
olydis Mar 22, 2016
3fefa5f
PA preparation (vlg_index + example)
olydis Apr 18, 2016
2075f4e
better names and SDSLish layout
olydis Apr 18, 2016
93921f9
more comments, better names, etc
olydis Apr 19, 2016
4e3ad08
and, or
olydis Apr 19, 2016
2ed396e
polish
olydis Apr 19, 2016
55df26a
minor fix
olydis Apr 19, 2016
860f833
moved some things
olydis Apr 19, 2016
9b3e4b3
more static asserts and removed trailing whitespace
olydis Apr 19, 2016
c4d93fa
benchmark cleanup
olydis Apr 19, 2016
81fef65
merge
olydis Apr 21, 2016
6d29e23
manual merge towards simongog/master
olydis Apr 21, 2016
40ef282
fixed up ys_in_xrange example
olydis Apr 21, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "benchmark/gapped-matching/external/easyloggingpp"]
path = benchmark/gapped-matching/external/easyloggingpp
url = https://github.com/easylogging/easyloggingpp.git
[submodule "external/googletest"]
path = external/googletest
url = https://chromium.googlesource.com/external/googletest
Expand Down
7 changes: 7 additions & 0 deletions benchmark/gapped-matching/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
results*
collections/*
build/*
*.sdsl
*.csv
*.swp
*~
81 changes: 81 additions & 0 deletions benchmark/gapped-matching/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
cmake_minimum_required(VERSION 2.8)
cmake_policy(SET CMP0015 NEW)
set(CMAKE_MODULE_PATH "${CMAKE_HOME_DIRECTORY}/../../CMakeModules")
include(AppendCompilerFlags)
include(ExternalProject)

project(gapped-matching CXX C)
set(CMAKE_BUILD_TYPE "Release")

# parse make helper (HACK :( )
file(STRINGS ${CMAKE_HOME_DIRECTORY}/../../Make.helper makehelper_contents)
list (GET makehelper_contents 0 INC_DIR_STR)
list (GET makehelper_contents 1 LIB_DIR_STR)
STRING(REGEX REPLACE "INC_DIR = (.*)" "\\1" SDSL_INC_DIR "${LIB_DIR_STR}")
STRING(REGEX REPLACE "LIB_DIR = (.*)" "\\1" SDSL_LIB_DIR "${INC_DIR_STR}")
MESSAGE( STATUS "SDSL_INC_DIR: " ${SDSL_INC_DIR} )
MESSAGE( STATUS "SDSL_LIB_DIR: " ${SDSL_LIB_DIR} )


# set include and lib dirs
INCLUDE_DIRECTORIES(${SDSL_INC_DIR}
${CMAKE_HOME_DIRECTORY}/external/easyloggingpp/src/
${CMAKE_HOME_DIRECTORY}/include
)
LINK_DIRECTORIES(${SDSL_LIB_DIR})


# C++11 compiler Check
if(NOT CMAKE_CXX_COMPILER_VERSION) # work around for cmake versions smaller than 2.8.10
execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE CMAKE_CXX_COMPILER_VERSION)
endif()
if(CMAKE_CXX_COMPILER MATCHES ".*clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
set(CMAKE_COMPILER_IS_CLANGXX 1)
endif()
if( (CMAKE_COMPILER_IS_GNUCXX AND ${CMAKE_CXX_COMPILER_VERSION} VERSION_LESS 4.7) OR
(CMAKE_COMPILER_IS_CLANGXX AND ${CMAKE_CXX_COMPILER_VERSION} VERSION_LESS 3.2))
message(FATAL_ERROR "Your C++ compiler does not support C++11. Please install g++ 4.7 (or greater) or clang 3.2 (or greater)")
else()
message(STATUS "Compiler is recent enough to support C++11.")
endif()
if( MEASURE_ENERGY )
message(STATUS "Measure energy.")
append_cxx_compiler_flags("-DLIKWID_PERFMON -L$ENV{LIKWID_LIB} -I$ENV{LIKWID_INCLUDE} -llikwid" "GCC" CMAKE_CXX_FLAGS)
endif()
if( CMAKE_COMPILER_IS_GNUCXX )
append_cxx_compiler_flags("-std=c++11 -Wall -Wextra " "GCC" CMAKE_CXX_FLAGS)
append_cxx_compiler_flags("-msse4.2 -O3 -ffast-math -funroll-loops" "GCC" CMAKE_CXX_FLAGS_RELEASE)
else()
append_cxx_compiler_flags("-std=c++11" "CLANG" CMAKE_CXX_FLAGS)
append_cxx_compiler_flags("-stdlib=libc++" "CLANG" CMAKE_CXX_FLAGS)
append_cxx_compiler_flags("-msse4.2 -O3 -ffast-math -funroll-loops -DNDEBUG" "CLANG" CMAKE_CXX_FLAGS_RELEASE)
endif()

FIND_PACKAGE(Boost 1.45 COMPONENTS regex)

# # read the index configs
file(GLOB index_config_files RELATIVE ${CMAKE_HOME_DIRECTORY}/config/ "${CMAKE_HOME_DIRECTORY}/config/*.config")
foreach(f ${index_config_files})
file(STRINGS ${CMAKE_HOME_DIRECTORY}/config/${f} config_contents)
set(compile_defs "")
foreach(keyvalue ${config_contents})
string(REGEX REPLACE "^[ ]+" "" keyvalue ${keyvalue})
string(REGEX MATCH "^[^=]+" key ${keyvalue})
string(REPLACE "${key}=" "" value ${keyvalue})
set(${key} "${value}")
list(APPEND compile_defs ${key}=${value})
endforeach(keyvalue)
INCLUDE_DIRECTORIES(${Boost_INCLUDE_DIRS})
ADD_EXECUTABLE(gm_index-${NAME}.x src/gm_index.cpp)
TARGET_LINK_LIBRARIES(gm_index-${NAME}.x sdsl divsufsort divsufsort64 pthread ${Boost_LIBRARIES})
set_property(TARGET gm_index-${NAME}.x PROPERTY COMPILE_DEFINITIONS IDXNAME="${NAME}" ${compile_defs})

INCLUDE_DIRECTORIES(${Boost_INCLUDE_DIRS})
ADD_EXECUTABLE(gm_search-${NAME}.x src/gm_search.cpp)
TARGET_LINK_LIBRARIES(gm_search-${NAME}.x sdsl divsufsort divsufsort64 pthread ${Boost_LIBRARIES})
set_property(TARGET gm_search-${NAME}.x PROPERTY COMPILE_DEFINITIONS IDXNAME="${NAME}" ${compile_defs})
endforeach(f)

ADD_EXECUTABLE(create_collection.x src/create_collection.cpp)
TARGET_LINK_LIBRARIES(create_collection.x sdsl divsufsort divsufsort64 pthread)

130 changes: 130 additions & 0 deletions benchmark/gapped-matching/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
include ../Make.helper
include ../Make.download

TIMEOUT = 2050s
MEASURE_ENERGY = 0
MEASURE_CMD = $(if $(findstring 1,$(MEASURE_ENERGY)), likwid-perfctr -C S0:0 -g ENERGY -m,)

CXX_FLAGS = $(MY_CXX_FLAGS) # in compile_options.config
LIBS = -lsdsl
SRC_DIR = src
TMP_DIR = ../tmp

.SECONDARY:

NUM_SAMPLES:=20
NUM_SPS:=$(call config_ids,num_subpatterns.config)
TC_IDS:=$(call config_ids,test_case.config)
TC_COLLECTIONS:=$(foreach TC_ID,$(TC_IDS),collections/$(TC_ID)/text.TEXT)
UNIQUE_DATA_FILES:=$(shell echo $(call config_column,test_case.config,2) | tr " " "\n" | uniq | tr "\n" " ")

PATTERN_IDS:=$(call config_ids,patterns.config)
UNIQUE_SP_LENGTHS:=$(shell echo $(call config_column,patterns.config,2) | tr " " "\n" | uniq | tr "\n" " ")

ALGOS:=$(call config_ids,algorithms.config)

GM_WORDS = $(foreach TC_ID,$(TC_IDS),$(foreach SP_LENGTH,$(UNIQUE_SP_LENGTHS),collections/$(TC_ID)/patterns/words.$(TC_ID).$(SP_LENGTH).words.txt))
GM_PATTERNS = $(foreach TC_ID,$(TC_IDS),$(foreach PATTERN_ID,$(PATTERN_IDS),$(foreach NUM_SP,$(NUM_SPS),collections/$(TC_ID)/patterns/regex.$(TC_ID).$(PATTERN_ID).$(NUM_SP).regex.txt)))
GM_EXECS = $(foreach ALGO,$(ALGOS),build/gm_index-$(ALGO).x) \
$(foreach ALGO,$(ALGOS),build/gm_search-$(ALGO).x) \
build/create_collection.x
INDEX_SENTINELS = $(foreach TC_ID,$(TC_IDS),$(foreach ALGO,$(ALGOS),collections/$(TC_ID).$(ALGO).sentinel))

RES_FILES = $(foreach TC_ID,$(TC_IDS),$(foreach PATTERN_ID,$(PATTERN_IDS),$(foreach NUM_SP,$(NUM_SPS),$(foreach ALGO,$(ALGOS),results/$(ALGO).$(PATTERN_ID).$(NUM_SP).$(TC_ID)))))
RES_FILE=results/all.txt


debug:
@echo $(NUM_SPS)

inputs: $(GM_PATTERNS)
indices: $(INDEX_SENTINELS)

# Format: collections/[TC_ID].[ALGO].sentinel
collections/%.sentinel: $(GM_EXECS) $(TC_COLLECTIONS)
$(eval TC_ID:=$(call dim,1,$*))
$(eval ALGO:=$(call dim,2,$*))
test -e $@ || build/gm_index-$(ALGO).x -c collections/$(TC_ID)
touch $@

execs: $(GM_EXECS)

collections/%/text.TEXT: build/create_collection.x $(UNIQUE_DATA_FILES)
$(eval TC_ID:=$(call dim,1,$*))
$(eval SOURCE_FILE:=$(call config_select,test_case.config,$(TC_ID),2))
$(eval NUM_BYTES:=$(call config_select,test_case.config,$(TC_ID),5))
$(eval PREFIX_BYTES:=$(call config_select,test_case.config,$(TC_ID),6))
$(eval TEMP_DATA_FILE:=$(TC_ID).$(NUM_BYTES).$(PREFIX_BYTES).tmp)
@test -e $@ || echo "Creating collection $(TC_ID) from file $(SOURCE_FILE)"
$(if $(findstring -,$(PREFIX_BYTES)), , @test -e $@ || head -c $(PREFIX_BYTES) $(SOURCE_FILE) > $(TEMP_DATA_FILE))
$(if $(findstring -,$(PREFIX_BYTES)), , $(eval SOURCE_FILE:=$(TEMP_DATA_FILE)))
@test -e $@ || build/create_collection.x -i $(SOURCE_FILE) -n $(NUM_BYTES) -c collections/$(TC_ID)
$(if $(findstring -,$(PREFIX_BYTES)), , @rm -f $(TEMP_DATA_FILE))

timing: $(RES_FILES)
tail -n +1 $(RES_FILES) > $(RES_FILE)
@cd visualize; make

# Format: results/[ALGO].[PATTERN_ID].[NUM_SP].[TC_ID]
results/%: $(GM_EXECS) $(TC_COLLECTIONS) $(GM_PATTERNS) $(INDEX_SENTINELS)
@mkdir -p results
$(eval ALGO:=$(call dim,1,$*))
$(eval PATTERN_ID:=$(call dim,2,$*))
$(eval NUM_SP:=$(call dim,3,$*))
$(eval TC_ID:=$(call dim,4,$*))
$(eval IS_TEXT:=$(call config_select,test_case.config,$(TC_ID),7))
@test -e $@ || ( echo "# COLL_ID = $(TC_ID)" > $@ &&\
echo "# PATT_SAMPLE = $(PATTERN_ID)" >> $@ &&\
echo "# ALGO = $(ALGO)" >> $@ &&\
echo "gm_search-$(ALGO).x -c collections/$(TC_ID) -p collections/$(TC_ID)/patterns/regex.$(TC_ID).$(PATTERN_ID).$(NUM_SP).regex.txt -t $(IS_TEXT)" &&\
(timeout $(TIMEOUT) $(MEASURE_CMD) build/gm_search-$(ALGO).x -c collections/$(TC_ID) -p collections/$(TC_ID)/patterns/regex.$(TC_ID).$(PATTERN_ID).$(NUM_SP).regex.txt -t $(IS_TEXT) >> $@ || echo TIMEOUT) &&\
tail -n 1 mem-mon-out.csv >> $@ &&\
echo "" >> $@ &&\
mv mem-mon-out.csv [email protected])
#Rscript memvis.R [email protected] ;

../../examples/generate-pattern.x:
@cd ../../examples; make generate-pattern.x

# Format: collections/[TC_ID]/patterns/words.[TC_ID].[SP_LENGTH].words.txt
collections/%.words.txt: $(TC_COLLECTIONS) ../../examples/generate-pattern.x
$(eval TC_ID:=$(call dim,2,$*))
$(eval SP_LENGTH:=$(call dim,3,$*))
$(eval IS_TEXT:=$(call config_select,test_case.config,$(TC_ID),7))
@echo "Extracting words of length $(SP_LENGTH) from $(TC_ID)..."
@../../examples/generate-pattern.x collections/$(TC_ID)/text.TEXT 200 $(SP_LENGTH) $(IS_TEXT) > $@

# Format: collections/[TC_ID]/patterns/regex.[TC_ID].[PATTERN_ID].[NUM_SP].regex.txt
collections/%.regex.txt: $(GM_WORDS)
$(eval TC_ID:=$(call dim,2,$*))
$(eval PATTERN_ID:=$(call dim,3,$*))
$(eval NUM_SP:=$(call dim,4,$*))
$(eval SP_LENGTH:=$(call config_select,patterns.config,$(PATTERN_ID),2))
$(eval GAP:=$(call config_select,patterns.config,$(PATTERN_ID),3))
@echo "$(PATTERN_ID)"
@echo "Creating regex patterns with $(NUM_SP) words of length $(SP_LENGTH) from $(TC_ID) and a gap of $(GAP)..."
@cat collections/$(TC_ID)/patterns/words.$(TC_ID).$(SP_LENGTH).words.txt | sed -e 's/ /___/g' | xargs -n$(NUM_SP) | sed -e 's/ /.{$(GAP)}/g' | head -n $(NUM_SAMPLES) | sed -e 's/___/ /g' > $@

build/%:
@mkdir -p build
cd build && cmake -DMEASURE_ENERGY=$(MEASURE_ENERGY) .. && make -j 2

clean-build:
@echo "Remove executables"
rm -rf build/*

clean:
rm -rf build/*
rm -f $(GM_PATTERNS)
rm -f $(GM_WORDS)
rm -f $(INDEX_SENTINELS)
rm -f $(RES_FILES)

clean-results:
rm -f $(RES_FILES)

clean-inputs:
rm -f $(GM_PATTERNS)
rm -f $(GM_WORDS)

#cleanall: clean clean-results clean-build
17 changes: 17 additions & 0 deletions benchmark/gapped-matching/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
INSTALL

1. git submodule update --init
2. mkdir build
3. cd build
4. cmake .. && make

CREATE COLLECTIONS

1. cd build
2. ./create-collection.x -c ../collection/NEWNAME -i NEWNAME.raw

EXECUTE BENCHMARK

1. cd build
2. ./gm_index-YOUR_IDX.x -c ../collections/your_collection
3. ./gm_search-YOUR_IDX.x -c ../collections/your_collection -p ../collections/your_collection/patterns/your_pattern.txt
21 changes: 21 additions & 0 deletions benchmark/gapped-matching/algorithms.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Configuration for test files
# (1) Identifier of configuration to use
# (2) LaTeX name
# (3) point type
# (4) line type
# (5) color

#REGEXP_ECMA;regexp-ecma;1;solid;black
#REGEXP_ECMAOPT;regexp-ecma-opt;1;solid;blue
QGRAM_REGEXP_ECMA;qgram-regexp-ecma;1;solid;blue
REGEXP_ECMA_BOOST;regexp-ecma-boost;1;solid;red
#WCSEARCH_DFS;wildcard-search-dfs;1;solid;green
WCSEARCH_DFS3;wildcard-search-dfs3;1;solid;green
SASEARCH;baseline-sa-search;2;solid;yellow
#STREE;bs-tree;3;solid;blue


#WCSEARCH_FULL_DFS;wildcard-search-dfs-full;2;solid;green
#WCSEARCH_BFS;wildcard-search-bfs;3;solid;green
#DBSEARCH;double-binary-search;2;solid;yellow

3 changes: 3 additions & 0 deletions benchmark/gapped-matching/config/QGRAM_REGEXP_ECMA.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
NAME=QGRAM_REGEXP_ECMA
INDEX_TYPE=index_qgram_regexp<3>
REGEXP_TYPE=std::regex::ECMAScript
3 changes: 3 additions & 0 deletions benchmark/gapped-matching/config/REGEXP_ECMA_BOOST.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
NAME=REGEXP_ECMA_BOOST
INDEX_TYPE=index_regexp_boost
REGEXP_TYPE=std::regex::ECMAScript
3 changes: 3 additions & 0 deletions benchmark/gapped-matching/config/SASEARCH.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
NAME=SASEARCH
INDEX_TYPE=index_sasearch
REGEXP_TYPE=std::regex::ECMAScript
3 changes: 3 additions & 0 deletions benchmark/gapped-matching/config/WCSEARCH_DFS3.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
NAME=WCSEARCH_DFS3
INDEX_TYPE=index_wcsearch3
REGEXP_TYPE=std::regex::ECMAScript
1 change: 1 addition & 0 deletions benchmark/gapped-matching/external/easyloggingpp
Submodule easyloggingpp added at f92680
Loading