Skip to content

Recursively search directories for a regex pattern

License

Notifications You must be signed in to change notification settings

p-ranav/hypergrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Highlights

  • Search recursively for a regex pattern using Intel Hyperscan.
  • When a git repository is detected, the repository index is searched using libgit2.
  • Similar to grep, ripgrep, ugrep, The Silver Searcher etc.
  • C++17, Multi-threading, SIMD.
  • USAGE GUIDE
  • Implementation notes here.
  • Not cross-platform. Tested in Linux.

Performance

The following tests compare the performance of hypergrep against:

System Details

Type Value
Processor 11th Gen Intel(R) Core(TM) i9-11900KF @ 3.50GHz 3.50 GHz
Instruction Set Extensions Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2, Intel® AVX-512
Installed RAM 32.0 GB (31.9 GB usable)
SSD ADATA SX8200PNP
OS Ubuntu 20.04 LTS
C++ Compiler g++ (Ubuntu 11.1.0-1ubuntu1-20.04) 11.1.0

Vcpkg Installed Libraries

vcpkg commit: 662dbb5

Library Version
argparse 2.9
concurrentqueue 1.0.3
fmt 10.0.0
hyperscan 5.4.2
libgit2 1.6.4

Single Large File Search: OpenSubtitles.raw.en.txt

The following searches are performed on a single large file cached in memory (~13GB, OpenSubtitles.raw.en.gz).

Regex Line Count ag ugrep ripgrep hypergrep
Count number of times Holmes did something
hgrep -c 'Holmes did \w'
27 n/a 1.820 1.022 0.696
Literal with Regex Suffix
hgrep -nw 'Sherlock [A-Z]\w+' en.txt
7882 n/a 1.812 1.509 0.803
Simple Literal
hgrep -nw 'Sherlock Holmes' en.txt
7653 15.764 1.888 1.524 0.658
Simple Literal (case insensitive)
hgrep -inw 'Sherlock Holmes' en.txt
7871 15.599 6.945 2.162 0.650
Alternation of Literals
hgrep -n 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' en.txt
10078 n/a 6.886 1.836 0.689
Alternation of Literals (case insensitive)
hgrep -in 'Sherlock Holmes|John Watson|Irene Adler|Inspector Lestrade|Professor Moriarty' en.txt
10333 n/a 7.029 3.940 0.770
Words surrounding a literal string
hgrep -n '\w+[\x20]+Holmes[\x20]+\w+' en.txt
5020 n/a 6m 11s 1.523 0.638

Git Repository Search: torvalds/linux

The following searches are performed on the entire Linux kernel source tree (after running make defconfig && make -j8). The commit used is f1fcb.

Regex Line Count ag ugrep ripgrep hypergrep
Simple Literal
hgrep -nw 'PM_RESUME'
9 2.807 0.316 0.147 0.140
Simple Literal (case insensitive)
hgrep -niw 'PM_RESUME'
39 2.904 0.435 0.149 0.141
Regex with Literal Suffix
hgrep -nw '[A-Z]+_SUSPEND'
536 3.080 1.452 0.148 0.143
Alternation of four literals
hgrep -nw '(ERR_SYS|PME_TURN_OFF|LINK_REQ_RST|CFG_BME_EVT)'
16 3.085 0.410 0.153 0.146
Unicode Greek
hgrep -n '\p{Greek}'
111 3.762 0.484 0.345 0.146

Git Repository Search: apple/swift

The following searches are performed on the entire Apple Swift source tree. The commit used is 3865b.

Regex Line Count ag ugrep ripgrep hypergrep
Function/Struct/Enum declaration followed by a valid identifier and opening parenthesis
hgrep -n '(func|struct|enum)\s+[A-Za-z_][A-Za-z0-9_]*\s*\('
59026 1.148 0.954 0.154 0.090
Words starting with alphabetic characters followed by at least 2 digits
hgrep -nw '[A-Za-z]+\d{2,}'
127858 1.169 1.238 0.156 0.095
Workd starting with Uppercase letter, followed by alpha-numeric chars and/or underscores
hgrep -nw '[A-Z][a-zA-Z0-9_]*'
2012372 3.131 2.598 0.550 0.482
Guard let statement followed by valid identifier
hgrep -n 'guard\s+let\s+[a-zA-Z_][a-zA-Z0-9_]*\s*=\s*\w+'
839 0.828 0.174 0.054 0.047

Directory Search: /usr

The following searches are performed on the /usr directory.

Regex Line Count ag ugrep ripgrep hypergrep
Any HTTPS or FTP URL
hgrep "(https?|ftp)://[^\s/$.?#].[^\s]*"
13682 4.597 2.894 0.305 0.171
Any IPv4 IP address
hgrep -w "(?:\d{1,3}\.){3}\d{1,3}"
12643 4.727 2.340 0.324 0.166
Any E-mail address
hgrep -w "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
47509 5.477 37.209 0.494 0.220
Any valid date MM/DD/YYYY
hgrep "(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/(19|20)\d{2}"
116 4.239 1.827 0.251 0.163
Count the number of HEX values
hgrep -cw "(?:0x)?[0-9A-Fa-f]+"
68042 5.765 28.691 1.439 0.611
Search any C/C++ for a literal
hgrep --filter "\.(c|cpp|h|hpp)$" test
7355 n/a 0.505 0.118 0.079

Build

Install Dependencies with vcpkg

git clone https://github.com/microsoft/vcpkg
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install concurrentqueue fmt argparse libgit2 hyperscan

Build hypergrep using cmake and vcpkg

Clone the repository

git clone https://github.com/p-ranav/hypergrep
cd hypergrep

If cmake is older than 3.19

mkdir build
cd build
cmake -DCMAKE_TOOLCHAIN_FILE=<path_to_vcpkg>/scripts/buildsystems/vcpkg.cmake ..
make

If cmake is newer than 3.19

Use the release preset:

export VCPKG_ROOT=<path_to_vcpkg>
cmake -B build -S . --preset release
cmake --build build

Binary Portability

To build the binary for x86_64 portability, invoke cmake with -DBUILD_PORTABLE=on option. This will use -march=x86-64 -mtune=generic and -static-libgcc -static-libstdc++, and link the C++ standard library and GCC runtime statically into the binary, reducing dependencies on the target system.