profile and optimize find_record #26

wez · 2013-09-28T20:55:29Z

Throwing this one out there for folks that want to dive in.

In the bufq API we have ph_bufq_consume_record which searches the bufq for a matching record delimiter.

For many internet protocols this delimiter is CRLF. For many other applications, we're likely to be looking for LF.

There are a couple of optimizations that could be investigated.

Can we accelerate the memmem call using the sse3_memchr function from here? http://repnop.org/carte/snippets/simd/. The performance.data file indicates that this performs consistently better than the darwin libc. To adopt this, we'd need to detect sse3 either at runtime or compile time
are there specializations of sse3_memchr that can be made for detecting CRLF?
are there specializations of string matching algorithms with a constant, known needle that we could use?
is there a more efficient way to match needles across the "straddle" point in cases where the delimiter straddles discontiguous memory regions?
for long records, we make repeated calls and search across the same memory regions repeatedly. We can surely cache the last searched offset and improve efficiency. We'd need to make sure that we invalidate this offset in the appropriate circumstances (mostly when we've consumed past that point)

The text was updated successfully, but these errors were encountered:

wez · 2013-09-29T00:21:31Z

Regarding SSE detection, we could use more of @sbahra's work from here: http://repnop.org/cpuid.html

dhobsd · 2013-09-30T18:27:45Z

We can probably accelerate memmem with an SSE version when searching for power-of-two sized strings. Otherwise, I think memmem will be faster.
We can specialize sse3_memchr for CRLF and friends, but at this point, we may as well just use the "memmem" version since these are all powers of two.
I don't understand what you mean by "constant, known needle"
For discontiguous buffers, I think we likely need to fill up a "stitch" space of 16 or 32 bytes depending on whether we're using xmm or ymm. I'm not sure whether that copy would invalidate performance gains; on the other hand, I don't see a way around a copy when searching for substrings longer than a byte long.

I think in either case, @sbahra's SSE string code / CPUID code is a good starting point for this.

ghost · 2015-08-04T18:01:27Z

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

wez · 2015-08-05T15:12:36Z

4d56a10 made the situation here a bit better, but we haven't gone for the super low level aspect of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile and optimize find_record #26

profile and optimize find_record #26

wez commented Sep 28, 2013

wez commented Sep 29, 2013

dhobsd commented Sep 30, 2013

ghost commented Aug 4, 2015

wez commented Aug 5, 2015

profile and optimize find_record #26

profile and optimize find_record #26

Comments

wez commented Sep 28, 2013

wez commented Sep 29, 2013

dhobsd commented Sep 30, 2013

ghost commented Aug 4, 2015

wez commented Aug 5, 2015