Would be interesting to know how something like attoparsec fares here as a little higher-level abstraction.
Very cool! This is a nice general optimization.
In your specific case, I think there is a much easier solution though. git grep has already parsed the exact match you’re looking for and you only need to re-parse every line because your git grep returns the entire line.
If you tell it to return only the exact matches (with git grep -o and a pattern that includes the contents of the references like with -E: (@|#)\(ref:[^)]*\), then you will get one result per match (instead of one per line) and only need to trimm off the first and last character
Thanks for the reminder! An old version of the original algorithm used -o, but we removed it because the git that ships with Centos 7 was too old and didn’t have the flag. But I am using --columns which is newer than -o, so let me add -o and document the minimum git version required. Thanks!
Would it make sense for the docs for break to have a note on performance, or mention that there are much faster ways to break on single specific characters?
Yes it would make sense
I think the migration from Text to ByteString could provide only marginal gains and probably was not worth it. UTF-8 decoding is implemented with SIMD instructions, so it is very fast, especially on the happy path. And one can use decodeLenient to skip decoding failures.
You could have looked inside Text to find ByteArray suitable for Data.Text.Internal.ArrayUtils.memchr (yes, suspiciously enough text already uses memchr). For a more ergonomic solution I’d welcome PRs for Add RULE from break to breakOn · Issue #695 · haskell/text · GitHub and Search of a singleton needle should use memchr · Issue #696 · haskell/text · GitHub.
Well the migration from Text to ByteString unblocked using elemIndex, so it’s worth it now. In a future where Text provides an equivalent of elemIndex, sure. No, I’d rather not use that internal memchr function ![]()
So fun story: I tried this, and found a git bug ![]()