-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text: add fun isSubsequenceOf #369
base: master
Are you sure you want to change the base?
Text: add fun isSubsequenceOf #369
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Anton-Latukha nice! But could we possibly squeeze out more performance by operating on underlying buffers? I imagine fuzzy search needs to be as fast as possible.
I currently do not have enough knowledge of low-level and high performance Haskell. But would learn and try to improve. Currently, next I would cover the supplementary quality required: tests, benchmarks and things that good library piece requires. Having that would look into how further optimization. Thank you for your work and reviews. |
src/Data/Text.hs
Outdated
@@ -1877,6 +1879,29 @@ isInfixOf needle haystack | |||
| otherwise = not . L.null . indices needle $ haystack | |||
{-# INLINE [1] isInfixOf #-} | |||
|
|||
-- | The 'isSubsequenceOf' function takes two 'Text's and returns | |||
-- 'True' iff the second is a subsequence of the first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks backwards. first `isSubsequenceOf` second
should be testing whether first
is a subsequence of second
.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of arguments does not impact performance at all here. The order of argument's I would expect is that first `isSubsequenceOf` second
matches the English specification "first
is a subsequence of second
". Furhtermore this would be consistent with the order of arguments of isInfixOf
.
Maybe you're thinking of a future implementation where you might precompute a state machine given the subsequence, but 1) this is not it, 2) then it could be called isSupersequenceOf
instead.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that we digress here. The original ticket has a pretty solid motivation: there is Data.List.isSubsequenceOf
, it is desirable to have a counterpart working over Text
. However, swapping arguments of Data.Text.isSubsequence
to an opposite of Data.List.isSubsequenceOf
directly contradicts this goal.
Currently all isFooBarOf
functions in core packages take a needle first and a haystack second. If there is an interest to change this convention, let's discuss it separately (and probably not even in text
issue tracker). FWIW I think that even from a theoretical viewpoint it is better to keep a needle first: fuzzy search compares a fixed needle against thousands of file names.
Further, if there is a desire to add isSupersequenceOf
or any other function to the public API, let's discuss it in an issue first.
In the meantime I suggest we stick to the convention and make isSubsequenceOf
to take a needle first.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
src/Data/Text.hs
Outdated
-- 'True' iff the second is a subsequence of the first. | ||
-- (characters of the second argument appear in same sequential order in | ||
-- the first, to say if second argument can be derived by deleting some | ||
-- or no elements from the first). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems worth mentioning in the doc that this is useful for fuzzy search.
I didn't test this (and so I wouldn't trust it to actually be correct!), but just to give you a rough idea of how a more efficient implementation might look, something like this might be a good starting point: isSubsequenceOf :: Text -> Text -> Bool
isSubsequenceOf (Text needle o1 l1) (Text target o2 l2) = go o1 o2
where
go :: Int -> Int -> Bool
go i _ | i >= l1 = True
go _ j | j >= l2 = False
go !i !j =
let
!(Iter c d) = iterArray needle i
!(Iter c' d') = iterArray target j
in
if c == c'
then go (i + d) (j + d')
else go i (j + d') There is plenty of good material to steal from in the rest of the library too if you want to see how to work with the underlying array. Getting tests and benchmarks going before figuring this out would be great though! |
505ffd3
to
6f06b72
Compare
Function to detect ordered subsequences in the content.
Previous argument order was alike `lookup` argument order mistake, when key cached (compiler saturates function) across DBs. Efficient use of the function is to all it once (per pattern) on the whole contents. So the contents should be cached first, and the search patterns are switchable.
As suggested, indeed `uncons` replaces 3 funciton invocations with 1.
`on` now ensures that `uncons` can be specialized/consumed once. It would be interesting to look into Core at some point of work.
This way may have more stack reuse.
6f06b72
to
073d676
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see the function in the library, but there seems to be some issues with this implementation.
If we had efficient implementations of findCharIndex :: Char -> Text -> Maybe Int
then we could make this function quite fast in C on the underlying utf-8 representation.
subseqOf :: Text -> Text -> Bool | ||
subseqOf t s = | ||
on f uncons t s | ||
where | ||
f :: Maybe (Char, Text) -> Maybe (Char, Text) -> Bool | ||
f _ Nothing = True | ||
f Nothing _ = False | ||
f (Just (tc,ts)) (Just (sc,ss)) = | ||
subseqOf ts $ | ||
if tc == sc | ||
then s | ||
else ss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason not to just use
isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)
This code is doing exactly the same work, since T.unpack
(as far as I understand) is just repeated calls to uncons
. I can't see how there's be any performance here (though I could see an argument to using an optimised implementation in C if there's a more efficient algorithm that takes advantage of the UTF-8 encoding).
-- | ||
-- Examples: | ||
-- | ||
-- >>> isSubsequenceOf "1234567" "1356" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't the arguments her backwards? Generally isInfixOf
etc. are used infix:
>>> "ello" `isInfixOf` "Hello, world!"
True
but this would be
"1234567" `isSubsequenceOf` "1356"
which should be False, if you read it in English. It should be isSubsequenceOf needles haystack
- "1234567"
is not a subsequence of "1356"
.
genOrdSubseq txt = | ||
T.pack . transform <$> genTransformMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't understand what this implementation is actually doing, but is this not all we need?
genOrdSubseq txt = | |
T.pack . transform <$> genTransformMap | |
genOrdSubseq txt = | |
T.pack . map fst . filter snd . zip (T.unpack txt) <$> vectorOf (T.length txt) arbitrary :: Gen [Bool] |
We're just trying to select a random sample of elements from the given string in order right? Are there any other properties we need from this? Trying to ensure that the result is non-empty isn't too hard, you just generate two lists of Bools:
genOrdSubseq txt = | |
T.pack . transform <$> genTransformMap | |
genOrdSubseq txt = do | |
firstLen <- choose (1,T.length txt - 2) | |
prefix <- vectorOf firstLen arbitrary | |
suffix <- vectorOf (T.length txt - firstLen - 1) arbitrary | |
pure $ T.pack . map fst . filter snd . zip (T.unpack txt) $ prefix ++ True : suffix |
(not sure if I have the maths right, but I'm trying to make a list of length txt
which has a True
somewhere in the middle)
uh, sorry, I didn't see others' reviews when I was writing mine, and it looks like most of my comments have been mentioned already. Sorry for the duplicate review, but I hope I have something to add nonetheless. |
I would accept the implementation mentioned in an earlier comment:
If you want to stick to the current implementation,
Compared to using |
Closes report: #368
Function to detect ordered subsequences in the content.