Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text: add fun isSubsequenceOf #369

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Anton-Latukha
Copy link

Closes report: #368

Function to detect ordered subsequences in the content.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Anton-Latukha nice! But could we possibly squeeze out more performance by operating on underlying buffers? I imagine fuzzy search needs to be as fast as possible.

src/Data/Text.hs Outdated Show resolved Hide resolved
@Anton-Latukha
Copy link
Author

I currently do not have enough knowledge of low-level and high performance Haskell. But would learn and try to improve.

Currently, next I would cover the supplementary quality required: tests, benchmarks and things that good library piece requires. Having that would look into how further optimization.

Thank you for your work and reviews.

src/Data/Text.hs Outdated
@@ -1877,6 +1879,29 @@ isInfixOf needle haystack
| otherwise = not . L.null . indices needle $ haystack
{-# INLINE [1] isInfixOf #-}

-- | The 'isSubsequenceOf' function takes two 'Text's and returns
-- 'True' iff the second is a subsequence of the first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks backwards. first `isSubsequenceOf` second should be testing whether first is a subsequence of second.

This comment was marked as outdated.

Copy link
Contributor

@Lysxia Lysxia Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of arguments does not impact performance at all here. The order of argument's I would expect is that first `isSubsequenceOf` second matches the English specification "first is a subsequence of second". Furhtermore this would be consistent with the order of arguments of isInfixOf.

Maybe you're thinking of a future implementation where you might precompute a state machine given the subsequence, but 1) this is not it, 2) then it could be called isSupersequenceOf instead.

This comment was marked as outdated.

Copy link
Contributor

@Bodigrim Bodigrim Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we digress here. The original ticket has a pretty solid motivation: there is Data.List.isSubsequenceOf, it is desirable to have a counterpart working over Text. However, swapping arguments of Data.Text.isSubsequence to an opposite of Data.List.isSubsequenceOf directly contradicts this goal.

Currently all isFooBarOf functions in core packages take a needle first and a haystack second. If there is an interest to change this convention, let's discuss it separately (and probably not even in text issue tracker). FWIW I think that even from a theoretical viewpoint it is better to keep a needle first: fuzzy search compares a fixed needle against thousands of file names.

Further, if there is a desire to add isSupersequenceOf or any other function to the public API, let's discuss it in an issue first.

In the meantime I suggest we stick to the convention and make isSubsequenceOf to take a needle first.

This comment was marked as outdated.

src/Data/Text.hs Outdated
-- 'True' iff the second is a subsequence of the first.
-- (characters of the second argument appear in same sequential order in
-- the first, to say if second argument can be derived by deleting some
-- or no elements from the first).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems worth mentioning in the doc that this is useful for fuzzy search.

@Boarders
Copy link

Boarders commented Sep 28, 2021

I didn't test this (and so I wouldn't trust it to actually be correct!), but just to give you a rough idea of how a more efficient implementation might look, something like this might be a good starting point:

isSubsequenceOf :: Text -> Text -> Bool
isSubsequenceOf (Text needle o1 l1) (Text target o2 l2) = go o1 o2
  where
    go :: Int -> Int -> Bool
    go i _ | i >= l1 = True
    go _ j | j >= l2 = False
    go !i !j =
      let
        !(Iter c d) =  iterArray needle i
        !(Iter c' d') = iterArray target j
      in
        if c == c'
          then go (i + d) (j + d')
          else go i (j + d')

There is plenty of good material to steal from in the rest of the library too if you want to see how to work with the underlying array. Getting tests and benchmarks going before figuring this out would be great though!

@Anton-Latukha Anton-Latukha force-pushed the 2021-09-25-add-isSubsequenceOf branch from 505ffd3 to 6f06b72 Compare March 2, 2022 17:34
Function to detect ordered subsequences in the content.
Previous argument order was alike `lookup` argument order mistake, when key
cached (compiler saturates function) across DBs.

Efficient use of the function is to all it once (per pattern) on the whole
contents.

So the contents should be cached first, and the search patterns are switchable.
As suggested, indeed `uncons` replaces 3 funciton invocations with 1.
`on` now ensures that `uncons` can be specialized/consumed once.

It would be interesting to look into Core at some point of work.
This way may have more stack reuse.
@Anton-Latukha Anton-Latukha force-pushed the 2021-09-25-add-isSubsequenceOf branch from 6f06b72 to 073d676 Compare March 2, 2022 18:15
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see the function in the library, but there seems to be some issues with this implementation.

If we had efficient implementations of findCharIndex :: Char -> Text -> Maybe Int then we could make this function quite fast in C on the underlying utf-8 representation.

Comment on lines +1910 to +1921
subseqOf :: Text -> Text -> Bool
subseqOf t s =
on f uncons t s
where
f :: Maybe (Char, Text) -> Maybe (Char, Text) -> Bool
f _ Nothing = True
f Nothing _ = False
f (Just (tc,ts)) (Just (sc,ss)) =
subseqOf ts $
if tc == sc
then s
else ss
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason not to just use

isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)

This code is doing exactly the same work, since T.unpack (as far as I understand) is just repeated calls to uncons. I can't see how there's be any performance here (though I could see an argument to using an optimised implementation in C if there's a more efficient algorithm that takes advantage of the UTF-8 encoding).

--
-- Examples:
--
-- >>> isSubsequenceOf "1234567" "1356"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't the arguments her backwards? Generally isInfixOf etc. are used infix:

>>> "ello" `isInfixOf` "Hello, world!"
True

but this would be

"1234567" `isSubsequenceOf` "1356"

which should be False, if you read it in English. It should be isSubsequenceOf needles haystack - "1234567" is not a subsequence of "1356".

Comment on lines +239 to +240
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
Copy link

@ghost ghost Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't understand what this implementation is actually doing, but is this not all we need?

Suggested change
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
genOrdSubseq txt =
T.pack . map fst . filter snd . zip (T.unpack txt) <$> vectorOf (T.length txt) arbitrary :: Gen [Bool]

We're just trying to select a random sample of elements from the given string in order right? Are there any other properties we need from this? Trying to ensure that the result is non-empty isn't too hard, you just generate two lists of Bools:

Suggested change
genOrdSubseq txt =
T.pack . transform <$> genTransformMap
genOrdSubseq txt = do
firstLen <- choose (1,T.length txt - 2)
prefix <- vectorOf firstLen arbitrary
suffix <- vectorOf (T.length txt - firstLen - 1) arbitrary
pure $ T.pack . map fst . filter snd . zip (T.unpack txt) $ prefix ++ True : suffix

(not sure if I have the maths right, but I'm trying to make a list of length txt which has a True somewhere in the middle)

@ghost
Copy link

ghost commented Jun 22, 2022

uh, sorry, I didn't see others' reviews when I was writing mine, and it looks like most of my comments have been mentioned already. Sorry for the duplicate review, but I hope I have something to add nonetheless.

@ghost ghost mentioned this pull request Jun 22, 2022
@axman6 axman6 mentioned this pull request Jul 1, 2022
@Lysxia
Copy link
Contributor

Lysxia commented Feb 7, 2023

I would accept the implementation mentioned in an earlier comment:

isSubsequenceOf needles haystack = List.isSubsequenceOf (T.unpack needles) (T.unpack haystack)

If you want to stick to the current implementation,

  • the order of argument should be the same as Data.List.isSubsequenceOf
  • the auxiliary subseqOf should avoid doing redundant uncons when the needle doesn't change. This can be done by specializing it to a non-empty needle (represented as a Char and Text argument

Compared to using List.isSubsequenceOf, a well optimized implementation would have the advantage of not allocating intermediate data structures (I don't think you need any unsafe primitives, GHC's worker-wrapper + inlining should be sufficient).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants