Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle nym similarity #588

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

SatsAllDay
Copy link
Contributor

@SatsAllDay SatsAllDay commented Oct 25, 2023

Closes #580

This PR updates the edit nym feature by introducing fees if the new nym is too similar to another user's nym. Similarity is calculated via a levenshtein distance (see #580 for discussion).

The UX for the edit nym form was updated to include a cancel and a save button so users can back out of changing their nym if they want, which they can't explicitly do today.

The save button is largely a copy of the existing FeeButton component, built for this particular use case instead of being item-focused.

The save button also displays a receipt if the fee is non-zero, to explain to the user why changing their nym costs as much as it does.

The edit nym action is also now invoiceable, meaning a user can pay an invoice on-demand if their balance doesn't have enough to cover the transaction.

Paid nym changes are logged in a new NonItemAct table, and are therefore included in the wallet history aka satistics page. We still don't track a history of which nyms were used by any account - just how much was paid and when to change to a new nym.

Additionally, when new users are created, their auto-generated nyms are checked to see what their cost would be, and if the cost would be non-zero, a random nym is instead generated. This prevents a user from bypassing the cost to get a high-value nym by creating an account via email, twitter, or github login.

Some nuances in the code:

  1. Ensure we don't include the current user's nym in the levenshtein distance calculation, since it's ok if the user changes their nym to something similar to their current nym.
  2. I tried to update the user's name in the server-side session upon name change, but I'm not convinced it works. This was recently discovered as the source of a bug with hiding bookmarks from yourself. This may warrant further investigation.
  3. I introduced a new wallet history fact type nonItemSpent because it needs a different UI rendering compared to spent types, but I show them as spent type in wallet history because users don't need to know the difference.
  4. The API call to load name change costs is configured to not cache, since nym changes can happen frequently and they can influence the response greatly.
  5. I would not be opposed to tweaking the fee calculation formula, it seems a little heavy handed IMO.

enum NonItemActType {
NYM_CHANGE
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Probably should've done this with donations!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has already grown far beyond what I originally expected lol

@SatsAllDay SatsAllDay changed the title WIP: nym similarity Handle nym similarity Oct 26, 2023
@SatsAllDay SatsAllDay marked this pull request as ready for review October 26, 2023 18:31
@huumn
Copy link
Member

huumn commented Oct 31, 2023

I've been messing around with this. I don't think levenshtein by itself gives us enough information to know the distance for our purposes, ie it's really expensive for names that are nothing like any others.

I've been working with a plpgsql function that supports transpositions implementing damerau-levenshtein. It's really slow, but it's more accurate.

If we go this route though, we'll probably want to first filter users somehow to account for the slowness.

Anyway, not suggesting you need to do this, but I wanted to update my status on this.

CREATE OR REPLACE FUNCTION are_visually_similar(ch1 char, ch2 char)
RETURNS boolean AS $$
BEGIN
    RETURN (ch1 = ch2) OR (
        (LOWER(ch1), LOWER(ch2)) IN (
            ('0', 'o'), ('o', '0'),
            ('1', 'i'), ('i', '1'), ('1', 'l'), ('l', '1'),
            ('2', 'z'), ('z', '2'), ('2', '3'), ('3', '2'),
            ('5', 's'), ('s', '5'), ('5', '6'), ('6', '5'),
            ('6', '8'), ('8', '6'),
            ('7', '1'), ('1', '7'),
            ('9', 'q'), ('q', '9'), ('9', 'g'), ('g', '9'),
            ('a', '4'), ('4', 'a'),
            ('a', 'd'), ('d', 'a'),
            ('b', '6'), ('6', 'b'), ('b', 'h'), ('h', 'b'),
            ('c', 'o'), ('o', 'c'), ('c', 'e'), ('e', 'c'),
            ('d', '0'), ('0', 'd'), ('d', 'o'), ('o', 'd'),
            ('e', '3'), ('3', 'e'), ('e', 'o'), ('o', 'e'),
            ('f', 't'), ('t', 'f'),
            ('g', 'q'), ('q', 'g'),
            ('i', 'j'), ('j', 'i'),
            ('i', 'l'), ('l', 'i'),
            ('i', 't'), ('t', 'i'),
            ('k', 'x'), ('x', 'k'),
            ('m', 'n'), ('n', 'm'), ('m', 'r'), ('r', 'm'), ('m', 'nn'), ('nn', 'm'),
            ('n', 'u'), ('u', 'n'),
            ('o', 'u'), ('u', 'o'),
            ('p', 'q'), ('q', 'p'),
            ('v', 'w'), ('w', 'v'), ('v', 'y'), ('y', 'v'), ('v', 'vv'), ('vv', 'v'),
            ('x', 'z'), ('z', 'x'),
            ('c', 'g'), ('g', 'c'),
            ('d', 'o'), ('o', 'd'),
            ('f', 'e'), ('e', 'f'),
            ('k', 'x'), ('x', 'k'),
            ('p', 'f'), ('f', 'p'),
            ('q', 'o'), ('o', 'q'),
            ('v', 'w'), ('w', 'v'),
            ('z', '2'), ('2', 'z'),
            ('c', 'c'),
            ('o', '0'), ('0', 'o'),
            ('s', '5'), ('5', 's'),
            ('z', '2'), ('2', 'z')
        )
    );
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION damerau_levenshtein_distance(
    str1 text,
    str2 text,
    insertion_cost int DEFAULT 1,
    deletion_cost int DEFAULT 1,
    substitution_cost int DEFAULT 1,
    transposition_cost int DEFAULT 1
)
RETURNS int AS $$
DECLARE
    len1 int;
    len2 int;
    i int;
    j int;
    cost int;
    d int[][];
    ch1 char;
    ch2 char;
BEGIN
    len1 = LENGTH(str1);
    len2 = LENGTH(str2);

    -- Short-circuit: If length difference is greater than possible edit distance
    IF ABS(len1 - len2) > GREATEST(insertion_cost, deletion_cost) * GREATEST(len1, len2) THEN
        RETURN ABS(len1 - len2);
    END IF;

    -- Special cases: if either string is empty
    IF len1 = 0 THEN
        RETURN len2 * insertion_cost;
    ELSIF len2 = 0 THEN
        RETURN len1 * deletion_cost;
    END IF;

    -- Initialize 2D array with dimensions (len1+1) x (len2+1)
    d := ARRAY_FILL(0, ARRAY[len1+1, len2+1]);

    -- Initialize the first row and column
    FOR i IN 0..len1 LOOP
        d[i+1][1] := i * deletion_cost;
    END LOOP;
    FOR j IN 0..len2 LOOP
        d[1][j+1] := j * insertion_cost;
    END LOOP;

    -- Populate the matrix
    FOR i IN 1..len1 LOOP
        ch1 := SUBSTRING(str1 FROM i FOR 1);
        FOR j IN 1..len2 LOOP
            ch2 := SUBSTRING(str2 FROM j FOR 1);
            IF are_visually_similar(ch1, ch2) THEN
                cost := 0;
            ELSE
                cost := substitution_cost;
            END IF;

            d[i+1][j+1] := LEAST(
                d[i][j+1] + deletion_cost,   -- Deletion
                d[i+1][j] + insertion_cost,   -- Insertion
                d[i][j] + cost   -- Substitution
            );

            -- Check for transposition
            IF i > 1 AND j > 1 THEN
                IF are_visually_similar(ch1, SUBSTRING(str2 FROM j-1 FOR 1)) AND are_visually_similar(ch2, SUBSTRING(str1 FROM i-1 FOR 1)) THEN
                    d[i+1][j+1] := LEAST(
                        d[i+1][j+1],
                        d[i-1][j-1] + transposition_cost -- Transposition
                    );
                END IF;
            END IF;
        END LOOP;
    END LOOP;

    -- The distance is at the bottom-right corner of the matrix
    RETURN d[len1+1][len2+1];
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

@SatsAllDay
Copy link
Contributor Author

SatsAllDay commented Oct 31, 2023

ie it's really expensive for names that are nothing like any others.

I agree, that's kinda what I meant by it being too heavy handed. It could probably be accounted for (somewhat) by adjusting the cost formula, but ultimately a better similarity detection algorithm is best.

I've been working with a plpgsql function that supports transpositions implementing damerau-levenshtein. It's really slow, but it's more accurate.

If we go this route though, we'll probably want to first filter users somehow to account for the slowness.

Anyway, not suggesting you need to do this, but I wanted to update my status on this.

Cool! It'll take me a bit to grok the algorithm, but that sounds like a great approach, assuming we can get the perf in good shape.

I imagine we should be able to update the name-cost SQL query to use this new similarity function and the rest of this PR would just work, yea?

@huumn
Copy link
Member

huumn commented Oct 31, 2023

Yeah the rest of the PR is good to go afaict. We just need to be able to back up charging really well, which means we need to detect similarity really well ... which is hard.

With the above function, it's not bad if tuned to:

  1. double the distance on insertions/deletions/substitutions relative to transpositions
  2. don't add distance for substitutions if the substitution character is similar

e.g. SELECT name, damerau_levenshtein_distance(lower(name), 'stackerrnews', 2, 2, 2) as dist from users order by dist asc limit 10;

Is pretty excellent at finding similar things

name dist
stacker_news 2
stackernews 2
stackernews9 4
stack_ernews 4
staker_news 4
stackernews7 4
stackernews5 4
stackernews1 4
stackernews92 6
stacker_mrw5g 6

e.g. SELECT name, damerau_levenshtein_distance(lower(name), 'kob', 2, 2, 2) as dist from users order by dist asc limit 20;

name dist
ko0b 2
mob 2
kb 2
ken6 2
koj 2
Kuba 2
iqb 2
k0ob 2
db 2
ken 2
Ko69 2
eb 2
koob 2
ku 2
dh 2
b0b 2
job 2
k00b 2
bob 2
pub 2

@huumn
Copy link
Member

huumn commented Oct 31, 2023

Additionally, I think my cost function was too continuous.

It should probably be more of a step function.

  1. really expensive (1m) for a very short distance (0, 1, or 2)
    • this isn't accounting for my doubling of ins/del/sub costs but the idea is similar
    • also not sure on these numbers
  2. free otherwise

@huumn
Copy link
Member

huumn commented Nov 1, 2023

I also wonder if we should exclude inactive accounts from comparisons. Like, if we haven't seen someone on site in 6 months, what are we protecting?

@SatsAllDay
Copy link
Contributor Author

Additionally, I think my cost function was too linear.

It should probably be more of a step function.

  1. really expensive (1m) for a very short distance (0, 1, or 2)

    • this isn't accounting for my doubling of ins/del/sub costs but the idea is similar
    • also not sure on these numbers
  2. free otherwise

Yea, that makes sense. Fine-tuning the cost algo might take some experimentation, but I do really like how nym-changes are currently free, so nickel and diming folks for low-risk changes would be a worsening of the experience, IMO.

I also wonder if we should exclude inactive accounts from comparisons. Like, if we haven't seen someone on site in 6 months, what are we protecting?

Yea, that would make sense. You could take it a step further and factor activity frequency into the cost, like a scaling factor, where anything older than 6 months is 0, but as you get progressively more recently active, the scaling factor increases.

There's probably many similar variations we could apply - basically any method of identifying a high value account that someone would want to pretend to be - top stackers, verified contributor, verified corporate accounts, etc.

Let me know how much of this you want me to tackle, if any. I figured I'd let you finalize the above sql functions first.

@huumn
Copy link
Member

huumn commented Nov 1, 2023

Thanks! I’ll let you know. The ball is fully in my court until I can figure this out

@huumn
Copy link
Member

huumn commented Nov 1, 2023

Another approach: don't charge and instead let a nym "occupy" the entire grid of nyms that's 3 distance units away from it, ie consider all those nyms taken rather than for sale.

Not actively pursuing any approach yet, but kind of background thinking through some of this.

@SatsAllDay
Copy link
Contributor Author

I'll defer resolving the conflicts to whenever the overall approach of this PR is decided upon.

@huumn
Copy link
Member

huumn commented Nov 19, 2023

Going to put this in draft temporarily. We definitely need this and I'm pretty sure we want to use the DL algo but hard to tell what the thresholds should be and what to do when an nym is on the wrong side of the threshold.

@huumn huumn self-assigned this Nov 19, 2023
@huumn huumn marked this pull request as draft November 19, 2023 18:14
@SatsAllDay
Copy link
Contributor Author

Going to put this in draft temporarily. We definitely need this and I'm pretty sure we want to use the DL algo but hard to tell what the thresholds should be and what to do when an nym is on the wrong side of the threshold.

Sounds good to me, I agree that we need to exercise caution with changes like this 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle nym similarity
2 participants