Handle nym similarity #588

SatsAllDay · 2023-10-25T21:46:54Z

Closes #580

This PR updates the edit nym feature by introducing fees if the new nym is too similar to another user's nym. Similarity is calculated via a levenshtein distance (see #580 for discussion).

The UX for the edit nym form was updated to include a cancel and a save button so users can back out of changing their nym if they want, which they can't explicitly do today.

The save button is largely a copy of the existing FeeButton component, built for this particular use case instead of being item-focused.

The save button also displays a receipt if the fee is non-zero, to explain to the user why changing their nym costs as much as it does.

The edit nym action is also now invoiceable, meaning a user can pay an invoice on-demand if their balance doesn't have enough to cover the transaction.

Paid nym changes are logged in a new NonItemAct table, and are therefore included in the wallet history aka satistics page. We still don't track a history of which nyms were used by any account - just how much was paid and when to change to a new nym.

Additionally, when new users are created, their auto-generated nyms are checked to see what their cost would be, and if the cost would be non-zero, a random nym is instead generated. This prevents a user from bypassing the cost to get a high-value nym by creating an account via email, twitter, or github login.

Some nuances in the code:

Ensure we don't include the current user's nym in the levenshtein distance calculation, since it's ok if the user changes their nym to something similar to their current nym.
I tried to update the user's name in the server-side session upon name change, but I'm not convinced it works. This was recently discovered as the source of a bug with hiding bookmarks from yourself. This may warrant further investigation.
I introduced a new wallet history fact type nonItemSpent because it needs a different UI rendering compared to spent types, but I show them as spent type in wallet history because users don't need to know the difference.
The API call to load name change costs is configured to not cache, since nym changes can happen frequently and they can influence the response greatly.
I would not be opposed to tweaking the fee calculation formula, it seems a little heavy handed IMO.

huumn · 2023-10-25T21:53:10Z

prisma/schema.prisma

+enum NonItemActType {
+  NYM_CHANGE
+}
+


Nice! Probably should've done this with donations!

This has already grown far beyond what I originally expected lol

* make edit nym invoiceable * consolidate schema migrations * only create an act instance if the cost is nonzero, since we only care about payments for nym changes, not free ones * display paid nym changes in wallet history aka satistics

disable submit button while form is submitting

huumn · 2023-10-31T23:12:34Z

I've been messing around with this. I don't think levenshtein by itself gives us enough information to know the distance for our purposes, ie it's really expensive for names that are nothing like any others.

I've been working with a plpgsql function that supports transpositions implementing damerau-levenshtein. It's really slow, but it's more accurate.

If we go this route though, we'll probably want to first filter users somehow to account for the slowness.

Anyway, not suggesting you need to do this, but I wanted to update my status on this.

CREATE OR REPLACE FUNCTION are_visually_similar(ch1 char, ch2 char)
RETURNS boolean AS $$
BEGIN
    RETURN (ch1 = ch2) OR (
        (LOWER(ch1), LOWER(ch2)) IN (
            ('0', 'o'), ('o', '0'),
            ('1', 'i'), ('i', '1'), ('1', 'l'), ('l', '1'),
            ('2', 'z'), ('z', '2'), ('2', '3'), ('3', '2'),
            ('5', 's'), ('s', '5'), ('5', '6'), ('6', '5'),
            ('6', '8'), ('8', '6'),
            ('7', '1'), ('1', '7'),
            ('9', 'q'), ('q', '9'), ('9', 'g'), ('g', '9'),
            ('a', '4'), ('4', 'a'),
            ('a', 'd'), ('d', 'a'),
            ('b', '6'), ('6', 'b'), ('b', 'h'), ('h', 'b'),
            ('c', 'o'), ('o', 'c'), ('c', 'e'), ('e', 'c'),
            ('d', '0'), ('0', 'd'), ('d', 'o'), ('o', 'd'),
            ('e', '3'), ('3', 'e'), ('e', 'o'), ('o', 'e'),
            ('f', 't'), ('t', 'f'),
            ('g', 'q'), ('q', 'g'),
            ('i', 'j'), ('j', 'i'),
            ('i', 'l'), ('l', 'i'),
            ('i', 't'), ('t', 'i'),
            ('k', 'x'), ('x', 'k'),
            ('m', 'n'), ('n', 'm'), ('m', 'r'), ('r', 'm'), ('m', 'nn'), ('nn', 'm'),
            ('n', 'u'), ('u', 'n'),
            ('o', 'u'), ('u', 'o'),
            ('p', 'q'), ('q', 'p'),
            ('v', 'w'), ('w', 'v'), ('v', 'y'), ('y', 'v'), ('v', 'vv'), ('vv', 'v'),
            ('x', 'z'), ('z', 'x'),
            ('c', 'g'), ('g', 'c'),
            ('d', 'o'), ('o', 'd'),
            ('f', 'e'), ('e', 'f'),
            ('k', 'x'), ('x', 'k'),
            ('p', 'f'), ('f', 'p'),
            ('q', 'o'), ('o', 'q'),
            ('v', 'w'), ('w', 'v'),
            ('z', '2'), ('2', 'z'),
            ('c', 'c'),
            ('o', '0'), ('0', 'o'),
            ('s', '5'), ('5', 's'),
            ('z', '2'), ('2', 'z')
        )
    );
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION damerau_levenshtein_distance(
    str1 text,
    str2 text,
    insertion_cost int DEFAULT 1,
    deletion_cost int DEFAULT 1,
    substitution_cost int DEFAULT 1,
    transposition_cost int DEFAULT 1
)
RETURNS int AS $$
DECLARE
    len1 int;
    len2 int;
    i int;
    j int;
    cost int;
    d int[][];
    ch1 char;
    ch2 char;
BEGIN
    len1 = LENGTH(str1);
    len2 = LENGTH(str2);

    -- Short-circuit: If length difference is greater than possible edit distance
    IF ABS(len1 - len2) > GREATEST(insertion_cost, deletion_cost) * GREATEST(len1, len2) THEN
        RETURN ABS(len1 - len2);
    END IF;

    -- Special cases: if either string is empty
    IF len1 = 0 THEN
        RETURN len2 * insertion_cost;
    ELSIF len2 = 0 THEN
        RETURN len1 * deletion_cost;
    END IF;

    -- Initialize 2D array with dimensions (len1+1) x (len2+1)
    d := ARRAY_FILL(0, ARRAY[len1+1, len2+1]);

    -- Initialize the first row and column
    FOR i IN 0..len1 LOOP
        d[i+1][1] := i * deletion_cost;
    END LOOP;
    FOR j IN 0..len2 LOOP
        d[1][j+1] := j * insertion_cost;
    END LOOP;

    -- Populate the matrix
    FOR i IN 1..len1 LOOP
        ch1 := SUBSTRING(str1 FROM i FOR 1);
        FOR j IN 1..len2 LOOP
            ch2 := SUBSTRING(str2 FROM j FOR 1);
            IF are_visually_similar(ch1, ch2) THEN
                cost := 0;
            ELSE
                cost := substitution_cost;
            END IF;

            d[i+1][j+1] := LEAST(
                d[i][j+1] + deletion_cost,   -- Deletion
                d[i+1][j] + insertion_cost,   -- Insertion
                d[i][j] + cost   -- Substitution
            );

            -- Check for transposition
            IF i > 1 AND j > 1 THEN
                IF are_visually_similar(ch1, SUBSTRING(str2 FROM j-1 FOR 1)) AND are_visually_similar(ch2, SUBSTRING(str1 FROM i-1 FOR 1)) THEN
                    d[i+1][j+1] := LEAST(
                        d[i+1][j+1],
                        d[i-1][j-1] + transposition_cost -- Transposition
                    );
                END IF;
            END IF;
        END LOOP;
    END LOOP;

    -- The distance is at the bottom-right corner of the matrix
    RETURN d[len1+1][len2+1];
END;
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

SatsAllDay · 2023-10-31T23:28:47Z

ie it's really expensive for names that are nothing like any others.

I agree, that's kinda what I meant by it being too heavy handed. It could probably be accounted for (somewhat) by adjusting the cost formula, but ultimately a better similarity detection algorithm is best.

I've been working with a plpgsql function that supports transpositions implementing damerau-levenshtein. It's really slow, but it's more accurate.

If we go this route though, we'll probably want to first filter users somehow to account for the slowness.

Anyway, not suggesting you need to do this, but I wanted to update my status on this.

Cool! It'll take me a bit to grok the algorithm, but that sounds like a great approach, assuming we can get the perf in good shape.

I imagine we should be able to update the name-cost SQL query to use this new similarity function and the rest of this PR would just work, yea?

huumn · 2023-10-31T23:48:51Z

Yeah the rest of the PR is good to go afaict. We just need to be able to back up charging really well, which means we need to detect similarity really well ... which is hard.

With the above function, it's not bad if tuned to:

double the distance on insertions/deletions/substitutions relative to transpositions
don't add distance for substitutions if the substitution character is similar

e.g. SELECT name, damerau_levenshtein_distance(lower(name), 'stackerrnews', 2, 2, 2) as dist from users order by dist asc limit 10;

Is pretty excellent at finding similar things

name	dist
stacker_news	2
stackernews	2
stackernews9	4
stack_ernews	4
staker_news	4
stackernews7	4
stackernews5	4
stackernews1	4
stackernews92	6
stacker_mrw5g	6

e.g. SELECT name, damerau_levenshtein_distance(lower(name), 'kob', 2, 2, 2) as dist from users order by dist asc limit 20;

name	dist
ko0b	2
mob	2
kb	2
ken6	2
koj	2
Kuba	2
iqb	2
k0ob	2
db	2
ken	2
Ko69	2
eb	2
koob	2
ku	2
dh	2
b0b	2
job	2
k00b	2
bob	2
pub	2

huumn · 2023-10-31T23:55:30Z

Additionally, I think my cost function was too continuous.

It should probably be more of a step function.

really expensive (1m) for a very short distance (0, 1, or 2)
- this isn't accounting for my doubling of ins/del/sub costs but the idea is similar
- also not sure on these numbers
free otherwise

huumn · 2023-11-01T00:03:18Z

I also wonder if we should exclude inactive accounts from comparisons. Like, if we haven't seen someone on site in 6 months, what are we protecting?

SatsAllDay · 2023-11-01T00:16:28Z

Additionally, I think my cost function was too linear.

It should probably be more of a step function.

really expensive (1m) for a very short distance (0, 1, or 2)

this isn't accounting for my doubling of ins/del/sub costs but the idea is similar

also not sure on these numbers

free otherwise

Yea, that makes sense. Fine-tuning the cost algo might take some experimentation, but I do really like how nym-changes are currently free, so nickel and diming folks for low-risk changes would be a worsening of the experience, IMO.

I also wonder if we should exclude inactive accounts from comparisons. Like, if we haven't seen someone on site in 6 months, what are we protecting?

Yea, that would make sense. You could take it a step further and factor activity frequency into the cost, like a scaling factor, where anything older than 6 months is 0, but as you get progressively more recently active, the scaling factor increases.

There's probably many similar variations we could apply - basically any method of identifying a high value account that someone would want to pretend to be - top stackers, verified contributor, verified corporate accounts, etc.

Let me know how much of this you want me to tackle, if any. I figured I'd let you finalize the above sql functions first.

huumn · 2023-11-01T02:32:36Z

Thanks! I’ll let you know. The ball is fully in my court until I can figure this out

huumn · 2023-11-01T18:47:05Z

Another approach: don't charge and instead let a nym "occupy" the entire grid of nyms that's 3 distance units away from it, ie consider all those nyms taken rather than for sale.

Not actively pursuing any approach yet, but kind of background thinking through some of this.

SatsAllDay · 2023-11-19T02:37:20Z

I'll defer resolving the conflicts to whenever the overall approach of this PR is decided upon.

huumn · 2023-11-19T18:14:24Z

Going to put this in draft temporarily. We definitely need this and I'm pretty sure we want to use the DL algo but hard to tell what the thresholds should be and what to do when an nym is on the wrong side of the threshold.

SatsAllDay · 2023-11-20T01:11:44Z

Going to put this in draft temporarily. We definitely need this and I'm pretty sure we want to use the DL algo but hard to tell what the thresholds should be and what to do when an nym is on the wrong side of the threshold.

Sounds good to me, I agree that we need to exercise caution with changes like this 👍

huumn reviewed Oct 25, 2023

View reviewed changes

SatsAllDay changed the title ~~WIP: nym similarity~~ Handle nym similarity Oct 26, 2023

SatsAllDay added 8 commits October 26, 2023 14:26

First drop of nym change fees based on similarities to other nyms

3332ee7

edit nym changes

78d8e72

* make edit nym invoiceable * consolidate schema migrations * only create an act instance if the cost is nonzero, since we only care about payments for nym changes, not free ones * display paid nym changes in wallet history aka satistics

Update wording for nym change wallet history item

940b09f

ensure default name for a new user isn't too similar to an existing nym

4f0be85

Adjust default nym for email-created accounts

47fe0f5

make random nym longer to help avoid high-cost defaults

ef1214f

remove console.log statements

fcc41ca

Ensure cost uses latest me.name and not a cached name from session

547b4ba

disable submit button while form is submitting

SatsAllDay force-pushed the 580-nym-similarities branch from 5648f9c to 547b4ba Compare October 26, 2023 18:27

Remove potentially stale name check for 0-fee

bed3500

SatsAllDay marked this pull request as ready for review October 26, 2023 18:31

SatsAllDay mentioned this pull request Oct 26, 2023

Disincentivizing All-Caps Headlines on Stacker News #584

Open

Merge branch 'master' into 580-nym-similarities

9f82718

huumn self-assigned this Nov 19, 2023

huumn marked this pull request as draft November 19, 2023 18:14

huumn mentioned this pull request Nov 19, 2023

Disincentivizing the use of all caps in post titles #590

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle nym similarity #588

Handle nym similarity #588

SatsAllDay commented Oct 25, 2023 •

edited

Loading

huumn Oct 25, 2023

SatsAllDay Oct 25, 2023

huumn commented Oct 31, 2023 •

edited

Loading

SatsAllDay commented Oct 31, 2023 •

edited

Loading

huumn commented Oct 31, 2023

huumn commented Oct 31, 2023 •

edited

Loading

huumn commented Nov 1, 2023

SatsAllDay commented Nov 1, 2023

huumn commented Nov 1, 2023

huumn commented Nov 1, 2023

SatsAllDay commented Nov 19, 2023

huumn commented Nov 19, 2023

SatsAllDay commented Nov 20, 2023

Handle nym similarity #588

Are you sure you want to change the base?

Handle nym similarity #588

Conversation

SatsAllDay commented Oct 25, 2023 • edited Loading

huumn Oct 25, 2023

Choose a reason for hiding this comment

SatsAllDay Oct 25, 2023

Choose a reason for hiding this comment

huumn commented Oct 31, 2023 • edited Loading

SatsAllDay commented Oct 31, 2023 • edited Loading

huumn commented Oct 31, 2023

huumn commented Oct 31, 2023 • edited Loading

huumn commented Nov 1, 2023

SatsAllDay commented Nov 1, 2023

huumn commented Nov 1, 2023

huumn commented Nov 1, 2023

SatsAllDay commented Nov 19, 2023

huumn commented Nov 19, 2023

SatsAllDay commented Nov 20, 2023

SatsAllDay commented Oct 25, 2023 •

edited

Loading

huumn commented Oct 31, 2023 •

edited

Loading

SatsAllDay commented Oct 31, 2023 •

edited

Loading

huumn commented Oct 31, 2023 •

edited

Loading