-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for get_dupes: return additional variable "dupe_group" #371
Comments
I like this idea! I don't think it would be hard to implement either, any of these would probably do it: https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-index-based-on-a-grouping-variable-in-a-dataframe. We probably ought to test them for performance but Any thoughts from other users, either in terms of how to implement this or how it should work for the user? |
I think this is a nice idea as well. And super easy to implement with probably zero performance hit. I'd wonder if it should be default or require an argument to get the output? I suggest the latter, as I think it isn't useful to most people unless they specifically want it in order to accomplish something. Here's the way I've just tested that seems super simple and shouldn't add any noticeable performance decrease:
Sorry for not doing a PR, my
And now:
|
This sounds like a good idea. Would it make sense to add a function like keep_first_dupe(), that would keep only the first occurrence of each duplicated observation? That is something I find myself doing fairly often... |
I think it's easier just to follow it up with |
Doh, you are absolutely right...
On Friday, February 4, 2022, 02:22:46 PM EST, Jonathan Zadra ***@***.***> wrote:
I think it's easier just to follow it up with distinct(), unless I'm missing something?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you commented.Message ID: ***@***.***>
|
Feature request
I use get_dupes() all the time. Thank you!
I would find it very helpful to index each set of dupes with an additional variable.
Something like "dupe_group" or "dupe_index"?
In the code below, I've use
frank
to create the variable I am looking for, but it would be great to have this automatically embedded into get_dupes().Thanks for your work on janitor.
The following illustrates what I'm thinking of:
This returns the following:
Having the "dupe_group" variable lets me immediately operate on each set of duplicates.
The text was updated successfully, but these errors were encountered: