-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating new columns on a view
should fill in missings everywhere else.
#2211
Comments
I do not know Stata well enough. Can you show a minimal example of expected input and output please? I understand that rows that do not meet the predicate are filled with |
I was thinking about it today. If we added it would you also request this to be added to Therefore it would be great if you gave some use-cases justifying the need, and in particular showing that just using broadcasted |
I would not advocate it for DataFrameRow but I would like it for In Stata, it is very common to add
The first command will copy over values in column This is a great syntax. Stata users love it. It reads very nicely like a sentence. "Generate this variable if it satisfies these conditions" Not only is this feature useful for end users, it's very easy to implement for programmers. When writing a stata program, the opening syntax works like this
The command Why this is good syntax:
Here is a very rough outline for how it might be implemented.
All of the
This is a feature that both @jmboehm and @matthieugomez have both requested, without stating it as formally. |
Tbh, I don’t think this is first order (ifelse does the same thing albeit less elegant). Maybe doing a specific type where() similarly to groupby() would be nice, but it’s more of an upgrade than a fundamental change. |
Supporting this could make sense, but if we allow it it should work on any |
Currently
Also I think immediately after we add this we will get a request for two additional features:
Therefore in the long term I would rather think of appropriate kwargs to Still in this thread we can try work out the API. |
So I’ve been thinking about that a bit more. Here is an idea for a syntax that builds on the recent changes (to be clear: its not a request ;)) # create a new column equal to mean of x by id for rows for which x> 1
@> df by(:id) where(:x => x -> x > 1) transform!(:x => mean)
# return a DataFrame with the mean of x by id for rows for which x>1
@> df by(:id) where(:x => x -> x > 1) combine(:x => mean) The idea is that
The design is of course inspired by Stata and SQL where A nice thing about the current design (#2214) is that these clauses don’t have to live in DataFrames, so one could try them in an outside package first. |
Following up on Matthieu's idea, it seems to me a large class of operations on tabular data df_1,..., df_k can be described as a triple of
Split-apply-combine, reshapes, merges etc are all examples of these. Of course the above is so general that it's probably useless to think of in this generality. Still, the question is where you want to provide an interface to reduce the dimensionality of the above. Hence: should |
We are converging here towards a full:
combo 😄 (of course it would be nice to have it all covered). Probably as @matthieugomez it would be good to experiment with it in some side package (or a PR that would not be intended to be merged) so that the syntax could be worked out without rushing. |
Yes this is exactly my proposal. I should add that I don't think what you describe is that different from the proposal for grouped transforms. A grouped data frame is, after all, a data frame with some information about row indexes for each group attached to it. A sub data frame is just a dataframe with row-indices attached to it. Composability is nice, though. I will make sure to explore this more thoroughly after Finals and probably after Quals as well. |
@pdeffebach yes, without The main difference with your proposal is to use a type different from SubDataFrame (and, in particular, that does not inherit from AbstractDataFrame) |
also column indices |
yes, i am missing the functionality of adding/modifying a column on a subset of a dataframe only. I think it's really important. I think a good reference here (alternative to the magrittr pipe mentioned above) is the R data.table setup. The simple subsetting with
|
what you write is currently supported, but with a different syntax:
what we do not allow is for a view to mutate the parent (in particular a view can have column |
Oh I see! that's good then. sorry about the noise. |
You are welcome. Note also that with |
Given #2329 if you have some comments after some time has passed, please add your opinion. Thank you! |
I think I may try my hand at an |
Actually, if we see the value in allowing the view to modify parents, I would prefer not to add a new object that is very similar to the one we have. I understand the rules would be:
Is this what you think would be good? If yes - and you go for implementing it - the key challenge is handling of Actually - maybe we want to allow this operation only if |
Having thought more about it I am leaning even more to allow adding columns only in Then we would have to highlight more how different is The general idea is that when you write |
Thanks for this! I think this is what I want. I hadn't thought about the One major reason for this logic is that it means we wouldn't have to do as much work in #2258. We could overload This would also mean that our "piping" functions are less special than working normally with |
As mentioned earlier in the thread, I think I would still prefer a `WhereDataFrame` type that disappears after
a `transform`/`select`/`combine` call.
For instance,
```
transform!(where(df, :x => >(1), x => mean(x))
transform!(where(df, :x => x -> !ismissing(x)), x => mean(x))
```
would return the original (parent) DataFrame.
But maybe having such a functionality would be too complex.
…On Thu, Aug 6, 2020 at 2:57 PM pdeffebach ***@***.***> wrote:
Thanks for this! I think this is what I want. I hadn't thought about the
push! and append! cases, but you are right.
One major reason for this logic is that it means we wouldn't have to do as
much work in #2258
<#2258>. We could
overload transform for a SubDataFrame and it would "just work".
This would also mean that our "piping" functions are less special than
working normally with DataFrame objects. cc @matthieugomez
<https://github.com/matthieugomez> about this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2211 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPPPXLFED6AXGP4FY5W4WLR7MRNNANCNFSM4MQ44KJQ>
.
|
Another option, if we only need this functionality in piping and
the question is if we want to also be able to do such things in imperative style, so:
If we need it only for piping style I feel I would prefer to use a kwarg for this in the functions that we want to support it. |
I don't think we only need this functionality in piping. I think that this reasoning would be useful using This solution also strikes me as very "julian" since it relies on method overloading and is similar to our use of |
So as you have commented in the other thread to support So it would have to be I would propose to think about it a bit and decide if users will have an easy time understanding how it works or not. Probably writing some more code snippets what would show how it should be used would be nice. In particular let me stress what tension we have here:
|
I agree that adding new type of data frame view is probably overkill. I'd rather recommend people to use in-place operations like df = DataFrame(...)
transform!(filter(:x => >(1), df, view=true), :x => mean)
df
... Of course if you prefer piping you can add Adding a |
That was my fear. Having this via |
I really think a
Just to clarify the proposal, the function
In contrast with Then, the functions left to define on a This does not need to happen now, since this proposal is not breaking, but I just wanted to say I think it's the right way forward to solve a number of issues we've been talking about recently. |
I am OK with The question is if we need another type. An alternative we have been discussing is to allow In this case we should just add a Having a separate |
Ah - now I see one difference of my approach and your approach when you would assign to an EXISTING column (as opposed to creating a new column).
I am not sure which approach should be preferred in practice (maybe both.) |
The latter behavior could be obtained with I think we should first see how far we can get with DataFramesMeta without introducing new types, and then we'll see whether it's still worth it. |
It could - there is no restriction how we do it. My major concern is if we want to have both behaviors or having one of them is enough (indeed the former can be obtained by just writing But in general - if this is the case - then we have all we want with |
Yeah it could, but I think it would be more consistent with |
Then - such @matthieugomez - do you see any differences? |
Closing this as in 1.3 release we allow adding columns to |
This has been kicking around in my head for a while, but a
SubDataFrame
should allow creation of new columns.Current behavior is:
I propose we add a convenience function
if
, to enable workflow as follows (withmagittr
pipes for convenience)The
transform
function only knows about rows where:a
is notmissing
. This replicates the behavior of Stata'sif
syntax.The text was updated successfully, but these errors were encountered: