Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby to SArray #303

Open
TobyRoseman opened this issue May 12, 2016 · 4 comments
Open

Add groupby to SArray #303

TobyRoseman opened this issue May 12, 2016 · 4 comments
Assignees

Comments

@TobyRoseman
Copy link
Contributor

SFrame has a groupby method but SArrays does not. I think having a groupby method for SArray would be useful for several operators: AVG, MEAN, COUNT, COUNT_DISTINCT, DISTINCT, FREQ_COUNT, MAX, MEAN, MIN, QUANTILE, STD, STDV, SUM, VAR, VARIANCE.

Doing this might be relatively easy; hopefully we can just reuse the functionality already in SArray. Aggregator code will probably need to change since I don't think it every makes sense for an aggregator to take any parameters if it's being used in an SArray groupby.

@hershaw
Copy link
Contributor

hershaw commented May 29, 2016

What would the api for this look like? Sketch objects have a lot of the aggregations you mentioned for a single SArray.

@TobyRoseman
Copy link
Contributor Author

Well I think it would just be something like this:

sa = gl.SArray([.......])
sa. groupby(MAX)
sa.groupby(VAR)

What are your thoughts?

I guess we could add them all as functions rather than parameters to groupby. Perhaps it would be better to have sa.max() rather than sa.groupby(MAX).

Interesting, I didn't realize these were already in the sketch object. That seems like even more of a reason to add it SArray (i.e. for consistency).

@hershaw
Copy link
Contributor

hershaw commented Jun 7, 2016

So directly on SArray there are already min(), max(), var(), mean() among others. For an efficient summary of all aggregates that can be computed in one pass, there is the sketch_summary() that you can call directly on the SArray so in a sense they are included in the SArray api.

I think that grouping operations only make sense when there are multiple series involved and you need to group one by the other. Otherwise it's just an aggregator, not really a grouper. What do you think? Was there something particular that you were trying to do that made you want the groupby()?

@TobyRoseman
Copy link
Contributor Author

The main use case I'd like is the COUNT aggregator. I often find myself wishing SArray had this built in.

Also I think it would be good for the SArray interface to be a similar as reasonably possible to the SFrame interface. I'm a big fan of consistency where it makes sense.

The values returned by sketch_summary() are only estimates, not exact values. Perhaps this isn't clear from the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants