Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add elbow_chart to help determine ideal class size #32

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

raashidsalih
Copy link

@raashidsalih raashidsalih commented Jun 23, 2024

Hey, @mthh.
Loved the experience using JenksPy, especially the model-like interface by @yasirroni. My use case involved trying to figure out the ideal class size (similar to this) and I had some code on hand, so thought it might be a good idea to integrate it into the package. I'm not sure if there is a methodological precedence to using elbow charts based on goodness of variance fit to do so, but the added convenience should be nice for people who need it.

At a high level, the elbow_chart function takes in the data array, a lower bound (default to 2) and an upper bound. It then gets the GVF values for each class size between the bounds (inclusive of both) and returns an elbow chart along with the results.

How Function Operates

  1. Checks and confirms both bounds are of int datatype, and also checks if upper bound is greater than lower bound. Didn't validate any further to prevent redundancy since the function later triggers validate_input for each n_class value.
  2. Pre-allocate lists to store results instead of appending via for loop. Not strictly necessary for performance, but could help for large n_class values.
  3. Loop over each n_class value and obtain results.
  4. Convert disparate results into dictionary to return later.
  5. Generate plot based on results.
  6. Return Tuple containing plot and results dictionary.

Changes Made

  1. Import Tuple from Typing for function return typehint in core.py.
  2. Import matplotlib for plot generation in core.py.
  3. Add function elbow_chart at end of core.py.
  4. Import elbow_chart in __init__.py.
  5. Add elbow_chart in __all__ in __init.py__.

Things to Discuss

  1. I'm not thrilled about the added matplotlib dependency. We have a few options:
    • We could throw an error instructing the user to install matplotlib if they want to use the function (as seen here)
    • Add an optional tag when installing that either contains [core] or [all]
    • I breifly considered using something like tkinter (which seems to be built-in) and even ASCII art, but clearly, they're not that feasible.
    • Honestly, I reckon most people who want to use this package will be the type that has matplotlib installed already, but I am curious as to what you think. Any other alternatives we can employ here?
  2. Do we need a default value for upper_bound? I don't have a strong opinion on this one, could go either way.

Apologies for the long message, but wanted to ensure clarity. I also have ideas about integrating parallel processing in the for loop for additional performance gains, and also a sampling strategy (as mentioned here), but I think that can wait for another PR.

Example Usage
image

@yasirroni
Copy link
Contributor

It has been a long time since I make contribution here. But I will give two opinions based on my experience.

  1. You should not optimize number of group, except you are optimizing in your sample data and later use the number found for the larger original data. As you already know, this is due to the upper (or lower) bound, especially for normalization.
  2. I've write small implementation of elbow long time ago that you want to read (?): https://github.com/yasirroni/kneed_elbow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants