Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache check_ is_ active_ rm #114

Open
lxorc opened this issue Jun 21, 2022 · 10 comments
Open

Cache check_ is_ active_ rm #114

lxorc opened this issue Jun 21, 2022 · 10 comments

Comments

@lxorc
Copy link

lxorc commented Jun 21, 2022

rm = ResourceManager() Spend a lot of time on check_ is_ active_ rm function , could i cache it ?

@lxorc
Copy link
Author

lxorc commented Jun 21, 2022

I need to wait about 20 seconds for one execution. It seems abnormal. Is it the wrong way I use it?

@kevin-bates
Copy link
Member

It seems like you must have some network issues. check_is_active_rm merely hits the RM's cluster endpoint. Once you get the RM instance (after 20 seconds), do you continue to incur lengthy requests within that RM instance?

The usage model is to get an RM instance, then use that instance to query applications and manage their lifecycle - each of which hits different endpoints.

Seems like using the @cache decorator on check_is_active_rm() would make sense - assuming that's the only endpoint performing poorly. But if you find others being slow as well, it's probably worth taking a closer look at your configuration.

@dimon222
Copy link
Collaborator

dimon222 commented Jun 21, 2022

I had some thinking about this question myself few years ago, but here's why I still haven't done so:

  • Hadoop is considered cluster nature environment. That means that to achieve HA on clientside when running against direct namenodes you need to hit multiple and be able to fallback easily in chances when one or another gets down. If you cache the result, then you risk getting stuck on same broken node for n-time. We could consider short living cache for few minutes, but it might still breach the concept of HA.

Now, enterprise distributions of Hadoop usually include Knox gateway that typically deals with HA concept on its own. If you have single direct Knox url you, however, don't need to check if cluster is active at all, because it always considered so.

So in my opinion what we could consider:

  1. Knox mode - don't check active RM. Its always same single URL and it should be alive by design
  2. Non-knox mode - as is what we have today - default

@kevin-bates
Copy link
Member

Good points @dimon222. I definitely think if we were to add any kind of caching, it should be optional (i.e., configurable).

I think understanding the current scenario is more warranted though. I've never encountered such a delay. @dimon222 - do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

@dimon222
Copy link
Collaborator

dimon222 commented Jun 21, 2022

do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

Honestly, nothing particular in my experience. Out of the head guesses - high network latency, overloaded underline YARN platform. Doubt those scenarios would be a surprise...

@lxorc
Copy link
Author

lxorc commented Jun 22, 2022

Seems like using the @cache decorator on check_is_active_rm() would make sense - assuming that's the only endpoint performing poorly. But if you find others being slow as well, it's probably worth taking a closer look at your configuration.

Thank you very much for your reply and suggestions. Now in my project, I will run a long-term application. I used pickle' to get the instance of ResourceManager during initialization and persist it. Now it works very well. But waiting about 20s for initialization is not a good idea. Maybe this persistence can be configurable? I don't think standy and active of hadoop HA switch very frequently.

@lxorc
Copy link
Author

lxorc commented Jun 22, 2022

do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

Thanks again, very good thinking direction. I'll go deeper to see the reason. Maybe the reason is not as it seems. If only my cluster will wait for such a long time, I will do more tests.

@lxorc
Copy link
Author

lxorc commented Jun 22, 2022

I will do more tests.

image

@lxorc
Copy link
Author

lxorc commented Jun 22, 2022

I will do more tests.

image

More than 50,000+applications have been submitted in my production cluster, and the number will continue to increase. If there are so many applications, Get http://yourhost/cluster will download a large html page(about all application list), and if your network bandwidth is very small, it will take a long time, such as the above picture ~20s. How about using another page to check_ is_ active_ rm?

@dimon222
Copy link
Collaborator

@lxorc thanks for this information. Indeed, it sounds plausible to achieve it when cluster page takes long to load. I still have to review available endpoints, but if there's one that can play the role of active mode health check, it's good idea to consider replacement. That is also in top of above suggestions (optimization-wise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants