Cache check_ is_ active_ rm #114

lxorc · 2022-06-21T14:09:20Z

rm = ResourceManager() Spend a lot of time on check_ is_ active_ rm function , could i cache it ？

The text was updated successfully, but these errors were encountered:

lxorc · 2022-06-21T14:16:45Z

I need to wait about 20 seconds for one execution. It seems abnormal. Is it the wrong way I use it?

kevin-bates · 2022-06-21T17:19:53Z

It seems like you must have some network issues. check_is_active_rm merely hits the RM's cluster endpoint. Once you get the RM instance (after 20 seconds), do you continue to incur lengthy requests within that RM instance?

The usage model is to get an RM instance, then use that instance to query applications and manage their lifecycle - each of which hits different endpoints.

Seems like using the @cache decorator on check_is_active_rm() would make sense - assuming that's the only endpoint performing poorly. But if you find others being slow as well, it's probably worth taking a closer look at your configuration.

dimon222 · 2022-06-21T17:59:31Z

I had some thinking about this question myself few years ago, but here's why I still haven't done so:

Hadoop is considered cluster nature environment. That means that to achieve HA on clientside when running against direct namenodes you need to hit multiple and be able to fallback easily in chances when one or another gets down. If you cache the result, then you risk getting stuck on same broken node for n-time. We could consider short living cache for few minutes, but it might still breach the concept of HA.

Now, enterprise distributions of Hadoop usually include Knox gateway that typically deals with HA concept on its own. If you have single direct Knox url you, however, don't need to check if cluster is active at all, because it always considered so.

So in my opinion what we could consider:

Knox mode - don't check active RM. Its always same single URL and it should be alive by design
Non-knox mode - as is what we have today - default

kevin-bates · 2022-06-21T18:32:52Z

Good points @dimon222. I definitely think if we were to add any kind of caching, it should be optional (i.e., configurable).

I think understanding the current scenario is more warranted though. I've never encountered such a delay. @dimon222 - do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

dimon222 · 2022-06-21T19:39:18Z

do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

Honestly, nothing particular in my experience. Out of the head guesses - high network latency, overloaded underline YARN platform. Doubt those scenarios would be a surprise...

lxorc · 2022-06-22T02:32:19Z

Seems like using the @cache decorator on check_is_active_rm() would make sense - assuming that's the only endpoint performing poorly. But if you find others being slow as well, it's probably worth taking a closer look at your configuration.

Thank you very much for your reply and suggestions. Now in my project, I will run a long-term application. I used pickle' to get the instance of ResourceManager during initialization and persist it. Now it works very well. But waiting about 20s for initialization is not a good idea. Maybe this persistence can be configurable? I don't think standy and active of hadoop HA switch very frequently.

lxorc · 2022-06-22T02:39:03Z

do you know under what circumstances there might be such a delay to get a response from the /cluster endpoint?

Thanks again, very good thinking direction. I'll go deeper to see the reason. Maybe the reason is not as it seems. If only my cluster will wait for such a long time, I will do more tests.

lxorc · 2022-06-22T02:48:19Z

I will do more tests.

lxorc · 2022-06-22T02:54:22Z

I will do more tests.

More than 50,000+applications have been submitted in my production cluster, and the number will continue to increase. If there are so many applications, Get http://yourhost/cluster will download a large html page（about all application list）, and if your network bandwidth is very small, it will take a long time, such as the above picture ~20s. How about using another page to check_ is_ active_ rm?

dimon222 · 2022-06-22T13:06:08Z

@lxorc thanks for this information. Indeed, it sounds plausible to achieve it when cluster page takes long to load. I still have to review available endpoints, but if there's one that can play the role of active mode health check, it's good idea to consider replacement. That is also in top of above suggestions (optimization-wise)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache check_ is_ active_ rm #114

Cache check_ is_ active_ rm #114

lxorc commented Jun 21, 2022

lxorc commented Jun 21, 2022

kevin-bates commented Jun 21, 2022

dimon222 commented Jun 21, 2022 •

edited

Loading

kevin-bates commented Jun 21, 2022

dimon222 commented Jun 21, 2022 •

edited

Loading

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022 •

edited

Loading

dimon222 commented Jun 22, 2022

Cache check_ is_ active_ rm #114

Cache check_ is_ active_ rm #114

Comments

lxorc commented Jun 21, 2022

lxorc commented Jun 21, 2022

kevin-bates commented Jun 21, 2022

dimon222 commented Jun 21, 2022 • edited Loading

kevin-bates commented Jun 21, 2022

dimon222 commented Jun 21, 2022 • edited Loading

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022

lxorc commented Jun 22, 2022 • edited Loading

dimon222 commented Jun 22, 2022

dimon222 commented Jun 21, 2022 •

edited

Loading

dimon222 commented Jun 21, 2022 •

edited

Loading

lxorc commented Jun 22, 2022 •

edited

Loading