Fix java export for huge models #306

Aulust · 2020-09-04T21:29:23Z

This is a suggestion on how to tackle a problem mentioned at #298 . In my case, I had about 1000 estimators and this fix worked quite well in production.

This fix has some flaws since it brakes backward compatibility in a sense that you would get several class files after compilation. This might be relevant in cases like mine when a custom class loader is used to load new models in runtime without an application restart. A class loader written with only one model class file in mind might not work correctly after this fix.

Another problem comes from R code generator. With huge trees, context stack overflow emerges from nested function calls and I don't know a good fix for that other than refactoring ast tree generation to reduce recursion.

And lastly, I would like to discuss java code performance. Currently, it's quite bad since generated methods are too big to be jit-compiled by default. A simple jmh benchmark as well as checking this in production shows up to 10 times boost in performance with VM flag -XX:-DontCompileHugeMethods. Reducing ast_size_per_subroutine_threshold helps, but it creates more function calls which is expensive (default jvm limit for method inline, FreqInlineSize, is too small to overcome that). After a decent amount of tweaking ast_size_per_subroutine_threshold I was able to get about 80% of -DontCompileHugeMethods performance, but it certainly depends on a particular model.
I see several ways to mitigate this:

Mentioning in the FAQ about DontCompileHugeMethods or HugeMethodLimit.
Making ast_size_per_subroutine_threshold a configurable parameter for export_to_java function
Reducing default ast_size_per_subroutine_threshold to gain some performance (this might also help Java Score function is too long. The Java Compiler doesn't allow it to execute. #297)

Would like to hear your take on it.

coveralls · 2020-09-04T21:50:45Z

Coverage decreased (-0.5%) to 96.052% when pulling 85545f0 on Aulust:java_fix into 0c1ea76 on BayesWitnesses:master.

StrikerRUS · 2021-07-07T22:03:18Z

@Aulust
I'm so so sorry for the huge delay and for that this project was a kind of abandoned!

I've updated all project dependencies and fixed CI recently. And now I have enough rights to review and merge PRs in this repo.
Are you still interested in finishing this PR?

P.S. Unfortunately, I'm not familiar with Java, but I'll do my best to help with this PR.

Aulust · 2021-07-08T17:55:48Z

Happy to hear that this project is back to being maintained. I've actually been using it in production this whole time (with this fix applied), so the pull request is still valid and I would like it to be ultimately merged into the master branch.
I'll rebase it in a couple of days and check if fixes/updates are required.

StrikerRUS · 2021-07-09T16:02:26Z

I've actually been using it in production this whole time

This is so cool to hear! 🎉

I'll rebase it ...

Many thanks! I'll get familiar with the content of this PR in a few days.

StrikerRUS · 2021-07-17T19:16:32Z

Thanks a lot for this PR! Let me share some my initial thoughts.

I have only one conceptual question about the implementation. Do we want to enhance the current SubroutinesMixin class or maybe we can use inheritance and introduce new ModulesWithSubroutinesMixin child class? New class will allow to overcome the problem of the lack of modules in R. Also, for other language generators it may be enough to just split code into subroutines without thinking about modules. Something like "basic" and "advanced" versions of mixin.

This fix has some flaws since it brakes backward compatibility in a sense that you would get several class files after compilation.

I believe this is totally fine given the benefits that new API introduces.

Another problem comes from R code generator. With huge trees, context stack overflow emerges from nested function calls ...

Is it something obsolete from the initial implementation? Looks like that new tests with huge trees are passed in R.

I know one awesome R expert and I think I can talk with him about R support in m2cgen. From my point of view, R is the most challenging language to support here :-(.

... other than refactoring ast tree generation to reduce recursion.

Do you know any variants? I've spent quite large amount of time reading articles about decision trees inference, but unfortunately haven't found anything useful.

And lastly, I would like to discuss java code performance.
Mentioning in the FAQ about DontCompileHugeMethods or HugeMethodLimit.

Awesome research and great idea documenting this in the FAQ! Thank you very much! I fully support it and PR is very welcome. Also, linking #152 here as еру related issue.

Making ast_size_per_subroutine_threshold a configurable parameter for export_to_java function

To be honest, I don't want overcomplicate export_to_xxx() API. I'd like to keep it as simple as possible. Instead, I think we can go more general way and better document OOP API (see #256). So, the majority of users will be happy with very simple export_to_xxx() API while those users who need advanced options will use well-documented OOP API. Please share WDYT.

Reducing default ast_size_per_subroutine_threshold to gain some performance ...

I'm not against it. Feel free to propose a PR!

izeigerman

@Aulust , I think this update is absolutely amazing and I really appreciate you revisiting this almost 2 years later! Left a few minor comments but looks fantastic otherwise!

izeigerman · 2022-06-29T19:41:35Z

m2cgen/interpreters/mixins.py

@@ -112,6 +115,7 @@ class SubroutinesMixin(BaseToCodeInterpreter):
    # disabled by default
    ast_size_check_frequency = sys.maxsize
    ast_size_per_subroutine_threshold = sys.maxsize
+    subroutine_per_group_threshold = sys.maxsize


I feel like subroutine_per_module_threshold seems more accurate

izeigerman · 2022-06-29T19:49:19Z

m2cgen/interpreters/mixins.py

            self._reset_reused_expr_cache()
            subroutine = self.subroutine_expr_queue.pop(0)
            subroutine_code = self._process_subroutine(subroutine)
+            subroutines.append((subroutine, subroutine_code))
+
+        subroutines.sort(key=lambda subroutine: subroutine[0].idx)


Why do we need this sort given that subroutines are added in a specific order already (the subroutine_expr_queue)? And a follow-up: why do we need index as part of the Subroutine data structure?

izeigerman · 2022-06-29T19:50:27Z

m2cgen/interpreters/mixins.py

            self._reset_reused_expr_cache()
            subroutine = self.subroutine_expr_queue.pop(0)
            subroutine_code = self._process_subroutine(subroutine)
+            subroutines.append((subroutine, subroutine_code))


If my assumption in the previous comment is correct I don't think we actually need subroutine here and having only subroutine_code will suffice.

izeigerman · 2022-06-29T19:55:43Z

m2cgen/interpreters/r/code_generator.py

@@ -28,3 +28,9 @@ def array_index_access(self, array_name, index):

    def vector_init(self, values):
        return f"c({', '.join(values)})"
+
+    def module_definition(self, module_name):
+        raise NotImplementedError("Modules in r is not supported")


Since "Modules" is plural I suggest to you use are not supported.

izeigerman · 2022-06-29T19:55:47Z

m2cgen/interpreters/r/code_generator.py

+        raise NotImplementedError("Modules in r is not supported")
+
+    def module_function_invocation(self, module_name, function_name, *args):
+        raise NotImplementedError("Modules in r is not supported")


izeigerman · 2022-09-14T16:28:48Z

Hey @Aulust! I think this change is great and will be an amazing addition to the library. Any chance you plan to finish the work on this PR?

Aulust mentioned this pull request Sep 4, 2020

Error when compiling exported Java code using javac: Too many constants #298

Open

Aulust force-pushed the java_fix branch 2 times, most recently from d7d93fe to 5275380 Compare September 4, 2020 21:40

Aulust force-pushed the java_fix branch 2 times, most recently from 0a0070b to 85545f0 Compare September 5, 2020 13:57

Aulust force-pushed the java_fix branch 2 times, most recently from 020de91 to 6ec1033 Compare July 13, 2021 13:50

Aulust force-pushed the java_fix branch from 6ec1033 to c78a10d Compare June 24, 2022 11:54

Fix java export for huge models

ef5c272

Aulust force-pushed the java_fix branch from c78a10d to ef5c272 Compare June 24, 2022 12:29

izeigerman requested changes Jun 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix java export for huge models #306

Fix java export for huge models #306

Aulust commented Sep 4, 2020 •

edited

Loading

coveralls commented Sep 4, 2020 •

edited

Loading

StrikerRUS commented Jul 7, 2021

Aulust commented Jul 8, 2021

StrikerRUS commented Jul 9, 2021

StrikerRUS commented Jul 17, 2021

izeigerman left a comment

izeigerman Jun 29, 2022 •

edited

Loading

izeigerman Jun 29, 2022

izeigerman Jun 29, 2022

izeigerman Jun 29, 2022

izeigerman Jun 29, 2022

izeigerman commented Sep 14, 2022

Fix java export for huge models #306

Are you sure you want to change the base?

Fix java export for huge models #306

Conversation

Aulust commented Sep 4, 2020 • edited Loading

coveralls commented Sep 4, 2020 • edited Loading

StrikerRUS commented Jul 7, 2021

Aulust commented Jul 8, 2021

StrikerRUS commented Jul 9, 2021

StrikerRUS commented Jul 17, 2021

izeigerman left a comment

Choose a reason for hiding this comment

izeigerman Jun 29, 2022 • edited Loading

Choose a reason for hiding this comment

izeigerman Jun 29, 2022

Choose a reason for hiding this comment

izeigerman Jun 29, 2022

Choose a reason for hiding this comment

izeigerman Jun 29, 2022

Choose a reason for hiding this comment

izeigerman Jun 29, 2022

Choose a reason for hiding this comment

izeigerman commented Sep 14, 2022

Aulust commented Sep 4, 2020 •

edited

Loading

coveralls commented Sep 4, 2020 •

edited

Loading

izeigerman Jun 29, 2022 •

edited

Loading