Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binary ninja: optimize feature extraction #2402

Open
williballenthin opened this issue Sep 25, 2024 · 6 comments
Open

binary ninja: optimize feature extraction #2402

williballenthin opened this issue Sep 25, 2024 · 6 comments
Labels
binary-ninja performance Related to capa's performance

Comments

@williballenthin
Copy link
Collaborator

During some initial profiling, I'm finding that the Binary Ninja backend is substantially slower than vivisect or IDA. This thread will enumerate all the things we discover. It might include: bugs in Binary Ninja, things we're doing wrong, workarounds, etc.

Given how good Binary Ninja's code analysis is, we'd really like to be able to use it widely. So, let's prepare the code for this.

@williballenthin williballenthin added binary-ninja performance Related to capa's performance labels Sep 25, 2024
@williballenthin
Copy link
Collaborator Author

Vector35/binaryninja-api#5951

0953cc3b77ed2974b09e3a00708f88de931d681e2d0cb64afbaf714610beabe6 (100KB or so) takes a huge amount of time to load into Binary Ninja. Maybe there's an infinite loop somewhere.

@williballenthin
Copy link
Collaborator Author

williballenthin commented Sep 25, 2024

To run capa against 321338196a46b600ea330fc5d98d0699, it takes 2:48. But :36 is spent just in BNGetLowLevelILForInstruction to help recover calls to functions. I have expected this to be extremely fast (less than :01). Need to triage if this is a bug in the API or we're using it incorrectly and/or there's a workaround.

We can also see that BNGetFunctionMediumLevelIL takes quite a bit of time (:13, and its called twice).

  Austin  TUI   Wall Time Profile                                                                                         
   _________   Command python -m capa.main --json --backend=binja tests/data/321338196a46b600ea330fc5d98d0699.exe_                     
   ⎝__⎠ ⎝__⎠   Python 3.12.3    PID 3946467     PID:TID 3946467:0:3946467   
               Samples 1060897  ⏲️    2'48"       Threshold 1%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                              
  00"    2'09"     0.0%   76.6%  ├─ _run_module_as_main (<frozen runpy>:199)                                                           │
  00"    2'09"     0.0%   76.6%  │  └─ _run_code (<frozen runpy>:88)                                                                   │
  00"    2'08"     0.0%   76.3%  │     └─ <module> (~/code/public/capa/capa/main.py:1096)                                              │
  00"     14"      0.0%    8.4%  │        ├─ main (~/code/public/capa/capa/main.py:983)                                                │
  00"     14"      0.0%    8.4%  │        │  └─ get_extractor_from_cli (~/code/public/capa/capa/main.py:830)                           │
  00"     13"      0.0%    7.7%  │        │     ├─ get_extractor (~/code/public/capa/capa/loader.py:264)                               │
  00"     13"      0.0%    7.7%  │        │     │  └─ load (~/software/binaryninja/python/binaryninja/__init__.py:414)                 │
  00"     13"      0.0%    7.6%  │        │     │     └─ BinaryView.load (~/software/binaryninja/python/binaryninja/binaryview.py:2758)│
  13"     13"      7.6%    7.6%  │        │     │        └─ BNLoadFilename (~/software/binaryninja/python/binaryninja/_binaryninjacore.│
  00"    1'39"     0.0%   58.7%  │        ├─ main (~/code/public/capa/capa/main.py:987)                                                │
  00"    1'39"     0.0%   58.7%  │        │  └─ find_capabilities (~/code/public/capa/capa/capabilities/common.py:75)                  │
  00"    1'38"     0.0%   58.3%  │        │     ├─ find_static_capabilities (~/code/public/capa/capa/capabilities/static.py:184)       │
  00"     11"      0.0%    6.3%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:117)      │
  00"     10"      0.0%    6.2%  │        │     │  │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extra│
  00"     10"      0.0%    6.0%  │        │     │  │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395) │
  10"     10"      6.0%    6.0%  │        │     │  │  │  │  └─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_b│
  00"     48"      0.0%   28.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:118)      │
  00"     08"      0.1%    4.6%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     07"      0.0%    4.2%  │        │     │  │  │  ├─ BinjaFeatureExtractor.get_instructions (~/code/public/capa/capa/features/ex│
  00"     02"      0.1%    1.4%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     05"      0.0%    2.8%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     02"      0.0%    1.5%  │        │     │  │  │  │  │  ├─ CoreArchitecture.get_instruction_text (~/software/binaryninja/python/│
  00"     35"      0.1%   20.8%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  01"     31"      0.5%   18.6%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     30"      0.0%   18.1%  │        │     │  │  │  │  ├─ BinjaFeatureExtractor.extract_insn_features (~/code/public/capa/capa/fea│
  01"     30"      0.8%   18.0%  │        │     │  │  │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/i│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  ├─ extract_insn_number_features (~/code/public/capa/capa/features/ext│
  00"     02"      0.0%    1.1%  │        │     │  │  │  │  │  │  ├─ extract_insn_bytes_features (~/code/public/capa/capa/features/extr│
  00"     02"      0.0%    1.3%  │        │     │  │  │  │  │  │  ├─ extract_insn_api_features (~/code/public/capa/capa/features/extrac│
  00"     02"      0.0%    1.2%  │        │     │  │  │  │  │  │  │  ├─ is_stub_function (~/code/public/capa/capa/features/extractors/b│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  │  │  ├─ Function.llil (~/software/binaryninja/python/binaryninja/fun│
  02"     02"      1.0%    1.0%  │        │     │  │  │  │  │  │  │  │  │  ├─ BNGetFunctionLowLevelIL (~/software/binaryninja/python/bi│
  00"     03"      0.1%    2.1%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     03"      0.0%    2.0%  │        │     │  │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)          │
  02"     02"      1.4%    1.4%  │        │     │  │  │  │  │  ├─ RuleSet._match (~/code/public/capa/capa/rules/__init__.py:1879)      │
  00"     04"      0.0%    2.4%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     04"      0.0%    2.4%  │        │     │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)             │
  00"     37"      0.0%   21.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:128)      │
  00"     37"      0.0%   21.9%  │        │     │  │  ├─ BinjaFeatureExtractor.extract_function_features (~/code/public/capa/capa/featu│
  00"     37"      0.0%   21.9%  │        │     │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/functio│
>>00"     36"      0.0%   21.5%  │        │     │  │  │  │  ├─ extract_function_calls_to (~/code/public/capa/capa/features/extractors/b│
>>00"     36"      0.0%   21.4%  │        │     │  │  │  │  │  ├─ ReferenceSource.llil (~/software/binaryninja/python/binaryninja/binar│
>>00"     36"      0.0%   21.3%  │        │     │  │  │  │  │  │  ├─ Function.get_low_level_il_at (~/software/binaryninja/python/binary│
>>36"     36"     21.3%   21.3%  │        │     │  │  │  │  │  │  │  └─ BNGetLowLevelILForInstruction (~/software/binaryninja/python/bi│
  00"     14"      0.0%    8.6%  │        ├─ main (~/code/public/capa/capa/main.py:990)                                                │
  00"     14"      0.0%    8.6%  │        │  └─ compute_layout (~/code/public/capa/capa/loader.py:662)                                 │
  00"     14"      0.0%    8.4%  │        │     ├─ compute_static_layout (~/code/public/capa/capa/loader.py:631)                       │
  00"     14"      0.0%    8.3%  │        │     │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extracto│
  00"     14"      0.0%    8.2%  │        │     │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395)    │
  14"     14"      8.2%    8.2%  │        │     │  │  │  ├─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_bin

edit: maybe we can cache the results of fetching the llil/mlil to save some time. Still is surprising that it takes 3x longer to fetch the llil than do the complete analysis. Maybe its Python serialization overhead?

@xusheng6
Copy link
Contributor

To run capa against 321338196a46b600ea330fc5d98d0699, it takes 2:48. But :36 is spent just in BNGetLowLevelILForInstruction to help recover calls to functions. I have expected this to be extremely fast (less than :01). Need to triage if this is a bug in the API or we're using it incorrectly and/or there's a workaround.

We can also see that BNGetFunctionMediumLevelIL takes quite a bit of time (:13, and its called twice).

  Austin  TUI   Wall Time Profile                                                                                         
   _________   Command python -m capa.main --json --backend=binja tests/data/321338196a46b600ea330fc5d98d0699.exe_                     
   ⎝__⎠ ⎝__⎠   Python 3.12.3    PID 3946467     PID:TID 3946467:0:3946467   
               Samples 1060897  ⏲️    2'48"       Threshold 1%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                              
  00"    2'09"     0.0%   76.6%  ├─ _run_module_as_main (<frozen runpy>:199)                                                           │
  00"    2'09"     0.0%   76.6%  │  └─ _run_code (<frozen runpy>:88)                                                                   │
  00"    2'08"     0.0%   76.3%  │     └─ <module> (~/code/public/capa/capa/main.py:1096)                                              │
  00"     14"      0.0%    8.4%  │        ├─ main (~/code/public/capa/capa/main.py:983)                                                │
  00"     14"      0.0%    8.4%  │        │  └─ get_extractor_from_cli (~/code/public/capa/capa/main.py:830)                           │
  00"     13"      0.0%    7.7%  │        │     ├─ get_extractor (~/code/public/capa/capa/loader.py:264)                               │
  00"     13"      0.0%    7.7%  │        │     │  └─ load (~/software/binaryninja/python/binaryninja/__init__.py:414)                 │
  00"     13"      0.0%    7.6%  │        │     │     └─ BinaryView.load (~/software/binaryninja/python/binaryninja/binaryview.py:2758)│
  13"     13"      7.6%    7.6%  │        │     │        └─ BNLoadFilename (~/software/binaryninja/python/binaryninja/_binaryninjacore.│
  00"    1'39"     0.0%   58.7%  │        ├─ main (~/code/public/capa/capa/main.py:987)                                                │
  00"    1'39"     0.0%   58.7%  │        │  └─ find_capabilities (~/code/public/capa/capa/capabilities/common.py:75)                  │
  00"    1'38"     0.0%   58.3%  │        │     ├─ find_static_capabilities (~/code/public/capa/capa/capabilities/static.py:184)       │
  00"     11"      0.0%    6.3%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:117)      │
  00"     10"      0.0%    6.2%  │        │     │  │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extra│
  00"     10"      0.0%    6.0%  │        │     │  │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395) │
  10"     10"      6.0%    6.0%  │        │     │  │  │  │  └─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_b│
  00"     48"      0.0%   28.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:118)      │
  00"     08"      0.1%    4.6%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     07"      0.0%    4.2%  │        │     │  │  │  ├─ BinjaFeatureExtractor.get_instructions (~/code/public/capa/capa/features/ex│
  00"     02"      0.1%    1.4%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     05"      0.0%    2.8%  │        │     │  │  │  │  ├─ BasicBlock.__iter__ (~/software/binaryninja/python/binaryninja/basicbloc│
  00"     02"      0.0%    1.5%  │        │     │  │  │  │  │  ├─ CoreArchitecture.get_instruction_text (~/software/binaryninja/python/│
  00"     35"      0.1%   20.8%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  01"     31"      0.5%   18.6%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     30"      0.0%   18.1%  │        │     │  │  │  │  ├─ BinjaFeatureExtractor.extract_insn_features (~/code/public/capa/capa/fea│
  01"     30"      0.8%   18.0%  │        │     │  │  │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/i│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  ├─ extract_insn_number_features (~/code/public/capa/capa/features/ext│
  00"     02"      0.0%    1.1%  │        │     │  │  │  │  │  │  ├─ extract_insn_bytes_features (~/code/public/capa/capa/features/extr│
  00"     02"      0.0%    1.3%  │        │     │  │  │  │  │  │  ├─ extract_insn_api_features (~/code/public/capa/capa/features/extrac│
  00"     02"      0.0%    1.2%  │        │     │  │  │  │  │  │  │  ├─ is_stub_function (~/code/public/capa/capa/features/extractors/b│
  00"     02"      0.0%    1.0%  │        │     │  │  │  │  │  │  │  │  ├─ Function.llil (~/software/binaryninja/python/binaryninja/fun│
  02"     02"      1.0%    1.0%  │        │     │  │  │  │  │  │  │  │  │  ├─ BNGetFunctionLowLevelIL (~/software/binaryninja/python/bi│
  00"     03"      0.1%    2.1%  │        │     │  │  │  ├─ find_instruction_capabilities (~/code/public/capa/capa/capabilities/static.│
  00"     03"      0.0%    2.0%  │        │     │  │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)          │
  02"     02"      1.4%    1.4%  │        │     │  │  │  │  │  ├─ RuleSet._match (~/code/public/capa/capa/rules/__init__.py:1879)      │
  00"     04"      0.0%    2.4%  │        │     │  │  ├─ find_basic_block_capabilities (~/code/public/capa/capa/capabilities/static.py:│
  00"     04"      0.0%    2.4%  │        │     │  │  │  ├─ RuleSet.match (~/code/public/capa/capa/rules/__init__.py:2053)             │
  00"     37"      0.0%   21.9%  │        │     │  ├─ find_code_capabilities (~/code/public/capa/capa/capabilities/static.py:128)      │
  00"     37"      0.0%   21.9%  │        │     │  │  ├─ BinjaFeatureExtractor.extract_function_features (~/code/public/capa/capa/featu│
  00"     37"      0.0%   21.9%  │        │     │  │  │  ├─ extract_features (~/code/public/capa/capa/features/extractors/binja/functio│
>>00"     36"      0.0%   21.5%  │        │     │  │  │  │  ├─ extract_function_calls_to (~/code/public/capa/capa/features/extractors/b│
>>00"     36"      0.0%   21.4%  │        │     │  │  │  │  │  ├─ ReferenceSource.llil (~/software/binaryninja/python/binaryninja/binar│
>>00"     36"      0.0%   21.3%  │        │     │  │  │  │  │  │  ├─ Function.get_low_level_il_at (~/software/binaryninja/python/binary│
>>36"     36"     21.3%   21.3%  │        │     │  │  │  │  │  │  │  └─ BNGetLowLevelILForInstruction (~/software/binaryninja/python/bi│
  00"     14"      0.0%    8.6%  │        ├─ main (~/code/public/capa/capa/main.py:990)                                                │
  00"     14"      0.0%    8.6%  │        │  └─ compute_layout (~/code/public/capa/capa/loader.py:662)                                 │
  00"     14"      0.0%    8.4%  │        │     ├─ compute_static_layout (~/code/public/capa/capa/loader.py:631)                       │
  00"     14"      0.0%    8.3%  │        │     │  ├─ BinjaFeatureExtractor.get_basic_blocks (~/code/public/capa/capa/features/extracto│
  00"     14"      0.0%    8.2%  │        │     │  │  ├─ Function.mlil (~/software/binaryninja/python/binaryninja/function.py:1395)    │
  14"     14"      8.2%    8.2%  │        │     │  │  │  ├─ BNGetFunctionMediumLevelIL (~/software/binaryninja/python/binaryninja/_bin

edit: maybe we can cache the results of fetching the llil/mlil to save some time. Still is surprising that it takes 3x longer to fetch the llil than do the complete analysis. Maybe its Python serialization overhead?

I opened the file in binja GUI and the analysis only took 4.3 seconds:

[Analysis] Analysis update took 4.325 seconds

My machine is probably faster then the CI box used by GitHub, still quite surprising to see such a huge difference

@williballenthin
Copy link
Collaborator Author

williballenthin commented Sep 26, 2024

@xusheng6 on my test rig it took maybe 13s to load the binary. Then lots longer to extract the features (minutes). So accessing the LLIR/MLIR is taking integer multiples of the total load time 😕

Maybe 3s vs 13s comes from only having about two cores available in the test environment.

@xusheng6
Copy link
Contributor

@xusheng6 on my test rig it took maybe 13s to load the binary. Then lots longer to extract the features (minutes). So accessing the LLIR/MLIR is taking integer multiples of the total load time 😕

Maybe 3s vs 13s comes from only having about two cores available in the test environment.

thx for letting me know about it, it seems either I wrote the backend in a bad way, or the Python wrapping adds significant overhead to it

@williballenthin
Copy link
Collaborator Author

The profiler didn't expose any invocation counts, so I'm not yet sure if we're calling the API way too many times or if the API itself is slow. Given that it's both LLIR and MLIR, I sorta suspect the latter. But, in the few minutes I looked at the bindings, it didn't seem like all that much was happening (on the Python side).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binary-ninja performance Related to capa's performance
Projects
None yet
Development

No branches or pull requests

2 participants