-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracts web domains and IP address and implements tests #2031
base: master
Are you sure you want to change the base?
Conversation
…d sandbox traces and presents them to the user. In (-v) and (-vv) modes, this pull request also tries to identify how web domains and IP addresses are used. It checks whether a WinAPI networking function is called soon after the web domain or IP address appears.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased)
section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
CHANGELOG updated or no update needed, thanks! 😄
raise NotImplementedError("Confirm that argv is an attribute of doc.meta") | ||
|
||
else: | ||
print("in 'get_sigpaths_from_doc', run in debug (-d) mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was for debugging? Why are you not using logger here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @VascoSch92 yes trying to figure out why it's failing to build with PyInstaller 3.11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok easy... I just saw it and I wanted to check :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yes thank you for the question though, I appreciate the comments and feedback you're giving me
"""same as 'formatted_domain_verbose' but without 'ip_address_statement'""" | ||
return ( | ||
f"{ip_addr}\n" | ||
+ f" |---- {networking_functions_statement(doc, ip_addr)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ f" |---- {networking_functions_statement(doc, ip_addr)}" | |
+ f" |---- {networking_functions_statement(doc, ip_addr)}\n" |
or?
return statement | ||
|
||
else: | ||
raise LengthError("'api_functions' contains unexpected data!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can never have this error or?
Because the three cases above are already covering all the output of the len
function.
If api_functions
doesn't implement the length functions, an error will occur on the first if or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Yes I think you are right, I have just taken that out!
capa/render/verbose.py
Outdated
|
||
|
||
def ceil(num): | ||
if isinstance(num, float): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if num
is integer? The return is None
. Do you want this behaviour?
but there is a function in the math default package of python which do that not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question - it shouldn't be an issue because the input to ceil
will always be some multiple of 1/3, so the input's type should always be float, but you're right that it's not explicit - I will add a type hint here. I thought it might be a bit unnecessary to import the math package just for a simple function lol but what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated, please let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the method is better :)
I have a problem with the fact that you implement the same method in two different scripts. If you have a bug in one method you have the same bug in the other, i.e, two bugs.
However, it is improbable that you have a bug in this method ;)
Using math.ceil is not a problem, math is a standard library in python.
But up to you ;)
Co-authored-by: Vasco Schiavo <[email protected]>
Co-authored-by: Vasco Schiavo <[email protected]>
Hey @VascoSch92 looks like most of these tests are passing but GitHub is saying it's missing a code license agreement from you - do you think you could check about that? Would love to commit this code together :) |
Done! Let me know if there are still problems ;) You should probably trigger the cla/google tests. I can not do that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work with many good parts, let's see how we can incorporate this as part of the main extraction phase and extend the result document with NBI/HBI information initially
yield feature, addr | ||
|
||
|
||
def default_extract_domain_names(doc: ResultDocument) -> Generator[str, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid reprocessing the file here (in generate_insns_from_doc
etc.), let's discuss how to move this into the core extraction phase and how we can extend the ResultDocument with HBI/NBI data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mr-tz Thank you for your feedback! Yes, this is something I thought about too. I'll look into restructuring this program to move the domain/IP extraction to the core extraction phase and then extend ResultDocument with HBI/NBI data
(str): either the formatted IP address, or an error message | ||
""" | ||
try: | ||
ip_address = socket.gethostbyname(domain) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to do this and in most offline analysis systems it won't work anyway
return f"Could not get IP address for {domain.split('/')[0]}\n" | ||
|
||
|
||
def networking_functions_statement(doc: ResultDocument, domain_or_ip: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can be simplified to called networking APIs in "close proximity" to the found string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mr-tz I think this suggestion could be a good idea. However, when I previously tried simple ways of extracting networking APIs in "close proximity" (like skipping over "push" statements until we get to APIs), I actually got reduced performance. I'll have another look at this and see if I can find a way to make the "close proximity" searching work
yield "Not able to identify the calling function" | ||
|
||
|
||
def potential_winapi_function(string: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a finite list we could generate and embed instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mr-tz thank you for this! Yes, I will do that
tests/test_domain_ip_extractor.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests look great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mr-tz thank you!
Co-authored-by: Moritz <[email protected]>
This PR partially resolves #1907. It extracts web domains and IP addresses, and implements rendering functions and tests.
These changes likely don't require updates to the documentation, but if some users want to, they should be able to repurpose many of the extraction functions fairly easily.
In (-d) mode, this pull request extracts web domains and IP addresses from files and sandbox traces and presents them to the user. In (-v) and (-vv) modes, this pull request also tells the user how many times each web domain and IP address occur, and tries to identify a WinAPI networking function acting on a web domain and IP address for every time they occur. (-v) and (-vv) modes are currently the same.
This PR also implements tests for the part of the code that checks valid web domains, valid IP addresses, and potential WinAPI networking functions.
Example output:
Default (-d) output
Default (-d) output when there are no domains found
Verbose (-v) output
Very verbose (-vv) output
Checklist