Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Specificity of dexdump Regex #9

Open
peterstory opened this issue Aug 20, 2018 · 2 comments
Open

Increase Specificity of dexdump Regex #9

peterstory opened this issue Aug 20, 2018 · 2 comments

Comments

@peterstory
Copy link

peterstory commented Aug 20, 2018

I'm not sure if it will make a practical difference on which trackers are detected, but I noticed that the regex here seems unnecessary broad, and will match on things other than class names:

cmd = '%s %s/classes*.dex | perl -n -e\'/[A-Z]+((?:\w+\/)+\w+)/ && print "$1\n"\'|sort|uniq' % (
dexdump, tmp_dir)

Here is a short example of what I mean. Here are the first 20 lines from running dexdump on WhatsApp:

dexdump work_dir/tmp_dir/classes.dex | head -n 20
Processing 'work_dir/tmp_dir/classes.dex'...
Opened 'work_dir/tmp_dir/classes.dex', DEX version '035'
Class #0            -
  Class descriptor  : 'La/a/a/a/a/a$a;'
  Access flags      : 0x0011 (PUBLIC FINAL)
  Superclass        : 'Ljava/lang/Object;'
  Interfaces        -
  Static fields     -
  Instance fields   -
    #0              : (in La/a/a/a/a/a$a;)
      name          : 'a'
      type          : 'Ljava/lang/String;'
      access        : 0x0001 (PUBLIC)
    #1              : (in La/a/a/a/a/a$a;)
      name          : 'b'
      type          : 'Ljava/lang/String;'
      access        : 0x0001 (PUBLIC)
    #2              : (in La/a/a/a/a/a$a;)
      name          : 'c'
      type          : 'Ljava/lang/String;'

And here is the result of running the regex which the code is currently using. Note the matching on the superclass and on instance field types.

dexdump work_dir/tmp_dir/classes.dex | head -n 20 | perl -n -e'/[A-Z]+((?:\w+\/)+\w+)/ && print "$1\n"' | sort | uniq
a/a/a/a/a/a
java/lang/Object
java/lang/String

I've written a slightly different regex which is more specific, so it will only match on the Class descriptor:

dexdump work_dir/tmp_dir/classes.dex | head -n 20 | perl -n -e'/\s*Class descriptor\s*:\s*\047L((?:\w+\/)+\w+)\W/ && print "$1\n"'  | sort | uniq
a/a/a/a/a/a

I would be curious to see whether this change has any downstream effects (i.e., on what trackers are detected). If this change looks good, I can include it in a PR, potentially addressing #7 at the same time.

EDIT: Fixed my regex. It didn't account for some class descriptors not including a $ or starting with a variable number of whitespace characters.

@peterstory
Copy link
Author

Although I haven't rigorously benchmarked the effect of this change, there appears to be a significant performance improvement from having a more specific regex (probably because there are fewer lines to sort). My unit test which includes the subprocesses described above went from about 3.4 seconds to 2.5 seconds.

@counter-reverse
Copy link
Contributor

I am already replacing the regex by python code. See my PR: #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants