Skip to content

Commit

Permalink
upgraded to current SMDA, added migration helpers
Browse files Browse the repository at this point in the history
  • Loading branch information
danielplohmann committed Jan 24, 2024
1 parent 825697d commit 6228599
Show file tree
Hide file tree
Showing 7 changed files with 82 additions and 28 deletions.
41 changes: 21 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ In July 2023, we started populating a [Github repository](https://github.com/dan


## Version History
* 2024-01-24 v1.3.0: BREAKING: Milestone release with indexing improvements for PicHash and MinHash. To ensure full backward compatibility, recalculation of all hashes is recommended. Check this [migration guide](https://github.com/danielplohmann/mcrit/blob/main/docs/migration-v1.3.0.md).
* 2024-01-23 v1.2.26: Pinning lief to 0.13.2 in order to ensure that the pinned SMDA remains compatible.
* 2024-01-09 v1.2.25: Ensure that we can deliver system status regardless of whether there is a `db_state` and `db_timestamp` or not.
* 2024-01-05 v1.2.24: Now supporting "query" argument in CLI, as well as compact MatchingResults (without function match info) to reduce file footprint.
Expand All @@ -132,19 +133,19 @@ In July 2023, we started populating a [Github repository](https://github.com/dan
* 2023-12-05 v1.2.15: Added convenience functionality to Job objects, version number aligned with mcritweb.
* 2023-11-24 v1.2.11: SMDA pinned to version 1.12.7 before we upgrade SMDA and introduce a database migration to recalculate pic + picblock hashes with the improved generalization.
* 2023-11-17 v1.2.10: Added ability to set an authorization token for the server via header field: `apitoken`; added ability to filter by job groups; added ability to fail orphaned jobs.
* 2023-10-17 v1.2.8: Minor fix in job groups.
* 2023-10-16 v1.2.6: Summarized queue statistics, refined Job classification.
* 2023-10-13 v1.2.4: Exposed Queue/Job Deletion to REST interface, improved query speed for various queue lookups via indexing and parameterized mongodb queries.
* 2023-10-13 v1.2.3: Workers will now de-register from in-progress jobs in case they crash (THX to @yankovs for the code template).
* 2023-10-03 v1.2.2: MatchingResult filtering for min/max num samples (incl. fix).
* 2023-10-02 v1.2.0: Milestone release for Virus Bulletin 2023.
* 2023-09-18 v1.1.7: Bugfix: Tasking matching with 0 bands now deactivates minhash matching as it was supposed to be before. Also matching job progress percentage fixed.
* 2023-09-15 v1.1.6: Bugfix in BlockMatching, convenience functionality for interacting with Job objects.
* 2023-09-14 v1.1.5: Deactivated gunicorn as default WSGI handler for the time being due to issues with non-returning calls when handling compute-heavy calls.
* 2023-09-14 v1.1.4: BUGFIX: Added `requirements.txt` to `data_files` in `setup.py` to ensure it's available for the package.
* 2023-09-13 v1.1.3: Extracted some performance critical constants into parameters configurable in MinHashConfig and StorageConfig, fixed progress reporting for batched matching, BUGFIX: usage of GunicornConfig to proper dataclass.
* 2023-09-13 v1.1.1: Streamlined requirements / setup, excluded `gunicorn` for Windows (THX to @yankovs!!).
* 2023-09-12 v1.1.0: For Linux deployments, MCRIT now uses `gunicorn` instead of `waitress` as WSGI server because of [much better performance](https://github.com/danielplohmann/mcrit/pull/39). As gunicorn needs its own config, this required bumping the minor versions (THX to @yankovs!!).
* 2023-10-17 v1.2.8: Minor fix in job groups.
* 2023-10-16 v1.2.6: Summarized queue statistics, refined Job classification.
* 2023-10-13 v1.2.4: Exposed Queue/Job Deletion to REST interface, improved query speed for various queue lookups via indexing and parameterized mongodb queries.
* 2023-10-13 v1.2.3: Workers will now de-register from in-progress jobs in case they crash (THX to @yankovs for the code template).
* 2023-10-03 v1.2.2: MatchingResult filtering for min/max num samples (incl. fix).
* 2023-10-02 v1.2.0: Milestone release for Virus Bulletin 2023.
* 2023-09-18 v1.1.7: Bugfix: Tasking matching with 0 bands now deactivates minhash matching as it was supposed to be before. Also matching job progress percentage fixed.
* 2023-09-15 v1.1.6: Bugfix in BlockMatching, convenience functionality for interacting with Job objects.
* 2023-09-14 v1.1.5: Deactivated gunicorn as default WSGI handler for the time being due to issues with non-returning calls when handling compute-heavy calls.
* 2023-09-14 v1.1.4: BUGFIX: Added `requirements.txt` to `data_files` in `setup.py` to ensure it's available for the package.
* 2023-09-13 v1.1.3: Extracted some performance critical constants into parameters configurable in MinHashConfig and StorageConfig, fixed progress reporting for batched matching, BUGFIX: usage of GunicornConfig to proper dataclass.
* 2023-09-13 v1.1.1: Streamlined requirements / setup, excluded `gunicorn` for Windows (THX to @yankovs!!).
* 2023-09-12 v1.1.0: For Linux deployments, MCRIT now uses `gunicorn` instead of `waitress` as WSGI server because of [much better performance](https://github.com/danielplohmann/mcrit/pull/39). As gunicorn needs its own config, this required bumping the minor versions (THX to @yankovs!!).
* 2023-09-08 v1.0.21: All methods of McritClient now forward apitokens/usernames to the backend.
* 2023-09-05 v1.0.20: Use two-complement to represent addresses in SampleEntry, FunctionEntry when storing in MongoDB to address BSON limitations (THX to @yankovs).
* 2023-09-05 v1.0.19: Statistics are now using the internal counters that had been created a while ago (THX to @yankovs).
Expand All @@ -154,13 +155,13 @@ In July 2023, we started populating a [Github repository](https://github.com/dan
* 2023-08-23 v1.0.12: Added the ability to rebuild the minhash bands used for indexing.
* 2023-08-22 v1.0.11: Fixed a bug where when importing bulk data, the `function_name` was not also added as a `function_label`.
* 2023-08-11 v1.0.10: Fixed a bug where when importing bulk data, the function_id would not be adjusted prior to adding MinHashes to bands, possibly leading to non-existing function_ids.
* 2023-08-02 v1.0.9: IDA plugin can now filter by block size and minhash score, optimized layout and user experience (THX for the feedback to @r0ny123!!)
* 2023-07-28 v1.0.8: IDA plugin can now display colored graphs for remote functions and do queries for PicBlockHashes (for basic blocks) for the currently viewed function.
* 2023-06-06 v1.0.7: Extended filtering capabilities on MatchingResult.
* 2023-06-02 v1.0.6: IDA plugin can now task matching jobs, show their results and batch import labels. Harmonization of MatchingResult.
* 2023-05-22 v1.0.3: More robustness for path verification when using MCRIT CLI on Malpedia repo folder.
* 2023-05-12 v1.0.1: Some progress on label import for the IDA plugin. Reflected API extension of MCRITweb in McritClient.
* 2023-04-10 v1.0.0: Milestone release for Botconf 2023.
* 2023-08-02 v1.0.9: IDA plugin can now filter by block size and minhash score, optimized layout and user experience (THX for the feedback to @r0ny123!!)
* 2023-07-28 v1.0.8: IDA plugin can now display colored graphs for remote functions and do queries for PicBlockHashes (for basic blocks) for the currently viewed function.
* 2023-06-06 v1.0.7: Extended filtering capabilities on MatchingResult.
* 2023-06-02 v1.0.6: IDA plugin can now task matching jobs, show their results and batch import labels. Harmonization of MatchingResult.
* 2023-05-22 v1.0.3: More robustness for path verification when using MCRIT CLI on Malpedia repo folder.
* 2023-05-12 v1.0.1: Some progress on label import for the IDA plugin. Reflected API extension of MCRITweb in McritClient.
* 2023-04-10 v1.0.0: Milestone release for Botconf 2023.
* 2023-04-10 v0.25.0: IDA plugin can now do function queries for the currently viewed function.
* 2023-03-24 v0.24.2: McritClient can forward username/apitoken, addJsonReport is now forwardable.
* 2023-03-21 v0.24.0: FunctionEntries now can store additional FunctionLabelEntries, along submitting user/date.
Expand Down
40 changes: 40 additions & 0 deletions docs/migration-v1.3.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# MCRIT Migration Guide for v1.3.0

With the MCRIT v1.3.0 release, we address several issues noticed with SMDA over the last months.
In particular, we noticed e.g. that not all addresses were [properly masked](https://github.com/danielplohmann/smda/issues/37), which caused functions that should be PicHash-identical to have different hashes and thus being missed during this matching phase.
Additionally, some of you experienced log output about [unhandled instructions](https://github.com/danielplohmann/smda/issues/48) during mnemonic escaping, which also in rare cases broke [opcode bytes](https://github.com/danielplohmann/smda/issues/46).
Larger Delphi binaries could furthermore stall batch processing, as there were [issues](https://github.com/danielplohmann/smda/issues/44) in parsing internal structures.

All of these have been fixed, but some of this comes at the price of potential incompatibility with calculated PicHashes and MinHashes in your databases.
To simplify the migration and especially avoid having to reprocess any binary content, we have introduced specific migration functions in the MinHashIndex that will help to modernize all content to the new SMDA version.

## Triggering the Database Migration

After updating to the latest requirements, you should have SMDA v1.3.11 or higher available:

```bash
$ python -m pip install -r requirements.txt
...
$ python -m pip freeze | grep smda
smda==1.3.11
```

You can now do one of the following:

* use curl to queue the recalculation jobs for PicHash and MinHash:
```bash
$ curl http://127.0.0.1:8000/recalculate_pichashes
$ curl http://127.0.0.1:8000/recalculate_minhashes
```

* use the McritClient to queue the recalculation jobs for PicHash and MinHash:
```python
>>> from mcrit.client.McritClient import McritClient
>>> c = McritClient()
>>> c.recalculatePicHashes()
>>> c.recalculateMinHashes()
```
* use the McritWeb front-end to trigger the matching jobs
-> this will be implemented asap and then be available to admin users in the server section.

Note that these jobs may run for an extensive amount of time depending on the number of functions indexed in your database.
4 changes: 2 additions & 2 deletions mcrit/Worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,8 +190,8 @@ def recalculatePicHashes(self, progress_reporter=NoProgressReporter()):
# Reports PROGRESS
@Remote(progress=True)
def recalculateMinHashes(self, progress_reporter=NoProgressReporter()):
return self._storage.recalculateAllMinHashes(progress_reporter=progress_reporter)

self._storage.deleteAllMinHashes(progress_reporter=progress_reporter)
return self.updateMinHashes(None, progress_reporter=progress_reporter)

# Reports PROGRESS
@Remote(progress=True)
Expand Down
2 changes: 1 addition & 1 deletion mcrit/config/McritConfig.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
class McritConfig(object):

# NOTE to self: always change this in setup.py as well!
VERSION = "1.2.26"
VERSION = "1.3.0"
# basic pathing info
CONFIG_FILE_PATH = str(os.path.abspath(__file__))
PROJECT_ROOT = str(os.path.abspath(os.sep.join([CONFIG_FILE_PATH, "..", ".."])))
Expand Down
14 changes: 14 additions & 0 deletions mcrit/storage/MongoDbStorage.py
Original file line number Diff line number Diff line change
Expand Up @@ -1102,6 +1102,20 @@ def rebuildMinhashBandIndex(self, progress_reporter=None):
progress_reporter.step()
return {"minhash_functions_indexed": minhash_functions}

def deleteAllMinHashes(self, progress_reporter=None):
# delete all minhashes
self._getDb().functions.update_many({}, {"$set": {"minhash": ""}})
# reset bands
collections = []
for band_id in range(self._storage_config.STORAGE_NUM_BANDS):
collections.append("band_%d" % band_id)
for c in collections:
self._getDb()[c].drop()
col = self._getDb()[c]
self._getDb()[c].create_index("band_hash")
LOGGER.info("Dropped all Minhashes and created a fresh banding index.")
return

def recalculateAllPicHashes(self, progress_reporter=None):
# get current SMDA version
smda_config = SmdaConfig()
Expand Down
7 changes: 3 additions & 4 deletions mcrit/storage/StorageInterface.py
Original file line number Diff line number Diff line change
Expand Up @@ -650,11 +650,10 @@ def recalculateAllPicHashes(self) -> int:
"""
raise NotImplementedError

def recalculateAllMinHashes(self) -> int:
""" Process all FunctionEntries and use this SMDA version and MCRIT config to recalculate and update the MinHashes
In the end, call rebuildMinhashBandIndex
def deleteAllMinHashes(self) -> int:
""" drop every minhash in all function_entries as a preparation for a full rebuild
Returns:
the number of minhashes indexed
the number of minhashes dropped
"""
raise NotImplementedError

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

setup(
name='mcrit',
version="1.2.26",
version="1.3.0",
description='MCRIT is a framework created for simplified application of the MinHash algorithm to code similarity.',
long_description_content_type="text/markdown",
long_description=README,
Expand Down

0 comments on commit 6228599

Please sign in to comment.