-
Notifications
You must be signed in to change notification settings - Fork 187
Home
Welcome to the bulk_extractor wiki!
bulk_extractor is a C++ program that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results are stored in feature files that can be easily inspected, parsed, or processed with automated tools. bulk_extractor also creates histograms of features that it finds, as features that are more common tend to be more important. We have made the following tools available for processing feature files generated by bulk_extractor: We have provided a small number of python programs that perform automated processing on feature files. We have provided the Bulk Extractor Viewer User Interface (BEViewer) for browsing features stored in feature files and for launching bulk_extractor scans. Please see page BEViewer.
bulk_extractor is a C++ program that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results are stored in feature files that can be easily inspected, parsed, or processed with automated tools. bulk_extractor also creates histograms of features that it finds, as features that are more common tend to be more important.
We have made the following tools available for processing feature files generated by bulk_extractor:
- A a small number of python programs that perform automated processing on feature files.
- A Bulk Extractor Viewer User Interface (BEViewer) for browsing features stored in feature files and for launching bulk_extractor scans. Please see page BEViewer.
bulk_extractor now creates an output directory that has the following layout:
alerts.txt | Processing errors. |
ccn.txt | Credit card numbers |
ccn_track2.txt | Credit card “track 2″ informaiton, which has previously been found in some bank card fraud cases. |
domain.txt | Internet domains found on the drive, including dotted-quad addresses found in text. |
email.txt | Email addresses. |
ether.txt | Ethernet MAC addresses found through IP packet carving of swap files and compressed system hibernation files and file fragments. |
exif.txt | EXIFs from JPEGs and video segments. This feature file contains all of the EXIF fields, expanded as XML records. |
find.txt | The results of specific regular expression search requests. |
ip.txt | IP addresses found through IP packet carving. |
rfc822.txt | Email message headers including Date:, Subject: and Message-ID: fields. |
tcp.txt | TCP flow information found through IP packet carving. |
telephone.txt | US and international telephone numbers. |
url.txt | URLs, typically found in browser caches, email messages, and pre-compiled into executables. |
url_searches.txt | A histogram of terms used in Internet searches from services such as Google, Bing, Yahoo, and others. |
url_services.txt | A histogram of the domain name portion of all the URLs found on the media. |
wordlist.txt | A list of all “words” extracted from the disk, useful for password cracking. |
wordlist_*.txt | The wordlist with duplicates removed, formatted in a form that can be easily imported into a popular password-cracking program. |
zip.txt | A file containing information regarding every ZIP file component found on the media. This is exceptionally useful as ZIP files contain internal structure and ZIP is increasingly the compound file format of choice for a variety of products such as Microsoft Office |
For each of the above, two additional files may be created:
_stopped.txt | bulk_extractor supports a stop list, or a list of items that do not need to be brought to the user’s attention. However rather than simply suppressing this information, which might cause something critical to be hidden, stopped entries are stored in the stopped files. |
_histogram.txt | bulk_extractor can also create histograms of features. This is important, as experience has shown that email addresses, domain names, URLs, and other informaiton that appear more frequently on a hard drive or in a cell phone’s memory can be used to rapidly create a pattern of life report. |
Bulk extractor also creates a file that captures the provenance of the run:
report.xml | A Digital Forensics XML report that includes information about the source media, how the bulk_extractor program was compiled and run, the time to process the digital evidence, and a meta report of the information that was found. |
We have developed four programs for post-processing the bulk_extractor output:
bulk_diff.py | This program reports the differences between two bulk_extractor runs. The intent is to image a computer, run bulk_extractor on a disk image, let the computer run for a period of time, re-image the computer, run bulk_extractor on the second image, and then report the differences. This can be used to infer the user’s activities within a time period. |
cda_tool.py | This tool, currently under development, reads multiple bulk_extractor reports from multiple runs against multiple drives and performs a multi-drive correlation using Garfinkel’s Cross Drive Analysis technique. This can be used to automatically identify new social networks or to identify new members of existing networks. |
identify_filenames.py | In the bulk_extractor feature file, each feature is annotated with the byte offset from the beginning of the image in which it was found. The program takes as input a bulk_extractor feature file and a DFXML file containing the locations of each file on the drive (produced with Garfinkel’s fiwalk program) and produces an annotated feature file that contains the offset, feature, and the file in which the feature was found. |
make_context_stop_list.py | Although forensic analysts frequently make “stop lists”—for example, a list of email addresses that appear in the operating system and should therefore be ignored—such lists have a significant problem. Because it is relatively easy to get an email address into the binary of an open source application, ignoring all of these email addresses may make it possible to cloak email addresses from forensic analysis. Our solution is to create context-sensitive stop lists, in which the feature to be stopped is presented with the context in which it occurs. |
- Download Current Source (Source is available in the .tar.gz download.)
- View ChangeLog
- Downloads
- BEViewer
- External Links
- FAQ
- Licensing
Image data is available at http://digitalcorpora.org. Suggested images include the following:
- nps-2009-domexusers
- nps-2009-domexusers.redacted
- nps-2009-ubnist1.gen3
- Bulk Extractor Users Group: http://groups.google.com/group/bulk_extractor-users.