Skip to content

Removing junk from your data

Michael A. Cianfrocco edited this page Jun 9, 2018 · 2 revisions

Home > 2D classification > Removing junk from your data

Removing junk from your data

When users start to collect large datasets, automated particle picking becomes an effective way to save time while also picking almost all of the particles in your images.

The downside of this, however, is that there will inevitably be 'junk' present in your data:

  • Stain / ice crystals
  • Edges of holes
  • Aggregates

These features will be picked by the automated picking program, so you will have to do a few steps in order to remove this 'junk' from your data.

Removing junk:

General principle

The general approach to remove 'junk' is to classify your data into a large number of classes, and then to discard any classes that don't look protein.

For instance, in this publication the authors show that initial 2D classification of their cryo-EM dataset resulted in a subset of class averages that did not resemble their protein because it was too fuzzy, too small, or contained ice crystals:

Jiang, Chan, et al. Science (2015)

So, the authors discarded these 'bad' particles and then continued analysis with a 'cleaned up' dataset.

Running a single round of Auto_Align.py & removing bad particles

Running a single round of Auto_Align.py

The best way to sort 'junk' out of your particle stack is to perform just classification without subsequent rotational alignment. This is because the iterative comparison of 'junk' class averages to the data will invetiable begin to mix together 'junk' particles with 'good' particles. Therefore, all we need to do is to run Auto_Align.py for the first step (calculating class averages), and then exit from the routine.

You will run the same script Auto_Align.py, with the exception that you will:

  1. Provide a larger number of initial averages (10 particles or less per class)
  2. Provide a flag --oneIter to stop the program after the first classification step

You will also need to provide all input options for Auto_Align.py, although the --final option does not mean anything since you will be stopping the alignment after the first step of classification.

Example: Let's assume that you have a particle stack of 20,000 particles. For this step of removing junk, you should classify your data into classes of 10 particles (or less) per class. This means that you should ask for 2,000 class averages initially as the --start input option:

$ EM-processing/2D-classification/Auto_Align.py -i stack2_norm_sel.img -o alignJunk --num=1001 --iter=1 --start=2000 --final=200 --maskradius=50 --oneIter --spider --radius=50

NOTE: While you would filter & mask your data for normal 2D classification, for sorting junk it is not advised because we want to classify particles based upon their entire dimensions.

Record bad classes & remove using remove_bad_classes.py

Now that you have finished running a single round of classification using Auto_Align.py, you will need to inspect the class averages to determine which averages are made up of particles that you do NOT want to include in subsequent processing steps.

Following the example listed above, you would open the class average stack using EMAN:

$ v2 alignJunk/auto_iteration_1/classsums1_avg.img

Now, at the same time, open a text file and record the numbers that you see next to the 'bad' class averages, one entry per line. To help keep track of your files, it is a good idea to open the file with the same name as the class averages except with a different extension:

$ gedit alignJunk/auto_iteration_1/classsums1_avg_bad_avgs.txt& 
alignJunk/auto_iteration_1/classsums1_avg_bad_avgs.txt:: 
2
4
50
...

NOTE: The numbering scheme for removing 'bad' particles assumes that you wrote down the numbers that you saw next to the averages. These numbers start with particle #1 labeled as '0', and then increase up until particle #n labeled as 'n-1'.

Now that you have this text file, you will run the script remove_bad_classes.py:

$ EM-processing/2D-classification/remove_bad_classes.py 
Usage: remove_bad_classes.py -i  -o  --num=[number of particles in stack]

Options:
  -h, --help            show this help message and exit
  --autoalignfolder=FOLDER
                        Input output folder from Auto_Align.py that contains
                        the auto_iteration_? folders.
  --stack=FILE          Input particle stack (.img) from which the averages
                        were calculated, and from which particles will be
                        removed
  --badlist=FILE        Text file containing list of 'bad' class averages,
                        where they are numbered according to the convention
                        where particle #1 is '0', etc.
  -d                    debug

By providing this script with the text file containing the 'bad' classes, in addition to the Auto Align output folder and particle stack, you can remove all particles that were in the bad classes.

Example:

$ EM-processing/2D-classification/remove_bad_classes.py --autoalignfolder=alignJunk --stack=stack2_norm_sel.img --badlist=alignJunk/auto_iteration_1/classsums1_avg_bad_avgs.txt 

This will output a new stack of particles named stack2_norm_sel_select.img.

Removing bad particle using Relion

First, run 2D classification with Relion, asking for 200 - 250 classes.

After this finishes, you can use Relion's 'Display' feature within the Relion GUI to select 'good' particles. This way you can avoid analyzing the particles that went into the 'bad' classes during 3D classification and refinement.

First, find the _model.star file associated with the iteration of 2D classification that you liked the most. For example, the 22nd iteration of this 2D classification looked the best:

Class2D/run1_it022_model.star

Select this file using the Relion 'Display' feature.

Then, the following window will display the class averages. Select the class(es) that you like by left clicking with your mouse (you'll see a red box appear about around the classes that you selected). Then, click click one of these selected classes and choose SAVE star with particles from selected classes:

Now,specify an output name for this .star file. In order to keep everything straight, naming the selected file the same base name as original file but with a suffix like '_selected.star' is helpful.

Clone this wiki locally