diff --git a/README.md b/README.md index d7ce587..d311689 100644 --- a/README.md +++ b/README.md @@ -36,13 +36,13 @@ starparser --i particles.star --count ``` ``` -starparser --i particles.star --list_column _rlnOriginX +starparser --i particles.star --list_column OriginX ``` For some options, a second star file can also be passed as input ```--f secondfile.star```. ``` -starparser --i particles1.star --f particles2.star --find_shared _rlnMicrographName +starparser --i particles1.star --f particles2.star --find_shared MicrographName ``` The list of options are organized by [Data Mining](#mining), [Modifications](#modify), and [Plots](#plot). Arguments that are not required are surrounded by parentheses in the descriptions below. Do not include the parentheses in your arguments. @@ -65,7 +65,7 @@ Find particles that match a column header ```--c``` and query ```--q``` (see the **```--limit```** *```column/comparator/value```* -Extract particles that match a specific comparison (*lt* for less than, *gt* for greater than, *le* for less than or equal to, *ge* for greater than or equal to). The argument to pass is "column/comparator/value" (e.g. *\_rlnDefocusU/lt/40000* for defocus values less than 40000). +Extract particles that match a specific comparison (*lt* for less than, *gt* for greater than, *le* for less than or equal to, *ge* for greater than or equal to). The argument to pass is "column/comparator/value" (e.g. *DefocusU/lt/40000* for defocus values less than 40000). **```--count```** *`(--c column --q query (--e))`* @@ -77,7 +77,7 @@ Count the number of unique micrographs and display the result. Optionally, this **```--list_column```** *```column-name(s)```* *`(--c column --q query (--e))`* -Write all values of a column to a file. For example, passing *\_rlnMicrographName* will write all values to MicrographName.txt. To output multiple columns, separate the column names with a slash (for example, *\_rlnMicrographName/\_rlnCoordinateX* outputs MicrographName.txt and CoordinateX.txt). Optionally, this can be used with ```--c``` and ```--q``` to only consider particles that match the query (see the [*Querying*](#query) options), otherwise it lists all values. +Write all values of a column to a file. For example, passing *MicrographName* will write all values to MicrographName.txt. To output multiple columns, separate the column names with a slash (for example, *MicrographName/CoordinateX* outputs MicrographName.txt and CoordinateX.txt). Optionally, this can be used with ```--c``` and ```--q``` to only consider particles that match the query (see the [*Querying*](#query) options), otherwise it lists all values. **```--find_shared```** *```column-name```* *`--f otherfile.star`* @@ -129,21 +129,21 @@ Split the input star file into independent star files for each optics group. The **```--sort_by```** *```column-name(/n)```* -Sort the columns in ascending order according to the column passed here. Outputs a new file to output.star (or specified with ```--o```). Add a slash followed by "*n*" if the column contains numeric values (e.g. *\_rlnClassNumber/n*); otherwise, it will sort the values as text. +Sort the columns in ascending order according to the column passed here. Outputs a new file to output.star (or specified with ```--o```). Add a slash followed by "*n*" if the column contains numeric values (e.g. *ClassNumber/n*); otherwise, it will sort the values as text. ### Modification Options **```--operate```** *```column-name[operator]value```* -Perform operation on all values of a column. The argument to pass is column[operator]value (without the brackets and without any spaces); operators include "\*", "/", "+", and "-" (e.g. *\_rlnHelicalTrackLength\*0.25*). The result is written to a new star file (default output.star, or specified with ```--o```). If your terminal throws an error, try surrounding the argument with quotations (e.g. *"\_rlnHelicalTrackLength\*0.25"*). +Perform operation on all values of a column. The argument to pass is column[operator]value (without the brackets and without any spaces); operators include "\*", "/", "+", and "-" (e.g. *HelicalTrackLength\*0.25*). The result is written to a new star file (default output.star, or specified with ```--o```). If your terminal throws an error, try surrounding the argument with quotations (e.g. *"HelicalTrackLength\*0.25"*). **```--operate_columns```** *```column1[operator]column2=newcolumn```* -Perform operation between two columns and write to a new column. The argument to pass is column1[operator]column2=newcolumn (without the brackets and without any spaces); operators include "\*", "/", "+", and "-" (e.g. *\_rlnCoordinateX+\_rlnOriginX=\_rlnShiftedX*). If your terminal throws an error, try surrounding the argument with quotations (e.g. *"\_rlnCoordinateX+\_rlnOriginX=\_rlnShiftedX"*). +Perform operation between two columns and write to a new column. The argument to pass is column1[operator]column2=newcolumn (without the brackets and without any spaces); operators include "\*", "/", "+", and "-" (e.g. *CoordinateX+OriginX=ShiftedX*). If your terminal throws an error, try surrounding the argument with quotations (e.g. *"CoordinateX+OriginX=ShiftedX"*). **```--remove_column```** *```column-name(s)```* -Remove column, renumber headers, and write to a new star file (default output.star, or specified with ```--o```). E.g. *\_rlnMicrographName*. To enter multiple columns, separate them with a slash: *\_rlnMicrographName/\_rlnCoordinateX*. Note that "relion_star_handler --remove_column" also does this. +Remove column, renumber headers, and write to a new star file (default output.star, or specified with ```--o```). E.g. *MicrographName*. To enter multiple columns, separate them with a slash: *MicrographName/CoordinateX*. Note that "relion_star_handler --remove_column" also does this. **```--remove_particles```** *`--c column --q query (--e)`* @@ -151,7 +151,7 @@ Remove particles that match a query (specified with ```--q```) within a column h **```--remove_duplicates```** *```column-name```* -Remove duplicate particles based on the column provided here (e.g. *\_rlnImageName*) (one instance of the duplicate is retained). +Remove duplicate particles based on the column provided here (e.g. *ImageName*) (one instance of the duplicate is retained). **```--remove_mics_fromlist```** *`--f micrographs.txt`* @@ -167,41 +167,41 @@ Replace all entries of a column with a list of values found in the file provided **```--copy_column```** *```source-column/target-column```* -Replace all entries of a target column with those of a source column in the same star file. If the target column does not exist, a new column will be made. The argument to pass is source-column/target-column (e.g. *\_rlnAngleTiltPrior/\_rlnAngleTilt*). The result is written to a new star file (default output.star, or specified with ```--o```) +Replace all entries of a target column with those of a source column in the same star file. If the target column does not exist, a new column will be made. The argument to pass is source-column/target-column (e.g. *AngleTiltPrior/AngleTilt*). The result is written to a new star file (default output.star, or specified with ```--o```) **```--reset_column```** *```column-name/new-value```* -Change all values of a column to the one provided here. The argument to pass is column-name/new-value (e.g. *\_rlnOriginX/0*). The result is written to a new star file (default output.star, or specified with ```--o```) +Change all values of a column to the one provided here. The argument to pass is column-name/new-value (e.g. *OriginX/0*). The result is written to a new star file (default output.star, or specified with ```--o```) **```--swap_columns```** *```column-name(s)```* *`--f otherfile.star`* -Swap columns from another star file (specified with ```--f```). For example, pass *\_rlnMicrographName* to swap that column. To enter multiple columns, separate them with a slash: *\_rlnMicrographName/\_rlnCoordinateX*. Note that the total number of particles should match. The result is written to a new star file (default output.star, or specified with ```--o```). +Swap columns from another star file (specified with ```--f```). For example, pass *MicrographName* to swap that column. To enter multiple columns, separate them with a slash: *MicrographName/CoordinateX*. Note that the total number of particles should match. The result is written to a new star file (default output.star, or specified with ```--o```). **```--insert_optics_column```** *```column-name/value```* -Insert a new column in the optics table with the name and value provided (e.g. *\_rlnAmplitudeContrast/0.1*). The value will populate all rows of the optics table. The result is written to a new star file (default output.star, or specified with ```--o```). +Insert a new column in the optics table with the name and value provided (e.g. *AmplitudeContrast/0.1*). The value will populate all rows of the optics table. The result is written to a new star file (default output.star, or specified with ```--o```). **```--fetch_from_nearby```** *```distance/column-name(s)```* *`--f otherfile.star`* -Find the nearest particle in a second star file (specified with ```--f```) and if it is within a threshold distance, retrieve its column value to replace the original particle column value. The argument to pass is distance/column-name(s) (e.g. *300/\_rlnClassNumber* or *100/\_rlnAnglePsi/\_rlnHelicalTubeID*). Outputs to output.star (or specified with ```--o```). Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). The micrograph paths from \_rlnMicrographName do not necessarily need to match, just the filenames need to. +Find the nearest particle in a second star file (specified with ```--f```) and if it is within a threshold distance, retrieve its column value to replace the original particle column value. The argument to pass is distance/column-name(s) (e.g. *300/ClassNumber* or *100/AnglePsi/HelicalTubeID*). Outputs to output.star (or specified with ```--o```). Particles that couldn't be matched to a neighbor will be skipped (i.e. if the second star file lacks particles in that micrograph). The micrograph paths from MicrographName do not necessarily need to match, just the filenames need to. **```--import_mic_values```** *```column-name(s)```* *`--f otherfile.star`* -For every particle, find the micrograph that it belongs to in a second star file (specified with ```--f```) and replace the original column value with that of the second star file (e.g. *\_rlnOpticsGroup*). This requires that the second star file only has one instance of each micrograph name (e.g. a micrograph star file like micrographs_ctf.star). To import multiple columns, separate them with a slash. The result is written to a new star file (default output.star, or specified with ```--o```). +For every particle, find the micrograph that it belongs to in a second star file (specified with ```--f```) and replace the original column value with that of the second star file (e.g. *OpticsGroup*). The paths do not have to be identical, just the micrograph filename itself. To import multiple columns, separate them with a slash. The result is written to a new star file (default output.star, or specified with ```--o```). **```--import_particle_values```** *```column-name(s)```* *`--f otherfile.star`* -For every particle in the input star file, find the equivalent particle in a second star file (specified with ```--f```) (i.e. those with equivalent *\_rlnImageName* values) and replace the original column value with the one from the second star file. To import multiple columns, separate them with a slash. +For every particle in the input star file, find the equivalent particle in a second star file (specified with ```--f```) (i.e. those with identical *ImageName* values) and replace the original column value with the one from the second star file. To import multiple columns, separate them with a slash. **```--regroup```** *```particles-per-group```* -Regroup particles such that those with similar defocus values are in the same group (the number of particles per group is specified here) and write to a new star file (default output.star, or specified with ```--o```). Any value can be entered. This is useful if there aren't enough particles in each micrograph to make meaningful groups. This only works if *\_rlnGroupNumber* is being used in the star file rater than *\_rlnGroupName*. Note that Subset selection in Relion should be used for regrouping if possible (which groups on the \*\_model.star intensity scale factors). +Regroup particles such that those with similar defocus values are in the same group (the number of particles per group is specified here) and write to a new star file (default output.star, or specified with ```--o```). Any value can be entered. This is useful if there aren't enough particles in each micrograph to make meaningful groups. This only works if *GroupNumber* is being used in the star file rater than *GroupName*. Note that Subset selection in Relion should be used for regrouping if possible (which groups on the \*\_model.star intensity scale factors). **```--swap_optics```** Swap the optics table with that of another star file provided by ```--f```. The result is written to a new star file (default output.star, or specified with ```--o```). -**```--new_optics```** *```optics-group-name```* *`--c column --q query (--e)`* +**```--new_optics```** *```opticsgroup-name```* *`--c column --q query (--e)`* Provide a new optics group name. Use ```--c``` and ```--q``` to specify which particles belong to this optics group (see the [*Querying*](#query) options). The optics values from the last entry of the optics table will be duplicated. The result is written to a new star file (default output.star, or specified with ```--o```). @@ -221,7 +221,7 @@ Plot values of a column as a histogram. Optionally, use ```--c``` and ```--q``` **```--plot_orientations```** *`(--c column --q query (--e))`* -Plot the particle orientations based on the *\_rlnAngleRot* and *\_rlnAngleTilt* columns on a Mollweide projection (longitude and latitude, respectively). Optionally, use ```--c``` and ```--q``` to only plot a subset of particles, otherwise it will plot all. The result will be saved to Particle_orientations.png. Use ```--t``` to change the file type (see the [*Output*](#output) options). +Plot the particle orientations based on the *AngleRot* and *AngleTilt* columns on a Mollweide projection (longitude and latitude, respectively). Optionally, use ```--c``` and ```--q``` to only plot a subset of particles, otherwise it will plot all. The result will be saved to Particle_orientations.png. Use ```--t``` to change the file type (see the [*Output*](#output) options). **```--plot_class_iterations```** *```classes```* @@ -233,13 +233,13 @@ Find the proportion of particle sets that belong to each class. Pass at least tw **```--plot_coordinates```** *```number-of-micrographs(/circle-size)```* - Plot the particle coordinates for the input star file for each micrograph in a multi-page pdf (red circles). The argument to pass is the total number of micrographs to plot (pass \"all\" to plot all micrographs, but it might take a long time if there are many). Make sure you are running it in the Relion directory so that the micrograph .mrc files can be properly sourced (or change the *\_rlnMicrographName* column to absolute paths). Use ```--f``` to overlay the coordinates of a second star file (larger blue circles); in this case, the micrograph names should match between the two star files. Optionally, pass the desired size of the circle after a slash (e.g. *1/250* for 1 micrograph and a circle size of 250 pixels). The plots are written to Coordinates.pdf. + Plot the particle coordinates for the input star file for each micrograph in a multi-page pdf (red circles). The argument to pass is the total number of micrographs to plot (pass \"all\" to plot all micrographs, but it might take a long time if there are many). Make sure you are running it in the Relion directory so that the micrograph .mrc files can be properly sourced (or change the *MicrographName* column to absolute paths). Use ```--f``` to overlay the coordinates of a second star file (larger blue circles); in this case, the micrograph names should match between the two star files. Optionally, pass the desired size of the circle after a slash (e.g. *1/250* for 1 micrograph and a circle size of 250 pixels). The plots are written to Coordinates.pdf. ### Querying **```--c```** *```column-name(s)```* -Column query term(s). E.g. *\_rlnMicrographName*. This is used to look for a specific query specified with ```--q```. In cases where you can enter multiple columns, separate them with a slash: *\_rlnMicrographName/\_rlnCoordinateX*. +Column query term(s). E.g. *MicrographName*. This is used to look for a specific query specified with ```--q```. In cases where you can enter multiple columns, separate them with a slash: *MicrographName/CoordinateX*. **```--q```** *```query(ies)```* @@ -273,6 +273,8 @@ File type of the plot that will be written. Choose between png, jpg, svg, and pd * The term *particles* here refers to rows in a star file, but the star files don't need to contain particles (e.g. parsing movies in a *movies.star* file). +* Columns can be specified by their full or short names (e.g. \_rlnColumnName, rlnColumnName, or ColumnName). If scripting with the starparser package, columns are specified as their full name (i.e.\_rlnColumnName). + * If the star file lacks an optics table, such as those from Relion 3.0, add the ```--opticsless``` option to parse it. --- @@ -417,7 +419,7 @@ fileparser.writestar(particles[particles.index.isin(keeplist)], metadata, "parti * Plot a histogram of defocus values. ``` -starparser --i run_data.star --histogram _rlnDefocusU +starparser --i run_data.star --histogram DefocusU ```      → Output figure to **DefocusU.png**: @@ -446,10 +448,10 @@ starparser --i run_it025_data.star --plot_class_iterations all --- -* Plot the proportion of particles in each class that belong to particles with the term 200702 versus those with the term 200826 in the \_rlnMicrographName column. +* Plot the proportion of particles in each class that belong to particles with the term 200702 versus those with the term 200826 in the MicrographName column. ``` -starparser --i run_it025_data.star --plot_class_proportions --c _rlnMicrographName --q 200702/200826 +starparser --i run_it025_data.star --plot_class_proportions --c MicrographName --q 200702/200826 ```      → The percentage in each class will be displayed in terminal. @@ -478,7 +480,7 @@ starparser --i particles.star --f select_particles.star --plot_coordinates 1/200 **Remove columns** ``` -starparser --i run_data.star --o run_data_del.star --remove_column _rlnCtfMaxResolution/_rlnCtfFigureOfMerit +starparser --i run_data.star --o run_data_del.star --remove_column CtfMaxResolution/CtfFigureOfMerit ```      → A new star file named **run_data_del.star** will be identical to run_data.star except will be missing those two columns. The headers in the particles table will be renumbered. @@ -487,27 +489,27 @@ starparser --i run_data.star --o run_data_del.star --remove_column _rlnCtfMaxRes **Remove a subset of particles** ``` -starparser --i run_data.star --o run_data_del.star --remove_particles --c _rlnMicrographName --q 200702/200715 +starparser --i run_data.star --o run_data_del.star --remove_particles --c MicrographName --q 200702/200715 ``` -     → A new star file named **run_data_del.star** will be identical to run_data.star except will be missing any particles that have the term 200702 or 2000715 in the \_rlnMicrographName column. In this case, this was useful to remove particles from specific data-collection days that had the date in the filename. +     → A new star file named **run_data_del.star** will be identical to run_data.star except will be missing any particles that have the term 200702 or 2000715 in the MicrographName column. In this case, this was useful to remove particles from specific data-collection days that had the date in the filename. --- **Replace values in a column with those of a text file** ``` -starparser --i particles.star --replace_column _rlnOpticsGroup --f newoptics.txt --o particles_newoptics.star +starparser --i particles.star --replace_column OpticsGroup --f newoptics.txt --o particles_newoptics.star ``` -     → A new star file named **particles_newoptics.star** will be output that will be identical to particles.star except for the \_rlnOpticsGroup column, which will have the values found in newoptics.txt. +     → A new star file named **particles_newoptics.star** will be output that will be identical to particles.star except for the OpticsGroup column, which will have the values found in newoptics.txt. --- **Swap columns** ``` -starparser --i run_data.star --f run_data_2.star --o run_data_swapped.star --swap_columns _rlnAnglePsi/_rlnAngleRot/_rlnAngleTilt/_rlnNormCorrection/_rlnLogLikeliContribution/_rlnMaxValueProbDistribution/_rlnNrOfSignificantSamples/_rlnOriginXAngst/_rlnOriginYAngst +starparser --i run_data.star --f run_data_2.star --o run_data_swapped.star --swap_columns AnglePsi/AngleRot/AngleTilt/NormCorrection/LogLikeliContribution/MaxValueProbDistribution/NrOfSignificantSamples/OriginXAngst/OriginYAngst ```      → A new star file named **run_data_swapped.star** will be output that will be identical to run_data.star except for the columns in the input, which will instead be swapped in from run_data_2.star. This is useful for sourcing alignments from early global refinements. @@ -520,17 +522,17 @@ starparser --i run_data.star --f run_data_2.star --o run_data_swapped.star --swa starparser --i run_data.star --o run_data_regroup200.star --regroup 200 ``` -     → A new star file named **run_data_regroup200.star** will be output that will be identical to run_data.star except for the \_rlnGroupNumber or \_rlnGroupName columns, which will be renumbered to have 200 particles per group. +     → A new star file named **run_data_regroup200.star** will be output that will be identical to run_data.star except for the GroupNumber or GroupName columns, which will be renumbered to have 200 particles per group. --- **Create a new optics group for a subset of particles** ``` -starparser --i run_data.star --o run_data_newoptics.star --new_optics myopticsname --c _rlnMicrographName --q 10090 +starparser --i run_data.star --o run_data_newoptics.star --new_optics myopticsname --c MicrographName --q 10090 ``` -     → A new star file named **run_data_newoptics.star** will be output that will be identical to run_data.star except that a new optics group called *myopticsname* will be created in the optics table and particles with the term 10090 in the \_rlnMicrographName column will have modified \_rlnOpticsGroup and/or \_rlnOpticsName columns to match the new optics group. +     → A new star file named **run_data_newoptics.star** will be output that will be identical to run_data.star except that a new optics group called *myopticsname* will be created in the optics table and particles with the term 10090 in the MicrographName column will have modified OpticsGroup and/or OpticsName columns to match the new optics group. --- @@ -540,7 +542,7 @@ starparser --i run_data.star --o run_data_newoptics.star --new_optics myopticsna starparser --i run_data.star --o run_data_3p0.star --relegate ``` -     → A new star file named **run_data_3p0.star** will be output that will be identical to run_data.star except will be missing the optics table and \_rlnOpticsGroup column. The headers in the particles table will be renumbered accordingly. +     → A new star file named **run_data_3p0.star** will be output that will be identical to run_data.star except will be missing the optics table and OpticsGroup column. The headers in the particles table will be renumbered accordingly. --- @@ -549,7 +551,7 @@ starparser --i run_data.star --o run_data_3p0.star --relegate **Extract a subset of particles** ``` -starparser --i run_data.star --o run_data_c1.star --extract --c _rlnClassNumber --q 1 --e +starparser --i run_data.star --o run_data_c1.star --extract --c ClassNumber --q 1 --e ```      → A new star file named **run_data_c1.star** will be output with only particles that belong to class 1. The `--e` option was passed to avoid extracting any class with the number 1 in it, such as "10", "11", etc. @@ -559,7 +561,7 @@ starparser --i run_data.star --o run_data_c1.star --extract --c _rlnClassNumber **Extract particles with limited defoci** ``` -starparser --i run_data.star --o run_data_under4um.star --limit _rlnDefocusU/lt/40000 +starparser --i run_data.star --o run_data_under4um.star --limit DefocusU/lt/40000 ```      → A new star file named **run_data_under4um.star** will be output with only particles that have defocus estimations below 4 microns. @@ -581,7 +583,7 @@ starparser --i particles.star --o particles_remove_weird_poses.star --remove_pos **Count specific particles** ``` -starparser --i particles.star --o output.star --count --c _rlnMicrographName --q 200702/200715 +starparser --i particles.star --o output.star --count --c MicrographName --q 200702/200715 ```      → *There are 7726 particles that match ['200702', '200715'] in the specified columns (out of 69120, or 11.2%).* @@ -601,7 +603,7 @@ starparser --i run_data.star --count_mics **Count the number of micrographs for specific particles** ``` -starparser --i run_data.star --count_mics --c _rlnMicrographName --q 200826 +starparser --i run_data.star --count_mics --c MicrographName --q 200826 ```      → *Creating a subset of 2358 particles that match ['200826'] in the columns ['\_rlnMicrographName'] \(or 3.4%\)* @@ -613,40 +615,40 @@ starparser --i run_data.star --count_mics --c _rlnMicrographName --q 200826 **List all items from a column in a text file** ``` -starparser --i run_data.star --list_column _rlnMicrographName +starparser --i run_data.star --list_column MicrographName ``` -     → All entries of \_rlnMicrographName will be written to *MicrographName.txt* in a single column. +     → All entries of MicrographName will be written to *MicrographName.txt* in a single column. --- **List all items from multiple columns in independent text files** ``` -starparser --i run_data.star --list_column _rlnDefocusU/_rlnCoordinateX +starparser --i run_data.star --list_column DefocusU/CoordinateX ``` -     → All entries of \_rlnDefocusU will be written to *DefocusU.txt* and all entries of \_rlnCoordinateX will be written to *CoordinateX.txt*. +     → All entries of DefocusU will be written to *DefocusU.txt* and all entries of CoordinateX will be written to *CoordinateX.txt*. --- **List all items from a column that match specific particles** ``` -starparser --i run_data.star --list_column _rlnDefocusU --c _rlnMicrographName --q 200826 +starparser --i run_data.star --list_column DefocusU --c MicrographName --q 200826 ``` -     → Only \_rlnDefocusU entries that have 200826 in \_rlnMicrographName will be written to *DefocusU.txt*. +     → Only DefocusU entries that have 200826 in MicrographName will be written to *DefocusU.txt*. --- **Compare particles between star files and extract those that are shared and unique** ``` -starparser --i run_data1.star --find_shared _rlnMicrographName --f run_data2.star +starparser --i run_data1.star --find_shared MicrographName --f run_data2.star ``` -     → Two new star files will be created named shared.star and unique.star that will have only the particles that are unique to run_data1.star relative to run_data2.star (unique.star) and only the particles that are shared between them (shared.star) based on the \_rlnMicrographName column. +     → Two new star files will be created named shared.star and unique.star that will have only the particles that are unique to run_data1.star relative to run_data2.star (unique.star) and only the particles that are shared between them (shared.star) based on the MicrographName column. --- @@ -666,10 +668,10 @@ starparser --i particles1.star --f particles2.star --extract_if_nearby 650 **Extract a random set of specific particles** ``` -starparser --i run_it025_data.star --extract_random 10000 --c _rlnMicrographName --q DOA3/OAA2 +starparser --i run_it025_data.star --extract_random 10000 --c MicrographName --q DOA3/OAA2 ``` -     → Two new star files will be created named DOA3_10000.star and OAA2_10000.star that will have a random set of 10000 particles that match DOA3 and OAA2 in the \_rlnMicrographName column, respectively. +     → Two new star files will be created named DOA3_10000.star and OAA2_10000.star that will have a random set of 10000 particles that match DOA3 and OAA2 in the MicrographName column, respectively. --- diff --git a/starparser/__init__.py b/starparser/__init__.py index 10747b2..42849d6 100644 --- a/starparser/__init__.py +++ b/starparser/__init__.py @@ -1,4 +1,4 @@ import os -__version__ = '1.49' +__version__ = '1.50' _ROOT = os.path.abspath(os.path.dirname(__file__)) \ No newline at end of file diff --git a/starparser/columnplay.py b/starparser/columnplay.py index 44e483b..90465ac 100644 --- a/starparser/columnplay.py +++ b/starparser/columnplay.py @@ -88,62 +88,6 @@ def swapcolumns(original_particles, swapfrom_particles, columns): return(swappedparticles) -""" ---import_mic_values -""" -def importmicvalues(importedparticles, importfrom_particles, column): - - #~needs explanation~# - - dropflag = False - - if "/" in importedparticles['_rlnMicrographName'][0]: - importedparticles["_rlnMicrographNameSimple"] = importedparticles['_rlnMicrographName'] - for idx, row in importedparticles.iterrows(): - micname = importedparticles.loc[idx,"_rlnMicrographName"] - importedparticles.loc[idx,"_rlnMicrographNameSimple"] = micname[micname.rfind("/")+1:] - - importedparticles = importedparticles.set_index('_rlnMicrographNameSimple') - - dropflag = True - - else: - - importedparticles = importedparticles.set_index('_rlnMicrographName') - - ## - - if "/" in importfrom_particles['_rlnMicrographName'][0]: - importfrom_particles["_rlnMicrographNameSimple"] = importfrom_particles['_rlnMicrographName'] - for idx, row in importfrom_particles.iterrows(): - micname = importfrom_particles.loc[idx,"_rlnMicrographName"] - importfrom_particles.loc[idx,"_rlnMicrographNameSimple"] = micname[micname.rfind("/")+1:] - - importfrom_particles = importfrom_particles[["_rlnMicrographNameSimple", column]] - importfrom_particles = importfrom_particles.set_index('_rlnMicrographNameSimple') - - else: - - importfrom_particles = importfrom_particles[["_rlnMicrographName", column]] - importfrom_particles = importfrom_particles.set_index('_rlnMicrographName') - - importedparticles.update(importfrom_particles) - - importedparticles.reset_index(inplace=True) - - #If we created a new column with simple micrograph names, we - #need to remove it - if dropflag: - - """ - The .drop can be used to drop a whole column. - The "1" tells .drop that it is the column axis that we want to drop - inplace means we want the dataframe to be modified instead of creating an assignment - """ - importedparticles.drop("_rlnMicrographNameSimple", axis=1, inplace=True) - - return(importedparticles) - """ --operate """ @@ -166,11 +110,11 @@ def operate(particles,column,operator,value): """ if operator == "multiply": - print("\n>> Multiplying all values in " + column + " by " + str(value) + ".") + print("\n>> Multiplying all values in " + column + " by " + str(value) + ".") particles[column] = particles[column] * value elif operator == "divide": - print("\n>> Dividing all values in " + column + " by " + str(value) + ".") + print("\n>> Dividing all values in " + column + " by " + str(value) + ".") particles[column] = particles[column] / value elif operator == "add": diff --git a/starparser/decisiontree.py b/starparser/decisiontree.py index 593d674..168c547 100644 --- a/starparser/decisiontree.py +++ b/starparser/decisiontree.py @@ -31,6 +31,15 @@ def decide(): if 'file' not in params: print("\n>> Error: no filename entered. See the help page (-h).\n") sys.exit() + + #Modify column name to be _rlnXXX regardless of the format input by the user (rln, _rln or no prefix) + if params["parser_column"] != "": + tempsplit = params["parser_column"].split("/") + for i,c in enumerate(tempsplit): + tempsplit[i] = makefullname(c) + params["parser_column"]='/'.join(tempsplit) + + print(params["parser_column"]) #This is a rare ocurance, but it's possible that the user asks to delete the _rlnOpticsGroup column as well as pass the --relegate option, which is redundant. if "_rlnOpticsGroup" in params["parser_column"] and params["parser_relegate"]: @@ -203,7 +212,9 @@ def decide(): """ if params["parser_delcolumn"] != "": - columns = params["parser_delcolumn"].split("/") + columns = makefullname(params["parser_delcolumn"]).split("/") + for i,c in enumerate(columns): + columns[i] = makefullname(c) newparticles, metadata = columnplay.delcolumn(allparticles, columns, metadata) print("\n>> Removed the columns " + str(columns)) fileparser.writestar(newparticles, metadata, params["parser_outname"], relegateflag) @@ -228,7 +239,7 @@ def decide(): """ if params["parser_delduplicates"] != "": - column = params["parser_delduplicates"] + column = makefullname(params["parser_delduplicates"]) if column not in allparticles: print("\n>> Error: the column " + str(column) + " does not exist in your star file.\n") sys.exit() @@ -268,6 +279,9 @@ def decide(): """ if params["parser_swapcolumns"] != "": + columnstoswap = params["parser_swapcolumns"].split("/") + for i,c in enumerate(columnstoswap): + columnstoswap[i]=makefullname(c) if params["parser_file2"] == "": print("\n>> Error: provide a second file with --f to swap columns from.\n") sys.exit() @@ -277,7 +291,6 @@ def decide(): sys.exit(); print("\n>> Reading " + file2) otherparticles, metadata2 = fileparser.getparticles(file2) - columnstoswap = params["parser_swapcolumns"].split("/") swappedparticles = columnplay.swapcolumns(allparticles, otherparticles, columnstoswap) print("\n>> Swapped in " + str(columnstoswap) + " from " + file2) fileparser.writestar(swappedparticles, metadata, params["parser_outname"], relegateflag) @@ -298,6 +311,8 @@ def decide(): print("\n>> Reading " + file2) otherparticles, metadata2 = fileparser.getparticles(file2) columnstoimport = params["parser_importmicvalues"].split("/") + for i,c in enumerate(columnstoimport): + columnstoimport[i]=makefullname(c) for column in columnstoimport: if column not in allparticles: @@ -322,7 +337,7 @@ def decide(): """ importedparticles = allparticles.copy() for column in columnstoimport: - importedparticles = columnplay.importmicvalues(importedparticles, otherparticles, column) + importedparticles = particleplay.importmicvalues(importedparticles, otherparticles, column) fileparser.writestar(importedparticles, metadata, params["parser_outname"], relegateflag) sys.exit() @@ -370,6 +385,8 @@ def decide(): print("\n>> Reading " + file2) otherparticles, metadata2 = fileparser.getparticles(file2) columnstoimport = params["parser_importpartvalues"].split("/") + for i,c in enumerate(columnstoimport): + columnstoimport[i] = makefullname(c) for column in columnstoimport: if column not in allparticles: @@ -417,6 +434,7 @@ def decide(): print("\n>> Error: the argument to pass is column[operator]value (e.g. _rlnHelicalTrackLength*0.25).\n") sys.exit() column, value = arguments + column = makefullname(column) try: value = float(value) except ValueError: @@ -455,12 +473,15 @@ def decide(): sys.exit() column1, secondhalf = arguments + column1 = makefullname(column1) try: column2 = secondhalf.split("=")[0] + column2 = makefullname(column2) newcolumn = secondhalf.split("=")[1] + newcolumn = makefullname(newcolumn) except IndexError: - print("\n>> Error: the argument to pass is column1[operator]column2=newcolumn (e.g. _rlnCoordinateX*_rlnOriginX=_rlnShifted).\n") + print("\n>> Error: the argument to pass is column1[operator]column2=newcolumn (e.g. CoordinateX*OriginX=Shifted). Try using quotations around it if it doesn't work\n") sys.exit() if column1 not in allparticles: @@ -484,7 +505,7 @@ def decide(): """ if params["parser_findshared"] != "": - columntocheckunique = params["parser_findshared"] + columntocheckunique = makefullname(params["parser_findshared"]) if columntocheckunique not in allparticles.columns: print("\n>> Error: could not find the " + columntocheckunique + " column in " + filename + ".\n") sys.exit() @@ -499,11 +520,13 @@ def decide(): otherparticles, f2metadata = fileparser.getparticles(file2) unsharedparticles = allparticles[~allparticles[columntocheckunique].isin(otherparticles[columntocheckunique])] sharedparticles = allparticles[allparticles[columntocheckunique].isin(otherparticles[columntocheckunique])] - if params["parser_findshared"] != "": - print("\nShared: \n-------\n" + str(len(sharedparticles.index)) + " particles are shared between " + filename + " and " + file2 + " in the " + str(columntocheckunique) + " column.\n") - fileparser.writestar(sharedparticles, metadata, "shared.star", relegateflag) - print("Unique: \n-------\n·" + filename + ": " + str(len(unsharedparticles.index)) + " particles (these will be written to unique.star)\n·" + file2 + ": " + str(len(otherparticles.index) - len(sharedparticles.index)) + " particles\n") + + print("\nShared: \n-------\n" + str(len(sharedparticles.index)) + " particles are shared between " + filename + " and " + file2 + " in the " + str(columntocheckunique) + " column.\n") + fileparser.writestar(sharedparticles, metadata, "shared.star", relegateflag) + print("Unique: \n-------\n·" + filename + ": " + str(len(unsharedparticles.index)) + " particles (written to unique.star if non-zero)\n·" + file2 + ": " + str(len(otherparticles.index) - len(sharedparticles.index)) + " particles\n") + if not unsharedparticles.empty: fileparser.writestar(unsharedparticles, metadata, "unique.star", relegateflag) + sys.exit() @@ -572,6 +595,8 @@ def decide(): print("\n>> Error: provide argument in this format: distance/column(s) (e.g. 300/_rlnClassNumber).") sys.exit() columnstoretrieve = retrieveparams[1:] + for i,c in enumerate(columnstoretrieve): + columnstoretrieve[i]=makefullname(c) for c in columnstoretrieve: if c not in allparticles: print("\n>> Error: " + c + " does no exist in the input star file.\n") @@ -670,7 +695,7 @@ def decide(): if len(parsedinput) != 3: print("\n>> Error: provide argument in this format: column/operator/value (e.g. _rlnDefocusU/lt/40000).\n") sys.exit() - columntocheck = parsedinput[0] + columntocheck = makefullname(parsedinput[0]) operator = parsedinput[1] limit = float(parsedinput[2]) if operator not in ["lt", "gt", "le", "ge"]: @@ -779,7 +804,7 @@ def decide(): if params["parser_file2"] == "": print("\n>> Error: provide a second file with --f that has the list of values.\n") sys.exit() - insertcol = params["parser_insertcol"] + insertcol = makefullname(params["parser_insertcol"]) if insertcol in allparticles.columns: print("\n>> Error: the column " + str(insertcol) + " already exists in your star file. Use --replace_column if you would like to replace it.\n") sys.exit() @@ -795,12 +820,17 @@ def decide(): fileparser.writestar(allparticles, metadata, params["parser_outname"], relegateflag) sys.exit() + """ + --insert_optics_column + """ + if params["parser_insertopticscol"] != "": try: new_header, value = params["parser_insertopticscol"].split("/") except ValueError: print("\n>> Error: the argument to pass is column-name/value.\n") sys.exit() + new_header = makefullname(new_header) print("\n Creating the column " + new_header + " in the optics table with the value " + value) metadata[2][new_header]=value @@ -868,7 +898,7 @@ def decide(): if params["parser_file2"] == "": print("\n>> Error: provide a second file with --f that has the list of values.\n") sys.exit() - replacecol = params["parser_replacecol"] + replacecol = makefullname(params["parser_replacecol"]) if replacecol not in allparticles.columns: print("\n>> Error: the column " + str(replacecol) + " does not exist in your star file.\n") sys.exit() @@ -894,6 +924,8 @@ def decide(): print("\n>> Error: the input should be source-column/target-column.\n") sys.exit() sourcecol, targetcol = inputparams + sourcecol = makefullname(sourcecol) + targetcol = makefullname(targetcol) if sourcecol not in allparticles: print("\n>> Error: " + sourcecol + " does not exist in the star file.\n") sys.exit() @@ -912,6 +944,7 @@ def decide(): print("\n>> Error: the input should be column-name/value.\n") sys.exit() columntoreset, value = inputparams + columntoreset = makefullname(columntoreset) if columntoreset not in allparticles: print("\n>> Error: " + columntoreset + " does not exist in the star file.\n") sys.exit() @@ -920,12 +953,12 @@ def decide(): sys.exit() """ - --sort + --sort_by """ if params["parser_sort"] != "": inputparams = params["parser_sort"].split("/") - sortcol = inputparams[0] + sortcol = makefullname(inputparams[0]) if sortcol not in allparticles.columns: print("\n>> Error: the column " + str(sortcol) + " does not exist in your star file.\n") sys.exit() @@ -934,7 +967,7 @@ def decide(): try: pd.to_numeric(allparticles[sortcol].iloc[0], downcast="float") print("\n----------------------------------------------------------------------") - print("\n>> Warning: it looks like this column is numeric but you haven't specified so.\n Make sure that this is the behavior you intended. Otherwise, use \"column/n\".\n") + print("\n>> Warning: it looks like this column is numeric but you haven't specified so.\n>> Make sure that this is the behavior you intended. Otherwise, use \"column/n\".\n") print("----------------------------------------------------------------------") except ValueError: pass @@ -945,7 +978,7 @@ def decide(): try: pd.to_numeric(allparticles[sortcol].iloc[0], downcast="float") print("\n----------------------------------------------------------------------") - print("\n>> Warning: it looks like this column is numeric but you haven't specified so.\n Make sure that this is the behavior you intended. Otherwise, use \"column/n\".\n") + print("\n>> Warning: it looks like this column is numeric but you haven't specified so.\n>> Make sure that this is the behavior you intended. Otherwise, use \"column/n\".\n") print("----------------------------------------------------------------------") except ValueError: pass @@ -1042,7 +1075,7 @@ def decide(): """ if params["parser_plot"] != "": - columntoplot = params["parser_plot"] + columntoplot = makefullname(params["parser_plot"]) if columntoplot not in particles2use: print("\n>> Error: the column \"" + columntoplot + "\" does not exist.\n") sys.exit() @@ -1071,7 +1104,9 @@ def decide(): if params["parser_writecol"] != "": colstowrite = params["parser_writecol"].split("/") - for col in colstowrite: + for i,col in enumerate(colstowrite): + col = makefullname(col) + colstowrite[i] = col if col not in particles2use: print("\n>> Error: the column \"" + str(col) + "\" does not exist in your star file.\n") sys.exit() @@ -1103,3 +1138,19 @@ def decide(): #The end. print("\n>> Error: either the options weren't passed correctly or none were passed at all. See the help page (-h).\n") + + +def makefullname(col): + if col.startswith("_rln"): + return(col) + elif col.startswith("rln"): + return("_"+col) + elif col.startswith("_rn") or col.startswith("rn"): + print(f"\n>> Error: check the column name {col}.\n") + sys.exit() + elif col.startswith("rn") and not col.startswith("rln"): + print(f"\n>> Error: check the column name {col}.\n") + sys.exit() + else: + return("_rln"+col) + diff --git a/starparser/particleplay.py b/starparser/particleplay.py index fbc5b21..c006c85 100644 --- a/starparser/particleplay.py +++ b/starparser/particleplay.py @@ -264,18 +264,48 @@ def setparticleoptics(particles,column,query,queryexact,opticsnumber): --import_particle_values """ def importpartvalues(original_particles, importfrom_particles, columnstoswap): + # Create a dictionary for fast lookup from importfrom_particles + lookup_dict = importfrom_particles.set_index('_rlnImageName')[columnstoswap].to_dict('index') - importedparticles = original_particles.copy() - - for index, particle in original_particles.iterrows(): - imagename = particle["_rlnImageName"] - importloc = importfrom_particles.index[importfrom_particles["_rlnImageName"] == imagename].tolist() - if len(importloc) > 1: - print("\n>> Error: " + imagename + " exists more than once in the star file.\n") - sys.exit() - importloc = importloc[0] - for c in columnstoswap: - importedparticles[c].iloc[index] = importfrom_particles[c].iloc[importloc] + # Check for duplicates in importfrom_particles + if importfrom_particles['_rlnImageName'].duplicated().any(): + print("\n>> Error: Duplicate entries found in the star file.\n") + sys.exit() + + # Perform the swapping of values + for c in columnstoswap: + # Map each value, checking if it exists in the dictionary + original_particles[c] = original_particles['_rlnImageName'].apply( + lambda x: lookup_dict[x][c] if x in lookup_dict else sys.exit(f"\n>> Error: {x} not found in file that you are importing from.") + ) + + return(original_particles) + +""" +--import_mic_values +""" +def importmicvalues(importedparticles, importfrom_particles, column): + # Extract the simple micrograph name if necessary + if "/" in importedparticles['_rlnMicrographName'][0]: + importedparticles["_rlnMicrographNameSimple"] = importedparticles['_rlnMicrographName'].str.split('/').str[-1] + else: + importedparticles["_rlnMicrographNameSimple"] = importedparticles['_rlnMicrographName'] + + if "/" in importfrom_particles['_rlnMicrographName'][0]: + importfrom_particles["_rlnMicrographNameSimple"] = importfrom_particles['_rlnMicrographName'].str.split('/').str[-1] + else: + importfrom_particles["_rlnMicrographNameSimple"] = importfrom_particles['_rlnMicrographName'] + + # Create a lookup dictionary from importfrom_particles + lookup_dict = importfrom_particles.set_index('_rlnMicrographNameSimple')[column].to_dict() + + # Update values in importedparticles using lookup_dict, with error reporting + importedparticles[column] = importedparticles['_rlnMicrographNameSimple'].apply( + lambda x: lookup_dict[x] if x in lookup_dict else sys.exit(f"\n>> Error: Micrograph {x} not found in original file.") + ) + + # Drop the temporary '_rlnMicrographNameSimple' column + importedparticles.drop("_rlnMicrographNameSimple", axis=1, inplace=True) return(importedparticles)