Entity and attribute names and formats for sample and diffraction plan shipment/upload #4

KarlLevik · 2024-01-08T10:57:28Z

As a starting-point, below is documentation for the CSV format we currently use for this at Diamond.

I imagine we would want to agree on a standard for attribute names as well as a JSON format to replace this.

These are the CSV column names:

oscillationRange,proteinAcronym,proteinName,spaceGroup,sampleBarcode,sampleName,samplePosition,sampleComments,
cell_a,cell_b,cell_c,cell_alpha,cell_beta,cell_gamma,subLocation,loopType,requiredResolution,centringMethod,experimentKind,
radiationSensitivity,energy,userPath,screenAndCollectRecipe,screenAndCollectNValue,sampleGroup

In our actual CSV files, the first line is a header which "dynamically" defines which columns you have and their ordering. So, you can have different columns and ordering for each file, just as long as the column names are ones we know about, and you have included the mandatory columns.

Here is an example - only the three first lines of data - and note that empty columns are ignored:

#proposalCode,proposalNumber,visitNumber,shippingName,dewarCode,containerCode,preObsResolution,neededResolution,oscillationRange,proteinAcronym,proteinName,spaceGroup,sampleBarcode,sampleName,samplePosition,sampleComments,cell_a,cell_b,cell_c,cell_alpha,cell_beta,cell_gamma,subLocation,loopType,requiredResolution,centringMethod,experimentKind,radiationSensitivity,energy,userPath,screenAndCollectRecipe,screenAndCollectNValue,sampleGroup
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10A,1,,,,,,,,,Litho Loop,,,,,,,,,
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10B,2,,,,,,,,,Litho Loop,,,,,,,,,
mx,32101,21,mx32101-23,DLS-MX-0079,MEP-005,,,,GPP91,GPP91,,,GPP91-2059-263C10C,3,,,,,,,,,Litho Loop,,,,,,,,,
...

I assume many of the attribute/column names are familiar and self-explanatory, but here is some extra info:

subLocation is an index referring to a position within a multipin sample.
userPath describes one or two levels of folders (folder1/folder2) that will be created inside the visit directory and into which the acquisition system will write diffraction images for the given sample.
screenAndCollectRecipe: can be "best", "all" or "none" (or empty). If using "best", then set screenAndCollectNValue to some integer, e.g. 3 if you want the best 3 samples from the group collected on. It has to be a value in the range 1 to 5.
sampleGroup: should be the name of a new group. If you want an existing group, use the group id.

The following fields are mandatory:

In the first line only: proposalCode, proposalNumber, shippingName
In all lines: dewarCode (i.e. dewar name) + containerCode + proteinAcronym + proteinName + sampleName + sampleBarcode

Additionally, you can specify flags when you upload the file:

--queuecontainer so that the container is queued for Unattended Data Collection (UDC)
--highpriority|mediumpriority|lowpriority so that the container is moved in the UDC queue (DLS staff only)
--allowanyregcontainer to use any puck, not just the ones associated with a proposal
--allowmissingfacilitycode so you dont need to specify a dewar facility code

Validation

If not successful, the uploader will abort with an error message. If there was a minor problem, then it will complete but with a warning message.

The warning messages are:

Unable to calculate unit cell volume for sample %s with cell params %s.
Unit cell volume must be positive. Got %s for sample %s with cell params %s
Not setting lab contacts for shipment as the csv file owner %s is not a lab contact for proposal %s.
The csv file owner %s is not in the ISPyB database.

The error messages are:

client is required.

inputcsvfile is required.

file %s not found.

The csv file owner %s is not in the ISPyB database.

If either of the unit cell parameters are defined, then all must be defined. Got %s for sample %s

All unit cell angles must be < 180 degrees. Got %s for sample %s

User-defined field list is missing the following mandatory fields: %s

If uploading the csv file from a visit dir, then the visit's proposal (%s) must match that given in the file (%s).

Authorisation failure - the time delta is too large.

The csv file owner %s is not a member of any sessions/visits in the ISPyB database.

If not uploading the csv file from a visit dir, then you must be a member of a session on the proposal you're trying to upload to (%s).

Illegal characters in sampleGroup %s. Legal characters: alpha-numeric, hyphen and underscore.

The sample group ID %d does not exist

The proposalId of sample group ID %d is different from the proposalId of sample %s

There is already a sample group for proposal %s with name %s

screenAndCollectNValue is not an integer - problem with sampleName %s

screenAndCollectRecipe 'none' requires a value for requiredResolution - sampleName %s

For screenAndCollectRecipe 'best' the screenAndCollectNValue must be from 1 to 5 - problem with sampleName %s

For screenAndCollectRecipe 'best' a sampleGroup is required - problem with sampleName %s

screenAndCollectRecipe 'all' requires a value for neededResolution - problem with sampleName %s

'%s' not a valid screenAndCollectRecipe - problem with sampleName %s
Mandatory field %s not filled in. (Only mandatory for first row.) Required format is: %s

Mandatory field %s not filled in. Required format is: %s

Field %s must be max 45 characters long, this value is longer: %s

Illegal characters in sampleName %s. Legal characters: alpha-numeric, hyphen and underscore.
Space group must be at least 2 characters long or be a positive integer: %s

Space group number must be in the range [1, 230]: %s

The dewar code %s is not a registered facility code for proposal %s

The container code %s is not a registered container code

The userPath can be max 100 characters long, this one is longer: %s

The proteins must have been approved - this one isn't: acronym: %s

The proteins must already exist in ISPyB - this one doesn't: acronym: %s

Sample with name %s already exists for protein with acronym %s in this proposal.

Value required for experimentKind when UDC/queueContainer option specified. No value found for sampleName %s

Sample %s in container %s is in an invalid location %s. Valid locations are 1 to 16.

Sample %s in container %s has an invalid non-integer location %s

Sample %s in container %s is in an invalid sub-location %s. Valid locations are 0 to 7.

Sample %s in container %s has location %s, sub-location %s which is already taken.

Project %s does not exist

There are %d occurrences of sample with name %s and protein acronym %s in this CSV file.

The text was updated successfully, but these errors were encountered:

katesmith280 · 2024-01-22T10:28:43Z

Thanks Karl for your very comprehensive starting point!

Prior to the SLS darktime this is what we our users could provide prior to their experiment (by email): V6_TELLSamplesSpreadsheetTemplate.xlsx

Our website heidi.psi.ch allowed users to validate their spreadsheets prior to emailing them to us. Our desktop sample changer GUI would also run the same sample import validation when the spreadsheet is uploaded prior to an experiment.

Pydantic model: (https://github.com/HeidiProject/backend/blob/main/app/sample_models.py)
Sample importer module: (https://github.com/HeidiProject/backend/blob/main/app/sample_importer.py)

ejd53 · 2024-01-23T17:07:07Z

What I like about both of these is that the column names appear to be scientist-friendly and completely decoupled from those in the database :)

Here's some JSON Schema for a previous attempt at a one-shot shipment submission, intended to encompass both pin and plate shipments as well as retrieval of crystal coordinates when putting a plate onto a home source: https://icebear.fi/shiplink/v0_3_0/schema.json

(Karl, you might remember this one, back in the day...)

A more human-friendly representation is here: https://icebear.fi/shiplink/schemadoc/?schema=https://icebear.fi/shiplink/v0_3_0/schema.json

Some of this doesn't make any sense to me after not having seen it for a few years, and there's some stuff missing, but nothing fundamentally wrong with it as far as I can see.

antolinos · 2024-02-07T08:50:45Z

Hi,

Our column names are pretty similar to what Karl has described with some minor differences. The csv can be downloaded from here

Parameters

Parameter	Description
parcel name
container name
container name
container type
container position
protein acronym
sample acronym
barcode	pin barcode
SPG
cellA
cellB
cellC
cellAlpha
cellBeta
cellGamma
experimentType	this is the name of the workflow: MXPress-A, etc...
aimed Resolution
required Resolution
beam diameter
number of positions
aimed multiplicity
aimed completeness
forced SPG
radiation sensitivity
smiles
total rot. angle
min osc. angle
observed resolution
comments

Currently, we are adding more parameters from online data analysis, but it is still in a very immature state.

hormiai76 · 2024-02-07T12:35:36Z

Hi,
at MAXIV we added 4 more columns to the ESRF ones. We need them to manage the unattended data collections:

energy
transmission
oexposure time
oscillation range

example_MAXIV.csv

We are working to a new tempalte in Excel to apply some restrictions to the diffraction plan columns and then the user will need to export the file as csv and import it into py-ispyb-ui or exi

CV-GPhL · 2024-02-07T13:31:25Z

Maybe too early, but a few comments about some of those items - also mainly to show the kind of connection one could do between the some items here and a dictionary like PDBx/mmCIF (the definitions there are also not perfect in some places, but it seems the best we have and is actively developed and maintained).

"aimed Resolution" and "required Resolution":

We also need to define what "resolution" here means: is that a purely geometric value (e.g. the edge/corner of the detector) or something more related to the diffraction quality of a given project/crystal?
Probably the latter, in which case it is most likely the highest diffraction limit (in any direction) according to some (not yet defined) criterion ... and it might be better to stay away from that hugely loaded word "Resolution".
I would go with a generic "diffraction limit" (see also _diffrn_reflns.pdbx_d_res_high) and _reflns.d_resolution_high - although thesedefinitions are not perfect either) and then add another item "diffraction limit criterion"
- This could then be free text, which would be easier to define than an enumerated list (where we get into isotropic-vs-anisotropic, different binning methods, different bin sizes, Fridel's law true/false, anomalous etc). See also _reflns.pdbx_signal_details, _diffrn_reflns.pdbx_observed_criterion etc.

"aimed multiplicity":

Is there a style guide about upper/lower-casing ("Resolution" vs "multiplicity")?
Is that normal multiplicity or anomalous multiplicity? Probably first (but should be stated), i.e. _diffrn_reflns.pdbx_redundancy or _reflns.pdbx_redundancy and not _reflns.pdbx_redundancy_anomalous

"aimed completeness":

This also needs clarification: _diffrn_reflns.pdbx_percent_possible_obs, _reflns.pdbx_percent_possible_spherical, _reflns.pdbx_percent_possible_ellipsoidal and _reflns.pdbx_percent_possible_anomalous

Nothing mentioned above has any impact right now - apart from maybe a renaming of "Resolution" ;-)

antolinos · 2024-02-08T08:35:49Z

Hi @CV-GPhL

I remember discussing 'aimed resolution' and 'required resolution' for quite a long time in a recent meeting. It was also mentioned the word 'desired'.

I have no say about this. My opinion, at this stage of the project, is to encourage more scientists to participate in the discussions. I've tried to involve some at the ESRF with little (or zero) success

Is there a style guide about upper/lower-casing ("Resolution" vs "multiplicity")?

At least in my case, I have just copied and pasted what we have in the CSV example template. It is only for listing purposes. This should not be considered as the final name that will be used to define the metadata in the catalog, where I presume each implementation will have its own styles.

CV-GPhL · 2024-02-08T13:57:08Z

@antolinos,

As I said, this kind of discussion is maybe a bit too early (and others might join in at later stages). What is important is that a discussion about the "proper" (whatever that means) scientific definition of various categories has to happen before anything goes into production. At the moment we shouldn't really care what a box is called - it's just a name after all with only a very rough meaning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity and attribute names and formats for sample and diffraction plan shipment/upload #4

Entity and attribute names and formats for sample and diffraction plan shipment/upload #4

KarlLevik commented Jan 8, 2024

katesmith280 commented Jan 22, 2024

ejd53 commented Jan 23, 2024

antolinos commented Feb 7, 2024

hormiai76 commented Feb 7, 2024

CV-GPhL commented Feb 7, 2024

antolinos commented Feb 8, 2024

CV-GPhL commented Feb 8, 2024