Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets from SIM-XL, Mascot and ProteomeDiscover in PRIDE #63

Open
ypriverol opened this issue Apr 3, 2024 · 23 comments
Open

Datasets from SIM-XL, Mascot and ProteomeDiscover in PRIDE #63

ypriverol opened this issue Apr 3, 2024 · 23 comments
Assignees

Comments

@ypriverol
Copy link

We have to find out the list of datasets with the following conditions:

  • mzIdentml 1.2 (Complete submission)
  • Peak list files corresponding to the mzid
  • Software that produces the files ProteomeDiscover / Mascot / SIM-XL
  • Crosslinking (this is the most important one )

Please lets update the list in this issue.

@sureshhewabi
Copy link
Collaborator

sureshhewabi commented Apr 3, 2024

Datasets in PRIDE with "crosslink" or "cross-link" word in TITLE which contains mzIdentML files:

Ordered by priority:

Need to check version 1.2, the corresponding peak list and producer

@colin-combe
Copy link

colin-combe commented Apr 3, 2024

as noted in meeting, they might not be complete submissions
(what does the strikeout represent above? PXD018935 / PXD012759)

@colin-combe
Copy link

"crosslink" or "cross-link" word in TITLE

wasn't there a "crosslink" tag people were referring to? (I don't know but people spoke of this)

@ypriverol
Copy link
Author

We will continue with different combinations. We will et you know when errors start to happen.

@colin-combe
Copy link

OK, great, thanks!

@ypriverol
Copy link
Author

@sureshhewabi reported the following error in this one:

PXD014359 - Error parsing C_Lee_141014_CRM_dialysis_NCE20_2.mzid
MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))
2024-04-04 09:35:10 - main - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))

@colin-combe
Copy link

i guess the error message is correct and it is not valid XML

@ypriverol
Copy link
Author

Another similar error for OpenMS

Error parsing XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid
MzIdParseException ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))
2024-04-04 14:28:13 - parser.process_dataset - ERROR - ('XMLSyntaxError', ("Start tag expected, '<' not found, line 1, column 1",))

@sureshhewabi
Copy link
Collaborator

sureshhewabi commented Apr 4, 2024

Schema seems valid in XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid file from PXD021417:
xmllint --noout --schema mzIdentML1.2.0.xsd XLpeplib_Beveridge_QEx-HFX_DSS_R1.mzid > XLpeplib_Beveridge_QEx-HFX_DSS_R1_output_file1.txt 2>&1

There should be an issue with the parser. @colin-combe any idea?

@colin-combe
Copy link

yes, could be a issue with parser. Or perhaps something to do with character encoding. I looked into and was confused.

@sureshhewabi - I'm not sure what your xmllint command does ?

@sureshhewabi
Copy link
Collaborator

It is a command to check the schema validity against the schema definition file(xsd)

@colin-combe
Copy link

one problem is the empty location attribute for spectra data:

XLpeplib_Beveridge_QEx-HFX_DSS_R3.mzid, line 527672:
<SpectraData location="" id="SDAT_1534307058980521776">

It is a required attribute, but empty string is enough to make the file schema valid (http://www.datypic.com/sc/xsd/t-xsd_anyURI.html).

But this isn't the only problem, there's something else that's still mysterious...

@sureshhewabi
Copy link
Collaborator

This means we cannot use this dataset for us anyway, isn't it? because we cannot find the peaklist file

@colin-combe
Copy link

we could manually fix the location.
But also, yes, there is a problem with the parser. It is requiring some elements that are optional. (Breaks if they're not there.)
I'll provide an update (will make PR when its fixed).
I would stop testing datasets until this is fixed.

@colin-combe
Copy link

think this fixes a problem - #64
the datset with the empty location still won't work, but maybe some of the other ones throwing errors like that will.

sorry about that

@sureshhewabi
Copy link
Collaborator

sureshhewabi commented Apr 5, 2024

PXD021417 Dataset Issues:

  • Spectra Data Location was not available - Manually inserted as we can see one to one relationship between mzidentml files and mzML files
  • Issues with Score
  • Run Name is missing
  • <Seq> is missing

@sureshhewabi
Copy link
Collaborator

sureshhewabi commented Apr 8, 2024

PXD026603 Dataset Issues:

  • ModificationParam

parser.process_dataset - INFO - parsing AnalysisProtocolCollection- start
Error parsing GPR158-RGS7-Gb5_CONSENSUS.mzid
KeyError 'ModificationParams'
parser.process_dataset - ERROR - 'ModificationParams'

@colin-combe
Copy link

thanks, will check it

@colin-combe
Copy link

colin-combe commented Apr 8, 2024

similar to before - parser was treating things that are optional as if they were required
fixed by #65

re PXD026603 - the peaklists are missing?

@sureshhewabi
Copy link
Collaborator

Yes, peakfile is missing too:
<SpectraData location="C:\Users\griffinlab.PG18844\Dropbox (Scripps Research)\Griffin Lab\fusion lumos\TSS\XLMS\20210129 GPR158 Complex\GPR158_Complex_XLMS\GPR158-RGS7-Gb5_CONSENSUS.mzML" name="MzML spectra file" id="ID_MZML_FILE_with_spectra"> but GPR158-RGS7-Gb5_CONSENSUS.mzML is not available

@colin-combe
Copy link

We have to find out the list of datasets with the following conditions:

  • mzIdentml 1.2 (Complete submission)

would these actually have complete submission status? I thought complete submission status wasn't previously being given to crosslinking data?

@colin-combe
Copy link

PXD021417 Dataset Issues:

  • Spectra Data Location was not available - Manually inserted as we can see one to one relationship between mzidentml files and mzML files
  • Issues with Score
  • Run Name is missing
  • <Seq> is missing

this one shouldn't be in DB because the sequences are missing
(related #78 (comment))

@colin-combe
Copy link

re. PXD021417 - maybe lets leave this in for testing purposes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants