Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EE-642 Extend xls2xml converter to allow user defined columns #21

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions xls2xml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,30 +48,30 @@ python ./xls2xml/xls2xml.py -h
```
Here are some of the examples you could try out:
```commandline
python ./xls2xml/validate_xls.py --conf tests/data/T2D_xls2xml_v1.conf --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_AMP_T2D_Submission_form_V2.xlsx
python ./xls2xml/validate_tsv.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_samples.tsv
python ./xls2xml/xls2tsv.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2tsv.tsv
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v1.xslt tests/data/example_samples.tsv tests/data/output_tsv2xml.xml
python ./xls2xml/xls2xml.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Analysis --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2xml_single.xml
python ./xls2xml/xls2xml.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Analysis,File --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2xml_multiple.xml
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Analysis --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_analysis.tsv tests/data/output_tsv2xml_single.xml
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v1.conf --conf-key Analysis,File --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_analysis.tsv,tests/data/example_files.tsv tests/data/output_tsv2xml_multiple.xml
python ./xls2xml/validate_xls.py --conf tests/data/T2D_xls2xml_v3.conf --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_AMP_T2D_Submission_form_V2.xlsx
python ./xls2xml/validate_tsv.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_samples.tsv
python ./xls2xml/xls2tsv.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2tsv.tsv
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Sample --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v1.xslt tests/data/example_samples.tsv tests/data/output_tsv2xml.xml
python ./xls2xml/xls2xml.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Analysis --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2xml_single.xml
python ./xls2xml/xls2xml.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Analysis,File --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_AMP_T2D_Submission_form_V2.xlsx tests/data/output_xls2xml_multiple.xml
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Analysis --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_analysis.tsv tests/data/output_tsv2xml_single.xml
python ./xls2xml/tsv2xml.py --conf tests/data/T2D_xls2xml_v3.conf --conf-key Analysis,File --schema tests/data/T2D_xls2xml_v1.schema --xslt tests/data/T2D_xls2xml_v2.xslt tests/data/example_analysis.tsv,tests/data/example_files.tsv tests/data/output_tsv2xml_multiple.xml
```

### Writing the configuration files
There are a few different configuration files for these scripts. The examples of each could be found in tests/data:
```commandline
tests/data/T2D_xls2xml_v1.conf
tests/data/T2D_xls2xml_v3.conf
tests/data/T2D_xls2xml_v1.schema
tests/data/T2D_xls2xml_v1.xslt
tests/data/T2D_xls2xml_v2.xslt # for combining multiple worksheets into a single xml file
tests/data/T2D_xls2xml_v3.xslt # for allowing user defined columns
```

For the details of how they they should be written, there are comments in each examples above.

There are other data files served as test cases input. They are also used in the example usages as illustrated above:
```commandline
tests/data/example_AMP_T2D_Submission_form_V2.xlsx
tests/data/example_AMP_T2D_Submission_form_V3.xlsx
tests/data/example_samples.tsv
tests/data/example_samples.xml
tests/data/example_analysis.xml
Expand Down
2 changes: 1 addition & 1 deletion xls2xml/amp-t2d/T2D_xls2xml.xslt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Please note that because XML tag must start with a letter or underscore and cont
letters, digits, hyphens, underscores and periods, any violating characters should be replaced
with underscores.
-->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:exsl="http://exslt.org/common"
extension-element-prefixes="exsl">
<xsl:output method="xml" indent="yes"/>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,11 @@ Sample: # worksheet title itself
- Paternal_id
- Novel Attributes

# Regex and placeholder to match user defined column names
user_defined_columns:
regex: '^attribute_\[[a-z0-9_]*\]$'
placeholder: 'attribute_[add_value]'

data_type:
T2D: int

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ Please note that because XML tag must start with a letter or underscore and cont
letters, digits, hyphens, underscores and periods, any violating characters should be replaced
with underscores.
-->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
xmlns:regexp="http://exslt.org/regular-expressions"
extension-element-prefixes="regexp">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/SampleSet"><!-->Should match <key_in_config>+'Set'<-->
<xsl:template match="SampleSet"><!-->Should match <key_in_config>+'Set'<-->
<SAMPLE_SET noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.sample.xsd">
<xsl:for-each select="Sample"><!-->Should select from <key_in_config><-->
<SAMPLE>
Expand Down Expand Up @@ -52,13 +54,21 @@ with underscores.
<TAG>year_of_birth</TAG>
<VALUE><xsl:value-of select="Year_of_Birth"/></VALUE>
</SAMPLE_ATTRIBUTE>
<xsl:for-each select="*">
<xsl:if test="regexp:test(name(.), '^attribute__[a-z0-9_]*_$')">
<SAMPLE_ATTRIBUTE>
<TAG><xsl:value-of select="name(.)"/></TAG>
<VALUE><xsl:value-of select="."/></VALUE>
</SAMPLE_ATTRIBUTE>
</xsl:if>
</xsl:for-each>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
</xsl:for-each>
</SAMPLE_SET>
</xsl:template>

<xsl:template match="/AnalysisSet">
<xsl:template match="AnalysisSet">
<ANALYSIS_SET noNamespaceSchemaLocation="ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_5/SRA.analysis.xsd">
<xsl:for-each select="Analysis">
<ANALYSIS>
Expand Down
Binary file not shown.
Binary file not shown.
15 changes: 7 additions & 8 deletions xls2xml/tests/data/example_samples.tsv
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
Sample_ID Subject_ID Geno_ID Phenotype Gender Analysis_alias Cohort ID Ethnicity Ethnicity Description T2D Case_Control Description Center_name Hispanic or Latino; of Spanish origin Age Year of Birth Year of first visit Cell Type Maternal_id Paternal_id Novel Attributes
SAM111111 SAM111111 MeSH:D006262 male AN001 CO1111 EUWH 0 Control Male normal WTGC cambridge 31 1986 Blood SAM111113 SAM111115
SAM111112 SAM111112 FEM1 MeSH:D006262 female AN001 CO1111 EUWH 0 Control normal WTGC cambridge 21 2016 Blood SAM111114 SAM111116
SAM111113 SAM111113 EFO:0001360 female AN001 CO1112 EUWH 1 Case WTGC cambridge 67 2016 Blood
SAM111114 SAM111114 EFO:0001360,EFO:0001359 female AN001 CO1112 EUWH 1 Case WTGC cambridge 72 2015 Blood
SAM111115 SAM111115 EFO:0001359 male AN001 CO1112 EUWH 0 Case WTGC cambridge 56 2017 Blood
SAM111116 SAM111116 EFO:0001359 male AN001 CO1112 UNKN 0 Case WTGC cambridge 1 73 2017 Blood

Sample_ID Subject_ID Geno_ID Phenotype Gender Analysis_alias Cohort ID Ethnicity Ethnicity Description T2D Case_Control Description Center_name Hispanic or Latino; of Spanish origin Age Year of Birth Year of first visit Cell Type Maternal_id Paternal_id Novel Attributes attribute_[test_column_1] attribute_[test_column_2] attribute_[add_value]
SAM111111 SAM111111 MeSH:D006262 male AN001 CO1111 EUWH 0 Control Male normal WTGC cambridge 31 1986 Blood SAM111113 SAM111115 attribute_test_value_1_1 attribute_test_value_2_1
SAM111112 SAM111112 FEM1 MeSH:D006262 female AN001 CO1111 EUWH 0 Control normal WTGC cambridge 21 2016 Blood SAM111114 SAM111116 attribute_test_value_1_2 attribute_test_value_2_2
SAM111113 SAM111113 EFO:0001360 female AN001 CO1112 EUWH 1 Case WTGC cambridge 67 2016 Blood attribute_test_value_1_3 attribute_test_value_2_3
SAM111114 SAM111114 EFO:0001360,EFO:0001359 female AN001 CO1112 EUWH 1 Case WTGC cambridge 72 2015 Blood attribute_test_value_1_4 attribute_test_value_2_4
SAM111115 SAM111115 EFO:0001359 male AN001 CO1112 EUWH 0 Case WTGC cambridge 56 2017 Blood attribute_test_value_1_5 attribute_test_value_2_5
SAM111116 SAM111116 EFO:0001359 male AN001 CO1112 UNKN 0 Case WTGC cambridge 1 73 2017 Blood attribute_test_value_1_6 attribute_test_value_2_6
48 changes: 48 additions & 0 deletions xls2xml/tests/data/example_samples.xml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@
<TAG>year_of_birth</TAG>
<VALUE>1986</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_1</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_1</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<SAMPLE center_name="WTGC cambridge">
Expand Down Expand Up @@ -66,6 +74,14 @@
<TAG>year_of_birth</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_2</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_2</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<SAMPLE center_name="WTGC cambridge">
Expand Down Expand Up @@ -100,6 +116,14 @@
<TAG>year_of_birth</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_3</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_3</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<SAMPLE center_name="WTGC cambridge">
Expand Down Expand Up @@ -134,6 +158,14 @@
<TAG>year_of_birth</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_4</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_4</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<SAMPLE center_name="WTGC cambridge">
Expand Down Expand Up @@ -168,6 +200,14 @@
<TAG>year_of_birth</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_5</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_5</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
<SAMPLE center_name="WTGC cambridge">
Expand Down Expand Up @@ -202,6 +242,14 @@
<TAG>year_of_birth</TAG>
<VALUE/>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_2_</TAG>
<VALUE>attribute_test_value_2_6</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>attribute__test_column_1_</TAG>
<VALUE>attribute_test_value_1_6</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
</SAMPLE>
</SAMPLE_SET>
8 changes: 4 additions & 4 deletions xls2xml/tests/test_MetadataValidator.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,26 @@

def test_valid_data():
validator = MetadataValidator('data/T2D_xls2xml_v1.schema')
reader = XLSReader('data/example_AMP_T2D_Submission_form_V2.xlsx', 'data/T2D_xls2xml_v1.conf')
reader = XLSReader('data/example_AMP_T2D_Submission_form_V3.xlsx', 'data/T2D_xls2xml_v3.conf')
reader.active_worksheet = 'Sample'
row = reader.next()
assert validator.validate_data(row, 'Sample')
reader.active_worksheet = 'Analysis'
row = reader.next()
assert validator.validate_data(row, 'Analysis')
reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
row = reader.next()
assert validator.validate_data(row, 'Sample')

def test_invalid_data():
validator = MetadataValidator('data/T2D_xls2xml_v1.schema')
reader = XLSReader('data/example_AMP_T2D_Submission_form_V2.xlsx', 'data/T2D_xls2xml_v1.conf')
reader = XLSReader('data/example_AMP_T2D_Submission_form_V3.xlsx', 'data/T2D_xls2xml_v3.conf')
reader.active_worksheet = 'Sample'
row = reader.next()
assert not validator.validate_data(row, 'Analysis')
reader.active_worksheet = 'Analysis'
row = reader.next()
assert not validator.validate_data(row, 'Sample')
reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
row = reader.next()
assert not validator.validate_data(row, 'Analysis')
36 changes: 20 additions & 16 deletions xls2xml/tests/test_TSVReader.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
from xls2xml import TSVReader

def test_get_valid_conf_keys():
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
assert set(tsv_reader.get_valid_conf_keys()) == {'Sample'}
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Analysis')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Analysis')
assert tsv_reader.get_valid_conf_keys() == []

def test_set_current_conf_key():
# set_current_conf_key() should does nothing
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
assert tsv_reader.is_valid()
assert set(tsv_reader.get_valid_conf_keys()) == {'Sample'}
tsv_reader.set_current_conf_key('Analysis')
assert tsv_reader.is_valid()
assert set(tsv_reader.get_valid_conf_keys()) == {'Sample'}
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Analysis')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Analysis')
assert not tsv_reader.is_valid()
assert tsv_reader.get_valid_conf_keys() == []
tsv_reader.set_current_conf_key('Sample')
Expand All @@ -23,32 +23,36 @@ def test_set_current_conf_key():


def test_is_not_valid():
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Analysis')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Analysis')
assert not tsv_reader.is_valid()

def test_is_valid():
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
assert tsv_reader.is_valid()

def test_get_current_headers():
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
headers = tsv_reader.get_current_headers()
assert isinstance(headers, list)
assert set(headers) == {'Sample_ID', 'Subject_ID', 'Geno_ID', 'Phenotype', 'Gender', 'Analysis_alias', 'Cohort ID',
'Ethnicity', 'Ethnicity Description', 'T2D', 'Case_Control', 'Description', 'Center_name',
'Hispanic or Latino; of Spanish origin', 'Age', 'Year of Birth', 'Year of first visit',
'Cell Type', 'Maternal_id', 'Paternal_id', 'Novel Attributes'}
'Cell Type', 'Maternal_id', 'Paternal_id', 'Novel Attributes', 'attribute_[test_column_2]',
'attribute_[test_column_1]', 'attribute_[add_value]'}

def test_next():
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v1.conf', 'Sample')
tsv_reader = TSVReader('data/example_samples.tsv', 'data/T2D_xls2xml_v3.conf', 'Sample')
row = tsv_reader.next()
assert isinstance(row, dict)
assert 0 == cmp(row, {'Novel Attributes': None, 'Ethnicity Description': None, 'Description': 'Male normal',
'Cell Type': 'Blood', 'Maternal_id': 'SAM111113', 'Center_name': 'WTGC cambridge',
'Gender': 'male', 'Subject_ID': 'SAM111111', 'Paternal_id': 'SAM111115', 'T2D': 0,
'Hispanic or Latino; of Spanish origin': None, 'Cohort ID': 'CO1111', 'Year of Birth': '1986',
'Age': '31', 'Analysis_alias': 'AN001', 'Sample_ID': 'SAM111111', 'Geno_ID': None,
'Year of first visit': None, 'Case_Control': 'Control', 'Ethnicity': 'EUWH',
'Phenotype': 'MeSH:D006262'})
print(row)
assert 0 == cmp(row, {'Hispanic or Latino; of Spanish origin': None, 'Phenotype': 'MeSH:D006262',
'attribute_[test_column_1]': 'attribute_test_value_1_1', 'Description': 'Male normal',
'Center_name': 'WTGC cambridge', 'Case_Control': 'Control', 'T2D': 0,
'Analysis_alias': 'AN001', 'Geno_ID': None, 'Year of first visit': None, 'Cell Type': 'Blood',
'Maternal_id': 'SAM111113', 'Gender': 'male', 'Subject_ID': 'SAM111111',
'Paternal_id': 'SAM111115', 'Cohort ID': 'CO1111',
'attribute_[test_column_2]': 'attribute_test_value_2_1', 'Novel Attributes': None,
'Ethnicity Description': None, 'Year of Birth': '1986', 'Sample_ID': 'SAM111111', 'Age': '31',
'Ethnicity': 'EUWH'})
for row in tsv_reader:
assert isinstance(row, dict)
Loading