Skip to content

Latest commit

 

History

History
89 lines (63 loc) · 2.73 KB

README.md

File metadata and controls

89 lines (63 loc) · 2.73 KB

proteome_xml_for_prokka

XSLT file to transform a UniProt proteome XML file to a protein fasta file for Prokka.

Extract information from the XML file for each protein to format a FASTA file to be used as custom database with the --proteins arguments in Prokka.

>SeqID EC_number~~~gene~~~product~~~COG

Installation

Install a xslt processor

Example xsltproc, with Ubuntu/Debian:

sudo apt-get install xsltproc

Download the XSLT file

Download proteome_xml_for_prokka.xslt from Github.

Usage

xsltproc proteome_xml_for_prokka.xslt UNIPROT_XML_FILE

Examples

Single protein

Download a protein XML file from UniProt (e.g. L-lactate dehydrogenase from Lactobacillus casei)

curl \
--fail \
"https://www.uniprot.org/uniprot/P00343.xml" \
--output "P00343.xml" # 25.3 kB

Conversion to FASTA

xsltproc \
proteome_xml_for_prokka.xslt \
P00343.xml \
> P00343.faa

Expected result:

>LDH_LACCA 1.1.1.27~~~ldh~~~L-lactate dehydrogenase~~~COG0039
MASITDKDHQKVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSAEYSDAKDADLVVITAGAPQKPGETRLDLVNKNLKILKSIVDPIVDSGFNGIFLVAANPVDILTYATWKLSGFPKNRVVGSGTSLDTARFRQSIAEMVNVDARSVHAYIMGEHGDTEFPVWSHANIGGVTIAEWVKAHPEIKEDKLVKMFEDVRDAAYEIIKLKGATFYGIATALARISKAILNDENAVLPLSVYMDGQYGLNDIYIGTPAVINRNGIQNILEIPLTDHEEESMQKSASQLKKVLTDAFAKNDIETRQ

Proteome

Download a UniProt proteome XML file (e.g. Lactobacillus acidophilus)

curl \
  --fail \
  'https://www.uniprot.org/uniprot/?query=proteome:UP000006381&format=xml' \
  --output 'UP000006381.xml' # 11.8 MB

Conversion to FASTA

xsltproc \
proteome_xml_for_prokka.xslt \
UP000006381.xml \
> UP000006381.faa

Expected result (3 first entries):

>RPOE_LACAC ~~~rpoE~~~Probable DNA-directed RNA polymerase subunit delta~~~COG3343
MGLDKFKDKNRDELSMIEVARAILEDNGKRMAFADIVNAVQKYLNKSDEEIRERLPQFYTDMNTDGEFISMGENVWALRSWFPYESVDEEVNHPEDEEEDDSRKHHKKVNAFLASATGDDDIIDYDNDDPEDDDLDAATDDSDDDYSDDDSDYDEDNDDADDVLPDGIEGQLSQLNDEDDDEDD
>XPT_LACAC 2.4.2.22~~~xpt~~~Xanthine phosphoribosyltransferase~~~COG0503
MKLLEERIKRDGEVLDGNVLKINSFLNHQVDPKLMMEVGKEFKRLFAGEQIDKVLTCEASGIAPGVMTAYQLGVPMVFARKKKPSTLNDAVYWADVFSYTKKVNSKICVEEKFLHEGENILIIDDFVAHGEAVKGMVNIAKQAHCNIVGVGAVVAKTFQGGSDWVKDEGLRFESLASIASFKDGQVHFEGEE
>RL6_LACAC ~~~rplF~~~50S ribosomal protein L6~~~COG0097
MSRIGLKTIEVPDSVTVTKEGDNITVKGPKGELTRYFDPKITFEQNDGEINFSRSSESDKALHGTERANLASMIEGVLNGYKKTLKLIGVGYRAQAQGNKITLNVGYSHPVVLTAPEGVSVKATSATDVEVEGVSKQDVGQFAAEIRAVRPPEPYKGKGIRYVDEYVRRKEGKTGK