XSLT file to transform a UniProt proteome XML file to a protein fasta file for Prokka.
Extract information from the XML file for each protein to format a FASTA file to be used as custom database with the --proteins
arguments in Prokka.
>SeqID EC_number~~~gene~~~product~~~COG
Example xsltproc
, with Ubuntu/Debian:
sudo apt-get install xsltproc
Download proteome_xml_for_prokka.xslt
from Github.
xsltproc proteome_xml_for_prokka.xslt UNIPROT_XML_FILE
Download a protein XML file from UniProt (e.g. L-lactate dehydrogenase from Lactobacillus casei)
curl \
--fail \
"https://www.uniprot.org/uniprot/P00343.xml" \
--output "P00343.xml" # 25.3 kB
Conversion to FASTA
xsltproc \
proteome_xml_for_prokka.xslt \
P00343.xml \
> P00343.faa
Expected result:
>LDH_LACCA 1.1.1.27~~~ldh~~~L-lactate dehydrogenase~~~COG0039
MASITDKDHQKVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSAEYSDAKDADLVVITAGAPQKPGETRLDLVNKNLKILKSIVDPIVDSGFNGIFLVAANPVDILTYATWKLSGFPKNRVVGSGTSLDTARFRQSIAEMVNVDARSVHAYIMGEHGDTEFPVWSHANIGGVTIAEWVKAHPEIKEDKLVKMFEDVRDAAYEIIKLKGATFYGIATALARISKAILNDENAVLPLSVYMDGQYGLNDIYIGTPAVINRNGIQNILEIPLTDHEEESMQKSASQLKKVLTDAFAKNDIETRQ
Download a UniProt proteome XML file (e.g. Lactobacillus acidophilus)
curl \
--fail \
'https://www.uniprot.org/uniprot/?query=proteome:UP000006381&format=xml' \
--output 'UP000006381.xml' # 11.8 MB
Conversion to FASTA
xsltproc \
proteome_xml_for_prokka.xslt \
UP000006381.xml \
> UP000006381.faa
Expected result (3 first entries):
>RPOE_LACAC ~~~rpoE~~~Probable DNA-directed RNA polymerase subunit delta~~~COG3343
MGLDKFKDKNRDELSMIEVARAILEDNGKRMAFADIVNAVQKYLNKSDEEIRERLPQFYTDMNTDGEFISMGENVWALRSWFPYESVDEEVNHPEDEEEDDSRKHHKKVNAFLASATGDDDIIDYDNDDPEDDDLDAATDDSDDDYSDDDSDYDEDNDDADDVLPDGIEGQLSQLNDEDDDEDD
>XPT_LACAC 2.4.2.22~~~xpt~~~Xanthine phosphoribosyltransferase~~~COG0503
MKLLEERIKRDGEVLDGNVLKINSFLNHQVDPKLMMEVGKEFKRLFAGEQIDKVLTCEASGIAPGVMTAYQLGVPMVFARKKKPSTLNDAVYWADVFSYTKKVNSKICVEEKFLHEGENILIIDDFVAHGEAVKGMVNIAKQAHCNIVGVGAVVAKTFQGGSDWVKDEGLRFESLASIASFKDGQVHFEGEE
>RL6_LACAC ~~~rplF~~~50S ribosomal protein L6~~~COG0097
MSRIGLKTIEVPDSVTVTKEGDNITVKGPKGELTRYFDPKITFEQNDGEINFSRSSESDKALHGTERANLASMIEGVLNGYKKTLKLIGVGYRAQAQGNKITLNVGYSHPVVLTAPEGVSVKATSATDVEVEGVSKQDVGQFAAEIRAVRPPEPYKGKGIRYVDEYVRRKEGKTGK