Manage the flu-crew swine surveillance database
octofludb
is most easily installed through PyPi:
pip install octofludb
You may need to replace pip
with pip3
on your system.
octofludb
is also dependent on the GraphDB database, which is a triplestore
database to which all data is pushed. Go to https://graphdb.ontotext.com and
download the GraphDB Free version.
You can get a list of subcommands with octofludb -h
. More detailed
documentation is available for each subcommand, e.g., octofludb query -h
.
octofldub
has builtin documentation for each subcommand that describes the
all arguments and general usage. Below I will cover each of the subcommands at
a higher level with examples and background info as needed.
To start a new database, first you have to call the daemon:
$ graphdb -d
Then initialize the octofludb database:
$ octofludb init
If an octofludb repo is already present in GraphDB, this command does nothing. It will NOT destroy an existing database, so is always safe to run.
If you want to destroy the database and rebuild, then you can use pgraphdb
.
This is a Python wrapper around GraphDB that is used internally throughout
octofludb
. It can be installed with pip: pip3 install pgraphdb
.
pgraphdb rm_repo octofludb
Use of this command requires some understanding of the octofludb
configuration.
The octofludb
home directory is ~/.octofludb
. This directory will is
created when you call octofludb init
and the default configuration file
config.yaml
is copied into it. The config.yaml
file is well documented and
whatever it says is more likely to be true that what I write here, so please
look it over. In short, it specifies where the data can be found octofludb
needs can be found.
A particularly important field is octoflu_reference
, which allows you to
specify a local reference fasta file that will be used by octoFLU to classify
all strains.
The command octofludb pull
, without any additional flags, will do the
following:
-
Go to the
$OCTOFLUDB_HOME/build
directory (create it if it doesn't exist) -
Load the local
config.yaml
file -
Load the
schema.ttl
andgeography.ttl
Turtle files. These are both stored in theoctofludb
python package in theoctofludb/data
folder. Theschema.ttl
file includes logical relationships such as subProperty relationship between H1 and HA that allows the database to infer that a subject is an HA if it is an H1. Thegeography.ttl
file includes, hierarchical relationships between countries and regions (Iowa is in the USA and the USA is in North America) and the relationships between country names (United Arab Emirates) and ISO country codes (ARE). -
Retrieve and process all Influenza A Virus data from GenBank. This step will take a long time. Most of a day. There are two ways to speed this up. First, if you are building a new repo, and
octofludb
's GenBank processing code hasn't changed, you can go to your~/.octofludb/build
directory and runoctofludb upload .gb*ttl
. This will upload all the past GenBank turtle files. Of course, you can also copy over someone else's files. It is safe to re-upload the same files many times -- no duplicate triples will be created. Second, if you are already mostly up-to-date and just need to add the latest few months of GenBank data, you can add a--nmonths=4
argument to just pull the data submitted in the last 4 months. All data is pulled by submission date, NOT collection date, so incrementally pulling the last few months every month will be fine. -
(OPTIONAL) if the
--include-gisaid
flag is included, and if paths to gisaid sequence and metadata files are in theconfig.yaml
file, then gisaid data will be parsed and uploaded. -
All unclassified swine sequences are classified with octoFLU using the reference file listed in
config.yaml
or if this field is missing, the default octoFLU reference. -
The subtype of each strain is determined. GenBank has a subtype annotation under the
serotype
field,octoFLU
also determines the subtype of the HA and NA, and for gisaid agisaid_subtype
field is extracted.octofludb
synthesizes all this info to create one subtype. If there are conflicts, the octoFLU subtype is given priority followed by the GenBank subtype. NOTE: due to a performance bug in the python rdflib library, this step can take a long time to serialize to a turtle file, you'll have have to wait for it. -
H1 and H3 antigenic motifs are determined from the HA sequences using the
flutile
module.
Once you've uploaded your data, you will want to access it. This is done with
octofludb query
. Data is pulled using SPARQL queries. SPARQL is a query
language for accessing data in a triplestore database, just like SQL is a query
language for relational databases. There are no join statements in SPARQL. A
triplestore is, in principle, nothing more than a long list of subject,
predicate, object triples. A SPARQL query defines a pattern over these triples
and returns every set of values for which the pattern is true.
Here is a simple example:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?strain ?host ?genbank_id ?country_code ?date
WHERE {
?sid f:strain_name ?strain .
?sid f:host ?host .
?sid f:date ?date .
?sid f:country ?country .
?country f:code ?country_code .
}
LIMIT 10
The first line defines our unique flu-crew namespace, this ensures there are no name conflicts if we link this database to other SPARQL databases.
The SELECT
line determines which columns are in the returned table.
The WHERE
block is a set of patterns that defines what is returned.
All subjects and relations in a triplestore are UUIDs. Objects (such as
?strain
or ?country
) may be values or UUIDs. So ?sid
is some UUID such
that it has one or more f:strain_name
edges, one or more f:host
edges, one
or more f:date
edges, and one or more f:code
edges. Countries are stored
both by name and by 3-letter ISO code. ?country
is a UUID that links to the
literal code and name strings. The final two lines of the WHERE
clause could
alternatively be written as ?sid f:country/f:code ?country_code
.
The final line LIMIT 10
limits the return data to just 10 entries. I often
place limits on queries while I am building them.
While tables are often needed, in most of our pipelines we would rather have
FASTA files with annotated headers. This can be accomplished by adding the
--fasta
flag to the octofludb query
command and placing sequence data as
the final entry in the SELECT statement. For example:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?strain ?host ?genbank_id ?country_code ?date ?seq
WHERE {
?sid f:strain_name ?strain .
?sid f:host ?host .
?sid f:date ?date .
?sid f:country ?country .
?country f:code ?country_code .
?sid f:has_segment ?gid .
?gid f:segment_name "HA" .
?gid f:dnaseq ?seq .
}
LIMIT 10
I usually pass sequences through smof
to clean them up:
$ octofludb query --fasta myquery.rq | smof clean -t n -drux > myseqs.fna
For an example of complete with filters, aggregate, and optional data, see the
*.rq
query files in the octofludb/data
folder of the octofludb
git repo.
To explore the relations in the database, you can run queries such as the following:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?gid ?p ?o WHERE {
?sid f:has_segment ?gid .
?sid f:host "swine" .
?gid f:segment_name "HA" .
?gid f:clade ?clade .
?gid ?p ?o .
}
LIMIT 100
This gives you a list of links from a swine segment to other data.
Like octofludb query
, octofludb update
also takes a SPARQL file. They are
called in different ways since query
is a readonly call and update
can
delete values from the database. I have only used update
for deletion, though
it can also insert data.
Use this command with care -- deletion can't be undone. When possible, use the
octofludb delete
subcommand to delete specific selections of data (such as a
clades or all motifs).
As an example, say you want to reclassify the all strains in the 1C lineage. You can do this by updating the octoFLU reference file and running:
octofludb pull --no-subtype --no-motifs --no-schema --no-motifs --nmonths=0
This will just run the classification step.
However, it changes nothing, because octofludb
finds that all the 1C's are
already annotated tries to save you time by not passing them to octoFLU.
You could create your own table of data linking genbank ids to clades with
octofludb prep table
. But then you would end up with two clades for each 1C
strain. The triplestore doesn't know that this relationship should be unique
(there is a way toto give it this information, but let's save that for later).
Instead, we need to specifically delete the 1C clades triples. This can be done
with octofludb update
and the following SPARQL file:
DELETE {
?sid f:gl_clade ?clade .
} WHERE {
?sid f:gl_clade ?clade .
FILTER(REGEX(?clade, "1C")) .
}
You can check that octofludb did indeed delete the 1C clades with the query:
SELECT ?sid ?clade .
} WHERE {
?sid f:gl_clade ?clade .
FILTER(REGEX(?clade, "1C")) .
}
Which should return nothing.
octofludb delete
can be used as an alternative to the lower-level octofludb update
SPARQL calls described above. It deletes specific kinds of data, such
as all constellations, all subtypes, all us-clades, and global clades, or all
motifs.
Honestly, I never use this subcommand and am not sure if it works. Maybe I should remove it?
If you want to work with data beyond that which octofludb pull
loads, then
you will need prep
. This subcommand parses data out of various formats and
writes it into Turtle files that can be uploaded to the database.
The different file formats prep
supports are available as a list of
sub-subcommands. The three most important of these are fasta
, table
, and
unpublished
. The other sub-subcommands are slated for deprecation.
Internally, octofludb
tries to fit a FASTA file to a table where columns are
derived from the headers and the last column is the sequences. The default
delimiter for a FASTA header is '|'. So the following FASTA file:
>A/swine/R123456/Iowa/2020|H3N2|2020-06-21
GATACAGATACAGATACAGATACAGATACAGATACA
>A/horse/T4343/Canada/2020|H1N2|2020-07-04
CATCATCATCATCATCATCATCATCATCATCATCAT
Would be viewed internally as the table:
strain_name subtype date dnaseq
A/swine/R123456/Iowa/2020 H3N2 2020-06-21 GATACA...
A/horse/T4343/Canada/2020 H1N2 2020-07-04 CATCAT...
octofludb
will attempt to guess the type of each column. The each element in
the first field match the parser for the strain_name
type, the second field
matches the parser for subtype, the third for date, the fourth for dna
sequence. But here octofludb
hits a snag. How should the DNA sequence be
connected to the strain name? Strains don't have DNA sequences. Only the
segments do. This query shows the problem:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?strain ?seq
WHERE {
?sid f:strain_name ?strain .
?sid f:has_segment ?gid .
?gid f:dnaseq ?seq .
}
A strain in octofludb
is linked to many segments UUIDs. Each segment UUID is
linked to a DNA sequence. The UUIDs for segments are usually derived from the
genbank or epiflu ids. For example, MW872131
is translated to the UUID
<https://flu-crew.org/id/mw872131>
. Because there is a deterministic way of
mapping a GenBank id to a UUID, any new data that comes in that is associated
with a known GenBank ID can be automatically linked. But in the above FASTA, no
ID is specified for the segment.
octofludb
circumvents this issue by using the md5 hash of the DNA sequence as
the base of the UUID. This will merge together segments that have duplicate
sequences. So octofludb prep fasta
will produce the turtle file:
@prefix f: <https://flu-crew.org/term/> .
@prefix fid: <https://flu-crew.org/id/> .
@prefix usa: <https://flu-crew.org/geo/country/usa/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://flu-crew.org/id/a%2Fhorse%2Ft4343%2Fcanada%2F2020> f:date "2020-07-04"^^xsd:date ;
f:has_segment fid:e7e8df44044f1ee2b4fd9e60b33efe9a ;
f:strain_name "A/horse/T4343/Canada/2020" ;
f:subtype "H1N2" .
<https://flu-crew.org/id/a%2Fswine%2Fr123456%2Fiowa%2F2020> f:date "2020-06-21"^^xsd:date ;
f:has_segment fid:1ca74a6402f97cf38907d194bf6b837a ;
f:state usa:IA ;
f:strain_name "A/swine/R123456/Iowa/2020" ;
f:subtype "H3N2" .
fid:1ca74a6402f97cf38907d194bf6b837a f:chksum "1ca74a6402f97cf38907d194bf6b837a" ;
f:dnaseq "GATACAGATACAGATACAGATACAGATACAGATACA" .
fid:e7e8df44044f1ee2b4fd9e60b33efe9a f:chksum "e7e8df44044f1ee2b4fd9e60b33efe9a" ;
f:dnaseq "CATCATCATCATCATCATCATCATCATCATCATCAT" .
Whenever you use octofludb prep
, you should always double check the resulting
Turtle files before uploading them to the database to ensure the right
information is being pushed.
Not much to say here, the upload
command uploads Turtle files to the
database. Any triples that are already present in the database are ignored.
octofludb classify
takes a fasta file as an argument and produces a table
showing the clades. It is a wrapper around octoFLU that uses the locally
specified reference file. The path to this reference file may be specified in
the ~/.octofludb/config.yaml
file. If no reference file is specified, the
default octoFLU reference is used.
Any report generating logic that is important enough to crystallize into octofludb itself, that is comprised of both a query and filtering/cleaning in Python, should be added here.
There are currently two useful reports.
The first is octofludb report masterlist
. This produces the file that is the input to octoflushow and the quarterly reports.
The second is octofludb report monthly
. This produces the input to the monthly WGS selection pipeline.
There are currently two other commands: offlu
and quarter
. offlu
is a
stub that will eventually produce the public and gisaid inputs to the VCM-offlu
pipeline. quarter
is currently just a wrapper around masterlist
, since both
octoflushow and the quarterly report use the same input.
octofludb fetch
contains a selection of tools for working with specific sets of identifiers.
For example
octofludb fetch tag id-list.txt
Will read all the ids in id-list and associate them all with a unique tag. The
ids may be strain names, genbank ids, epiflu ids, epiflu isolate ids, or
barcodes. octofludb
understands all these ids and will figure out the right
thing for you.
Once your ids are tagged, you can retrieve them with the sub-subcommands:
isolate
, strain
, segment
, or sequence
. Each of these commands pulls the
tagged data at a different level.
The tags may also be used in SPARQL queries.
Here is an longer example. Suppose you have a mysterious list of 10 genbank ids in a file named mystery.txt
:
EPI653195
CY246223
JX983103
MF801489
JN652492
CY099224
MF145504
HQ896660
KM503168
JF316643
First you want to clear any previous tags:
octofludb fetch clear
Then tag your mystery ids:
octofludb fetch tag `mystery.txt`
Now you view default sections of your data. Try all the following:
octofludb fetch segment
octofludb fetch strain
octofludb fetch isolate
octofludb fetch sequence
You may also specify the tagged terms in a SPARQL query:
PREFIX f: <https://flu-crew.org/term/>
SELECT ?strain ?host ?genbank_id ?country_code ?date ?seq
WHERE {
?sid f:strain_name ?strain .
?sid f:host ?host .
?sid f:date ?date .
?sid f:country ?country .
?country f:code ?country_code .
?sid f:has_segment ?gid .
?gid f:segment_name "HA" .
?gid f:dnaseq ?seq .
# limit the query to tagged segments
?tag_id f:query_tag ?tag .
?gid f:genbank_id ?tag .
}
Here is a (very incomplete) list of strange strain names I have to deal with:
A/USA/LAN_(P10)_NA/2018
- parenthesesA/R(duck/Hokkaido/9/99-tern/South Africa/1961)
- worse parenthesesA/swine/Denmark/18-13002-73_PB1/2018
- includes segment name (there are lots of these)A/swine/Cotes d'Armor/110466/2010
- single quotes