Phylografter user interviews

Sep 12, 2013 -- Jim Allman [email protected]

This is a summary of interviews with five of the principal users of phylografter (listed with their respective taxonomic areas):

Romina Gazis [email protected], fungal studies
Bryan Drew [email protected], plants
Chris Owen [email protected], animals
Jessica Grant [email protected], microbes
Laura Katz [email protected], microbes

These notes will be organized broadly by topic. Detailed interview notes are available upon request.

Interviews were conducted via Skype, and focused on

best practices and lessons learned;
ideas for needed features; and
recommended roles and workflow for a larger community

We also explored some of the "pain points" reported in earlier curation feedback and comments from the Nov 2012 Curation Sprint.

The term "phylografter" will be used here to refer to both the existing phylografter tool and possible replacements with similar functionality.

BEST PRACTICES & LESSONS LEARNED

Experienced phylografter users can work quickly, but the learning curve is very difficult for new users. Key issues include:

general confusion about tasks and outcomes
some input fields are confusing (Where will this information appear? and to whom?)
prescribed sequence of operations is not obvious

Some users recommended GenBank as an example of a better first-time experience.

Experienced users consider study submission to be a one-time task, as opposed to one that needs repeated sessions. But most felt that a first-time or occasional user might need to return to the tool more than once, as they gather required data or groom trees for re-submission. In addition, OTU mapping for large studies will likely require multiple sessions.

One use case that involves multiple sessions is when a curator is handling a submission, but they're asking for data from the study's author. This typically takes a few rounds, during which the study will be in an incomplete state. (This would be reflected in the quality / fitness indicator discussed elsewhere.)

There's been a large accumulation of bad study data (tests, experiments, failed submissions, etc.) We should avoid this by clearly distinguishing test data and allowing sensible deletion of unwanted stuff. Perhaps this is handled under the topic of privacy / ownership below.

INTERNAL COMMUNICATION (AMONG CURATORS AND THE SYNTHESIS TEAM)

We've identified some need for communication among the community of curators and those managing synthesis. To date, this has been handled via email and informally using phylografter's tagging facility, eg, to mark some trees for deletion, or as recommended for synthesis.

Ideally we would provide some form of shared institutional memory for

submissions in progress
pending data requests from authors (incl. reminders?)
pending changes to taxonomy (required for OTU mapping)
noteworthy judgment calls or best guesses in each study
who's worked in each study (and when, and what they did)
who else is working in my (family-level?) areas of interest

Some of this information should appear in the status app. But we should also consider adding internal notes to each study -- at minimum, a text log with timestamped entries -- or using an external forum (or ticket management system?) to capture this information.

Of course, curatorial notes could also be stored as Nexson metadata, in which case we should accomodate attaching curator's notes to any study, file, tree or node. In this case, we should discuss whether these notes are static and isolated, or they would allow discussion threads. [ADD THESE IDEAS TO ANNOTATION / CONTROLLED VOCABULARY DOCUMENT]

Notification features would also be welcome, to let me know what's happening in my areas of interest (families, clades, even particular studies). This could be implemented as email notificaiton, or an activity feed on the OpenTree site, or RSS. Apparently TreeBASE has a "Likes" feature we should study.

NEEDED FEATURES

All agreed that first-time users would benefit from a more intuitive (self-descriptive) user interface.

Some mentioned tutorials as a good solution:

Easy/minimal tutorial is most important
More advanced example, or just a link to help pages?
Mouse-over hints throughout the UI, esp. to explain where data will appear

In particular, users suggested we review the BEAST tutorials as an effective model.

There's also a user-generated guide for contributors that we might formalize or share more widely.

All users were interested in our notion of an ever-present "fitness" or quality indicator for the current study. This should mirror the one in the status app, and would ideally be regularly updated within the curation UI.

Users expressed a need for easy embargo features for pre-publication data, and a simple publication trigger. (See OWNERSHIP & PRIVACY below.)

As often happens, users have hinted at useful features in the way that they've used free-form tags on studies. Common practices so far suggest a few tools that might be formalized in the UI:

marking a tree or study for deletion
marking a tree as "synthesis-worthy"
indicating that a tree's OTUs are fully mapped (this is probably handled by our quality indicator, described elsewhere)
indicating what genes a particular study is based on

One user suggested that TreeBASE import should incorporate gene trees "to show multiple resolutions of the focal group."

Users are definitely using the search-for-clade and search-for-species tools. One request is to add search-for-locus, using barcode or ITI markers. (Could these be indexed from trees or other data files? Or would this require additional input from the submitter?)

The choice of tree types can probably be reduced to a friendly list of perhaps a dozen choices, eg, neighbor-joining tree or chronogram. If we think other types are possible, with an option for a custom value.

ROLES & PERMISSIONS

Users have conflicting feelings about whether studies and trees should be "owned" by the original submitter. While everyone recognizes that some corrections might we welcome, as well as help with OTU mapping, they're concerned about possible damage in the event that:

tree structure is changed (this is probably off-limits)
taxa are mapped incorrectly by an unqualified curator
a rival might vandalize data, perhaps in subtle ways

A possible solution for OTU mapping is to add an automated integrity check, such that we flag conflicts within a tree if they exceed a preset threshold. This should be a reliable indicator that some taxa have been poorly mapped.

While volunteer effort is appreciated, the concensus is that the most qualified curator/editor for any study is an author, and the best time to submit data is while the study is fresh in their minds.

PRIVACY & OWNERSHIP OF DATA

Users pointed out situations in which study data should be private (pre-publication, testing, exploration, unfinished trees). This is clearly something we should provide. It might be desirable to restrict unfinished trees within a public study, but it's probably just as easy to validate the tree prior to accepting it into synthesis.

There should also be a clear indication of the private/public status of any study, so that an editor knows the visibility and consequences of what they're doing.

Users have different views of "ownership" of study data, with some wanting the submitter (owner) to have exclusive control of their studies.

Others are comfortable with the idea of collaborative curation, but usually only by qualified curators. This raises questions of vetting qualifications and designating curators, possibly within certain families.

There was some interest in the idea of "forking" a study, but only if we can clearly indicate who's doing what. Even this is seen as problematic, if the changes are beyond obvious cleanup of typographic mistakes, etc. Most users agreed that tree structure should be "hands-off" and that this is the heart of a phylografter submission.

Regarding pre-published data, one user felt that we could probably make this study data visible, pointing to GenBank's policy on early release.

She pointed out that all these issues hinge on one over-arching question for submitters: What's your motive? To share, or to publish?

PAIN POINTS

Delays and latency in phylografter are very irritating. These are most pronounced during OTU mapping, but there are occasional serious delays (or server timeouts) during general use.

A major frustration (esp. in microbial studies) is the need to pre-approve new taxon names in order to map trees. This creates a slow and complex bottleneck in the submission process. One suggested fix is to provide tentative mapping, ie, "pending" submissions whose OTU mapping will periodically be tested against the latest taxonomies. It would also be great to incorporate new-taxon requests inline in phylografter.

Unrooted trees sometimes require a lot of extra grooming. They also present a dilemma, particularly in microbe studies where any root would be an arbitrary choice. These users suggested that it would be preferable to submit (legitimately) unrooted trees for synthesis as-is, so that other sources can suggest the proper orientation/placement of the tree.

When importing data from TreeBASE, phylografter currently imposes a strict order of operations. Any deviation from this order means starting over or repeating lots of steps manually.

Branch lengths are problematic in a couple of ways. It's not always clear whether/how their meaning will be used, so users will sometimes omit this information.

Also, trees that are imported from TreeBASE sometimes have important information (branch lengths, units, bootstrap info) stripped due to TreeBASE's submission requirements. In this case, it's not clear whether a phylografter user should describe the meaningful branch lengths in the original tree, or the version imported from TreeBASE.

Some users see in-group selection as problematic in itself, because of its wider implications.

Some users could really use bulk uploading of tree files, eg, a microbe study might have thousands of trees.

While the tagging tools are generally used, the current UI for this is tricky, particularly the requirement to press the Enter key to accept each tag. This has resulted in lots of lost work when tags have been typed into this field, but not saved along with the rest of the study data.

Generally speaking, there's a concern that phylografter is asking for too much data. We should carefully evaluate everything we're asking for:

Is this truly useful and necessary? (eg, nucleotide data)
Can it lead to a "wild goose chase"? (eg, chasing down the email address for a minor contributor)

Some users have asked for additional kinds of submission in phylografter:

molecular phylogeny
morphological characteristics
taxonomy
other?

USING OTHER TOOLS

Some curators have used third-party tools to groom tree data before submission (or re-submission). For general cleanup, Mesquite is a recommended tool. Search-and-replace features (eg, in a text editor) are sometimes used to convert lab-specific taxon names to more standard forms, for easier OTU mapping.

Note that there's a sort of convenience threshold for doing this; it's not worth it for just a few unmapped taxa.

A preferred solution might be to incorporate some of these taxon-name transformations within the web UI, perhaps in a way that allows quick round-trip testing to see if OTU mapping is improved.

One curators (Chris Owen) has used external tools for tracking his work in phylografter. His spreadsheet suggests the kind of information we might show in a curator's personal "dashboard", for each study:

study number
focal group
first (or two) authors
date uploaded
rooted? or in-group designation
how many taxa (and how many mapped)
was double-checked?

RELATED INSIGHTS

There was general agreement that a status app (currently under development) would be very useful. This should make it easy to list all studies in the system, see the relative status / quality of each, and help someone to see if a study is already in the works (to avoid duplication of effort). It's not clear whether the status app should also show private studies.

Users generally hate repetition, and are interested in any tool that prevents it. One request was for automatic import whenever data is submitted to TreeBASE (or even Dryad), perhaps enabled by an opt-in (or opt-out?) checkbox in those sites.

Regarding contributions from authors, clearly we need additional incentives here. The most obvious would be to become a required step for publication in major journals. Another suggestion is to show an increase in citations for studies in phylografter.

While our current users have been populating the system with many studies, most new contributions will be from one-time or occasional users, often grads or undergrads. We should design with them in mind, even if it means extra steps for an expert user.

COMMUNITY OF PRACTICE

We should consider helping new curators to learn from those with more experience, and for all to benefit from lessons learned and best practices.

useful names for trees and studies
the intent and visibility of tags
best use of comments
how to handle common problems

More widely, we might promote best practices for preparing trees for easier import. For example, naming (or somehow marking) terminal taxa with a GenBank accession number or other ID should make OTU mapping easy.

This might also involve rewarding good work with increased visibility in the system, perhaps using a "leader board" or highlighting exemplary curation efforts on the OpenTree site. This might be a step toward rewarding the real work being done here and encouraging more. Suggested metrics include:

mapped taxa
resolved nodes
general contributions / activity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly