Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cproject Structure Query #65

Open
danmaclean opened this issue Oct 12, 2017 · 2 comments
Open

Cproject Structure Query #65

danmaclean opened this issue Oct 12, 2017 · 2 comments

Comments

@danmaclean
Copy link

Hi,

This document https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/cproject seems to claim to be definitive about Cproject structure, but seems to be at odds with this document about the output of ami https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/ami/README.md#ami2-species. In the CProject definition the extent of say a sequence results directory looks to be much simpler than the apparent results described in the tutorial.

CProject folder structure:

│   ├── results
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │       └── empty.xml

ami output tutorial

│   ├── results
│   │   ├── sequence
│   │   │   └── rna
│   │   │       └── empty.xml
│   │   │   └── dna
│   │   │       └── empty.xml
│   │   │   └── prot
│   │   │       └── empty.xml

Im trying to write a parser for CProjects, could you let me know whether the ami tools are going to produce lots of directories (e.g ami2seq will generate sequence/sequencetype folders or, as the CProject document suggests, will it generate just the sequence/dnaprimer folder? Or is the info in one of these docs out of date?

Thanks for clarification.

@ghost
Copy link

ghost commented Oct 12, 2017 via email

@petermr
Copy link
Member

petermr commented Oct 14, 2017

Dan,
First - many thanks for working on ContentMine and happy to talk more about your requirements and interests.

CM data structure is intentionally somewhat fluid because we are reacting to the very wide range of structures and information that people use in scientific communication. The philosophy is perhaps similar to JSON and other lightly typed structures rather than the rigidity of XML schemas and DTDs.
In the case you give the names rna , dnaprimer, etc are determined by the dictionaries or query types that are used in the query. The first query will have been for any of sequence(rna, dna, prot) while the second was for sequence(dnaprimer). This means that the names of the directories depend on the query - most are optional and may be set by the users choice of dictionaries. If I use a dictionaries 'foo.xml' and bar.xml then the output will be of the form:

│   ├── results
│   │   ├── dict
│   │   │   └── foo
│   │   │       └── empty.xml
│   │   │   └── bar
│   │   │       └── empty.xml
...

This means that a parser will have fewer hard coded names and more that are determined at runtime.

I think that JSON is a good analogy here (and indeed the output could be transformed into JSON). It makes parsing more challenging than hardcoded names and means that tools such as XPath and JSONPath are often useful.

(The info is probably also out of date in places - sorry! but that is often the case with evolving projects.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants