Merge pull request #76 from BenKaehler/master

added a conda recipe
biocore · Mar 28, 2019 · 5f7c6d7 · 5f7c6d7
2 parents e599d9b + a92cdd4
commit 5f7c6d7
Show file tree

Hide file tree

Showing 12 changed files with 62,172 additions and 29 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,6 +1,12 @@
 include README.md
 include COPYING.txt
 include logo.png
+include redbiom/assets/nltk_data/corpora/stopwords/README
+include redbiom/assets/nltk_data/corpora/stopwords/english
+redbiom/assets/nltk_data/tokenizers/punkt/PY3/english.pickle
+redbiom/assets/nltk_data/tokenizers/punkt/PY3/README
+redbiom/assets/nltk_data/tokenizers/punkt/english.pickle
+redbiom/assets/nltk_data/tokenizers/punkt/README
 
 graft redbiom
 graft licenses

diff --git a/conda/recipe/meta.yaml b/conda/recipe/meta.yaml
@@ -0,0 +1,55 @@
+{% set data = load_setup_py_data() %}
+{% set version = data.get('version') or 'placehold' %}
+
+package:
+  name: redbiom
+  version: "{{ version }}"
+
+source:
+  path: ../..
+
+build:
+  script: python setup.py install
+  noarch: generic
+
+requirements:
+  host:
+    - cython
+    - biom-format >=2.1.5
+    - click >=6.7
+    - h5py
+    - joblib
+    - nltk
+    - pandas
+    - singledispatch
+    - pip
+    - python
+    - requests
+    - scikit-bio >=0.4.2
+    - setuptools
+  run:
+    - cython
+    - biom-format >=2.1.5
+    - click >=6.7
+    - h5py
+    - joblib
+    - nltk
+    - pandas
+    - singledispatch
+    - python
+    - requests
+    - scikit-bio >=0.4.2
+    - setuptools
+
+test:
+  imports:
+    - redbiom
+    - redbiom.commands
+    - redbiom.tests
+  commands:
+    - redbiom --help
+
+about:
+  home: https://github.com/biocore/redbiom
+  license: BSD-3-Clause
+  license_family: BSD
diff --git a/redbiom/assets/nltk_data/corpora/stopwords/README b/redbiom/assets/nltk_data/corpora/stopwords/README
@@ -0,0 +1,16 @@
+Stopwords Corpus
+
+This corpus contains lists of stop words for english. These
+are high-frequency grammatical words which are usually ignored in text
+retrieval applications. Other languages are available from the link below.
+
+They were obtained from:
+https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip
+
+They were derived from:
+http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
+
+The English list has been augmented:
+https://github.com/nltk/nltk_data/issues/22
+
+Last updated 20 March 2019.
diff --git a/redbiom/assets/nltk_data/corpora/stopwords/english b/redbiom/assets/nltk_data/corpora/stopwords/english
@@ -0,0 +1,179 @@
+i
+me
+my
+myself
+we
+our
+ours
+ourselves
+you
+you're
+you've
+you'll
+you'd
+your
+yours
+yourself
+yourselves
+he
+him
+his
+himself
+she
+she's
+her
+hers
+herself
+it
+it's
+its
+itself
+they
+them
+their
+theirs
+themselves
+what
+which
+who
+whom
+this
+that
+that'll
+these
+those
+am
+is
+are
+was
+were
+be
+been
+being
+have
+has
+had
+having
+do
+does
+did
+doing
+a
+an
+the
+and
+but
+if
+or
+because
+as
+until
+while
+of
+at
+by
+for
+with
+about
+against
+between
+into
+through
+during
+before
+after
+above
+below
+to
+from
+up
+down
+in
+out
+on
+off
+over
+under
+again
+further
+then
+once
+here
+there
+when
+where
+why
+how
+all
+any
+both
+each
+few
+more
+most
+other
+some
+such
+no
+nor
+not
+only
+own
+same
+so
+than
+too
+very
+s
+t
+can
+will
+just
+don
+don't
+should
+should've
+now
+d
+ll
+m
+o
+re
+ve
+y
+ain
+aren
+aren't
+couldn
+couldn't
+didn
+didn't
+doesn
+doesn't
+hadn
+hadn't
+hasn
+hasn't
+haven
+haven't
+isn
+isn't
+ma
+mightn
+mightn't
+mustn
+mustn't
+needn
+needn't
+shan
+shan't
+shouldn
+shouldn't
+wasn
+wasn't
+weren
+weren't
+won
+won't
+wouldn
+wouldn't
diff --git a/redbiom/assets/nltk_data/tokenizers/punkt/PY3/README b/redbiom/assets/nltk_data/tokenizers/punkt/PY3/README
@@ -0,0 +1,98 @@
+Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
+
+Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
+been contributed by various people using NLTK for sentence boundary detection.
+
+For information about how to use these models, please confer the tokenization HOWTO:
+http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
+and chapter 3.8 of the NLTK book:
+http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
+
+There are pretrained tokenizers for the following languages:
+
+File                Language            Source                             Contents                Size of training corpus(in tokens)           Model contributed by
+=======================================================================================================================================================================
+czech.pickle        Czech               Multilingual Corpus 1 (ECI)        Lidove Noviny                   ~345,000                             Jan Strunk / Tibor Kiss
+                                                                           Literarni Noviny
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+danish.pickle       Danish              Avisdata CD-Rom Ver. 1.1. 1995     Berlingske Tidende              ~550,000                             Jan Strunk / Tibor Kiss
+                                        (Berlingske Avisdata, Copenhagen)  Weekend Avisen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+dutch.pickle        Dutch               Multilingual Corpus 1 (ECI)        De Limburger                    ~340,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+english.pickle      English             Penn Treebank (LDC)                Wall Street Journal             ~469,000                             Jan Strunk / Tibor Kiss
+                    (American)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+estonian.pickle     Estonian            University of Tartu, Estonia       Eesti Ekspress                  ~359,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+finnish.pickle      Finnish             Finnish Parole Corpus, Finnish     Books and major national        ~364,000                             Jan Strunk / Tibor Kiss
+                                        Text Bank (Suomen Kielen           newspapers
+                                        Tekstipankki)
+                                        Finnish Center for IT Science
+                                        (CSC)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+french.pickle       French              Multilingual Corpus 1 (ECI)        Le Monde                        ~370,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+german.pickle       German              Neue Zürcher Zeitung AG            Neue Zürcher Zeitung            ~847,000                             Jan Strunk / Tibor Kiss
+                    (Switzerland)       CD-ROM
+                    (Uses "ss"
+                     instead of "ß")
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+greek.pickle        Greek               Efstathios Stamatatos              To Vima (TO BHMA)               ~227,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+italian.pickle      Italian             Multilingual Corpus 1 (ECI)        La Stampa, Il Mattino           ~312,000                             Jan Strunk / Tibor Kiss
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+norwegian.pickle    Norwegian           Centre for Humanities              Bergens Tidende                 ~479,000                             Jan Strunk / Tibor Kiss
+                    (Bokmål and         Information Technologies,
+                     Nynorsk)           Bergen
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+polish.pickle       Polish              Polish National Corpus             Literature, newspapers, etc.  ~1,000,000                             Krzysztof Langner
+                                        (http://www.nkjp.pl/)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+portuguese.pickle   Portuguese          CETENFolha Corpus                  Folha de São Paulo              ~321,000                             Jan Strunk / Tibor Kiss
+                    (Brazilian)         (Linguateca)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+slovene.pickle      Slovene             TRACTOR                            Delo                            ~354,000                             Jan Strunk / Tibor Kiss
+                                        Slovene Academy for Arts
+                                        and Sciences
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+spanish.pickle      Spanish             Multilingual Corpus 1 (ECI)        Sur                             ~353,000                             Jan Strunk / Tibor Kiss
+                    (European)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+swedish.pickle      Swedish             Multilingual Corpus 1 (ECI)        Dagens Nyheter                  ~339,000                             Jan Strunk / Tibor Kiss
+                                                                           (and some other texts)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+turkish.pickle      Turkish             METU Turkish Corpus                Milliyet                        ~333,000                             Jan Strunk / Tibor Kiss
+                                        (Türkçe Derlem Projesi)
+                                        University of Ankara
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
+
+The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
+Unicode using the codecs module.
+
+Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
+Computational Linguistics 32: 485-525.
+
+---- Training Code ----
+
+# import punkt
+import nltk.tokenize.punkt
+
+# Make a new Tokenizer
+tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
+
+# Read in training corpus (one example: Slovene)
+import codecs
+text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
+
+# Train tokenizer
+tokenizer.train(text)
+
+# Dump pickled tokenizer
+import pickle
+out = open("slovene.pickle","wb")
+pickle.dump(tokenizer, out)
+out.close()
+
+---------
diff --git a/redbiom/assets/nltk_data/tokenizers/punkt/PY3/english.pickle b/redbiom/assets/nltk_data/tokenizers/punkt/PY3/english.pickle