scrapinghub · Kebniss · May 5, 2014 · May 12, 2014 · May 12, 2014 · May 12, 2014
diff --git a/.gitignore b/.gitignore
@@ -24,7 +24,10 @@ pip-log.txt
 # Unit test / coverage reports
 .coverage
 .tox
+cover
 nosetests.xml
+.cache
+htmlcov/
 
 # Translations
 *.mo
@@ -35,5 +38,19 @@ nosetests.xml
 .pydevproject
 
 # Other
+.idea
 webstruct_data/datastore
-
+.ipynb_checkpoints
+docs/_build
+webstruct_data/todo
+notebooks/old
+notebooks/*.zip
+notebooks/*.html
+notebooks/*.ipynb
+notebooks/*.marisa
+notebooks/*.wapiti
+notebooks/*.crfsuite
+webstruct_data/corpus/us_contact_pages/cleaned
+example/_data/*
+example/*.joblib
+example/*.html
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,40 @@
+language: python
+python: 3.5
+sudo: false
+
+branches:
+    only:
+        - master
+        - /^\d\.\d+$/
+
+matrix:
+  include:
+    - python: 2.7
+      env: TOXENV=py27
+    - python: 3.4
+      env: TOXENV=py34
+    - python: 3.5
+      env: TOXENV=py35
+    - python: 3.6
+      env: TOXENV=py36
+
+addons:
+    apt:
+        packages:
+            - python-numpy
+            - python-scipy
+            - libatlas-base-dev
+            - liblapack-dev
+            - gfortran
+
+install:
+    - pip install -U pip tox codecov
+
+script: tox
+
+after_success:
+    - codecov
+
+cache:
+    directories:
+        - $HOME/.cache/pip
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -0,0 +1,59 @@
+Changes
+=======
+
+0.6 (2017-12-29)
+----------------
+
+* A complete example (contact extractor) is added to the repo;
+* fixed a lot of issues in the annotated data;
+* fixed loading of ``<title>`` annotations;
+* all annotated data is converted from GATE to WebAnnotator format;
+* text tokenizers allow to optionally return original token positions;
+* converting text from tokenized to raw is now lossless;
+* ``webstruct.webannotator.to_webannotator`` is rewritten;
+* ``<script>``, ``<style>`` elements, HTML comments and processing
+  instructions are ignored when they are inside entities;
+* tutorial is rewritten for CRFSuite;
+* Wapiti support is fixed in Python 3;
+* top-N parsing support when using Wapiti; an option to merge top N chains,
+  to increase recall;
+* benchmarking script;
+* don't declare Python 3.3 support (it is EOL).
+
+0.5 (2017-05-10)
+----------------
+
+* webstruct.model.NER now uses ``requests`` library to make HTTP requests;
+* changed default headers used by webstruct.model.NER;
+* new ``webstruct.infer_domain`` module useful for proper cross-validation;
+* webstruct.webannotator.to_webannotator got an option to add ``<base>``
+  tag with the original URL to the page;
+* fixed a warning in webstruct.gazetteers.geonames.read_geonames;
+* add a few more country names to countries.txt list.
+
+0.4.1 (2016-11-28)
+------------------
+
+* fixed a bug in NER.extract().
+
+0.4 (2016-11-26)
+----------------
+
+* sklearn-crfsuite_ is used as a CRFsuite wrapper, CRFsuiteCRF class
+  is removed;
+* comments are preserved in HTML trees because recent Firefox puts
+  ``<base>`` tags to a comment when saving pages, and this affects
+  WebAnnotator;
+* fixed 'dont_penalize' argument of webstruct.NER.extract_groups_from_url;
+* new webstruct.model.extract_entity_groups utility function;
+* HtmlTokenizer and HtmlToken are moved to their own module
+  (webstruct.html_tokenizer);
+* test improvements;
+
+.. _sklearn-crfsuite: https://github.com/TeamHG-Memex/sklearn-crfsuite
+
+0.3 (2016-09-19)
+----------------
+
+There are many changes from previous version: API is changed,
+Python 3 is supported, better gazetteers support, CRFsuite support, etc.
diff --git a/README.rst b/README.rst
@@ -0,0 +1,45 @@
+Webstruct
+=========
+
+.. image:: https://img.shields.io/pypi/v/webstruct.svg
+   :target: https://pypi.python.org/pypi/webstruct
+   :alt: PyPI Version
+
+.. image:: https://travis-ci.org/scrapinghub/webstruct.svg?branch=master
+   :target: https://travis-ci.org/scrapinghub/webstruct
+   :alt: Build Status
+
+.. image:: https://codecov.io/gh/scrapinghub/webstruct/branch/master/graph/badge.svg
+   :target: https://codecov.io/gh/scrapinghub/webstruct
+   :alt: Code Coverage
+
+.. image:: https://readthedocs.org/projects/webstruct/badge/?version=latest
+   :target: http://webstruct.readthedocs.io/en/latest/
+   :alt: Documentation
+
+
+Webstruct is a library for creating statistical NER_ systems that work
+on HTML data, i.e. a library for building tools that extract named
+entities (addresses, organization names, open hours, etc) from webpages.
+
+Unlike most NER systems, webstruct works on HTML data, not only
+on text data. This allows to define features that use HTML structure,
+and also to embed annotation results back into HTML.
+
+Read the docs_ for more info.
+
+License is MIT.
+
+.. _docs: http://webstruct.readthedocs.io/en/latest/
+.. _NER: http://en.wikipedia.org/wiki/Named-entity_recognition
+
+Contributing
+------------
+
+* Source code: https://github.com/scrapinghub/webstruct
+* Bug tracker: https://github.com/scrapinghub/webstruct/issues
+
+To run tests, make sure tox_ is installed, then run
+``tox`` from the source root.
+
+.. _tox: https://tox.readthedocs.io/en/latest/
diff --git a/block_model/README.md b/block_model/README.md
diff --git a/block_model/convert_html.py b/block_model/convert_html.py
diff --git a/block_model/convert_labeled_data.py b/block_model/convert_labeled_data.py
diff --git a/block_model/data/1.html b/block_model/data/1.html
diff --git a/block_model/data/1.txt b/block_model/data/1.txt