Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per entity metric #63

Open
wants to merge 464 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
464 commits
Select commit Hold shift + click to select a range
98a2a0b
doctests indent
chekunkov May 5, 2014
989072c
fix unicode handling for a new tokenizer; add pounds char to rules
kmike May 12, 2014
177ad80
Merge branch 'speed_up_text_tokenizer' of https://github.com/chekunko…
kmike May 12, 2014
5fe04f6
Merge pull request #16 from scrapinghub/speed_up_text_tokenizer
kmike May 12, 2014
226e53f
small tokenizer cleanup
kmike May 13, 2014
24926c5
make min_length and max_length arguments required for utils.substrings
kmike May 14, 2014
b6d60f1
add crfsuite backend base on python-crfsuite
tpeng Apr 23, 2014
e3ef37a
DOC: fix crfsuite docstring
tpeng Apr 24, 2014
f96cae1
DOC fix style and typo
tpeng Apr 24, 2014
383f8b7
fix HtmlTokenizer pickling
kmike May 15, 2014
0adaaf2
WapitiCRF.fit returns self
kmike May 15, 2014
92553b7
train_test_split_noshuffle
kmike May 15, 2014
55598e0
TST runcoverage script
kmike May 15, 2014
a2111d4
python-crfsuite support; tests for NER and crfsuite pipeline
kmike May 15, 2014
01b0ee6
expose CRFsuiteCRF and CCRFsuiteFeatureEncoder
kmike May 16, 2014
0f248b6
rename wapiti_kwargs to crf_kwargs for consistency
kmike May 16, 2014
441ebf4
move tostr to wapiti module because it is wapiti-specific
kmike May 16, 2014
7d12376
NER.annotate and NER.annotate_url methods
kmike May 16, 2014
85e9407
Abstract temporary model files handling; add this feature to wapiti. …
kmike May 16, 2014
9525c46
A corpus (not annotated yet) with 450 pages from business websites in…
kmike May 19, 2014
38730d8
add EMAIL to dtd in order to load annotated files properly
kmike May 19, 2014
4619e8f
annotation fixes
kmike May 19, 2014
be9a91c
Fix html produced by WebAnnotator.
kmike May 19, 2014
591051d
(backwards incompatible) drop existing `load_trees`; rename `load_tre…
kmike May 20, 2014
5bb3768
make it possible to use existing WebAnnotator colors
kmike May 20, 2014
6cd6265
+100 annotated pages
kmike May 20, 2014
2e746c4
annotation fixes
kmike May 21, 2014
223d8f1
annotation fixes
kmike May 21, 2014
8875d3c
more annotation fixes
kmike May 21, 2014
146ad5e
+100 pages
kmike May 21, 2014
448048e
annotation fixes
kmike May 21, 2014
87279df
BUG fix an issue with WebAnnotatorLoader: it shouldn't add extra "Non…
kmike May 21, 2014
2150bda
fix a test after annotation fix
kmike May 21, 2014
79d81c5
easier Trainer customization for CRFsuiteCRF
kmike May 26, 2014
a98431e
X_dev and y_dev support for webstruct.crfsuite
kmike May 26, 2014
1c47f9e
+100 pages
kmike May 27, 2014
e9ebeaa
doctests (failing) for some tokenization gotchas
kmike May 27, 2014
f80c382
expose LongestMatchGlobalFeature
kmike May 27, 2014
1c17e7c
annotations fix
kmike May 27, 2014
17a5d4e
one more failing tokenization example
kmike May 27, 2014
9d8fcdc
webstruct.gazetteers.geonames.read_geonames_zipped: try to handle geo…
kmike May 28, 2014
ce775e6
DAWG gazetteers support (they are much faster than MARISA-based, but …
kmike May 28, 2014
6ee718f
more annotated data
kmike May 28, 2014
ed40e3e
CRFsuiteFeatureEncoder is not needed with python-crfsuite==0.6
kmike May 28, 2014
b2cb0e7
Undocumented HtmlFeatureExtractor post-processing step is removed to …
kmike May 28, 2014
649c814
bias feature
kmike May 28, 2014
12be72e
tiny speedup for BestMatch._find_matches
kmike May 28, 2014
727f61b
NER.extract_groups_from_url
kmike May 30, 2014
cd1860d
export webstruct.smart_join
kmike May 30, 2014
56cd57e
annotation fixes (more locations for about 70 pages)
kmike May 30, 2014
4019595
DOC suggest to use "Save as" in WebAnnotator
kmike Jul 7, 2014
c0448c9
get rid of seqlearn dependency
tpeng Aug 11, 2014
3dc2024
fix document
tpeng Aug 11, 2014
6e36995
Merge pull request #23 from tpeng/remove-seqlearn-deps
kmike Aug 11, 2014
9cfe657
Update requirements so that they will work automatically
Suor Feb 14, 2015
1c4c378
Set up tox to test py27, py33, py34 and docs
Suor Feb 14, 2015
3060950
Add Travis CI config
Suor Feb 14, 2015
ee25440
Use miniconda to test on Travis CI
Suor Feb 26, 2015
225cc76
Merge pull request #28 from Suor/travis
kmike Feb 26, 2015
c7c79b5
Migrate code to support Python 3
Suor Feb 27, 2015
4de7573
Rename cross module to compat
Suor Feb 27, 2015
b5d19c8
Get rid of bprint()/bformat()
Suor Feb 28, 2015
0e66518
Return to more natural doctest in HtmlTokenizer.tokenize_single()
Suor Feb 28, 2015
d2d3d5c
Set ELLIPSIS and IGNORE_UNICODE as default doctest options
Suor Mar 3, 2015
22c27c4
Add Python 3 version modifiers to setup.py
Suor Mar 3, 2015
86d44e6
Update python version requirements in installation docs
Suor Mar 3, 2015
d7e2fae
Merge pull request #29 from Suor/py3-clean
kmike Mar 3, 2015
b3be38c
add Travis badge to readme
kmike Mar 3, 2015
eba8084
fix requirements.txt: cython is no longer needed; bump python-crfsuit…
kmike Mar 3, 2015
a54dae3
Fix setup.py requires
Suor Apr 14, 2015
d21a83f
fixing typo: toolikit -> toolkit
carlosp420 Jul 18, 2015
8133674
Merge pull request #31 from carlosp420/patch-0
kmike Jul 19, 2015
06be1b4
declare Python 3.5 support
kmike Sep 19, 2016
d8f1d0a
bump version to 0.3
kmike Sep 19, 2016
6d3d109
Merge pull request #30 from Suor/master
kmike Nov 16, 2016
005c88b
fixed compatibility with recent scikit-learn
kmike Nov 16, 2016
f8fa440
TST simplify travis.yml. See GH-33.
kmike Nov 16, 2016
d043435
TST don’t test with Python 3.3
kmike Nov 16, 2016
0dfc6ac
TST don’t run tests twice for pull requests
kmike Nov 16, 2016
e7d552e
Merge pull request #34 from scrapinghub/fix-ci
kmike Nov 16, 2016
920df38
(backwards incompatible) remove custom CRFsuite wrapper, use sklearn-…
kmike Nov 16, 2016
c49301f
Merge pull request #35 from scrapinghub/sklearn-crfsuite
kmike Nov 16, 2016
93fc8c2
DOC more documentation for webstruct_data datasets
kmike Nov 16, 2016
db287d2
annotation fixes: emails, org names
kmike Nov 16, 2016
2c611c4
preserve comments in loaded trees
kmike Nov 16, 2016
03a82b4
annotations: remove problematic js code
kmike Nov 17, 2016
c51140a
DOC clarify known_entities of GateLoader
kmike Nov 17, 2016
1d0f4ac
add country names gazetteer
kmike Nov 17, 2016
9000067
TST switch to pytest, check that docs are building without warnings
kmike Nov 25, 2016
71f1e34
gitignore more files
kmike Nov 25, 2016
54b61a6
TST revert strict doc check
kmike Nov 25, 2016
a509bcb
Update codecov.yml
kmike Nov 25, 2016
e0fde7e
add codecov badge
kmike Nov 25, 2016
51684c0
DOC whoops, fix whitespaces in README
kmike Nov 25, 2016
3e05642
fixed NER.extract_groups_from_url `dont_penalize` argument
kmike Nov 25, 2016
d44d6f4
extract_entity_groups utility function
kmike Nov 25, 2016
6000221
move HtmlTokenizer to its own module
kmike Nov 25, 2016
8e5d98c
DOC trying to fix readthedocs build
kmike Nov 26, 2016
786a1f0
DOC try to fix readthedocs, again..
kmike Nov 26, 2016
9b7986b
bump version to 0.4; add changelog
kmike Nov 26, 2016
784fd3a
DOC typo fixes
kmike Nov 26, 2016
071bc78
fixed NER.extract bug
kmike Nov 28, 2016
628c8c2
bump version
kmike Nov 28, 2016
b63b9bb
webstruct.infer_domain
kmike Apr 6, 2017
5c33d14
TST create html coverage report locally by default
kmike Apr 6, 2017
1856b46
style fix: proper blank lines in imports
kmike Apr 6, 2017
97d6d37
Merge pull request #38 from scrapinghub/infer-domain
kmike Apr 6, 2017
c4786cb
preserve URL in <base> tag
kmike Apr 6, 2017
1499ad0
Merge pull request #39 from scrapinghub/wa-baseurl
kmike Apr 6, 2017
dfe77c2
switch to requests
kmike Apr 6, 2017
13d4437
a few countries.txt gazetter improvements
kmike Apr 11, 2017
0d6eaf7
fixed warning when reading geonames
kmike Apr 11, 2017
8c43f41
ignore more files in gitignore
kmike May 10, 2017
a5282a7
DOC more badges in README
kmike May 10, 2017
5a3f39e
v0.5
kmike May 10, 2017
8fb60d3
A complete example (contact extraction). See GH-24.
kmike Aug 4, 2017
9949492
DOC fix example README
kmike Aug 4, 2017
f656552
DOC mention requirements.txt in the example's README
kmike Aug 4, 2017
56913e2
hand made annotation
Sep 8, 2017
3aeb2d6
fix annotations
Sep 8, 2017
c016135
Merge pull request #41 from whalebot-helmsman/master
kmike Sep 8, 2017
7a42a23
add description for punctuation removing (#42)
whalebot-helmsman Sep 8, 2017
210f81a
more annotations
whalebot-helmsman Sep 12, 2017
e4fac51
more annotations
whalebot-helmsman Sep 12, 2017
2df7c24
more annotations
whalebot-helmsman Sep 12, 2017
b3956cc
correct ids
whalebot-helmsman Sep 12, 2017
51ef652
correct ids
whalebot-helmsman Sep 12, 2017
f1e002c
does not copy wa-title attributes
whalebot-helmsman Sep 13, 2017
807f3a6
verify conversion
whalebot-helmsman Sep 13, 2017
57bc016
convert annotation
whalebot-helmsman Sep 13, 2017
02aad41
write as html
whalebot-helmsman Sep 13, 2017
b7e1e17
move gate annotations to webannotator
whalebot-helmsman Sep 13, 2017
eb97aa5
tests for html tools
whalebot-helmsman Sep 14, 2017
e541806
pep8 style
whalebot-helmsman Sep 15, 2017
d42dcea
add program description
whalebot-helmsman Sep 15, 2017
37b8728
pep8 style
whalebot-helmsman Sep 15, 2017
80a6fb8
pep8 style
whalebot-helmsman Sep 15, 2017
dca5dd3
add program description
whalebot-helmsman Sep 15, 2017
9e480d4
pep8 style
whalebot-helmsman Sep 15, 2017
c1a1175
ability to pass entities list to verify
whalebot-helmsman Sep 15, 2017
5feb78a
look for annotations in WebAnnotator folder
whalebot-helmsman Sep 15, 2017
6d59b83
pep8
whalebot-helmsman Sep 15, 2017
2a7b013
test attribute removal for wa-title
whalebot-helmsman Sep 15, 2017
3556a57
Merge pull request #47 from whalebot-helmsman/master
kmike Sep 15, 2017
bc26275
mess is gone
whalebot-helmsman Sep 20, 2017
4130d78
no need for gate loader
whalebot-helmsman Sep 20, 2017
c2af278
Merge pull request #48 from whalebot-helmsman/master
kmike Sep 20, 2017
36d56f2
text tokenizer return postions of token
whalebot-helmsman Sep 21, 2017
2d4d2ef
update tests
whalebot-helmsman Sep 21, 2017
80658ca
separate statement for every action
whalebot-helmsman Sep 21, 2017
c52e449
comma preserving test
whalebot-helmsman Sep 21, 2017
8178776
too much tokens around
whalebot-helmsman Sep 21, 2017
51c0932
encode in indices instead of entities
whalebot-helmsman Sep 21, 2017
1a667ec
handle empty lists
whalebot-helmsman Sep 21, 2017
24465b1
pass token length and position from TextToken to HtmlToken
whalebot-helmsman Sep 21, 2017
06befbb
letter perfect detokenization
whalebot-helmsman Sep 22, 2017
e5730b2
do not cleanup tokenized tree by default, separate method for tree cl…
Sep 25, 2017
e340444
update tests for separate tree cleaning
Sep 25, 2017
89673c1
update tests for correct punctuation positions
Sep 25, 2017
7c45984
correct length for replaced quotes
Sep 25, 2017
46fc4df
pep8
Sep 29, 2017
90bdefd
new html tree based to webannotator transformer
Sep 26, 2017
1fb67a0
ignore scripts and styles
Sep 26, 2017
3117640
ignore elements with non-text tokens
Sep 27, 2017
084fb33
as we search use our regexp for text and tail in same moment, our sta…
Sep 27, 2017
43449a1
pep8
Sep 29, 2017
388170e
comma at line end, not start
Sep 29, 2017
71caf61
one join instead of many additions, dont be Schleimel
Sep 29, 2017
37d7470
correct formatting
Sep 29, 2017
e93c6dc
add clarification
Sep 29, 2017
e02c275
fix typo
Sep 29, 2017
f26569f
pep8
Sep 29, 2017
d1aecbb
preserve tokenize method for compatibility
Sep 29, 2017
35a9d88
function to reduce code in tests
Sep 29, 2017
9033188
remove test for nltk tokenizer
Sep 29, 2017
c14f363
test our behaviour, which difers from original treebank tokenizer
Sep 29, 2017
a071cd4
remove useless conversion
Sep 29, 2017
a33f564
rename method to avoid confusion with nltk tokenize_span method
Sep 29, 2017
75a9698
remove brittle tests
Sep 29, 2017
4729323
small benchmark for html tokenizer
Sep 29, 2017
943a44e
Revert "remove brittle tests"
whalebot-helmsman Oct 2, 2017
ba7d6fe
move brittle tests to pytest xfail
whalebot-helmsman Oct 2, 2017
b72bcc1
expect behaviour of nltk tokenizer
whalebot-helmsman Oct 2, 2017
f9190c3
Merge pull request #49 from whalebot-helmsman/master
kmike Oct 2, 2017
09f1699
Merge branch 'master' into webannotator-html
whalebot-helmsman Oct 3, 2017
281d4a5
rename variable
whalebot-helmsman Oct 3, 2017
a0d2519
make TagPosition private
whalebot-helmsman Oct 4, 2017
caa76cc
make translate_to_dfs private
whalebot-helmsman Oct 4, 2017
500ccf4
make fabricate_start/end private
whalebot-helmsman Oct 4, 2017
a743aed
make enclosure private
whalebot-helmsman Oct 4, 2017
f7e7a86
move enclosure deciding to separate function
whalebot-helmsman Oct 4, 2017
91c3962
rename generic tasks to concrete enclosures
whalebot-helmsman Oct 4, 2017
9e3b49a
move dfs order numbering to separate function
whalebot-helmsman Oct 4, 2017
3266427
move start/end tag locating in separate function
whalebot-helmsman Oct 4, 2017
7d56973
pep8
whalebot-helmsman Oct 4, 2017
1dc3f28
high level explanation of whats heppening here
whalebot-helmsman Oct 4, 2017
a92a339
no unicode tags, so string_types is enough
whalebot-helmsman Oct 4, 2017
833603b
reduce code
whalebot-helmsman Oct 4, 2017
4f22537
Merge pull request #50 from whalebot-helmsman/master
kmike Oct 4, 2017
ced2fd8
tutorial rewritten with usage of crfsuite
sibiryakov Oct 17, 2017
67763e6
wapiti link restored
sibiryakov Oct 17, 2017
770d777
Merge pull request #52 from scrapinghub/crfsuite-tutorial
kmike Oct 17, 2017
0bb8fd7
wapiti return bytes, not str
whalebot-helmsman Oct 19, 2017
2d92efb
collect all top N results but return only first of them
whalebot-helmsman Oct 19, 2017
b801d7a
merge top N chains for better recall
whalebot-helmsman Oct 19, 2017
739e269
benchmark script for model prediction
whalebot-helmsman Dec 21, 2017
d8afda6
we need newer wapiti version for python3 support
whalebot-helmsman Dec 21, 2017
0d92091
add various overlapping schemes for chains
whalebot-helmsman Dec 21, 2017
3842740
add description of merging method
whalebot-helmsman Dec 21, 2017
83b5327
Merge pull request #55 from whalebot-helmsman/master
kmike Dec 21, 2017
1713694
there are various types of unusual tags, not only comments
whalebot-helmsman Dec 22, 2017
7a68569
Merge pull request #56 from whalebot-helmsman/master
kmike Dec 22, 2017
0176cdb
non-recursive implementation of algorithm
whalebot-helmsman Dec 22, 2017
f4a1896
add description of WordTokenizer improvements
whalebot-helmsman Dec 22, 2017
3e09c9f
changd comment as code structure changed
whalebot-helmsman Dec 22, 2017
bff4c3e
Merge pull request #57 from whalebot-helmsman/master
kmike Dec 22, 2017
d8b1984
don't declare Python 3.3 support
kmike Dec 29, 2017
d5a7fcf
v0.6
kmike Dec 29, 2017
6b3bc61
fix boolean bug
Kebniss May 3, 2018
0aaef9f
add test case
Kebniss May 10, 2018
7816e9d
add bool test to test_pattern_features
Kebniss May 16, 2018
9fe8988
Merge pull request #59 from scrapinghub/fix-boolean-bug
kmike May 16, 2018
15308c0
update travis to run different python versions
Kebniss May 18, 2018
71bed4f
add branches
Kebniss May 18, 2018
7ddd317
Merge pull request #62 from scrapinghub/fix-travis
kmike May 18, 2018
83dbcc0
add order independent evaluation for a single page
Kebniss May 22, 2018
422a122
add evaluation and test
Kebniss May 24, 2018
25923f3
Fix bugs and add tests
Kebniss May 24, 2018
60ce632
fix bad error in tests
Kebniss May 24, 2018
8851704
make zeroes floats
Kebniss May 25, 2018
edb6123
reduce digits to compare in almost equal
Kebniss May 25, 2018
3f4f583
trying to hunt a bug in py 2.7
Kebniss May 25, 2018
f32e735
bug hunting
Kebniss May 25, 2018
17fd259
debug
Kebniss May 25, 2018
ad8bb48
debug
Kebniss May 25, 2018
8169a02
debug
Kebniss May 25, 2018
faa8fe0
debug
Kebniss May 25, 2018
e59813e
debug
Kebniss May 25, 2018
b824f16
debug
Kebniss May 25, 2018
241fe96
debug
Kebniss May 25, 2018
c6a34f6
debug
Kebniss May 25, 2018
f7d56b0
debug
Kebniss May 25, 2018
257a32f
still debug
Kebniss May 25, 2018
8b54fc1
debug
Kebniss May 25, 2018
2b8e9e7
debug
Kebniss May 25, 2018
7a19028
fix
Kebniss May 25, 2018
b3aa2c4
add single metric functions
Kebniss May 28, 2018
ca01649
fix tests
Kebniss May 28, 2018
8fab06a
remove comment and use np.allclose in test_evaluation
Kebniss Jun 27, 2018
6e5c305
remove spaces
Kebniss Jul 5, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 18 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@ pip-log.txt
# Unit test / coverage reports
.coverage
.tox
cover
nosetests.xml
.cache
htmlcov/

# Translations
*.mo
Expand All @@ -35,5 +38,19 @@ nosetests.xml
.pydevproject

# Other
.idea
webstruct_data/datastore

.ipynb_checkpoints
docs/_build
webstruct_data/todo
notebooks/old
notebooks/*.zip
notebooks/*.html
notebooks/*.ipynb
notebooks/*.marisa
notebooks/*.wapiti
notebooks/*.crfsuite
webstruct_data/corpus/us_contact_pages/cleaned
example/_data/*
example/*.joblib
example/*.html
40 changes: 40 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
language: python
python: 3.5
sudo: false

branches:
only:
- master
- /^\d\.\d+$/

matrix:
include:
- python: 2.7
env: TOXENV=py27
- python: 3.4
env: TOXENV=py34
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36

addons:
apt:
packages:
- python-numpy
- python-scipy
- libatlas-base-dev
- liblapack-dev
- gfortran

install:
- pip install -U pip tox codecov

script: tox

after_success:
- codecov

cache:
directories:
- $HOME/.cache/pip
59 changes: 59 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Changes
=======

0.6 (2017-12-29)
----------------

* A complete example (contact extractor) is added to the repo;
* fixed a lot of issues in the annotated data;
* fixed loading of ``<title>`` annotations;
* all annotated data is converted from GATE to WebAnnotator format;
* text tokenizers allow to optionally return original token positions;
* converting text from tokenized to raw is now lossless;
* ``webstruct.webannotator.to_webannotator`` is rewritten;
* ``<script>``, ``<style>`` elements, HTML comments and processing
instructions are ignored when they are inside entities;
* tutorial is rewritten for CRFSuite;
* Wapiti support is fixed in Python 3;
* top-N parsing support when using Wapiti; an option to merge top N chains,
to increase recall;
* benchmarking script;
* don't declare Python 3.3 support (it is EOL).

0.5 (2017-05-10)
----------------

* webstruct.model.NER now uses ``requests`` library to make HTTP requests;
* changed default headers used by webstruct.model.NER;
* new ``webstruct.infer_domain`` module useful for proper cross-validation;
* webstruct.webannotator.to_webannotator got an option to add ``<base>``
tag with the original URL to the page;
* fixed a warning in webstruct.gazetteers.geonames.read_geonames;
* add a few more country names to countries.txt list.

0.4.1 (2016-11-28)
------------------

* fixed a bug in NER.extract().

0.4 (2016-11-26)
----------------

* sklearn-crfsuite_ is used as a CRFsuite wrapper, CRFsuiteCRF class
is removed;
* comments are preserved in HTML trees because recent Firefox puts
``<base>`` tags to a comment when saving pages, and this affects
WebAnnotator;
* fixed 'dont_penalize' argument of webstruct.NER.extract_groups_from_url;
* new webstruct.model.extract_entity_groups utility function;
* HtmlTokenizer and HtmlToken are moved to their own module
(webstruct.html_tokenizer);
* test improvements;

.. _sklearn-crfsuite: https://github.com/TeamHG-Memex/sklearn-crfsuite

0.3 (2016-09-19)
----------------

There are many changes from previous version: API is changed,
Python 3 is supported, better gazetteers support, CRFsuite support, etc.
45 changes: 45 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
Webstruct
=========

.. image:: https://img.shields.io/pypi/v/webstruct.svg
:target: https://pypi.python.org/pypi/webstruct
:alt: PyPI Version

.. image:: https://travis-ci.org/scrapinghub/webstruct.svg?branch=master
:target: https://travis-ci.org/scrapinghub/webstruct
:alt: Build Status

.. image:: https://codecov.io/gh/scrapinghub/webstruct/branch/master/graph/badge.svg
:target: https://codecov.io/gh/scrapinghub/webstruct
:alt: Code Coverage

.. image:: https://readthedocs.org/projects/webstruct/badge/?version=latest
:target: http://webstruct.readthedocs.io/en/latest/
:alt: Documentation


Webstruct is a library for creating statistical NER_ systems that work
on HTML data, i.e. a library for building tools that extract named
entities (addresses, organization names, open hours, etc) from webpages.

Unlike most NER systems, webstruct works on HTML data, not only
on text data. This allows to define features that use HTML structure,
and also to embed annotation results back into HTML.

Read the docs_ for more info.

License is MIT.

.. _docs: http://webstruct.readthedocs.io/en/latest/
.. _NER: http://en.wikipedia.org/wiki/Named-entity_recognition

Contributing
------------

* Source code: https://github.com/scrapinghub/webstruct
* Bug tracker: https://github.com/scrapinghub/webstruct/issues

To run tests, make sure tox_ is installed, then run
``tox`` from the source root.

.. _tox: https://tox.readthedocs.io/en/latest/
13 changes: 0 additions & 13 deletions block_model/README.md

This file was deleted.

11 changes: 0 additions & 11 deletions block_model/convert_html.py

This file was deleted.

16 changes: 0 additions & 16 deletions block_model/convert_labeled_data.py

This file was deleted.

132 changes: 0 additions & 132 deletions block_model/data/1.html

This file was deleted.

32 changes: 0 additions & 32 deletions block_model/data/1.txt

This file was deleted.

Loading