Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date features #58

Open
wants to merge 450 commits into
base: master
Choose a base branch
from
Open

Add date features #58

wants to merge 450 commits into from

Conversation

Kebniss
Copy link
Contributor

@Kebniss Kebniss commented Jan 30, 2018

I added the features I created for Fireflax

kmike and others added 30 commits April 23, 2014 02:37
It is better to put models somewhere else, and notebooks were broken.
add base classifier and global ngrams feature functions
1. rename DEFAULT_TAGSET to EXAMPLE_TAGSET;
2. rename DEFAULT_FEATURES to EXAMPLE_TOKEN_FEATURES;
3. make token_features empty by default in create_wapiti_pipeline.
except model_filename must be kwargs now. Also, this fixes the example
from the tutorial.
@codecov
Copy link

codecov bot commented Jan 30, 2018

Codecov Report

Merging #58 into master will increase coverage by 0.13%.
The diff coverage is 84.44%.

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   81.01%   81.14%   +0.13%     
==========================================
  Files          40       41       +1     
  Lines        2091     2180      +89     
==========================================
+ Hits         1694     1769      +75     
- Misses        397      411      +14

@Kebniss
Copy link
Contributor Author

Kebniss commented Jan 30, 2018

Uh? Is it complaining because I did not write tests for the new features?

token = HtmlToken('1st')
expected = {'looks_like_day_ordinal': True}
result = looks_like_day_ordinal(token)
assert result == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind cleaning it up a bit - e.g. you can make a function assert_looks_like_day_ordinal("1st", True) to reduce copy-paste and make code more clear

from webstruct.features import token_lower, token_identity, Pattern


class PatternTest(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it removed?

@Kebniss Kebniss changed the title Add date features [WIP] Add date features Feb 2, 2018
@Kebniss Kebniss changed the title [WIP] Add date features Add date features Feb 2, 2018
return {'looks_like_date_pattern': True} # XX/XX/XXXX
if re.search('\d{1,2}\.\d{1,2}\.\d{2,4}', html_token.token):
return {'looks_like_date_pattern': True} # XX.XX.XXXX
if re.search('\d{1,2}-\d{1,2}-\d{2,4}', html_token.token):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matches XX.X.XXX, right? I think it makes sense to exclude 3-letter years from the pattern.

This function also doesn't catch common date variants like YYYY-MM-DD

def test_looks_like_ordinal():

def assert_looks_like_ordinal(token, expected):
assert looks_like_ordinal(token) == expected
Copy link
Member

@kmike kmike Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you replace it with

assert looks_like_ordinal(HtmlToken(text)) == {'looks_like_ordinal': expected}

test code will be smaller and more DRY, it'd be easier to add more tests. The same applies for test_looks_like_date_pattern.

@@ -99,4 +97,5 @@ def _add_pattern_features(feature_dicts, pattern, out_value, missing_value, sepa

# FIXME: there should be a cleaner/faster way
if not all(v == out_value for v in values):
values = [str(v) for v in values]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct in Python 2, as you'll be casting unicode features to str (i.e. to bytes).

@Kebniss
Copy link
Contributor Author

Kebniss commented May 3, 2018

I run some tests to check how much these features help identifying date objects and results were mixed:

  • when start and end dates were identified by a single entity the extra features slightly worsened the performance moving the F1 score for B-date and I-date from 0.567 and 0.628 to 0.548 and 0.611 respectively. Sequence accuracy remains the same
  • when start and end dates were identified in two separate entities the extra features slightly increased the performance. For B-END_DATE F1 score moved from 0.591 to 0.625, I-END_DATE went from 0.682 to 0.721, B-START_DATE went from 0.522 to 0.547 and I-START_DATE went from 0.667 to 0.690. sequence accuracy went from 1.5% to 3.1%

scores were evaluated cross validating (3 fold) on 45 labelled pages and using crf model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants