-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add date features #58
base: master
Are you sure you want to change the base?
Conversation
It is better to put models somewhere else, and notebooks were broken.
add base classifier and global ngrams feature functions
1. rename DEFAULT_TAGSET to EXAMPLE_TAGSET; 2. rename DEFAULT_FEATURES to EXAMPLE_TOKEN_FEATURES; 3. make token_features empty by default in create_wapiti_pipeline.
except model_filename must be kwargs now. Also, this fixes the example from the tutorial.
…v/webstruct into speed_up_text_tokenizer
Speed up text tokenizer
Codecov Report
@@ Coverage Diff @@
## master #58 +/- ##
==========================================
+ Coverage 81.01% 81.14% +0.13%
==========================================
Files 40 41 +1
Lines 2091 2180 +89
==========================================
+ Hits 1694 1769 +75
- Misses 397 411 +14 |
Uh? Is it complaining because I did not write tests for the new features? |
token = HtmlToken('1st') | ||
expected = {'looks_like_day_ordinal': True} | ||
result = looks_like_day_ordinal(token) | ||
assert result == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind cleaning it up a bit - e.g. you can make a function assert_looks_like_day_ordinal("1st", True)
to reduce copy-paste and make code more clear
from webstruct.features import token_lower, token_identity, Pattern | ||
|
||
|
||
class PatternTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it removed?
webstruct/features/data_features.py
Outdated
return {'looks_like_date_pattern': True} # XX/XX/XXXX | ||
if re.search('\d{1,2}\.\d{1,2}\.\d{2,4}', html_token.token): | ||
return {'looks_like_date_pattern': True} # XX.XX.XXXX | ||
if re.search('\d{1,2}-\d{1,2}-\d{2,4}', html_token.token): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It matches XX.X.XXX, right? I think it makes sense to exclude 3-letter years from the pattern.
This function also doesn't catch common date variants like YYYY-MM-DD
def test_looks_like_ordinal(): | ||
|
||
def assert_looks_like_ordinal(token, expected): | ||
assert looks_like_ordinal(token) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you replace it with
assert looks_like_ordinal(HtmlToken(text)) == {'looks_like_ordinal': expected}
test code will be smaller and more DRY, it'd be easier to add more tests. The same applies for test_looks_like_date_pattern
.
@@ -99,4 +97,5 @@ def _add_pattern_features(feature_dicts, pattern, out_value, missing_value, sepa | |||
|
|||
# FIXME: there should be a cleaner/faster way | |||
if not all(v == out_value for v in values): | |||
values = [str(v) for v in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct in Python 2, as you'll be casting unicode features to str (i.e. to bytes).
- looks_like_date now includes patterns like XXXX.XX.XX and excludes 3 digit years like XX/XX/XXX
I run some tests to check how much these features help identifying date objects and results were mixed:
scores were evaluated cross validating (3 fold) on 45 labelled pages and using crf model |
I added the features I created for Fireflax