A neural text process python lib for sequence tagging data generating.
Support feature template which used to extract context-based feature from text. Support hybrid feature template which often been used in Neural Network sequence labeling.
This lib used 'fields' to specify input data's format. In template file, you could see it's definition at the first and second line.
In particular, there are several reserved fields named 'w', 'y', 'x' and 'F', which are used to represent token in text, label in text, representation of current token and corresponding feature respectively. You should never use them to specify your own addtional feature.
For example, there are some datas:
我 C S
爱 C S
北 C B
京 C E
天 C B
安 C M
门 C E
。 P S
Each line is consist of multiple columns, but the first column is token itself(field 'w') and the last column is it's label(field 'y'). The second column of each line is the character type of token, for example a hanzi character is 'C', a letter is 'E', a number is 'N' and a punctuation is 'P'.
So you can define a template for this data like:
# Fields
w T y
# Templates
w:-1
w: 0
w: 1
T: 0
Which used field name 'T' to specify the second column. You can use any string but not {'w', 'y', 'x' and 'F'} to assign a field name.
A basic feature template(src.feature.Template) is used to extract context-based feature for text.
Support feature templates prefixes enabled or disabled. For example, there are few context-based feature templates:
# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2
Given the sentence "我爱北京天安门。", then it will extract features for "北" and "京" as:
- Prefix enbaled
'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']
- Prefix disabled
'北': ['我', '爱', '北', '京', '天']
'京': ['爱', '北', '京', '天', '安']
The prefix 'w[n]:'
disappeared. Disabled prefixes can be used to extract raw word from a window.
It is easy to use class Template, just type temp = Template(template_file, prefix)
, and then use temp
as a parameter.
A HybridTemplate(src.features.HybridTemplate) is a combination of prefix-enabled Template and prefix-disabled Template. It will generate both window-repr and context-feature.
For example, if the window size equals to 3 which means each token is represente by it's left and right neighboring tokens. And the template is:
# Fields
w y
# Templates
w:-2
w:-1
w: 0
w: 1
w: 2
Given the sentence "我爱北京天安门。", then it will use these to represente "北" and "京":
'北': ['爱', '北', '京']
'京': ['北', '京', '天']
And it will extract features for "北" and "京" as(default prefix enabled):
'北': ['w[-2]:我', 'w[-1]:爱', 'w[0]:北', 'w[1]:京', 'w[2]:天']
'京': ['w[-2]:爱', 'w[01]:北', 'w[0]:京', 'w[1]:天', 'w[2]:安']
It is easy to use class HybridTemplate, just type temp = HybridTemplate(template_file, window)
, and then use temp
as a parameter.
Evaluation method for BIO/BISO tagged sequences has been offered in this project. The label must conform to the following format:
O
B-name or I-name
O
B-name or I-name
S-name # Single token entity
For example, in NER tasks usually, has multiple kinds entity waited to be recognized. They usually are 'PER', 'LOC' and 'ORG'.
So in this case, the label set could be:
# others
O
# Begin token of a certain type entity
B-PER
B-ORG
B-LOC
# Inside token of a certain type entity
I-PER
I-LOC
I-ORG
But if the label set is not subdivided like this, just attach a suffix like '-ANYTHING'
after any non-O label to prevent the program from going wrong.
- 2018-07-28 ver 0.2.2
- Evaluation for BIO/BISO tagged sequence update.
- 2018-01-09 ver 0.2.1
- Prefix, not suffix(Ah my poor English:sweat_smile:).
- 2017-10-30 ver 0.2.0
- New HybridTemplate support
- Window-representation for current token, i.e. Xt = [Wt-l,...,Wt+r], you can represent Wt by concatenating vector.
- Tranditional context-based feature.
- Compatible modification.
- Reserved fields are now {'w', 'y', 'x', 'F'}. And use field 'x' to represent Wt now.
- The map word2idx is generated by statistication on field 'x'.
- The shape of returned tensor of method 'src.pretreatment.conv_corpus' has changed.
- New HybridTemplate support
- 2017-09-25 ver 0.1.3
- Add new method to return the size of feature templates
- Replace both 'START'&'END' tag with ''
- 2017-09-12 ver 0.1.2
- label2idx's index starts from 1
- Index of unknow words or labels will be 0
- 2017-09-04 ver 0.1.1
- Index of feature ‘OOV’ set to default 0
- label2idx's index starts from 0
- 2017-08-26 ver 0.1.0
- First version