Skip to content

Commit

Permalink
Merge pull request #108 from AmyOlex/chronobert
Browse files Browse the repository at this point in the history
Merging ChronoBERT branch with Master
  • Loading branch information
AmyOlex authored Sep 20, 2022
2 parents 4120f5e + 9e7dab0 commit eb2252c
Show file tree
Hide file tree
Showing 32 changed files with 2,809 additions and 323 deletions.
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,13 @@ Chrono_SemEval2018_PostEvalSubmission_NB_NewswireModel_060818/Chrono_TempEval201
*.tiff
*.dct
SemEval-OfficialTrain/*
*.xml.txt
*.list
*.out
i2b2_*
results*
062120_i2b2chrono.txt
062120_i2b2gold.txt
SemEval-OfficialTrain-Subset
THYME
THYME_subset
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 16 additions & 0 deletions .idea/Chrono.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

138 changes: 138 additions & 0 deletions Chrono.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Metadata-Version: 2.1
Name: Chrono
Version: 2.0.2
Summary: Chrono is a hybrid rule-based and machine learning system that identifies temporal expressions in text and normalizes them into the Semantically Compositional Annotations for Temporal Expressions (SCATE) schema developed by Bethard and Parker. Chrono has emerged as the top performing system for SemEval 2018 Task 6: Parsing Time Normalizations.
Home-page: https://github.com/AmyOlex/Chrono
Author: Amy Olex, Luke Maffey, Nick Morton, and Bridget McInnes
Author-email: [email protected], [email protected], [email protected], [email protected]
License: GPLv3
Keywords: nlp temporal time-normalization semeval2018-Chrono
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
License-File: LICENSE

<!---
output:
html_document: default
pdf_document: default
--->

# Chrono - Parsing Time Normalizations into the SCATE Schema

### Amy Olex, Luke Maffey, Nicholas Morton, and Bridget McInnes
---

### Overview

Chrono is a hybrid rule-based and machine learning system that identifies temporal expressions in text and normalizes them into the Semantically Compositional Annotations for Temporal Expressions (SCATE) schema developed by Bethard and Parker [1]. After minor parsing logic adjustments, Chrono has emerged as the top performing system for SemEval 2018 Task 6: Parsing Time Normalizations.

Chrono requires input text files to be in the Anafora XML Directory Structure (described below). All training was done on the SemEval 2013 AQUAINT/TimeML dataset that can be found at (https://github.com/bethard/anafora-annotations).

### Requirements

- Python 3 or later (comes with Anaconda 3 which can be obtained from here <https://www.anaconda.com/download>)
- TensorFlow <https://www.tensorflow.org>
- Keras <https://keras.io/#installation>
- AnaforaTools (for evaluation) <https://github.com/bethard/anaforatools>

#### Python Modules Required

- nltk
- python-dateutil
- numpy
- sklearn

### Installation

Installation has been tested on Mac OSX, Linux, and Windows 10 platforms.

- Download or clone this git repo to your computer. If using Git and SSH you can type ">> git clone [email protected]:AmyOlex/Chrono.git" into your terminal.

- Run
```bash
>> python setup.py install
```

or

- Ensure you have all Pre-Reqs installed, including all required Python modules.


### Usage

Navigate to the Chrono folder. For a description of all available options use:

```bash
>> python Chrono -h
```

Prior to running Chrono you must have:

> 1) The input text files organized into the Anafora XML Directory Structure.

> 2) A machine learning (ML) training matrix and class information.

The ML matrix files utilized by Chrono in the SemEval 2018 Task 6 challenge are included in the "sample_files" directory provided with this system. You may use these, or create your own using the "Create ML Matrix" instructions below.

#### Running Chrono

To run Chrono without evaluation against a gold standard all you need in the input file directory and the two ML training matrix files. We will assume the use of the provided matrix files in the "sample_files" directory, otherwise change the paths accordingly. We also assume the input data files are in a local folder named "./data/my_input/", the out put is being saved in "./results/my_output/", and all input files have a ".txt" extension. The ML training matrix files are named "official_train_MLmatrix_Win5_012618_data.csv" and "official_train_MLmatrix_Win5_012618_class.csv".

```
>> python Chrono.py -i ./data/my_input -x ".txt" -o ./results/my_output -m SVM -d "./sample_files/official_train_MLmatrix_Win5_012618_data.csv" -c "./sample_files/official_train_MLmatrix_Win5_012618_class.csv"
```

#### Evaluating Chrono with Anafora Tools

To evaluate Chrono performance you must have:

> 1) The gold standard Anafora Annotations for your input files organized in the Anafora XML Directory Structure with the gold standard XML file being named the same as the input file with an extension formatted as ".\*.completed.\*.xml". These gold standard files may be located in the same directory as the associated input file (as long as there is only one xml file present), which means your gold standard directory is also your input directory.

> 2) You must have Anafora Tools installed <https://github.com/bethard/anaforatools>.

The following assumes your gold standard xml files are stored in the same directory as your input files. If otherwise, adjust the paths as needed. It also assumes AnaforaTools is installed locally in the directory "./anaforatools". Change paths as needed if this is not the case.

```bash
>> cd ./anaforatools
>> python -m anafora.evaluate -r ../data/my_input -p ../results/my_output
```

The evaluation can be customized to focus on specific entities. Read the AnaforaTools documentation and/or review the help documentation for details.

```bash
>> python -m anafora.evaluate -h
```

#### Training Data Matrix Generation

The machine learning methods require two files to operate: a data matrix and a class file. We provide a file that utilizes a window size of 5 in the "sample_files" directory, you can also create your own training file with different window sizes and on different subsets of training data. To create your own training file do the following:

> 1) Ensure all the gold standard data you want to utilize for training is in a separate directory structure than your testing data.

> 2) Run the python script Chrono_createMLTrainingMatrix.py script as follows (assuming your input text files and the gold standard XML files are in the same directory named "./data/my_input"):

```bash
>> python Chrono_createMLTrainingMatrix.py -i ./data/my_input/ -g ./data/my_input/ -o MLTrainingMatrix_Win5 -w 5
```

The *-o* option should be the file name base you want your training data matrix files to be saved to, and the *-w* option is the context window size, which is 3 by default. The output from this script are two ".csv" files that can be used as input into Chrono.

#### K-Fold Cross-Validation
In order to thouroughly test the perfomance of a machine learning method, modify ChronoKFold.py to use the appropriate -m option and then run:
```bash
>> python ChronoKFold.py > kfoldoutput.txt
```
### Anafora XML Directory Structure
In the Anafora XML Directory Structure each input file is in a folder by itself with the folder named the same as the file without an extension. There is also an additional text file that contains the document time that is named that same as the input file, but has the extension ".dct". This DCT file only contains the document date. The Anafora XML Directory Structure can contain the raw input file as well as the Anafora Annotation XML file that is used as a gold standard. It should NOT contain the result XML files generated by Chrono. Results should be saved in a separate directory.


---
### References

1. Bethard, S. and Parker, J. (2016) [A Semantically Compositional Annotation Scheme for Time Normalization](http://www.lrec-conf.org/proceedings/lrec2016/pdf/288_Paper.pdf). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 5 2016


41 changes: 41 additions & 0 deletions Chrono.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
LICENSE
README.md
setup.py
Chrono/BuildEntities.py
Chrono/TimePhraseEntity.py
Chrono/__init__.py
Chrono/chronoEntities.py
Chrono/gold_standard_utils.py
Chrono/referenceToken.py
Chrono/temporalTest.py
Chrono/utils.py
Chrono/w2ny.py
Chrono.egg-info/PKG-INFO
Chrono.egg-info/SOURCES.txt
Chrono.egg-info/dependency_links.txt
Chrono.egg-info/requires.txt
Chrono.egg-info/top_level.txt
Chrono/TimePhraseToChrono/AMPM.py
Chrono/TimePhraseToChrono/BeforeAfter.py
Chrono/TimePhraseToChrono/DayOfMonth.py
Chrono/TimePhraseToChrono/DayOfWeek.py
Chrono/TimePhraseToChrono/Frequency.py
Chrono/TimePhraseToChrono/HourOfDay.py
Chrono/TimePhraseToChrono/Last.py
Chrono/TimePhraseToChrono/MinuteOfHour.py
Chrono/TimePhraseToChrono/Modifier.py
Chrono/TimePhraseToChrono/MonthYear.py
Chrono/TimePhraseToChrono/NthFromStart.py
Chrono/TimePhraseToChrono/NumericDate.py
Chrono/TimePhraseToChrono/PartOfDay.py
Chrono/TimePhraseToChrono/PartOfWeek.py
Chrono/TimePhraseToChrono/PeriodInterval.py
Chrono/TimePhraseToChrono/Season.py
Chrono/TimePhraseToChrono/SecondOfMinute.py
Chrono/TimePhraseToChrono/TextMonthAndDay.py
Chrono/TimePhraseToChrono/TextYear.py
Chrono/TimePhraseToChrono/This.py
Chrono/TimePhraseToChrono/TimeZone.py
Chrono/TimePhraseToChrono/TwentyFourHourTime.py
Chrono/TimePhraseToChrono/__init__.py
dictionary/__init__.py
1 change: 1 addition & 0 deletions Chrono.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

6 changes: 6 additions & 0 deletions Chrono.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
nltk
python-dateutil
numpy
sklearn
keras
tensorflow
2 changes: 2 additions & 0 deletions Chrono.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Chrono
dictionary
51 changes: 42 additions & 9 deletions Chrono.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@
from Chrono import BuildEntities
from Chrono import referenceToken
from Chrono import utils
from keras.models import load_model
from transformers import BertModel, BertTokenizer
from tensorflow.keras.models import load_model
from joblib import load

debug=False

Expand All @@ -70,8 +72,17 @@
parser.add_argument('-d', metavar='MLTrainData', type=str, help='A string representing the file name that contains the CSV file with the training data matrix.', required=False, default=False)
parser.add_argument('-c', metavar='MLTrainClass', type=str, help='A string representing the file name that contains the known classes for the training data matrix.', required=False, default=False)
parser.add_argument('-M', metavar='MLmodel', type=str, help='The path and file name of a pre-build ML model for loading.', required=False, default=None)
parser.add_argument('-b', metavar='BERTmodel', type=str,
help='The path and file name of a pre-built BERT model for loading.', required=False,
default=None)
parser.add_argument('-B', metavar='BERTClassificationModel', type=str,
help='The path and file name of a pre-trained SVM or CNN classification model from ChronoBERT.', required=False,
default=None)
#parser.add_argument('-r',metavar='includeRelative', type=str2bool, help='Tell Chrono to mark relative phrases temporal words as temporal.', action="store_true", default=False)
parser.add_argument('--includeRelative', action="store_true")
parser.add_argument('--includeRelative', action="store_true", default=False)
parser.add_argument('--includeContext', action="store_true", default=False)
parser.add_argument('--includeAttention', action="store_true", default=False)
parser.add_argument('--cnn', action="store_true", default=False)

args = parser.parse_args()
## Now we can access each argument as args.i, args.o, args.r
Expand Down Expand Up @@ -158,7 +169,16 @@
feats = utils.get_features(args.d)

## Pass the ML classifier through to the parse SUTime entities method.


# load in BERT model
bert_model = BertModel.from_pretrained(args.b, output_hidden_states=True, use_cache=True, output_attentions=True)
bert_tokenizer = BertTokenizer.from_pretrained(args.b)

if args.cnn:
bert_classifier = load_model(args.B)
else:
bert_classifier = load(args.B)

## Loop through each file and parse
for f in range(0,len(infiles)) :
print("Parsing "+ infiles[f] +" ...")
Expand All @@ -173,29 +193,42 @@
doctime = utils.getDocTime(infiles[f] + ".dct", i2b2=False)
if(debug) : print(doctime)

## parse out reference tokens
raw_text, text, tokens, spans, tags, sents = utils.getWhitespaceTokens(infiles[f]+args.x)
## parse out reference tokens. The spans returned are character spans, not token spans.
## sents is per token, a 1 indicates that token is the last in the sentence.
##
raw_text, text, tokens, abs_text_spans, rel_text_spans, tags, sents, sent_text, sent_membership = utils.getWhitespaceTokens2(infiles[f]+args.x)
#my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, span=spans, remove_stopwords="./Chrono/stopwords_short2.txt")
my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, span=spans, pos=tags, sent_boundaries=sents)
my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, abs_span=abs_text_spans, rel_span=rel_text_spans, pos=tags, sent_boundaries=sents, sent_membership=sent_membership)

if(args.includeRelative):
print("Including Relative Terms")

## mark all ref tokens if they are numeric or temporal
chroList = utils.markTemporal(my_refToks, include_relative = args.includeRelative)
chroList = utils.markTemporal(my_refToks, include_relative=args.includeRelative)

if(debug) :
print("REFERENCE TOKENS:\n")
for tok in chroList : print(tok)

tempPhrases = utils.getTemporalPhrases(chroList, doctime)
tempPhrases = utils.getTemporalPhrases(chroList, sent_text, doctime)

if(debug):
for c in tempPhrases:
print(c)


chrono_master_list, my_chrono_ID_counter, timex_phrases = BuildEntities.buildChronoList(tempPhrases, my_chrono_ID_counter, chroList, (classifier, args.m), feats, doctime)

chrono_master_list, my_chrono_ID_counter, timex_phrases = BuildEntities.buildChronoList(tempPhrases,
my_chrono_ID_counter,
chroList,
(classifier, args.m),
feats, bert_model,
bert_tokenizer,
bert_classifier,
args.includeContext,
args.includeAttention,
args.cnn,
doctime)

print("Number of Chrono Entities: " + str(len(chrono_master_list)))

Expand Down
16 changes: 8 additions & 8 deletions Chrono/BuildEntities.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
# @param list of TimePhrase Output
# @param document creation time (optional)
# @return List of Chrono entities and the ChronoID
def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeatures, dct=None):
def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeatures, bert_model, bert_tokenizer, bert_classifier, includeContext, includeAttention, cnn, dct=None):
chrono_list = []

## Do some further pre-processing on the ref token list
Expand Down Expand Up @@ -114,7 +114,7 @@ def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeature
chrono_tmp_list, chrono_id = Frequency.buildFrequency(s, chrono_id, chrono_tmp_list)


print("XXXXXXXXX")
#print("XXXXXXXXX")

# if len(chrono_tmp_list) > 0:
# print(s)
Expand All @@ -127,10 +127,10 @@ def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeature
## Need to add ISO conversion here!

if len(tmplist) > 0:
print("Converting phrase to ISO: " + str(s))
s.getISO(tmplist)
print("ISO Value: " + str(s))
print("TIMEX3 String: " + s.i2b2format())
#print("Converting phrase to ISO: " + str(s))
s.getISO(tmplist, bert_model, bert_tokenizer, bert_classifier, includeContext, includeAttention, cnn)
#print("ISO Value: " + str(s))
#print("TIMEX3 String: " + s.i2b2format())
timex_list.append(s)


Expand Down Expand Up @@ -226,12 +226,12 @@ def buildSubIntervals(chrono_list, chrono_id, dct, ref_list):
my_dayweek = weekdays[chrono_list[dayweek].get_day_type()]
if my_dayweek < dct_day:
chrono_list.append(chrono.ChronoLastOperator(entityID=str(chrono_id) + "entity", start_span=mStart, end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
chrono_list.append(chrono.ChronoLastOperator(entityID=str(chrono_id) + "entity", abs_start_span=mStart, abs_end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
chrono_id = chrono_id + 1
print("FOUND DAYWEEK LAST")
elif my_dayweek > dct_day:
chrono_list.append(chrono.ChronoNextOperator(entityID=str(chrono_id) + "entity", start_span=mStart, end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
chrono_list.append(chrono.ChronoNextOperator(entityID=str(chrono_id) + "entity", abs_start_span=mStart, abs_end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
chrono_id = chrono_id + 1
print("FOUND DAYWEEK NEXT")
'''
Expand Down
Loading

0 comments on commit eb2252c

Please sign in to comment.