Merge pull request #108 from AmyOlex/chronobert

Merging ChronoBERT branch with Master
AmyOlex · Sep 20, 2022 · eb2252c · eb2252c
2 parents 4120f5e + 9e7dab0
commit eb2252c
Show file tree

Hide file tree

Showing 32 changed files with 2,809 additions and 323 deletions.
diff --git a/.gitignore b/.gitignore
@@ -73,3 +73,13 @@ Chrono_SemEval2018_PostEvalSubmission_NB_NewswireModel_060818/Chrono_TempEval201
 *.tiff
 *.dct
 SemEval-OfficialTrain/*
+*.xml.txt
+*.list
+*.out
+i2b2_*
+results*
+062120_i2b2chrono.txt
+062120_i2b2gold.txt
+SemEval-OfficialTrain-Subset
+THYME
+THYME_subset
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/Chrono.iml b/.idea/Chrono.iml
diff --git a/Chrono.egg-info/PKG-INFO b/Chrono.egg-info/PKG-INFO
@@ -0,0 +1,138 @@
+Metadata-Version: 2.1
+Name: Chrono
+Version: 2.0.2
+Summary: Chrono is a hybrid rule-based and machine learning system that identifies temporal expressions in text and normalizes them into the Semantically Compositional Annotations for Temporal Expressions (SCATE) schema developed by Bethard and Parker. Chrono has emerged as the top performing system for SemEval 2018 Task 6: Parsing Time Normalizations.
+Home-page: https://github.com/AmyOlex/Chrono
+Author: Amy Olex, Luke Maffey, Nick Morton, and Bridget McInnes
+Author-email: [email protected], [email protected], [email protected], [email protected]
+License: GPLv3
+Keywords: nlp temporal time-normalization semeval2018-Chrono
+Platform: UNKNOWN
+Classifier: Development Status :: 4 - Beta
+Classifier: Intended Audience :: Science/Research
+Classifier: Topic :: Text Processing
+Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
+Classifier: Programming Language :: Python :: 3
+License-File: LICENSE
+
+<!---
+output:
+  html_document: default
+  pdf_document: default
+--->
+
+# Chrono - Parsing Time Normalizations into the SCATE Schema
+
+### Amy Olex, Luke Maffey, Nicholas Morton, and Bridget McInnes
+---
+
+### Overview
+
+Chrono is a hybrid rule-based and machine learning system that identifies temporal expressions in text and normalizes them into the Semantically Compositional Annotations for Temporal Expressions (SCATE) schema developed by Bethard and Parker [1]. After minor parsing logic adjustments, Chrono has emerged as the top performing system for SemEval 2018 Task 6: Parsing Time Normalizations.
+
+Chrono requires input text files to be in the Anafora XML Directory Structure (described below).  All training was done on the SemEval 2013 AQUAINT/TimeML dataset that can be found at (https://github.com/bethard/anafora-annotations).  
+
+### Requirements
+
+- Python 3 or later (comes with Anaconda 3 which can be obtained from here <https://www.anaconda.com/download>)
+- TensorFlow <https://www.tensorflow.org>
+- Keras <https://keras.io/#installation>
+- AnaforaTools (for evaluation) <https://github.com/bethard/anaforatools>
+
+#### Python Modules Required
+
+ - nltk
+ - python-dateutil
+ - numpy
+ - sklearn
+
+### Installation
+
+Installation has been tested on Mac OSX, Linux, and Windows 10 platforms.
+
+ - Download or clone this git repo to your computer.  If using Git and SSH you can type ">> git clone [email protected]:AmyOlex/Chrono.git" into your terminal.
+
+ - Run
+```bash
+>> python setup.py install
+```
+
+ or
+
+ - Ensure you have all Pre-Reqs installed, including all required Python modules.
+
+
+### Usage
+
+Navigate to the Chrono folder.  For a description of all available options use:
+
+```bash
+>> python Chrono -h
+```
+
+Prior to running Chrono you must have:
+
+> 1) The input text files organized into the Anafora XML Directory Structure.
+
+> 2) A machine learning (ML) training matrix and class information.
+
+The ML matrix files utilized by Chrono in the SemEval 2018 Task 6 challenge are included in the "sample_files" directory provided with this system.  You may use these, or create your own using the "Create ML Matrix" instructions below. 
+
+#### Running Chrono
+
+To run Chrono without evaluation against a gold standard all you need in the input file directory and the two ML training matrix files.  We will assume the use of the provided matrix files in the "sample_files" directory, otherwise change the paths accordingly.  We also assume the input data files are in a local folder named "./data/my_input/", the out put is being saved in "./results/my_output/", and all input files have a ".txt" extension.  The ML training matrix files are named "official_train_MLmatrix_Win5_012618_data.csv" and "official_train_MLmatrix_Win5_012618_class.csv".  
+
+```
+>> python Chrono.py -i ./data/my_input -x ".txt" -o ./results/my_output -m SVM -d "./sample_files/official_train_MLmatrix_Win5_012618_data.csv" -c "./sample_files/official_train_MLmatrix_Win5_012618_class.csv"
+```
+
+#### Evaluating Chrono with Anafora Tools
+
+To evaluate Chrono performance you must have:
+
+> 1) The gold standard Anafora Annotations for your input files organized in the Anafora XML Directory Structure with the gold standard XML file being named the same as the input file with an extension formatted as ".\*.completed.\*.xml".  These gold standard files may be located in the same directory as the associated input file (as long as there is only one xml file present), which means your gold standard directory is also your input directory.
+
+> 2) You must have Anafora Tools installed <https://github.com/bethard/anaforatools>.
+
+The following assumes your gold standard xml files are stored in the same directory as your input files.  If otherwise, adjust the paths as needed.  It also assumes AnaforaTools is installed locally in the directory "./anaforatools".  Change paths as needed if this is not the case.
+
+```bash
+>> cd ./anaforatools
+>> python -m anafora.evaluate -r ../data/my_input -p ../results/my_output
+```
+
+The evaluation can be customized to focus on specific entities. Read the AnaforaTools documentation and/or review the help documentation for details.
+
+```bash
+>> python -m anafora.evaluate -h
+```
+
+#### Training Data Matrix Generation
+
+The machine learning methods require two files to operate: a data matrix and a class file.  We provide a file that utilizes a window size of 5 in the "sample_files" directory, you can also create your own training file with different window sizes and on different subsets of training data.  To create your own training file do the following:
+
+> 1) Ensure all the gold standard data you want to utilize for training is in a separate directory structure than your testing data.
+
+> 2) Run the python script Chrono_createMLTrainingMatrix.py script as follows (assuming your input text files and the gold standard XML files are in the same directory named "./data/my_input"): 
+
+```bash
+>> python Chrono_createMLTrainingMatrix.py -i ./data/my_input/ -g ./data/my_input/ -o MLTrainingMatrix_Win5 -w 5
+```
+
+The *-o* option should be the file name base you want your training data matrix files to be saved to, and the *-w* option is the context window size, which is 3 by default.  The output from this script are two ".csv" files that can be used as input into Chrono. 
+
+#### K-Fold Cross-Validation
+In order to thouroughly test the perfomance of a machine learning method, modify ChronoKFold.py to use the appropriate -m option and then run:
+```bash
+>> python ChronoKFold.py > kfoldoutput.txt
+```
+### Anafora XML Directory Structure
+In the Anafora XML Directory Structure each input file is in a folder by itself with the folder named the same as the file without an extension.  There is also an additional text file that contains the document time that is named that same as the input file, but has the extension ".dct".  This DCT file only contains the document date. The Anafora XML Directory Structure can contain the raw input file as well as the Anafora Annotation XML file that is used as a gold standard.  It should NOT contain the result XML files generated by Chrono.  Results should be saved in a separate directory.
+
+
+---
+### References
+
+1. Bethard, S. and Parker, J. (2016) [A Semantically Compositional Annotation Scheme for Time Normalization](http://www.lrec-conf.org/proceedings/lrec2016/pdf/288_Paper.pdf). Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, 5 2016
+
+
diff --git a/Chrono.egg-info/SOURCES.txt b/Chrono.egg-info/SOURCES.txt
@@ -0,0 +1,41 @@
+LICENSE
+README.md
+setup.py
+Chrono/BuildEntities.py
+Chrono/TimePhraseEntity.py
+Chrono/__init__.py
+Chrono/chronoEntities.py
+Chrono/gold_standard_utils.py
+Chrono/referenceToken.py
+Chrono/temporalTest.py
+Chrono/utils.py
+Chrono/w2ny.py
+Chrono.egg-info/PKG-INFO
+Chrono.egg-info/SOURCES.txt
+Chrono.egg-info/dependency_links.txt
+Chrono.egg-info/requires.txt
+Chrono.egg-info/top_level.txt
+Chrono/TimePhraseToChrono/AMPM.py
+Chrono/TimePhraseToChrono/BeforeAfter.py
+Chrono/TimePhraseToChrono/DayOfMonth.py
+Chrono/TimePhraseToChrono/DayOfWeek.py
+Chrono/TimePhraseToChrono/Frequency.py
+Chrono/TimePhraseToChrono/HourOfDay.py
+Chrono/TimePhraseToChrono/Last.py
+Chrono/TimePhraseToChrono/MinuteOfHour.py
+Chrono/TimePhraseToChrono/Modifier.py
+Chrono/TimePhraseToChrono/MonthYear.py
+Chrono/TimePhraseToChrono/NthFromStart.py
+Chrono/TimePhraseToChrono/NumericDate.py
+Chrono/TimePhraseToChrono/PartOfDay.py
+Chrono/TimePhraseToChrono/PartOfWeek.py
+Chrono/TimePhraseToChrono/PeriodInterval.py
+Chrono/TimePhraseToChrono/Season.py
+Chrono/TimePhraseToChrono/SecondOfMinute.py
+Chrono/TimePhraseToChrono/TextMonthAndDay.py
+Chrono/TimePhraseToChrono/TextYear.py
+Chrono/TimePhraseToChrono/This.py
+Chrono/TimePhraseToChrono/TimeZone.py
+Chrono/TimePhraseToChrono/TwentyFourHourTime.py
+Chrono/TimePhraseToChrono/__init__.py
+dictionary/__init__.py
diff --git a/Chrono.egg-info/dependency_links.txt b/Chrono.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/Chrono.egg-info/requires.txt b/Chrono.egg-info/requires.txt
@@ -0,0 +1,6 @@
+nltk
+python-dateutil
+numpy
+sklearn
+keras
+tensorflow
diff --git a/Chrono.egg-info/top_level.txt b/Chrono.egg-info/top_level.txt
@@ -0,0 +1,2 @@
+Chrono
+dictionary
diff --git a/Chrono.py b/Chrono.py
@@ -44,7 +44,9 @@
 from Chrono import BuildEntities
 from Chrono import referenceToken
 from Chrono import utils
-from keras.models import load_model
+from transformers import BertModel, BertTokenizer
+from tensorflow.keras.models import load_model
+from joblib import load
 
 debug=False
 
@@ -70,8 +72,17 @@
     parser.add_argument('-d', metavar='MLTrainData', type=str, help='A string representing the file name that contains the CSV file with the training data matrix.', required=False, default=False)
     parser.add_argument('-c', metavar='MLTrainClass', type=str, help='A string representing the file name that contains the known classes for the training data matrix.', required=False, default=False)
     parser.add_argument('-M', metavar='MLmodel', type=str, help='The path and file name of a pre-build ML model for loading.', required=False, default=None)
+    parser.add_argument('-b', metavar='BERTmodel', type=str,
+                        help='The path and file name of a pre-built BERT model for loading.', required=False,
+                        default=None)
+    parser.add_argument('-B', metavar='BERTClassificationModel', type=str,
+                        help='The path and file name of a pre-trained SVM or CNN classification model from ChronoBERT.', required=False,
+                        default=None)
     #parser.add_argument('-r',metavar='includeRelative', type=str2bool, help='Tell Chrono to mark relative phrases temporal words as temporal.', action="store_true", default=False)
-    parser.add_argument('--includeRelative', action="store_true")
+    parser.add_argument('--includeRelative', action="store_true", default=False)
+    parser.add_argument('--includeContext', action="store_true", default=False)
+    parser.add_argument('--includeAttention', action="store_true", default=False)
+    parser.add_argument('--cnn', action="store_true", default=False)
 
     args = parser.parse_args()
     ## Now we can access each argument as args.i, args.o, args.r
@@ -158,7 +169,16 @@
             feats = utils.get_features(args.d)
 
     ## Pass the ML classifier through to the parse SUTime entities method.
-
+
+    # load in BERT model
+    bert_model = BertModel.from_pretrained(args.b, output_hidden_states=True, use_cache=True, output_attentions=True)
+    bert_tokenizer = BertTokenizer.from_pretrained(args.b)
+
+    if args.cnn:
+        bert_classifier = load_model(args.B)
+    else:
+        bert_classifier = load(args.B)
+
     ## Loop through each file and parse
     for f in range(0,len(infiles)) :
         print("Parsing "+ infiles[f] +" ...")
@@ -173,29 +193,42 @@
             doctime = utils.getDocTime(infiles[f] + ".dct", i2b2=False)
         if(debug) : print(doctime)
 
-        ## parse out reference tokens
-        raw_text, text, tokens, spans, tags, sents = utils.getWhitespaceTokens(infiles[f]+args.x)
+        ## parse out reference tokens.  The spans returned are character spans, not token spans.
+        ## sents is per token, a 1 indicates that token is the last in the sentence.
+        ##
+        raw_text, text, tokens, abs_text_spans, rel_text_spans, tags, sents, sent_text, sent_membership = utils.getWhitespaceTokens2(infiles[f]+args.x)
         #my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, span=spans, remove_stopwords="./Chrono/stopwords_short2.txt")
-        my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, span=spans, pos=tags, sent_boundaries=sents)
+        my_refToks = referenceToken.convertToRefTokens(tok_list=tokens, abs_span=abs_text_spans, rel_span=rel_text_spans, pos=tags, sent_boundaries=sents, sent_membership=sent_membership)
 
         if(args.includeRelative):
             print("Including Relative Terms")
 
         ## mark all ref tokens if they are numeric or temporal
-        chroList = utils.markTemporal(my_refToks, include_relative = args.includeRelative)
+        chroList = utils.markTemporal(my_refToks, include_relative=args.includeRelative)
 
         if(debug) :
             print("REFERENCE TOKENS:\n")
             for tok in chroList : print(tok)
 
-        tempPhrases = utils.getTemporalPhrases(chroList, doctime)
+        tempPhrases = utils.getTemporalPhrases(chroList, sent_text, doctime)
 
         if(debug):
             for c in tempPhrases:
                 print(c)
 
 
-        chrono_master_list, my_chrono_ID_counter, timex_phrases = BuildEntities.buildChronoList(tempPhrases, my_chrono_ID_counter, chroList, (classifier, args.m), feats, doctime)
+
+        chrono_master_list, my_chrono_ID_counter, timex_phrases = BuildEntities.buildChronoList(tempPhrases,
+                                                                                                my_chrono_ID_counter,
+                                                                                                chroList,
+                                                                                                (classifier, args.m),
+                                                                                                feats, bert_model,
+                                                                                                bert_tokenizer,
+                                                                                                bert_classifier,
+                                                                                                args.includeContext,
+                                                                                                args.includeAttention,
+                                                                                                args.cnn,
+                                                                                                doctime)
 
         print("Number of Chrono Entities: " + str(len(chrono_master_list)))
 

diff --git a/Chrono/BuildEntities.py b/Chrono/BuildEntities.py
@@ -53,7 +53,7 @@
 # @param list of TimePhrase Output
 # @param document creation time (optional)
 # @return List of Chrono entities and the ChronoID
-def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeatures, dct=None):
+def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeatures, bert_model, bert_tokenizer, bert_classifier, includeContext, includeAttention, cnn, dct=None):
     chrono_list = []
 
     ## Do some further pre-processing on the ref token list
@@ -114,7 +114,7 @@ def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeature
         chrono_tmp_list, chrono_id = Frequency.buildFrequency(s, chrono_id, chrono_tmp_list)
 
 
-        print("XXXXXXXXX")
+        #print("XXXXXXXXX")
 
        # if len(chrono_tmp_list) > 0:
         #    print(s)
@@ -127,10 +127,10 @@ def buildChronoList(TimePhraseList, chrono_id, ref_list, PIclassifier, PIfeature
         ## Need to add ISO conversion here!
 
         if len(tmplist) > 0:
-            print("Converting phrase to ISO: " + str(s))
-            s.getISO(tmplist)
-            print("ISO Value: " + str(s))
-            print("TIMEX3 String: " + s.i2b2format())
+            #print("Converting phrase to ISO: " + str(s))
+            s.getISO(tmplist, bert_model, bert_tokenizer, bert_classifier, includeContext, includeAttention, cnn)
+            #print("ISO Value: " + str(s))
+            #print("TIMEX3 String: " + s.i2b2format())
             timex_list.append(s)
 
 
@@ -226,12 +226,12 @@ def buildSubIntervals(chrono_list, chrono_id, dct, ref_list):
                 my_dayweek = weekdays[chrono_list[dayweek].get_day_type()]
                 
                 if my_dayweek < dct_day:
-                    chrono_list.append(chrono.ChronoLastOperator(entityID=str(chrono_id) + "entity", start_span=mStart, end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
+                    chrono_list.append(chrono.ChronoLastOperator(entityID=str(chrono_id) + "entity", abs_start_span=mStart, abs_end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
                     chrono_id = chrono_id + 1
                     print("FOUND DAYWEEK LAST")
                     
                 elif my_dayweek > dct_day:
-                    chrono_list.append(chrono.ChronoNextOperator(entityID=str(chrono_id) + "entity", start_span=mStart, end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
+                    chrono_list.append(chrono.ChronoNextOperator(entityID=str(chrono_id) + "entity", abs_start_span=mStart, abs_end_span=mEnd, repeating_interval=chrono_list[dayweek].get_id()))
                     chrono_id = chrono_id + 1  
                     print("FOUND DAYWEEK NEXT")        
                 '''