- Name: Alessandro
- Surname: Grassi
- Student number: 221224
The main.py script requires python, spacy and its english models to be installed.
To install spacy through pip run the following command in the shell
pip install spacy
To install the english models run the following command in the shell
python -m spacy download en_core_web_sm
Task: expract a path of dependency relations from the ROOT to a token
The task of the first function is to take one sentence as input and, for each token, return the path from the root to that token specifying what kind of dependency label the arcs.
Below can be seen an example where the token at the current iteration is a
['ROOT->', 'saw', 'dobj->', 'man', 'prep->', 'with', 'pobj->', 'telescope', 'det->', 'a']
To achieve the result the input string is given as input to the parser, then it is created a list of sentences. The input of the function should be just a single sentence, but in case more sentences are provided they will be splitted and processed one by one. After that each sentence is iterated and for each sentence each token is iterated, the token and its dependency with its head are inserted in the list, this is done until the root token is reached, it will not be inserted so the root will be appended outside the loop.
def get_dependency_paths(sentences):
doc = nlp(sentences) # get the doc object
sentences = list(doc.sents) # used to handle multiple sentences
sentences_in_form_of_dependencies = []
sentence_index = 0
for sentence in sentences: # iterate foreach sentence
sentences_in_form_of_dependencies.append([])
token_index = 0
for token in sentence:
sentences_in_form_of_dependencies[sentence_index].append([])
while token.head != token:
sentences_in_form_of_dependencies[sentence_index][token_index].insert(
0, token.text)
sentences_in_form_of_dependencies[sentence_index][token_index].insert(
0, token.dep_ + "->")
token = token.head
sentences_in_form_of_dependencies[sentence_index][token_index].insert(
0, token.text)
sentences_in_form_of_dependencies[sentence_index][token_index].insert(
0, token.dep_ + "->")
token_index += 1
sentence_index += 1
return sentences_in_form_of_dependencies
Task: extract subtree of a dependents given a token
The algorithm splits each sentence if more then one is provided. For each sentence each token is added to the list of subtrees a list composed of the subtree of the token ordered as the original sentence
def get_subtree_of_dependents_given_a_token(sentences):
doc = nlp(sentences) # get the doc object
sentences = list(doc.sents) # used to handle multiple sentences
subtrees = []
sentence_index = 0 # init sentence index
for sentence in sentences: # iterate foreach sentence
subtrees.append([])
token_index = 0
for token in sentence:
subtrees[sentence_index].append([subtree_token.text for subtree_token in token.subtree])
sentence_index += 1
return subtrees
Task: check if a given list of tokens (segment of a sentence) forms a subtree
The task is to check if the dependency tree created starting from the input sentence has as a subtree the one given as input.
The algorithm works by ordering the given subtree, this is useful to perform the equality check. After that every token of the sentence is iterated. For each iteration the token is inserted in a list that will be treated as a queue, this is done to performe a bfs (Breadth First Search) though the graph. With this metod a tree is created for each token with that token a root node, after the tokens are done each tree is ordered and compared with the input one, if they match the algorithm return true
def check_if_subtree(sentence, subtree):
subtree = set(subtree)
doc = nlp(sentence) # get the doc object
subtrees_array = []
subtree_index = 0
for token in doc:
if not is_generator_empty(token.children):
subtrees_array.append([])
queue = [token]
while len(queue) > 0:
current_node = queue.pop(0)
for child in current_node.children:
subtrees_array[subtree_index].append(child.text)
queue.append(child)
subtree_index += 1
for tree in subtrees_array:
tree = set(tree)
if tree == subtree:
return True
return False
Since there is no provided method to check if a generator is empty the below function check if a generator as elements in it
def is_generator_empty(generator):
for element in generator:
return False
return True
Task: identify head of a span, given its tokens
The algorithm works by receiving a span composed in form of a string
. Each token is iteretated untile the ROOT one is reached and will be returned.
def identify_head_of_a_span_in_form_of_string(span):
doc = nlp(span)
for token in doc:
if token.dep_ == "ROOT":
return token.text
return None
It is also provided a variant that takes a version of the algorithm that takes a span object as input
def identirfy_head_of_a_span(span):
return span.root
Task: extract sentence subject, direct object and indirect object spans
The algorithm works by finding first the root token. Then its children are viewed and if their dependency match dobj
or nsubj
or iobj
a bfs search is performed starting from that token.
def get_parts_of_sentence(sentence):
doc = nlp(sentence)
root = None
for token in doc: # get root
if token.dep_ == "ROOT":
root = token
break
parts_of_sentence = {}
for child in root.children:
if child.dep_ == "dobj":
parts_of_sentence["dobj"] = bfs(child)
if child.dep_ == "nsubj":
parts_of_sentence["nsubj"] = bfs(child)
if child.dep_ == "iobj":
parts_of_sentence["iobj"] = bfs(child)
return parts_of_sentence
The code below performs a bfs search implemented with a list treated as a queue. Returns a list composed by all the nodes encountered including the starting one.
def bfs(token):
tree_array = []
queue = [token]
while len(queue) > 0:
current_node = queue.pop(0)
tree_array.append(current_node)
for child in current_node.children:
queue.append(child)
return tree_array
When the main.py script is executed in the shell you will be asket to write a sentence to analyze. Before the fourth function you will have to write a span to analyze, the output of the function is already provided in a readable way.
In order to modify the features used for the building of the dependency graph I overrided the extract_features
method of the Configuration
class and the following methods from TransitionParser
: _create_training_examples_arc_std
, _create_training_examples_arc_eager
and parse
. In extract_features
is where it is specifyied which features are used for the classification problem, in the TransitionParser
class I just specifyid to use mine configuration instead of the default one. To modify the kind of classifyier I overrided the train
method, the classifyier used is a decision tree. Unfortunately the result are worst if compared with the default parser, the code is just provided to proove my understanding of the TransitionParser
and Configuration
classes. Probably the result aren't promising because of the Hughes effect, I probably added too many features with too few training samples causing overfitting. The features are selected randomly by inspecting the token object with the debugger tool of visual studio code.