-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model building #28
base: master
Are you sure you want to change the base?
Model building #28
Conversation
himesh257
commented
Aug 9, 2022
- Added PURE data setup and nervaluate scripts
- Restructured folders
- Took care of Dependabot alerts caused by previous changes to the master branch
sorted_by_direction = sorted(final_sent_arr, key=lambda d: d['right_to_left'], reverse=True) | ||
|
||
# getting the index after which no sentence will have a right_to_left relationship | ||
right_to_left_boundary = get_right_to_left_boundary(sorted_by_direction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@himesh257 I am not sure about this part. Would using the stratify parameter of sklearn's split solve your problem instead of doing all this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CVxTz the only reason I manually did it is because I wanted to make sure that enough amount of left_to_right and right_to_left relationships were present in the training data. I tried completely relying on train_test_split but the samples weren’t equally classified. We can talk more on it later if you'd like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, for the entity training model (with which we got 0.75 strict f1 for "base"), this is the file PURE_ent_data_setup.py which uses train_test_split. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should try the stratify parameter, it will ensure everything is the same between train, test and val.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to also have the same split between ent/relation, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing that confuses me is the usage of stratify where we don’t have a "y", you see? @CVxTz do you know how it would work in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, PURE will first predict the entities and then relationships. For simplicity and speed, I am populating the entities so I can skip that step and directly run the relationship model for now. So all in all, it will eventually have the same split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@himesh257 You can create an indicator about the most common relationship direction in the sentence and use this as the stratify value.
Can you share the resulting files from your different splits somewhere ?
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CVxTz I am not sure what you mean by that, can you fix the script if it's not a big change? I slacked you the train and test splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized that stratify isn’t needed anymore, updated scripts
import os | ||
|
||
file_name_answers = "gold_standard" | ||
file_path_answers = "/Users/ash/Desktop/PURE/PURE/new_data/"+file_name_answers+".jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hardcoded datapaths are not good for reproducibility
in order to achieve reproducibility it would probably be a good idea to have a way for the user to download dataset using a script which would save it in a directory and then use relative paths to that dir. README.md should contain steps necessary for executing experiment, e.g.
run download_data,py
and then data_setup.py
and then ...
also pathlib.Path
should be used for paths in order to make sure that they work on all machines.
|
||
train, test = train_test_split(final_sent_arr, test_size=0.1) | ||
#val, test = train_test_split(test, test_size=0.5) | ||
data_folder = "data_ent1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be defined on top of the python file with rest of the constants
data_folder = "data_ent1" | ||
|
||
#shutil.rmtree(data_folder) | ||
os.makedirs(data_folder, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again pathlib
is always prefered for handling paths and also over os
module for stuff like this
# for item in val: | ||
# file.write("%s\n" % json.dumps(item)) | ||
|
||
with open('./{}/train.json'.format(data_folder), 'w') as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open('./{}/train.json'.format(data_folder), 'w') as file: | |
with open(f'./{data_folder}/train.json', 'w') as file: |
f-strings should be used over .format