Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model building #28

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Model building #28

wants to merge 14 commits into from

Conversation

himesh257
Copy link
Contributor

  • Added PURE data setup and nervaluate scripts
  • Restructured folders
  • Took care of Dependabot alerts caused by previous changes to the master branch

sorted_by_direction = sorted(final_sent_arr, key=lambda d: d['right_to_left'], reverse=True)

# getting the index after which no sentence will have a right_to_left relationship
right_to_left_boundary = get_right_to_left_boundary(sorted_by_direction)
Copy link

@CVxTz CVxTz Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himesh257 I am not sure about this part. Would using the stratify parameter of sklearn's split solve your problem instead of doing all this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CVxTz the only reason I manually did it is because I wanted to make sure that enough amount of left_to_right and right_to_left relationships were present in the training data. I tried completely relying on train_test_split but the samples weren’t equally classified. We can talk more on it later if you'd like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the entity training model (with which we got 0.75 strict f1 for "base"), this is the file PURE_ent_data_setup.py which uses train_test_split. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should try the stratify parameter, it will ensure everything is the same between train, test and val.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to also have the same split between ent/relation, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing that confuses me is the usage of stratify where we don’t have a "y", you see? @CVxTz do you know how it would work in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, PURE will first predict the entities and then relationships. For simplicity and speed, I am populating the entities so I can skip that step and directly run the relationship model for now. So all in all, it will eventually have the same split

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himesh257 You can create an indicator about the most common relationship direction in the sentence and use this as the stratify value.
Can you share the resulting files from your different splits somewhere ?
Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CVxTz I am not sure what you mean by that, can you fix the script if it's not a big change? I slacked you the train and test splits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized that stratify isn’t needed anymore, updated scripts

@himesh257 himesh257 self-assigned this Sep 17, 2022
import os

file_name_answers = "gold_standard"
file_path_answers = "/Users/ash/Desktop/PURE/PURE/new_data/"+file_name_answers+".jsonl"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded datapaths are not good for reproducibility

in order to achieve reproducibility it would probably be a good idea to have a way for the user to download dataset using a script which would save it in a directory and then use relative paths to that dir. README.md should contain steps necessary for executing experiment, e.g.

run download_data,py and then data_setup.py and then ...

also pathlib.Path should be used for paths in order to make sure that they work on all machines.


train, test = train_test_split(final_sent_arr, test_size=0.1)
#val, test = train_test_split(test, test_size=0.5)
data_folder = "data_ent1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be defined on top of the python file with rest of the constants

data_folder = "data_ent1"

#shutil.rmtree(data_folder)
os.makedirs(data_folder, exist_ok=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again pathlib is always prefered for handling paths and also over os module for stuff like this

# for item in val:
# file.write("%s\n" % json.dumps(item))

with open('./{}/train.json'.format(data_folder), 'w') as file:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open('./{}/train.json'.format(data_folder), 'w') as file:
with open(f'./{data_folder}/train.json', 'w') as file:

f-strings should be used over .format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants