Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

Open
joongbo opened this issue Oct 26, 2020 · 0 comments

Comments

@joongbo
Copy link

joongbo commented Oct 26, 2020

Thanks for the great work.

I have a question about the gap between the paper's report and the released code for the sentence order prediction (SOP) task. Actually, the code for SOP seems to contain NSP, I think.

Section 3.1 in the ALBERT paper says that SOP can solve NSP (next sentence prediction) to a reasonable degree (as in Table 5, Section 4.6). Whereas the paper says SOP uses only consecutive sentences, the released code contains a random document selection procedure.

The problem I think is sentence_order_label in create_pretraining_data.py for a document with a single chunk. In line 315-7, this code randomly selects the other document for handling len(current_chunk) == 1 and set is_random_next = True (which means sentence_order_label = 1). This label is not for a truely reveresed order of consecutive sentences (as in SOP) but for NSP.

Is there any misunderstanding in my question?
If not, is there any difference in the version of the released code with the paper?

Or, is this the best practice for handling single-chunk-document?

Thanks.

@joongbo joongbo changed the title Why a document with single paragraph has a random next sentence in create_pretraining_data.py [Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant