[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

joongbo · 2020-10-26T16:54:01Z

Thanks for the great work.

I have a question about the gap between the paper's report and the released code for the sentence order prediction (SOP) task. Actually, the code for SOP seems to contain NSP, I think.

Section 3.1 in the ALBERT paper says that SOP can solve NSP (next sentence prediction) to a reasonable degree (as in Table 5, Section 4.6). Whereas the paper says SOP uses only consecutive sentences, the released code contains a random document selection procedure.

The problem I think is sentence_order_label in create_pretraining_data.py for a document with a single chunk. In line 315-7, this code randomly selects the other document for handling len(current_chunk) == 1 and set is_random_next = True (which means sentence_order_label = 1). This label is not for a truely reveresed order of consecutive sentences (as in SOP) but for NSP.

Is there any misunderstanding in my question?
If not, is there any difference in the version of the released code with the paper?

Or, is this the best practice for handling single-chunk-document?

Thanks.

The text was updated successfully, but these errors were encountered:

joongbo changed the title ~~Why a document with single paragraph has a random next sentence in create_pretraining_data.py~~ [Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

joongbo commented Oct 26, 2020

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

[Implementation] Sentence order prediction (SOP) label for a single-chunk-document in create_pretraining_data.py #234

Comments

joongbo commented Oct 26, 2020