Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi. I am sorry for the very ong delay. This took me much longer than I had planned. I hope it can still be useful.
If the following information is NOT present in the issue, please populate:
Checkbox
biodatasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py
. The tests returned 2 failures. 1) ID globally unique: coreferences (tlinks) in the original dataset use two formats for IDs. 2) Check passage offset: sometimes offsets seem to be incorrect in the original XML files. The tests returned also one error with "Check-multi-label type", but I am not sure how to interpret it.======================================================================
ERROR: runTest (main.TestDataLoader) [Check multi-label
type
]Run all tests that check:
Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 145, in runTest
self.test_multilabel_type(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 636, in test_multilabel_type
match = re.search(_CONNECTORS, feature_type)
File "C:\Users\franc\miniconda3\envs\BigScience\lib\re.py", line 201, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
======================================================================
FAIL: runTest (main.TestDataLoader) [IDs globally unique]
Run all tests that check:
Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 117, in runTest
self.test_are_ids_globally_unique(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 277, in test_are_ids_globally_unique
self._assert_ids_globally_unique(example, ids_seen=ids_seen)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 258, in _assert_ids_globally_unique
self._assert_ids_globally_unique(elem, ids_seen)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 262, in _assert_ids_globally_unique
self.assertNotIn(v, ids_seen)
AssertionError: 'Sectime0' unexpectedly found in {'TL74', 'T2', 'TL57', 'E60', 'E21', 'TL93', 'TL55', 'E54', 'E25', 'Sectime28', 'T14', 'T11', 'E7', 'E26', 'TL9', 'TL73', 'E57', 'T3', 'Sectime12', 'Sectime18', 'TL39', 'Sectime0', 'T13', 'Sectime25', 'TL17', 'TL5', 'TL11', 'E43', 'TL1', 'Sectime8', 'TL69', 'TL40', 'E15', 'E8', 'TL42', 'E24', 'TL58', 'TL88', 'S0', 'E59', 'TL33', 'TL36', 'Sectime11', 'Sectime22', 'E68', 'TL67', 'T12', 'E71', 'E76', 'TL51', 'S1', 'E73', 'E17', 'E34', '1', 'TL64', 'TL7', 'T1', 'Sectime1', 'Sectime3', 'E40', 'E39', 'TL24', 'E3', 'Sectime23', 'Sectime26', 'TL63', 'E62', 'TL78', 'E30', 'E41', 'Sectime2', 'E28', 'TL75', 'Sectime15', 'Sectime29', 'E58', 'E11', '1-full-passage', 'Sectime4', 'Sectime20', 'TL62', 'TL3', 'TL90', 'E36', 'TL53', 'TL31', 'Sectime7', 'T7', 'E2', 'T6', 'E29', 'TL70', 'Sectime14', 'TL29', 'TL23', 'TL14', 'E48', 'TL56', 'TL10', 'TL68', 'E49', 'TL34', 'TL43', 'Sectime6', 'E14', 'Sectime13', 'E45', 'T0', 'E0', 'E53', 'TL71', 'TL91', 'Sectime16', 'TL2', 'TL44', 'E75', 'TL18', 'TL60', 'Sectime17', 'E23', 'E67', 'E55', 'E31', 'TL22', 'TL95', 'TL13', 'Sectime21', 'TL46', 'TL12', 'E69', 'TL27', 'E51', 'E32', 'TL48', 'E44', 'TL35', 'TL89', 'T9', 'T8', 'TL79', 'E72', 'E66', 'TL47', 'TL59', 'E16', 'E22', 'TL0', 'TL15', 'E4', 'E12', 'TL50', 'Sectime19', 'E37', 'E74', 'TL41', 'TL30', 'Sectime27', 'E27', 'TL20', 'TL49', 'TL83', 'E13', 'TL52', 'T4', 'TL86', 'TL37', 'Sectime5', 'E5', 'E20', 'TL19', 'E46', 'TL45', 'TL8', 'TL54', 'E33', 'TL66', 'TL77', 'E52', 'TL72', 'E61', 'TL85', 'TL82', 'E63', 'TL21', 'E35', 'TL4', 'TL38', 'E6', 'TL76', 'TL92', 'E42', 'Sectime24', 'T5', 'E9', 'E47', 'E56', 'Sectime9', 'T10', 'E1', 'TL26', 'E65', 'E38', 'E18', 'TL28', 'E64', 'TL6', 'TL61', 'TL80', 'TL87', 'Sectime10', 'E19', 'TL84', 'E70', 'E10', 'E50'}
======================================================================
FAIL: runTest (main.TestDataLoader) [Check passage offsets]
Run all tests that check:
Traceback (most recent call last):
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 136, in runTest
self.test_passages_offsets(dataset_bigbio)
File "C:\Users\franc\Documents\GitHub\biomedical\tests\test_bigbio.py", line 382, in test_passages_offsets
self.assertEqual(example_text[start:end], text[idx], msg)
AssertionError: '9/29/1993\n' != '09/29/1993'
? -
: Split:train - Example:1 - text:
9/29/1993
!= text_by_offset:09/29/1993
Ran 1 test in 55.097s
FAILED (failures=2, errors=1)