Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse: crash on invalid dates "0", "99999" instead of handling gracefully as intended #1669

Open
corneliusroemer opened this issue Nov 10, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@corneliusroemer
Copy link
Member

corneliusroemer commented Nov 10, 2024

Description

When parsing FASTA headers containing "0" or "99999" as a date value, augur parse fails with an unhelpful error message instead of handling the invalid date more gracefully.

Steps to Reproduce

  1. Create a test FASTA file:
echo -e ">someid|0\nACTG" > test.fa
  1. Run augur parse:
augur parse --sequences test.fa --output-sequences out.fa --output-metadata out.tsv --fields id date --fix-dates dayfirst

Current Behavior

The command fails with error:

ERROR: Invalid date '0-XX-XX': Date does not match format `%Y-%m-%d`.

Expected Behavior

--fix-dates is documented as an attempt to parse non-standard dates that "should be spot-checked". The documented behavior of fix_dates is to return the input string if fixing fails and emit a warning. So it should never crash.

Stacktrace

The stack trace is actually quite interesting, showing how convoluted the current implementation is:

  1. Tries pandas newer method: parse_datetime_string_with_reso() ❌ (AttributeError)
  2. Falls back to pandas older method: parse_time_string() ❌ (DateParseError)
  3. Tries custom parsing with get_numerical_date_from_value() ❌ (InvalidDate)
  4. Finally raises AugurError

augur/augur/parse.py

Lines 28 to 65 in 8731851

def fix_dates(d, dayfirst=True):
'''
attempt to parse a date string using pandas date parser. If ambiguous,
the argument 'dayfirst' determines whether month or day is assumed to be
the first field. Incomplete dates will be padded with XX.
On failure to parse the date, the function will return the input.
'''
try:
from pandas.core.tools.datetimes import parsing
try:
# pandas 2.x
results = parsing.parse_datetime_string_with_reso(d, dayfirst=dayfirst)
except AttributeError:
# pandas 1.x
results = parsing.parse_time_string(d, dayfirst=dayfirst)
if len(results) == 2:
dto, res = results
else:
dto, _, res = results
if res == 'year':
return "%d-XX-XX"%dto.year
elif res == 'month':
return "%d-%02d-XX"%(dto.year, dto.month)
else:
return "%d-%02d-%02d"%(dto.year, dto.month, dto.day)
except ValueError as e:
# If the date can't be parsed by pandas above or as our own ambiguous
# date format (e.g., "2020-XX-XX"), let the user know.
try:
parsed_date = get_numerical_date_from_value(d, "%Y-%m-%d")
except ValueError:
parsed_date = None
if parsed_date is None:
print("WARNING: unable to parse %s as date"%d, e, file=sys.stderr)
return d

Traceback (most recent call last):
  File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 41, in fix_dates
    results = parsing.parse_datetime_string_with_reso(d, dayfirst=dayfirst)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'pandas._libs.tslibs.parsing' has no attribute 'parse_datetime_string_with_reso'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "pandas/_libs/tslibs/parsing.pyx", line 440, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso
  File "pandas/_libs/tslibs/parsing.pyx", line 667, in pandas._libs.tslibs.parsing.dateutil_parse
ValueError: day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 44, in fix_dates
    results = parsing.parse_time_string(d, dayfirst=dayfirst)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/tslibs/parsing.pyx", line 367, in pandas._libs.tslibs.parsing.parse_time_string
  File "pandas/_libs/tslibs/parsing.pyx", line 445, in pandas._libs.tslibs.parsing.parse_datetime_string_with_reso
pandas._libs.tslibs.parsing.DateParseError: day is out of range for month
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 120, in get_numerical_date_from_value
    ambig_date = AmbiguousDate(value, fmt=fmt).range(min_max_year=min_max_year)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 48, in init
    self.assert_only_less_significant_uncertainty()
  File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 138, in assert_only_less_significant_uncertainty
    if "X" in self.uncertain_date_components["Y"]:
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/augur_master/augur/dates/ambiguous_date.py", line 97, in uncertain_date_components
    raise InvalidDate(self.uncertain_date,
augur.dates.errors.InvalidDate: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/corneliusromer/code/augur_master/blackbox.py", line 94, in <module>
    run_probes()
  File "/Users/corneliusromer/code/augur_master/blackbox.py", line 80, in run_probes
    result, warnings = probe_date_parser(test_input, dayfirst)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/augur_master/blackbox.py", line 20, in probe_date_parser
    result = fix_dates(test_input, dayfirst)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/augur_master/augur/parse.py", line 60, in fix_dates
    parsed_date = get_numerical_date_from_value(d, "%Y-%m-%d")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corneliusromer/code/augur_master/augur/dates/init.py", line 122, in get_numerical_date_from_value
    raise AugurError(str(error)) from error
augur.errors.AugurError: Invalid date '0-XX-XX': Date does not match format %Y-%m-%d.

Environment

current master of Augur (26.0.0)

Additional Notes

The documentation states that date parsing is a "brittle process that should be spot-checked", suggesting the tool should be more resilient to invalid date formats rather than failing completely.

I discovered this while trying to migrate fix_dates to be compatible with pandas v2. Given the complexity of the current implementation, I thought I'll start treating it as a blackbox, discover current behavior and then reimplement it with the help of unit tests in a future proof way.

It turns out that there are unhandled edge cases in the current implementation.

@corneliusroemer corneliusroemer added the bug Something isn't working label Nov 10, 2024
@corneliusroemer corneliusroemer changed the title parse: crash on invalid dates "0", "99999" instead of handling gracefully as promised in docstring parse: crash on invalid dates "0", "99999" instead of handling gracefully as intended Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant