-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added fallback time extraction engine #135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
facebook_scraper/extractors.py
Outdated
if time_match: | ||
time = time_match.group(0) | ||
return { | ||
'time': dateparser.parse(time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understood the problem, you tried to scrap a post with the date 'October 13 at 9:44 PM', or did you send the string 'at 13' to dateparser?
I would love seeing the original post if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that the matched group was ASFOROctober 13 at 8:44 PM
, including a part of the page name or title and that's how (\d{1,2} \w+)
ended up finding 13 at
I came with this ugly regex fix
time_regex = re.compile(r"((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|(\d{1,2} \w+)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added also regex for Yesterday at 12:30 PM
-like matching also:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thks balazssandor, there is also has Today, so will be this:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Today at \d{1,2}:\d{2} (AM|PM)|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)
@evoup @balazssandor Thanks for the help! I added the new regex to |
Dateparser is not threadsafe, so I think it's not a really good realization. This issue is still open on GitHub. I have got the same errors with your way. |
Thanks for the issue, I will implement custom date parsing with the desired formats.
What errors did you get with 'my way'? Thread safe errors? |
I have got a problem described in the referenced issue ticket. |
Thank you. I tried it and it worked. Thanks a lot once again. The original author should consider fixing this in his code. |
There are multiple different date formats we need to parse, since as @valterartur stated, using
|
I think this is alright. |
Co-authored-by: Kevin Zúñiga <[email protected]>
# Conflicts: # facebook_scraper/extractors.py
I agree, this change is a fallback solution and the benefits are bigger than the downsides (the thread safety bug). Let's merge! |
Hi, can we please merge? |
facebook_scraper/utils.py
Outdated
hour = r"\d{1,2}" | ||
minute = r"\d{2}" | ||
period = r"AM|PM" | ||
exact_time = f"({date}) at {hour}:{minute} ({period})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I think it shouldn't capture the date or period.
exact_time = f"({date}) at {hour}:{minute} ({period})" | |
exact_time = f"(?:{date}) at {hour}:{minute} (?:{period})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand, why again shouldn't the whole date and am\pm be captured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because those groups are not being used.
In parse_datetime
only the group 0 is being retrieved, which is whole matched string.
It would make sense to capture every element of the date separately if we were parsing the date ourselves, but we are just passing the whole string to dateparser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
facebook_scraper/utils.py
Outdated
minute = r"\d{2}" | ||
period = r"AM|PM" | ||
exact_time = f"({date}) at {hour}:{minute} ({period})" | ||
relative_time = r"\d{1,2} \w+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit vague since it is being matched with the text from the whole post and not just a date.
For example, with the previous date April 3, 2018 at 8:02 PM
, it was matching with 18 at
and returning 2020-11-18 00:00
.
What exactly is this trying to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to a previous comment it could be 16 hrs or 16h so maybe:
relative_time = r"\d{1,2} \w+" | |
relative_time = r"\b\d{1,2}(?:h| hrs)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only date I found is in the entire post text, and I only saw dates that have explicit month and day of month specification and a time, days that have 'Today' or 'Yesterday' and a time, and just relative time dates (16 hours ago, hours is the only case I found)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only date I found is in the entire post text
I guess this is a reply to my suggestion of using self.element.find('abbr', first=True)
. If there are cases where the date is not inside that tag, then it could fallback to look in the entire text.
Co-authored-by: Kevin Zúñiga <[email protected]>
Co-authored-by: Kevin Zúñiga <[email protected]>
Co-authored-by: Kevin Zúñiga <[email protected]>
Solves #134, where for some reason time is not always correctly extracted, therefore I have added a fallback way of extracting time (if the current one fails), by trying to extract it from the element's text.
As of right now, it successfully works with two different type of dates:
Example 1:
October 28 11:58 AM
Example 2:
9 hrs
(9 hours ago, the example was only tested with hours)The engine uses the following regex:
(\w+ \d{2} at \d{2}:\d{2} (AM|PM))|(\d{1,2} \w+)
with a segment for example 1 (
\w+ \d{2} at \d{2}:\d{2} (AM|PM)
) and for example 2 ((\d{1,2} \w+
).In order to easily parse dates, I have added a dependency on the package
dateparser
(added torequirements.txt
andrequirements-dev.txt
)Even though not all example of dates have been tested yet, the regex will probably be able to extract the different undiscovered date templates and let the date parser parse it automatically.