Added fallback time extraction engine #135

TheMulti0 · 2020-11-05T05:49:54Z

Solves #134, where for some reason time is not always correctly extracted, therefore I have added a fallback way of extracting time (if the current one fails), by trying to extract it from the element's text.

As of right now, it successfully works with two different type of dates:
Example 1: October 28 11:58 AM
Example 2: 9 hrs (9 hours ago, the example was only tested with hours)

The engine uses the following regex:
(\w+ \d{2} at \d{2}:\d{2} (AM|PM))|(\d{1,2} \w+)
with a segment for example 1 (\w+ \d{2} at \d{2}:\d{2} (AM|PM)) and for example 2 ((\d{1,2} \w+).

In order to easily parse dates, I have added a dependency on the package dateparser (added to requirements.txt and requirements-dev.txt)

Even though not all example of dates have been tested yet, the regex will probably be able to extract the different undiscovered date templates and let the date parser parse it automatically.

balazssandor

Doesn't work in cases:

-> 'time': dateparser.parse(time)
(Pdb) time
'13 at'
(Pdb) dateparser.parse(time)
datetime.datetime(2020, 11, 13, 0, 0)

dateparser is not able to find the date from 13 at string

This was the date appearance from where it failed:

balazssandor · 2020-11-05T11:28:56Z

facebook_scraper/extractors.py

+            if time_match:
+                time = time_match.group(0)
+                return {
+                    'time': dateparser.parse(time)


#135 (review)

I'm not sure I understood the problem, you tried to scrap a post with the date 'October 13 at 9:44 PM', or did you send the string 'at 13' to dateparser?

I would love seeing the original post if so.

The thing is that the matched group was ASFOROctober 13 at 8:44 PM, including a part of the page name or title and that's how (\d{1,2} \w+) ended up finding 13 at
I came with this ugly regex fix
time_regex = re.compile(r"((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|(\d{1,2} \w+)")

Added also regex for Yesterday at 12:30 PM-like matching also:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)

Thks balazssandor, there is also has Today, so will be this:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Today at \d{1,2}:\d{2} (AM|PM)|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)

TheMulti0 · 2020-11-06T09:39:08Z

@evoup @balazssandor Thanks for the help!

I added the new regex to utils.py, in the latest commit (also extracted it a little bit to make it more readable), mind checking it out?

valterartur · 2020-11-06T14:42:18Z

Dateparser is not threadsafe, so I think it's not a really good realization. This issue is still open on GitHub. I have got the same errors with your way.

TheMulti0 · 2020-11-06T15:19:54Z

Thanks for the issue, I will implement custom date parsing with the desired formats.

I have got the same errors with your way.

What errors did you get with 'my way'? Thread safe errors?

valterartur · 2020-11-06T15:35:45Z

Thanks for the issue, I will implement custom date parsing with the desired formats.

I have got the same errors with your way.

What errors did you get with 'my way'? Thread safe errors?

I have got a problem described in the referenced issue ticket.

baraths92 · 2020-11-06T20:44:01Z

@evoup @balazssandor Thanks for the help!

I added the new regex to utils.py, in the latest commit (also extracted it a little bit to make it more readable), mind checking it out?

Thank you. I tried it and it worked. Thanks a lot once again. The original author should consider fixing this in his code.

TheMulti0 · 2020-11-07T10:05:18Z

There are multiple different date formats we need to parse, since as @valterartur stated, using dateparser is not trival.

Oct 16 at 11:00 PM
October 16 at 11:00 PM
Yesterday at 11:00 PM
Today at 11:00 PM
16 hrs (16 hours ago)
16h (16 hours ago)

requirements.txt

facebook_scraper/extractors.py

kevinzg · 2020-11-07T18:25:14Z

I think this is alright.
The only issue seems to be that it isn't thread safe, however anyone that needs a thread safe date parsing function could monkeypatch utils.parse_date, so I'm inclined to merge it with the dateparser library unless someone suggests a better option.

Co-authored-by: Kevin Zúñiga <[email protected]>

# Conflicts: # facebook_scraper/extractors.py

TheMulti0 · 2020-11-07T20:11:06Z

I think this is alright.
The only issue seems to be that it isn't thread safe, however anyone that needs a thread safe date parsing function could monkeypatch utils.parse_date, so I'm inclined to merge it with the dateparser library unless someone suggests a better option.

I agree, this change is a fallback solution and the benefits are bigger than the downsides (the thread safety bug).
Parsing for all of the different 5 formats I specified isn't extremely complicated, it can be implemented if there is a need.

Let's merge!

TheMulti0 · 2020-11-08T15:03:46Z

Hi, can we please merge?

facebook_scraper/utils.py

kevinzg · 2020-11-08T15:53:18Z

facebook_scraper/utils.py

+hour = r"\d{1,2}"
+minute = r"\d{2}"
+period = r"AM|PM"
+exact_time = f"({date}) at {hour}:{minute} ({period})"


Again, I think it shouldn't capture the date or period.

Suggested change

exact_time = f"({date}) at {hour}:{minute} ({period})"

exact_time = f"(?:{date}) at {hour}:{minute} (?:{period})"

I'm not sure I understand, why again shouldn't the whole date and am\pm be captured?

Because those groups are not being used.

In parse_datetime only the group 0 is being retrieved, which is whole matched string.
It would make sense to capture every element of the date separately if we were parsing the date ourselves, but we are just passing the whole string to dateparser.

I see. Thanks!

kevinzg · 2020-11-08T15:57:06Z

facebook_scraper/utils.py

+minute = r"\d{2}"
+period = r"AM|PM"
+exact_time = f"({date}) at {hour}:{minute} ({period})"
+relative_time = r"\d{1,2} \w+"


This seems a bit vague since it is being matched with the text from the whole post and not just a date.
For example, with the previous date April 3, 2018 at 8:02 PM, it was matching with 18 at and returning 2020-11-18 00:00.
What exactly is this trying to match?

According to a previous comment it could be 16 hrs or 16h so maybe:

Suggested change

relative_time = r"\d{1,2} \w+"

relative_time = r"\b\d{1,2}(?:h| hrs)"

The only date I found is in the entire post text, and I only saw dates that have explicit month and day of month specification and a time, days that have 'Today' or 'Yesterday' and a time, and just relative time dates (16 hours ago, hours is the only case I found)

The only date I found is in the entire post text

I guess this is a reply to my suggestion of using self.element.find('abbr', first=True). If there are cases where the date is not inside that tag, then it could fallback to look in the entire text.

facebook_scraper/extractors.py

Co-authored-by: Kevin Zúñiga <[email protected]>

TheMulti0 added 2 commits November 5, 2020 07:45

Added fallback time extraction engine

15c3c64

Return none without an exception failure in extract_time

b04c5ac

balazssandor reviewed Nov 5, 2020

View reviewed changes

Integrated more comprehensive datetime regex

72373ef

richardscollin reviewed Nov 7, 2020

View reviewed changes

requirements.txt Show resolved Hide resolved

kevinzg reviewed Nov 7, 2020

View reviewed changes

facebook_scraper/extractors.py Outdated Show resolved Hide resolved

TheMulti0 and others added 5 commits November 7, 2020 21:42

Update facebook_scraper/extractors.py

6b12045

Co-authored-by: Kevin Zúñiga <[email protected]>

Added day digit to month regex

ed936cf

Disable date parsing catch (caught in extract_post)

c67238a

time -> datetime

f866e37

Merge remote-tracking branch 'origin/master'

0a72f3a

# Conflicts: # facebook_scraper/extractors.py

kevinzg reviewed Nov 8, 2020

View reviewed changes

facebook_scraper/utils.py Outdated Show resolved Hide resolved

kevinzg reviewed Nov 8, 2020

View reviewed changes

facebook_scraper/extractors.py Show resolved Hide resolved

TheMulti0 and others added 3 commits November 8, 2020 21:54

Update facebook_scraper/utils.py

2b6e7c1

Co-authored-by: Kevin Zúñiga <[email protected]>

Month capture is optional

a252065

Co-authored-by: Kevin Zúñiga <[email protected]>

Date and period capture are optional

35e6eac

Co-authored-by: Kevin Zúñiga <[email protected]>

kevinzg merged commit d6fab2f into kevinzg:master Nov 8, 2020

kevinzg mentioned this pull request Nov 8, 2020

time is not properly formatted from posts into into posts for groups #67

Closed

kevinzg mentioned this pull request Nov 8, 2020

no date for group posts #73

Closed

TheMulti0 mentioned this pull request Nov 9, 2020

Yesterday/Today are instead of month+dayofmonth and not just instead … #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fallback time extraction engine #135

Added fallback time extraction engine #135

TheMulti0 commented Nov 5, 2020 •

edited

Loading

balazssandor left a comment •

edited

Loading

balazssandor Nov 5, 2020

TheMulti0 Nov 5, 2020

balazssandor Nov 5, 2020 •

edited

Loading

balazssandor Nov 5, 2020 •

edited

Loading

evoup Nov 6, 2020

TheMulti0 commented Nov 6, 2020

valterartur commented Nov 6, 2020 •

edited

Loading

TheMulti0 commented Nov 6, 2020

valterartur commented Nov 6, 2020

baraths92 commented Nov 6, 2020

TheMulti0 commented Nov 7, 2020

kevinzg commented Nov 7, 2020

TheMulti0 commented Nov 7, 2020

TheMulti0 commented Nov 8, 2020

kevinzg Nov 8, 2020

TheMulti0 Nov 9, 2020

kevinzg Nov 9, 2020

TheMulti0 Nov 9, 2020

kevinzg Nov 8, 2020 •

edited

Loading

kevinzg Nov 8, 2020

TheMulti0 Nov 9, 2020

kevinzg Nov 9, 2020 •

edited

Loading

	exact_time = f"({date}) at {hour}:{minute} ({period})"
	exact_time = f"(?:{date}) at {hour}:{minute} (?:{period})"

	relative_time = r"\d{1,2} \w+"
	relative_time = r"\b\d{1,2}(?:h\| hrs)"

Added fallback time extraction engine #135

Added fallback time extraction engine #135

Conversation

TheMulti0 commented Nov 5, 2020 • edited Loading

balazssandor left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

balazssandor Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

balazssandor Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheMulti0 commented Nov 6, 2020

valterartur commented Nov 6, 2020 • edited Loading

TheMulti0 commented Nov 6, 2020

valterartur commented Nov 6, 2020

baraths92 commented Nov 6, 2020

TheMulti0 commented Nov 7, 2020

kevinzg commented Nov 7, 2020

TheMulti0 commented Nov 7, 2020

TheMulti0 commented Nov 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinzg Nov 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinzg Nov 9, 2020 • edited Loading

Choose a reason for hiding this comment

TheMulti0 commented Nov 5, 2020 •

edited

Loading

balazssandor left a comment •

edited

Loading

balazssandor Nov 5, 2020 •

edited

Loading

balazssandor Nov 5, 2020 •

edited

Loading

valterartur commented Nov 6, 2020 •

edited

Loading

kevinzg Nov 8, 2020 •

edited

Loading

kevinzg Nov 9, 2020 •

edited

Loading