-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could you extend the documentation to include an example of collecting data from a table, please? #30
Comments
Hi, mate thanks for this! Let me know if that helps or you can show me your exact example and I will tell you what to do with it. |
Hey mate, thanks for the quick reply! You've actually caught me posting up a comment as I'm about to head off travelling for a week and a bit - so my replies might be a little slow, sorry! The following is an example of what I'm doing using a combination of scrapling and bs4. The code is in a state of laziness since I'm still just stuffing around in a jupyter notebook before tidy up, so please accept my apologies for that. I've also broken up the page url in the first couple of lines just so its a little less findable in search. The output is essentially the table that's on the above page to a pandas dataframe, noting that I've also had to account for colspans. from scrapling import Fetcher
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo
page = Fetcher().get('http://www.b om.g ov.au/cgi-bin/wra p_fwo.pl?IDQ 60005.html', stealthy_headers=True, follow_redirects=True)
#print the type of the page
print(type(page))
#convert page to string
page = str(page)
def parse_time_day_to_datetime(value, today_aest=None):
"""
Given a string like "11.49pm Tue", return a datetime in AEST for the *current week*.
Search backward up to 6 days if the day-of-week doesn't match "today".
"""
if not value or pd.isnull(value):
return pd.NaT # or None
# If "today_aest" wasn't provided, use now in AEST:
if today_aest is None:
tz = ZoneInfo("Australia/Brisbane")
today_aest = datetime.now(tz=tz)
# 1) Extract the time portion and the day name
# Typical format: "HH.MM(am|pm) DayName"
# e.g. "11.49pm Tue", "6.29am Wed"
match = re.match(r"^(\d{1,2})\.(\d{1,2})(am|pm)\s+(\w+)$", value.strip(), re.IGNORECASE)
if not match:
# If it doesn't match our expected pattern, return NaT or None
return pd.NaT
hour_str, minute_str, ampm, day_str = match.groups()
hour = int(hour_str)
minute = int(minute_str)
ampm = ampm.lower() # 'am' or 'pm'
# Convert to 24-hour format
if ampm == 'pm' and hour < 12:
hour += 12
elif ampm == 'am' and hour == 12:
hour = 0
# 2) Figure out which calendar date in the last 7 days has the correct day name
# Let's define a day-of-week name map, matching strftime("%a") output: Mon, Tue, Wed, Thu, Fri, Sat, Sun
tz = ZoneInfo("Australia/Brisbane")
# We'll consider 'today' as the day we run the script, in AEST.
# Then go from 0 to 6 days ago to find a date whose .strftime("%a") = day_str
for offset in range(7):
candidate = today_aest - timedelta(days=offset)
if candidate.strftime("%a") == day_str.title(): # e.g. "Wed" or "Tue"
# Found the matching day
# Now replace hour/minute to form the final reading datetime
reading_dt = datetime(
candidate.year,
candidate.month,
candidate.day,
hour,
minute,
tzinfo=tz
)
return reading_dt
# If we didn't find any match (unlikely if data is only 1-5 days old), fallback
return pd.NaT
#convert page to string
page_str = str(page)
page_soup = BeautifulSoup(page_str, "html.parser")
# Find the table (assuming there's only one table in the HTML)
table = page_soup.find("table")
all_rows = [] # will hold lists of cell values
for tr in table.find_all("tr"):
# Gather all cells (td/th) in this row
cells = tr.find_all(["td", "th"])
# Skip row if any cell has a colspan
# (You could also check for rowspan if needed)
skip_row = any(cell.has_attr("colspan") for cell in cells)
if skip_row:
continue
# Extract the text from each cell in the row
row_data = [cell.get_text(strip=True) for cell in cells]
# You mention the table has 7 columns, so only keep rows that have 7 cells
if len(row_data) == 7:
all_rows.append(row_data)
# -------------------------------------------------------------------------
# 1) Look for the link that contains the word "plot" (case-insensitive)
# -------------------------------------------------------------------------
plot_link_tag = tr.find("a", string=lambda text: text and "plot" in text.lower())
# If found, parse the link's href to extract your IDQ number and second number
if plot_link_tag and plot_link_tag.has_attr("href"):
href = plot_link_tag["href"] # e.g. http://www.bom.gov.au/fwo/IDQ65388/IDQ65388.540612.plt.shtml
# Get the filename after the last slash, e.g. "IDQ65388.540612.plt.shtml"
filename = href.split("/")[-1]
# Use a regular expression to capture the part before the first '.' and the next part
# Pattern: ^(.*?)\.(.*?)\.plt\.shtml$
match = re.match(r'^(.*?)\.(.*?)\.plt\.shtml$', filename)
if match:
idq_number = match.group(1) # e.g. "IDQ65388"
second_number = match.group(2) # e.g. "540612"
else:
idq_number = None
second_number = None
else:
# If there is no link or no href attribute
idq_number = None
second_number = None
# -------------------------------------------------------------------------
# 2) Append these two new values into row_data
# -------------------------------------------------------------------------
row_data.append(idq_number)
row_data.append(second_number)
# Now `all_rows` has 7 + 2 = 9 columns total:
# original 7 columns + IDQ_number + second_number
# If the first row in `all_rows` is your header, then:
header = all_rows[0] # e.g. the original 7 column headers (if that's truly a header row)
# Amend column titles to fill in missing values
header[7] = "IDQ_Number"
header[8] = "Station_ID"
# Assuming the first row is the header
df = pd.DataFrame(all_rows[1:], columns=all_rows[0])
# df = pd.DataFrame(all_rows[1:], columns=header)
# remove any instances of "^" from the height column
df['Height'] = df['Height'].str.replace('^', '')
# convert height column to float
df['Height'] = df['Height'].astype(float)
df["ReadingDateTime"] = df["Time/Day"].apply(parse_time_day_to_datetime)
df["ReadingDateTime"] = pd.to_datetime(df["ReadingDateTime"])
#Drop df['Recent Data'] column in place
df.drop(columns=['Recent Data'], inplace=True)
#For each column title, replace any spaces with _
df.columns = df.columns.str.replace(' ', '_')
# For Time/Day column replace the / in the title with an _
df.columns = df.columns.str.replace('/', '_')
print(df)
#export to csv
df.to_csv('flood.csv', index=False) |
Hey @sxwebster I wasn't planning to reply to you as I said to give me the URL only and didn't say I would rewrite your code! Here's the optimized code anyway for the most important part of the code, I tried to use from scrapling.defaults import Fetcher
rows = []
page = Fetcher.get(url)
table = page.find('table')
headers = table.css('thead th::text')
for tr in table.find_all('tr')[1:].filter(lambda r: len(r.css('td')) > 1):
row = [
(
element.text.clean()
if element.tag != 'a' else
element.attrib['href'].rstrip('/').re(r'.*/(.*?)\.(\d+)\.plt\.shtml$')
)
for element in (
tr.css('td').filter(lambda cell: not cell.css('a')) +
tr.css('td a:contains("Plot")')
)
]
rows.append(dict(zip(headers, row))) The first two rows of 1273 rows: [{'Station Name': 'Lt Nerang Ck at Little Nerang Dam *',
'Time/Day': '1.30am Tue',
'Height': '168.10',
'Tendency': 'steady',
'Crossing': '0.08 above Spillway',
'Flood Class': '',
'Recent Data': ['IDQ65388', '540612']},
{'Station Name': 'Lt Nerang Ck at Little Nerang Dam #',
'Time/Day': '2.46am Tue',
'Height': '0.08',
'Tendency': 'steady',
'Crossing': '0.08 above Spillway',
'Flood Class': 'below minor',
'Recent Data': ['IDQ65388', '540054']},
...] Please read the documentation next time before opening an issue. |
Have you searched if there an existing feature request for this?
Feature description
Great work on scrapling!
I would find it particularly useful if you could give an example in the readme of how to scrape from a table, particularly if that table doesn't necessarily had an ID or class.
The text was updated successfully, but these errors were encountered: