Could you extend the documentation to include an example of collecting data from a table, please? #30

sxwebster · 2025-01-14T22:00:02Z

Have you searched if there an existing feature request for this?

I have searched the existing requests

Feature description

Great work on scrapling!

I would find it particularly useful if you could give an example in the readme of how to scrape from a table, particularly if that table doesn't necessarily had an ID or class.

D4Vinci · 2025-01-14T22:07:52Z

Hi, mate thanks for this!
I have been working on creating a documentation website that will be more detailed and have more examples.
But in your case, there are a lot of things to do. For example, if it's a traditional HTML table, then you can use the table tr td tags to get what you want.
Otherwise, another easy way is to extract a row by text and then use the find_similar() method to get the rest of the rows if the contact doesn't change.

Let me know if that helps or you can show me your exact example and I will tell you what to do with it.

sxwebster · 2025-01-15T20:01:21Z

Hey mate, thanks for the quick reply! You've actually caught me posting up a comment as I'm about to head off travelling for a week and a bit - so my replies might be a little slow, sorry!

The following is an example of what I'm doing using a combination of scrapling and bs4. The code is in a state of laziness since I'm still just stuffing around in a jupyter notebook before tidy up, so please accept my apologies for that.

I've also broken up the page url in the first couple of lines just so its a little less findable in search.

The output is essentially the table that's on the above page to a pandas dataframe, noting that I've also had to account for colspans.

from scrapling import Fetcher
from bs4 import BeautifulSoup
import pandas as pd
import re
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo

page = Fetcher().get('http://www.b    om.g    ov.au/cgi-bin/wra    p_fwo.pl?IDQ    60005.html', stealthy_headers=True, follow_redirects=True)

#print the type of the page
print(type(page))

#convert page to string
page = str(page)


def parse_time_day_to_datetime(value, today_aest=None):
    """
    Given a string like "11.49pm Tue", return a datetime in AEST for the *current week*.
    Search backward up to 6 days if the day-of-week doesn't match "today".
    """

    if not value or pd.isnull(value):
        return pd.NaT  # or None

    # If "today_aest" wasn't provided, use now in AEST:
    if today_aest is None:
        tz = ZoneInfo("Australia/Brisbane")
        today_aest = datetime.now(tz=tz)

    # 1) Extract the time portion and the day name
    # Typical format: "HH.MM(am|pm) DayName"
    # e.g. "11.49pm Tue", "6.29am Wed"
    match = re.match(r"^(\d{1,2})\.(\d{1,2})(am|pm)\s+(\w+)$", value.strip(), re.IGNORECASE)
    if not match:
        # If it doesn't match our expected pattern, return NaT or None
        return pd.NaT

    hour_str, minute_str, ampm, day_str = match.groups()
    hour = int(hour_str)
    minute = int(minute_str)
    ampm = ampm.lower()  # 'am' or 'pm'

    # Convert to 24-hour format
    if ampm == 'pm' and hour < 12:
        hour += 12
    elif ampm == 'am' and hour == 12:
        hour = 0

    # 2) Figure out which calendar date in the last 7 days has the correct day name
    # Let's define a day-of-week name map, matching strftime("%a") output: Mon, Tue, Wed, Thu, Fri, Sat, Sun

    tz = ZoneInfo("Australia/Brisbane")
    # We'll consider 'today' as the day we run the script, in AEST.
    # Then go from 0 to 6 days ago to find a date whose .strftime("%a") = day_str
    for offset in range(7):
        candidate = today_aest - timedelta(days=offset)
        if candidate.strftime("%a") == day_str.title():  # e.g. "Wed" or "Tue"
            # Found the matching day
            # Now replace hour/minute to form the final reading datetime
            reading_dt = datetime(
                candidate.year,
                candidate.month,
                candidate.day,
                hour,
                minute,
                tzinfo=tz
            )
            return reading_dt

    # If we didn't find any match (unlikely if data is only 1-5 days old), fallback
    return pd.NaT

#convert page to string
page_str = str(page)

page_soup = BeautifulSoup(page_str, "html.parser")

# Find the table (assuming there's only one table in the HTML)
table = page_soup.find("table")

all_rows = []  # will hold lists of cell values

for tr in table.find_all("tr"):
    # Gather all cells (td/th) in this row
    cells = tr.find_all(["td", "th"])
    
    # Skip row if any cell has a colspan
    # (You could also check for rowspan if needed)
    skip_row = any(cell.has_attr("colspan") for cell in cells)
    if skip_row:
        continue

    # Extract the text from each cell in the row
    row_data = [cell.get_text(strip=True) for cell in cells]
    
    # You mention the table has 7 columns, so only keep rows that have 7 cells
    if len(row_data) == 7:
        all_rows.append(row_data)

    # -------------------------------------------------------------------------
    # 1) Look for the link that contains the word "plot" (case-insensitive)
    # -------------------------------------------------------------------------
    plot_link_tag = tr.find("a", string=lambda text: text and "plot" in text.lower())
    
    # If found, parse the link's href to extract your IDQ number and second number
    if plot_link_tag and plot_link_tag.has_attr("href"):
        href = plot_link_tag["href"]  # e.g. http://www.bom.gov.au/fwo/IDQ65388/IDQ65388.540612.plt.shtml
        
        # Get the filename after the last slash, e.g. "IDQ65388.540612.plt.shtml"
        filename = href.split("/")[-1]
        
        # Use a regular expression to capture the part before the first '.' and the next part
        # Pattern: ^(.*?)\.(.*?)\.plt\.shtml$
        match = re.match(r'^(.*?)\.(.*?)\.plt\.shtml$', filename)
        if match:
            idq_number = match.group(1)     # e.g. "IDQ65388"
            second_number = match.group(2)  # e.g. "540612"
        else:
            idq_number = None
            second_number = None
    else:
        # If there is no link or no href attribute
        idq_number = None
        second_number = None
    
    # -------------------------------------------------------------------------
    # 2) Append these two new values into row_data
    # -------------------------------------------------------------------------
    row_data.append(idq_number)
    row_data.append(second_number)


# Now `all_rows` has 7 + 2 = 9 columns total:
#   original 7 columns + IDQ_number + second_number

# If the first row in `all_rows` is your header, then:
header = all_rows[0]  # e.g. the original 7 column headers (if that's truly a header row)

# Amend column titles to fill in missing values
header[7] = "IDQ_Number"
header[8] = "Station_ID"

# Assuming the first row is the header
df = pd.DataFrame(all_rows[1:], columns=all_rows[0])
# df = pd.DataFrame(all_rows[1:], columns=header)

# remove any instances of "^" from the height column
df['Height'] = df['Height'].str.replace('^', '')

# convert height column to float
df['Height'] = df['Height'].astype(float)

df["ReadingDateTime"] = df["Time/Day"].apply(parse_time_day_to_datetime)
df["ReadingDateTime"] = pd.to_datetime(df["ReadingDateTime"])

#Drop df['Recent Data'] column in place
df.drop(columns=['Recent Data'], inplace=True)

#For each column title, replace any spaces with _
df.columns = df.columns.str.replace(' ', '_')

# For Time/Day column replace the / in the title with an _
df.columns = df.columns.str.replace('/', '_')

print(df)

#export to csv
df.to_csv('flood.csv', index=False)

D4Vinci · 2025-01-27T18:48:40Z

Hey @sxwebster

I wasn't planning to reply to you as I said to give me the URL only and didn't say I would rewrite your code!
Also, the code shows clearly that you didn't read the documentation as you used bs4 for some reason while you already have the same exact methods in the page object that you dropped.

Here's the optimized code anyway for the most important part of the code, I tried to use find/find_all when possible:

from scrapling.defaults import Fetcher

rows = []
page = Fetcher.get(url)
table = page.find('table')
headers = table.css('thead th::text')

for tr in table.find_all('tr')[1:].filter(lambda r: len(r.css('td')) > 1):
    row = [
        (
            element.text.clean() 
            if element.tag != 'a' else 
            element.attrib['href'].rstrip('/').re(r'.*/(.*?)\.(\d+)\.plt\.shtml$')
        )
        for element in (
            tr.css('td').filter(lambda cell: not cell.css('a')) +
            tr.css('td a:contains("Plot")')
        )
    ]
    rows.append(dict(zip(headers, row)))

The first two rows of 1273 rows:

[{'Station Name': 'Lt Nerang Ck at Little Nerang Dam *',
  'Time/Day': '1.30am Tue',
  'Height': '168.10',
  'Tendency': 'steady',
  'Crossing': '0.08 above Spillway',
  'Flood Class': '',
  'Recent Data': ['IDQ65388', '540612']},
 {'Station Name': 'Lt Nerang Ck at Little Nerang Dam #',
  'Time/Day': '2.46am Tue',
  'Height': '0.08',
  'Tendency': 'steady',
  'Crossing': '0.08 above Spillway',
  'Flood Class': 'below minor',
  'Recent Data': ['IDQ65388', '540054']},
...]

Please read the documentation next time before opening an issue.

sxwebster added the enhancement New feature or request label Jan 14, 2025

D4Vinci closed this as completed Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you extend the documentation to include an example of collecting data from a table, please? #30

Could you extend the documentation to include an example of collecting data from a table, please? #30

sxwebster commented Jan 14, 2025

D4Vinci commented Jan 14, 2025

sxwebster commented Jan 15, 2025 •

edited by D4Vinci

Loading

D4Vinci commented Jan 27, 2025

Could you extend the documentation to include an example of collecting data from a table, please? #30

Could you extend the documentation to include an example of collecting data from a table, please? #30

Comments

sxwebster commented Jan 14, 2025

Have you searched if there an existing feature request for this?

Feature description

D4Vinci commented Jan 14, 2025

sxwebster commented Jan 15, 2025 • edited by D4Vinci Loading

D4Vinci commented Jan 27, 2025

sxwebster commented Jan 15, 2025 •

edited by D4Vinci

Loading