Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/updating indeed scraper (#166) #170

Merged
merged 3 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,11 @@ about: Create a report to help us improve
title: ''
labels: bug
assignees: ''

---

## Description

Please include a summary of the issue.
Please include the steps to reproduce.
Please include a summary of the issue. Please include the steps to reproduce.
List any additional libraries that are affected.

## Steps to Reproduce
Expand All @@ -29,6 +27,6 @@ A description of what happens instead.

## Environment

* Build: [e.g. 3180 - type "About" in the Command Palette]
* Operating system and version: [e.g. macOS 10.14, Windows 10, Ubuntu 18.04]
* [Linux] Desktop Environment and/or Window Manager: [e.g. Gnome, LXDE, i3]
- Build: [e.g. 3180 - type "About" in the Command Palette]
- Operating system and version: [e.g. macOS 10.14, Windows 10, Ubuntu 18.04]
- [Linux] Desktop Environment and/or Window Manager: [e.g. Gnome, LXDE, i3]
17 changes: 8 additions & 9 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,16 @@ about: Suggest an idea for this project
title: ''
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Is your feature request related to a problem? Please describe.** A clear and
sammytheindi marked this conversation as resolved.
Show resolved Hide resolved
concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe the solution you'd like** A clear and concise description of what you
want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Describe alternatives you've considered** A clear and concise description of
any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
**Additional context** Add any other context or screenshots about the feature
request here.
9 changes: 4 additions & 5 deletions .github/issue_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

## Description

Please include a summary of the issue.
Please include the steps to reproduce.
Please include a summary of the issue. Please include the steps to reproduce.
List any additional libraries that are affected.

## Steps to Reproduce
Expand All @@ -22,6 +21,6 @@ A description of what happens instead.

## Environment

* Build: [e.g. 3180 - type "About" in the Command Palette]
* Operating system and version: [e.g. macOS 10.14, Windows 10, Ubuntu 18.04]
* [Linux] Desktop Environment and/or Window Manager: [e.g. Gnome, LXDE, i3]
- Build: [e.g. 3180 - type "About" in the Command Palette]
- Operating system and version: [e.g. macOS 10.14, Windows 10, Ubuntu 18.04]
- [Linux] Desktop Environment and/or Window Manager: [e.g. Gnome, LXDE, i3]
22 changes: 12 additions & 10 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@

## Description

Please include a summary of the change.
Please also include relevant motivation and context.
List any additional libraries that will be affected.
List any developers that will be affected or those who you had merge conflicts with.
Please include a summary of the change. Please also include relevant motivation
and context. List any additional libraries that will be affected. List any
developers that will be affected or those who you had merge conflicts with.

## Context of change

Expand All @@ -22,14 +21,15 @@ Please mark any boxes that apply.

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Breaking change (fix or feature that would cause existing functionality to
not work as expected)
- [ ] This change requires a documentation update

## How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration.
Please describe the tests that you ran to verify your changes. Provide
instructions so we can reproduce. Please also list any relevant details for your
test configuration.

- [ ] Test A
- [ ] Test B
Expand All @@ -42,6 +42,8 @@ Please mark any boxes that have been completed.
- [ ] I have commented my code, particularly in hard-to-understand areas.
- [ ] I have made corresponding changes to the documentation.
- [ ] My changes generate no new warnings.
- [ ] I have added tests that prove my fix is effective or that my feature works.
- [ ] I have added tests that prove my fix is effective or that my feature
works.
- [ ] New and existing unit tests pass locally with my changes.
- [ ] Any dependent changes have been merged and published in downstream modules.
- [ ] Any dependent changes have been merged and published in downstream
modules.
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
repos:
- repo: https://github.com/psf/black
rev: 24.8.0 # Replace this with the version of Black you want to use
hooks:
- id: black
- repo: https://github.com/pre-commit/mirrors-prettier
rev: "v3.1.0" # Specify Prettier version
hooks:
- id: prettier
13 changes: 6 additions & 7 deletions demo/settings_USA.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,17 @@ search:
# FIXME: we need to add back GLASSDOOR when that's working again.
providers:
- INDEED
- MONSTER

# Region that we are searching for jobs within:
province_or_state: "Texas" # NOTE: this is generally 2 characters long.
city: "Richardson" # NOTE: this is the full city / town name.
province_or_state: "CA" # NOTE: this is generally 2 characters long.
city: "San Fransisco" # NOTE: this is the full city / town name.
radius: 25 # km (NOTE: if we were in locale: USA_ENGLISH it's in miles)

# These are the terms you would be typing into the website's search field:
keywords:
- Python
- Senior
- AI
- Data Science
- Machine Learning
- Software Engineer

# Don't return any listings older than this:
max_listing_days: 35
Expand All @@ -47,7 +46,7 @@ search:
remoteness: ANY

# Logging level options are: critical, error, warning, info, debug, notset
log_level: INFO
log_level: DEBUG
sammytheindi marked this conversation as resolved.
Show resolved Hide resolved

# Delaying algorithm configuration
delay:
Expand Down
3 changes: 2 additions & 1 deletion jobfunnel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
"""JobFunnel base package init, we keep module version here.
"""
__version__ = "3.0.2"

__version__ = "4.0.0"
40 changes: 35 additions & 5 deletions jobfunnel/backend/job.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Base Job class to be populated by Scrapers, manipulated by Filters and saved
to csv / etc by Exporter
"""

from copy import deepcopy
from datetime import date, datetime
from typing import Dict, List, Optional
Expand Down Expand Up @@ -132,7 +133,7 @@ def update_if_newer(self, job: "Job") -> bool:
Returns:
True if we updated self with job, False if we didn't
"""
if job.post_date > self.post_date:
if job.post_date >= self.post_date:
# Update all attrs other than status (which user can set).
self.company = deepcopy(job.company)
self.location = deepcopy(job.location)
Expand All @@ -152,6 +153,7 @@ def update_if_newer(self, job: "Job") -> bool:
# pylint: disable=protected-access
self._raw_scrape_data = deepcopy(job._raw_scrape_data)
# pylint: enable=protected-access

return True
else:
return False
Expand Down Expand Up @@ -187,7 +189,7 @@ def as_row(self) -> Dict[str, str]:
self.location,
self.post_date.strftime("%Y-%m-%d"),
self.description,
", ".join(self.tags),
"\n".join(self.tags),
self.url,
self.key_id,
self.provider,
Expand All @@ -210,9 +212,11 @@ def as_json_entry(self) -> Dict[str, str]:
"title": self.title,
"company": self.company,
"post_date": self.post_date.strftime("%Y-%m-%d"),
"description": (self.description[:MAX_BLOCK_LIST_DESC_CHARS] + "..")
if len(self.description) > MAX_BLOCK_LIST_DESC_CHARS
else (self.description),
"description": (
(self.description[:MAX_BLOCK_LIST_DESC_CHARS] + "..")
if len(self.description) > MAX_BLOCK_LIST_DESC_CHARS
else (self.description)
),
"status": self.status.name,
}

Expand Down Expand Up @@ -243,3 +247,29 @@ def validate(self) -> None:
assert self.url, "URL is unset!"
if len(self.description) < MIN_DESCRIPTION_CHARS:
raise ValueError("Description too short!")

def __repr__(self) -> str:
"""Developer-friendly representation of the Job object."""
return (
f"Job("
f"title='{self.title}', "
f"company='{self.company}', "
f"location='{self.location}', "
f"status={self.status.name}, "
f"post_date={self.post_date}, "
f"url='{self.url}')"
)

def __str__(self) -> str:
"""Human-readable string representation of the Job object."""
return (
f"Job Title: {self.title}\n"
f"Company: {self.company}\n"
f"Location: {self.location}\n"
f"Post Date: {self.post_date.strftime('%Y-%m-%d') if self.post_date else 'N/A'}\n"
f"Status: {self.status.name}\n"
f"Wage: {self.wage if self.wage else 'N/A'}\n"
f"Remoteness: {self.remoteness if self.remoteness else 'N/A'}\n"
f"Description (truncated): {self.description[:100]}{'...' if len(self.description) > 100 else ''}\n"
f"URL: {self.url}\n"
)
6 changes: 5 additions & 1 deletion jobfunnel/backend/jobfunnel.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Scrapes jobs, applies search filters and writes pickles to master list
Paul McInnis 2020
"""

import csv
import json
import os
Expand Down Expand Up @@ -230,7 +231,9 @@ def scrape(self) -> Dict[str, Job]:
try:
incoming_jobs_dict = scraper.scrape()
except Exception as e:
self.logger.error(f"Failed to scrape jobs for {scraper_cls.__name__}")
self.logger.error(
f"Failed to scrape jobs for {scraper_cls.__name__}: {e}"
)

# Ensure we have no duplicates between our scrapers by key-id
# (since we are updating the jobs dict with results)
Expand Down Expand Up @@ -425,6 +428,7 @@ def read_master_csv(self) -> Dict[str, Job]:
short_description=short_description,
post_date=post_date,
scrape_date=scrape_date,
wage=wage,
raw=raw,
tags=row["tags"].split(","),
remoteness=remoteness,
Expand Down
1 change: 1 addition & 0 deletions jobfunnel/backend/scrapers/base.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""The base scraper class to be used for all web-scraping emitting Job objects
Paul McInnis 2020
"""

import random
from abc import ABC, abstractmethod
from concurrent.futures import ThreadPoolExecutor, as_completed
Expand Down
1 change: 1 addition & 0 deletions jobfunnel/backend/scrapers/glassdoor.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Scraper for www.glassdoor.X
FIXME: this is currently unable to get past page 1 of job results.
"""

import re
from abc import abstractmethod
from concurrent.futures import ThreadPoolExecutor, wait
Expand Down
Loading