SeekSpider: A Scrapy Project for Job Scraping

Introduction

SeekSpider is a comprehensive job market analysis tool built with Scrapy. It not only extracts job listings from seek.com.au but also performs advanced analysis of job descriptions, salaries, and technology stacks. The system uses a modular architecture with dedicated components for web scraping, data processing, and AI-powered analysis.

Key capabilities include:

Automated data collection from SEEK's job listings
AI-powered analysis of job descriptions and requirements
Salary standardization and analysis
Technology stack trend analysis
Real-time job market statistics

Features

Core Architecture:

Modular architecture with clear separation of concerns
Core components for database, logging, and AI integration
Centralized configuration management
Robust error handling and logging

Data Collection:

Scrapy framework for efficient web crawling
Selenium-based authentication system
Smart pagination with category management
BeautifulSoup integration for detailed parsing

Data Processing:

PostgreSQL database with JSONB support
Transaction management and data integrity
Job status tracking and updates
Batch processing capabilities

AI Integration:

AI-powered tech stack analysis
Salary standardization
Technology trend analysis
Configurable AI parameters

Monitoring and Management:

Detailed logging system
Performance monitoring
Rate limiting compliance
Automatic error recovery

Getting Started

Prerequisites

Ensure you have the following installed on your system:

Python 3.9 or above
pip (Python package installer)
PostgreSQL server (for database storage)
Chrome/Chromium browser (for Selenium)
ChromeDriver (for Selenium WebDriver)
Git (for version control)

Installation

Clone the repository to your local machine.

git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider

Navigate to the project directory in your terminal. Install the required Python packages listed in requirements.txt. You may use pip to install them:

pip install -r requirements.txt

Install ChromeDriver:

For Ubuntu/Debian:

sudo apt-get update
sudo apt-get install chromium-chromedriver

For macOS:

brew install chromedriver

For Windows, download from the official ChromeDriver website.

Create a .env file in the project root:

POSTGRESQL_HOST=your_host
POSTGRESQL_PORT=5432
POSTGRESQL_USER=your_user
POSTGRESQL_PASSWORD=your_password
POSTGRESQL_DATABASE=your_database
POSTGRESQL_TABLE=your_table

SEEK_USERNAME=your_seek_email
SEEK_PASSWORD=your_seek_password
AI_API_KEY=your_api_key
AI_API_URL=your_api_url
AI_MODEL=your_model_name

Be sure to have PostgreSQL installed and running on your local machine or remote server with the required database and table schema set up. Configure your database settings in settings_local.py to point to your PostgreSQL instance.

Setup

Database Configuration:
- Create your PostgreSQL database and user with appropriate privileges.
- Define your database connection settings in .env file.
Parameters Configuration:
- Customize search parameters in the search_params dictionary of the SeekSpider class for targeted scraping.

Execution

You can run the spider in two ways:

Using the main script:

python main.py

Using scrapy directly:

scrapy crawl seek

Upon execution, the spider will start to navigate through the job listings on SEEK and insert each job's data into the database using the pipeline.

Note: I'm not trying to be lazy, but you can just simply run main.py instead : )

Web API Parameters Explanation

The spider makes use of the Seek Job Search API with several query parameters to tailor the search results according to specific needs. Below is a detailed explanation of these parameters used in the spider's query string:

search_params = {
    'siteKey': 'AU-Main',  # Identifies the main SEEK Australia site
    'sourcesystem': 'houston',  # SEEK's internal system identifier
    'where': 'All Perth WA',  # Location filter
    'page': 1,  # Current page number
    'seekSelectAllPages': 'true',  # Enable full page access
    'classification': '6281',  # IT jobs classification
    'subclassification': '',  # Specific IT category
    'include': 'seodata',  # Include SEO metadata
    'locale': 'en-AU',  # Australian English locale
}

Key API features:

Systematic category traversal using subclassifications
Automatic pagination handling
Location-based filtering
SEO data inclusion
Locale support

Note: Seek has a 26-page limit for job listings, which means you won't go further than 26 pages of results. To overcome this limitation, the job is broken into smaller pieces using subclasses.

The params are converted to a query string by the urlencode method, which ensures they are properly formatted for the HTTP request. Adjusting these parameters allows for a wide range of searches to collect data that's useful for different users' intents.

The spider automatically handles:

URL encoding of parameters
Authentication token management
Request retries on failure
Rate limiting compliance
Response validation

These parameters are crucial for the spider's functionality as they dictate the scope and specificity of the web scraping task at hand. Users can modify these parameters as per their requirements to collect job listing data relevant to their own specific search criteria.

Project Components

Core Components

The core components of SeekSpider are responsible for database, logging, and AI integration.

Database Manager

The DatabaseManager class provides a centralized interface for all database operations:

Connection and transaction management using context managers
Parameterized queries to prevent SQL injection
Automatic retry mechanism for failed operations
Logging of all database operations
Logger

The Logger class provides a unified logging interface:

Console output with formatted messages
Different log levels (INFO, ERROR, WARNING, DEBUG)
Component-specific logging with named loggers
AI Client

The AIClient class handles all AI-related operations:

Integration with AI APIs for text analysis
Automatic retry for rate-limited requests
Configurable request parameters
Error handling and logging
Utils

The utils package contains specialized analyzers:

TechStackAnalyzer: Extracts technology stack information from job descriptions
SalaryNormalizer: Standardizes salary information into consistent format
TechStatsAnalyzer: Generates statistics about technology usage
get_token: Handles SEEK authentication and token management

Items

The SeekspiderItem class is defined as a Scrapy Item. Items provide a means to collect the data scraped by the spiders. The fields collected by this project are:

Field Name	Description
`job_id`	The unique identifier for the job posting.
`job_title`	The title of the job.
`business_name`	The name of the business advertising the job.
`work_type`	The type of employment (e.g., full-time, part-time).
`job_description`	A description of the job and its responsibilities.
`pay_range`	The salary or range provided for the position.
`suburb`	The suburb where the job is located.
`area`	A broader area designation for the job location.
`url`	The direct URL to the job listing.
`advertiser_id`	The unique identifier for the advertiser of the job.
`job_type`	The classification of the job.
`posted_date`	The original posting date of the job
`is_active`	Indicates if the job listing is still active
`expiry_date`	When the job listing expired (if applicable)

Spider

The heart of the SeekSpider project is the scrapy.Spider subclass that defines how job listings are scraped. It constructs the necessary HTTP requests, parses the responses returned from the web server, and extracts the data using selectors to populate SeekspiderItem objects.

The spider now includes several key components:

Authentication

Automated login using Selenium WebDriver
Token management and refresh mechanism
Secure credential handling

Job Category Management

Systematic traversal through IT job categories
Smart pagination with subclassification support
Detailed logging of category transitions

Data Extraction

API-based job listing retrieval
BeautifulSoup integration for detailed parsing
Robust error handling and retry logic

Post-Processing

Automated tech stack analysis
Salary standardization
Technology usage statistics generation
Job status tracking (active/inactive)

Pipeline

Key pipeline features:

Efficient database connection management
Transaction support for data integrity
Automatic job deactivation for expired listings
Smart update/insert logic based on job ID
Batch processing capabilities

settings

I have intentionally slowed down the speed to avoid any ban. If you feel the spider is too slow, please try to increase CONCURRENT_REQUESTS and decrease DOWNLOAD_DELAY.

Additional important settings:

CONCURRENT_REQUESTS = 16: Concurrent request limit
DOWNLOAD_DELAY = 2: Delay between requests
Custom retry middleware configuration
Logging level configuration

Configuration

The project uses a centralized configuration management system through the Config class and environment variables. All configuration is loaded from a .env file in the project root.

Before running the spider, create a .env file with the following configuration:

# Database Configuration
POSTGRESQL_HOST=your_host
POSTGRESQL_PORT=5432
POSTGRESQL_USER=your_user
POSTGRESQL_PASSWORD=your_password
POSTGRESQL_DATABASE=your_database
POSTGRESQL_TABLE=your_table

# SEEK Credentials
SEEK_USERNAME=your_seek_email
SEEK_PASSWORD=your_seek_password

# AI API Configuration
AI_API_KEY=your_api_key
AI_API_URL=your_api_url
AI_MODEL=your_model_name

Make sure to add .env to your .gitignore file to prevent sensitive information from being committed to your repository.

Key configuration features:

Environment-based configuration management
Automatic validation of required settings
Secure credential handling
Centralized configuration access

You can tweak the crawl parameters like search location, category, and job type in the spider's search_params dictionary:

search_params = {
    'siteKey': 'AU-Main',
    'sourcesystem': 'houston',
    'where': 'All Perth WA',
    'page': 1,
    'seekSelectAllPages': 'true',
    'classification': '6281',
    'subclassification': '',
    'include': 'seodata',
    'locale': 'en-AU',
}

The spider also supports configuration of:

AI analysis parameters
Database connection settings
Logging levels and formats
Retry mechanisms and delays
Authentication parameters

Database Schema

Make sure that your PostgreSQL database has a table with the correct schema to store the data. Below is a guideline schema based on the fields defined in the SeekspiderItem:

CREATE TABLE "Jobs"
(
    "Id"             INTEGER PRIMARY KEY,
    "JobTitle"       VARCHAR(255),
    "BusinessName"   VARCHAR(255),
    "WorkType"       VARCHAR(50),
    "JobDescription" TEXT,
    "PayRange"       VARCHAR(255),
    "Suburb"         VARCHAR(255),
    "Area"           VARCHAR(255),
    "Url"            VARCHAR(255),
    "AdvertiserId"   INTEGER,
    "JobType"        VARCHAR(50),
    "CreatedAt"      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    "UpdatedAt"      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    "ExpiryDate"     TIMESTAMP,
    "IsActive"       BOOLEAN   DEFAULT TRUE,
    "IsNew"          BOOLEAN   DEFAULT TRUE,
    "PostedDate"     TIMESTAMP,
    "TechStack"      JSONB,
    "MinSalary"      INTEGER,
    "MaxSalary"      INTEGER
);

CREATE TABLE "tech_word_frequency"
(
    "word"       VARCHAR(255) PRIMARY KEY,
    "frequency"  INTEGER,
    "created_at" TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Please include relevant indices based on your query patterns for optimal performance.

Recommended indices:

CREATE INDEX idx_jobs_tech_stack ON "Jobs" USING GIN ("TechStack");
CREATE INDEX idx_jobs_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_jobs_active ON "Jobs" ("IsActive");
CREATE INDEX idx_jobs_posted_date ON "Jobs" ("PostedDate");

Key database features:

JSONB support for flexible tech stack storage
Automated timestamp management
Salary range normalization
Active job tracking
Tech stack frequency analysis

Contributing

Contributions are welcome. Fork the project, make your changes, and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

The content for this README was generated with the assistance of Generative AI, ensuring accuracy and efficiency in delivering the information needed to understand

(Again, I'm not a lazy boy, definitely)

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
SeekSpider		SeekSpider
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeekSpider: A Scrapy Project for Job Scraping

Table of Contents

Introduction

Features

Getting Started

Prerequisites

Installation

Setup

Execution

Web API Parameters Explanation

Project Components

Core Components

Database Manager

Logger

AI Client

Utils

Items

Spider

Authentication

Job Category Management

Data Extraction

Post-Processing

Pipeline

settings

Configuration

Database Schema

Contributing

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

qinscode/SeekSpider

Folders and files

Latest commit

History

Repository files navigation

SeekSpider: A Scrapy Project for Job Scraping

Table of Contents

Introduction

Features

Getting Started

Prerequisites

Installation

Setup

Execution

Web API Parameters Explanation

Project Components

Core Components

Database Manager

Logger

AI Client

Utils

Items

Spider

Authentication

Job Category Management

Data Extraction

Post-Processing

Pipeline

settings

Configuration

Database Schema

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages