SeekSpider is a comprehensive job market analysis tool built with Scrapy. It not only extracts job listings from seek.com.au but also performs advanced analysis of job descriptions, salaries, and technology stacks. The system uses a modular architecture with dedicated components for web scraping, data processing, and AI-powered analysis.
Key capabilities include:
- Automated data collection from SEEK's job listings
- AI-powered analysis of job descriptions and requirements
- Salary standardization and analysis
- Technology stack trend analysis
- Real-time job market statistics
Core Architecture:
- Modular architecture with clear separation of concerns
- Core components for database, logging, and AI integration
- Centralized configuration management
- Robust error handling and logging
Data Collection:
- Scrapy framework for efficient web crawling
- Selenium-based authentication system
- Smart pagination with category management
- BeautifulSoup integration for detailed parsing
Data Processing:
- PostgreSQL database with JSONB support
- Transaction management and data integrity
- Job status tracking and updates
- Batch processing capabilities
AI Integration:
- AI-powered tech stack analysis
- Salary standardization
- Technology trend analysis
- Configurable AI parameters
Monitoring and Management:
- Detailed logging system
- Performance monitoring
- Rate limiting compliance
- Automatic error recovery
Ensure you have the following installed on your system:
- Python 3.9 or above
- pip (Python package installer)
- PostgreSQL server (for database storage)
- Chrome/Chromium browser (for Selenium)
- ChromeDriver (for Selenium WebDriver)
- Git (for version control)
- Clone the repository to your local machine.
git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider
- Navigate to the project directory in your terminal. Install the required Python packages listed
in
requirements.txt
. You may use pip to install them:
pip install -r requirements.txt
- Install ChromeDriver:
For Ubuntu/Debian:
sudo apt-get update
sudo apt-get install chromium-chromedriver
For macOS:
brew install chromedriver
For Windows, download from the official ChromeDriver website.
- Create a
.env
file in the project root:
POSTGRESQL_HOST=your_host
POSTGRESQL_PORT=5432
POSTGRESQL_USER=your_user
POSTGRESQL_PASSWORD=your_password
POSTGRESQL_DATABASE=your_database
POSTGRESQL_TABLE=your_table
SEEK_USERNAME=your_seek_email
SEEK_PASSWORD=your_seek_password
AI_API_KEY=your_api_key
AI_API_URL=your_api_url
AI_MODEL=your_model_name
Be sure to have PostgreSQL installed and running on your local machine or remote server with the required database and
table schema set up. Configure your database settings in settings_local.py
to point to your PostgreSQL instance.
-
Database Configuration:
- Create your PostgreSQL database and user with appropriate privileges.
- Define your database connection settings in
.env
file.
-
Parameters Configuration:
- Customize search parameters in the
search_params
dictionary of theSeekSpider
class for targeted scraping.
- Customize search parameters in the
You can run the spider in two ways:
- Using the main script:
python main.py
- Using scrapy directly:
scrapy crawl seek
Upon execution, the spider will start to navigate through the job listings on SEEK and insert each job's data into the database using the pipeline.
Note: I'm not trying to be lazy, but you can just simply run main.py instead : )
The spider makes use of the Seek Job Search API with several query parameters to tailor the search results according to specific needs. Below is a detailed explanation of these parameters used in the spider's query string:
search_params = {
'siteKey': 'AU-Main', # Identifies the main SEEK Australia site
'sourcesystem': 'houston', # SEEK's internal system identifier
'where': 'All Perth WA', # Location filter
'page': 1, # Current page number
'seekSelectAllPages': 'true', # Enable full page access
'classification': '6281', # IT jobs classification
'subclassification': '', # Specific IT category
'include': 'seodata', # Include SEO metadata
'locale': 'en-AU', # Australian English locale
}
Key API features:
- Systematic category traversal using subclassifications
- Automatic pagination handling
- Location-based filtering
- SEO data inclusion
- Locale support
Note: Seek has a 26-page limit for job listings, which means you won't go further than 26 pages of results. To overcome this limitation, the job is broken into smaller pieces using subclasses.
- The
params
are converted to a query string by theurlencode
method, which ensures they are properly formatted for the HTTP request. Adjusting these parameters allows for a wide range of searches to collect data that's useful for different users' intents.
The spider automatically handles:
- URL encoding of parameters
- Authentication token management
- Request retries on failure
- Rate limiting compliance
- Response validation
These parameters are crucial for the spider's functionality as they dictate the scope and specificity of the web scraping task at hand. Users can modify these parameters as per their requirements to collect job listing data relevant to their own specific search criteria.
The core components of SeekSpider are responsible for database, logging, and AI integration.
The DatabaseManager
class provides a centralized interface for all database operations:
-
Connection and transaction management using context managers
-
Parameterized queries to prevent SQL injection
-
Automatic retry mechanism for failed operations
-
Logging of all database operations
The Logger
class provides a unified logging interface:
-
Console output with formatted messages
-
Different log levels (INFO, ERROR, WARNING, DEBUG)
-
Component-specific logging with named loggers
The AIClient
class handles all AI-related operations:
-
Integration with AI APIs for text analysis
-
Automatic retry for rate-limited requests
-
Configurable request parameters
-
Error handling and logging
The utils package contains specialized analyzers:
TechStackAnalyzer
: Extracts technology stack information from job descriptionsSalaryNormalizer
: Standardizes salary information into consistent formatTechStatsAnalyzer
: Generates statistics about technology usageget_token
: Handles SEEK authentication and token management
The SeekspiderItem
class is defined as a Scrapy Item. Items provide a means to collect the data scraped by the
spiders. The fields collected by this project are:
Field Name | Description |
---|---|
job_id |
The unique identifier for the job posting. |
job_title |
The title of the job. |
business_name |
The name of the business advertising the job. |
work_type |
The type of employment (e.g., full-time, part-time). |
job_description |
A description of the job and its responsibilities. |
pay_range |
The salary or range provided for the position. |
suburb |
The suburb where the job is located. |
area |
A broader area designation for the job location. |
url |
The direct URL to the job listing. |
advertiser_id |
The unique identifier for the advertiser of the job. |
job_type |
The classification of the job. |
posted_date |
The original posting date of the job |
is_active |
Indicates if the job listing is still active |
expiry_date |
When the job listing expired (if applicable) |
The heart of the SeekSpider project is the scrapy.Spider
subclass that defines how job listings are scraped. It
constructs the necessary HTTP requests, parses the responses returned from the web server, and extracts the data using
selectors to populate SeekspiderItem
objects.
The spider now includes several key components:
- Automated login using Selenium WebDriver
- Token management and refresh mechanism
- Secure credential handling
- Systematic traversal through IT job categories
- Smart pagination with subclassification support
- Detailed logging of category transitions
- API-based job listing retrieval
- BeautifulSoup integration for detailed parsing
- Robust error handling and retry logic
- Automated tech stack analysis
- Salary standardization
- Technology usage statistics generation
- Job status tracking (active/inactive)
Key pipeline features:
- Efficient database connection management
- Transaction support for data integrity
- Automatic job deactivation for expired listings
- Smart update/insert logic based on job ID
- Batch processing capabilities
I have intentionally slowed down the speed to avoid any ban. If you feel the spider is too slow, please try to
increase CONCURRENT_REQUESTS
and decrease DOWNLOAD_DELAY
.
Additional important settings:
CONCURRENT_REQUESTS = 16
: Concurrent request limitDOWNLOAD_DELAY = 2
: Delay between requests- Custom retry middleware configuration
- Logging level configuration
The project uses a centralized configuration management system through the Config
class and environment variables.
All configuration is loaded from a .env
file in the project root.
Before running the spider, create a .env
file with the following configuration:
# Database Configuration
POSTGRESQL_HOST=your_host
POSTGRESQL_PORT=5432
POSTGRESQL_USER=your_user
POSTGRESQL_PASSWORD=your_password
POSTGRESQL_DATABASE=your_database
POSTGRESQL_TABLE=your_table
# SEEK Credentials
SEEK_USERNAME=your_seek_email
SEEK_PASSWORD=your_seek_password
# AI API Configuration
AI_API_KEY=your_api_key
AI_API_URL=your_api_url
AI_MODEL=your_model_name
- Make sure to add
.env
to your.gitignore
file to prevent sensitive information from being committed to your repository.
Key configuration features:
- Environment-based configuration management
- Automatic validation of required settings
- Secure credential handling
- Centralized configuration access
You can tweak the crawl parameters like search location, category, and job type in the spider's search_params
dictionary:
search_params = {
'siteKey': 'AU-Main',
'sourcesystem': 'houston',
'where': 'All Perth WA',
'page': 1,
'seekSelectAllPages': 'true',
'classification': '6281',
'subclassification': '',
'include': 'seodata',
'locale': 'en-AU',
}
The spider also supports configuration of:
- AI analysis parameters
- Database connection settings
- Logging levels and formats
- Retry mechanisms and delays
- Authentication parameters
Make sure that your PostgreSQL database has a table with the correct schema to store the data. Below is a guideline
schema based on the fields defined in the SeekspiderItem
:
CREATE TABLE "Jobs"
(
"Id" INTEGER PRIMARY KEY,
"JobTitle" VARCHAR(255),
"BusinessName" VARCHAR(255),
"WorkType" VARCHAR(50),
"JobDescription" TEXT,
"PayRange" VARCHAR(255),
"Suburb" VARCHAR(255),
"Area" VARCHAR(255),
"Url" VARCHAR(255),
"AdvertiserId" INTEGER,
"JobType" VARCHAR(50),
"CreatedAt" TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
"UpdatedAt" TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
"ExpiryDate" TIMESTAMP,
"IsActive" BOOLEAN DEFAULT TRUE,
"IsNew" BOOLEAN DEFAULT TRUE,
"PostedDate" TIMESTAMP,
"TechStack" JSONB,
"MinSalary" INTEGER,
"MaxSalary" INTEGER
);
CREATE TABLE "tech_word_frequency"
(
"word" VARCHAR(255) PRIMARY KEY,
"frequency" INTEGER,
"created_at" TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Please include relevant indices based on your query patterns for optimal performance.
Recommended indices:
CREATE INDEX idx_jobs_tech_stack ON "Jobs" USING GIN ("TechStack");
CREATE INDEX idx_jobs_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_jobs_active ON "Jobs" ("IsActive");
CREATE INDEX idx_jobs_posted_date ON "Jobs" ("PostedDate");
Key database features:
- JSONB support for flexible tech stack storage
- Automated timestamp management
- Salary range normalization
- Active job tracking
- Tech stack frequency analysis
Contributions are welcome. Fork the project, make your changes, and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The content for this README was generated with the assistance of Generative AI, ensuring accuracy and efficiency in delivering the information needed to understand
(Again, I'm not a lazy boy, definitely)