Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite with Node, support new website #8

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM library/node:16-bookworm

ARG DEBIAN_FRONTEND=noninteractive
RUN apt update \
&& apt install -y --no-install-recommends sudo \
&& apt autoremove -y \
&& rm -rf /var/lib/apt/lists/* \
&& echo "node ALL=(ALL) NOPASSWD: ALL" >/etc/sudoers.d/node \
&& chmod 0440 /etc/sudoers.d/node
5 changes: 5 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"name": "Basic Node.js",
"build": { "dockerfile": "Dockerfile" },
"remoteUser": "node"
}
130 changes: 130 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,133 @@
data.sqlite
vendor

# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
lerna-debug.log*
.pnpm-debug.log*

# Diagnostic reports (https://nodejs.org/api/report.html)
report.[0-9]*.[0-9]*.[0-9]*.[0-9]*.json

# Runtime data
pids
*.pid
*.seed
*.pid.lock

# Directory for instrumented libs generated by jscoverage/JSCover
lib-cov

# Coverage directory used by tools like istanbul
coverage
*.lcov

# nyc test coverage
.nyc_output

# Grunt intermediate storage (https://gruntjs.com/creating-plugins#storing-task-files)
.grunt

# Bower dependency directory (https://bower.io/)
bower_components

# node-waf configuration
.lock-wscript

# Compiled binary addons (https://nodejs.org/api/addons.html)
build/Release

# Dependency directories
node_modules/
jspm_packages/

# Snowpack dependency directory (https://snowpack.dev/)
web_modules/

# TypeScript cache
*.tsbuildinfo

# Optional npm cache directory
.npm

# Optional eslint cache
.eslintcache

# Optional stylelint cache
.stylelintcache

# Microbundle cache
.rpt2_cache/
.rts2_cache_cjs/
.rts2_cache_es/
.rts2_cache_umd/

# Optional REPL history
.node_repl_history

# Output of 'npm pack'
*.tgz

# Yarn Integrity file
.yarn-integrity

# dotenv environment variable files
.env
.env.development.local
.env.test.local
.env.production.local
.env.local

# parcel-bundler cache (https://parceljs.org/)
.cache
.parcel-cache

# Next.js build output
.next
out

# Nuxt.js build / generate output
.nuxt
dist

# Gatsby files
.cache/
# Comment in the public line in if your project uses Gatsby and not Next.js
# https://nextjs.org/blog/next-9-1#public-directory-support
# public

# vuepress build output
.vuepress/dist

# vuepress v2.x temp and cache directory
.temp
.cache

# Docusaurus cache and generated files
.docusaurus

# Serverless directories
.serverless/

# FuseBox cache
.fusebox/

# DynamoDB Local files
.dynamodb/

# TernJS port file
.tern-port

# Stores VSCode versions used for testing VSCode extensions
.vscode-test

# yarn v2
.yarn/cache
.yarn/unplugged
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"typescript.tsdk": "node_modules/typescript/lib"
}
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Mark Donnellon

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,25 @@
* Cookie tracking - No
* Pagnation - No
* Javascript - No
* Clearly defined data within a row - No
* Clearly defined data within a row - Yes

# Local Development
## Prerequisites
- Node 16 - BEWARE the old version, this is what's supported on morph.io. See `nvm` if you need to be able to switch between versions.
- OPTIONAL: The `./devcontainer` directory contains config for running a Node 16 Dev Container. See https://code.visualstudio.com/docs/devcontainers/containers
- Visual Studio Code - Highly recommended as your editor for build in type checking.

Enjoy
## Getting Started
In a terminal, run the following commands:
```
npm install
npm run dev
```
This will run the scraper whenever you save a change to a file.

Edit the `.ts` files in the `src` directory to make changes. These are TypeScript files that are "compiled" to JavaScript files in the `build` directory.

## Commiting
Ensure you have run `npm run dev` or `npm run tsc` before commiting and include the contents of `./build`.

morph.io runs the built javascript, not typescript files.
86 changes: 86 additions & 0 deletions build/db.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"use strict";
var __importDefault = (this && this.__importDefault) || function (mod) {
return (mod && mod.__esModule) ? mod : { "default": mod };
};
Object.defineProperty(exports, "__esModule", { value: true });
exports.insertData = exports.getDb = exports.fieldNames = void 0;
const sqlite3_1 = __importDefault(require("sqlite3"));
/** The field names in the SQL database.
* Use this for consistent ordering of fields in queries.
*/
exports.fieldNames = [
"council_reference",
"address",
"description",
"info_url",
"comment_url",
"date_scraped",
"on_notice_from",
"on_notice_to",
"documents",
];
const dbPromise = new Promise((resolve, reject) => {
const db = new sqlite3_1.default.Database("data.sqlite", (err) => {
if (err === null)
resolve(db);
else
reject(err);
});
});
async function getDb() {
return dbPromise;
}
exports.getDb = getDb;
getDb().then((db) => {
db.serialize();
const createFields = exports.fieldNames
.map((f, i) => {
if (i === 0)
return `${f} TEXT PRIMARY KEY`;
return `${f} TEXT`;
})
.join(", ");
const createQuery = `CREATE TABLE IF NOT EXISTS data (${createFields})`;
//Create new table
console.log(`createQuery:`, createQuery);
db.run(createQuery);
// add the documents column if it doesn't exist
const checkQuery = `PRAGMA table_info(data)`;
db.all(checkQuery, function (err, rows) {
if (err) {
console.error(err.message);
return;
}
const rowExists = !!rows.find((r) => r.name === "documents");
if (!rowExists) {
// Column doesn't exist, execute the ALTER TABLE statement
db.run(`ALTER TABLE data ADD COLUMN documents TEXT`, function (err) {
if (err) {
console.error(err.message);
return;
}
console.log('Column "documents" added to the table "data"');
});
}
});
});
async function insertData(data) {
const db = await getDb();
const insertFields = exports.fieldNames.join(", ");
/** Morph.io appears to persist the database across scraper runs.
* This should be enough to insert new DAs, update DAs when they change,
* and keep their data when they are removed from the website.
*/
const insertQuery = `INSERT OR REPLACE INTO data (${insertFields}) VALUES (${exports.fieldNames
.map(() => "?")
.join(", ")})`;
console.log(`insertQuery:`, insertQuery);
/** Insert new records */
var statement = db.prepare(insertQuery);
data.forEach((record) => {
statement.run(record[exports.fieldNames[0]], record[exports.fieldNames[1]], record[exports.fieldNames[2]], record[exports.fieldNames[3]], record[exports.fieldNames[4]], record[exports.fieldNames[5]], record[exports.fieldNames[6]], record[exports.fieldNames[7]], record[exports.fieldNames[8]]);
});
statement.finalize();
console.log("Inserted/updated", data.length, "records");
}
exports.insertData = insertData;
84 changes: 84 additions & 0 deletions build/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"use strict";
var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) {
if (k2 === undefined) k2 = k;
var desc = Object.getOwnPropertyDescriptor(m, k);
if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) {
desc = { enumerable: true, get: function() { return m[k]; } };
}
Object.defineProperty(o, k2, desc);
}) : (function(o, m, k, k2) {
if (k2 === undefined) k2 = k;
o[k2] = m[k];
}));
var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) {
Object.defineProperty(o, "default", { enumerable: true, value: v });
}) : function(o, v) {
o["default"] = v;
});
var __importStar = (this && this.__importStar) || function (mod) {
if (mod && mod.__esModule) return mod;
var result = {};
if (mod != null) for (var k in mod) if (k !== "default" && Object.prototype.hasOwnProperty.call(mod, k)) __createBinding(result, mod, k);
__setModuleDefault(result, mod);
return result;
};
var __importDefault = (this && this.__importDefault) || function (mod) {
return (mod && mod.__esModule) ? mod : { "default": mod };
};
Object.defineProperty(exports, "__esModule", { value: true });
const request_promise_1 = __importDefault(require("request-promise"));
const cheerio = __importStar(require("cheerio"));
const sqlite3_1 = __importDefault(require("sqlite3"));
const luxon_1 = require("luxon");
const db_1 = require("./db");
sqlite3_1.default.verbose();
const info_url = "https://www.kingborough.tas.gov.au/development/planning-notices/";
const comment_url = "mailto:[email protected]";
(async () => {
const $ = await (0, request_promise_1.default)({
uri: "https://www.kingborough.tas.gov.au/development/planning-notices/",
transform: (body) => cheerio.load(body),
});
/** Table rows parsed in to the database fields */
const data = $("#list tbody tr")
.toArray()
.map((el) => {
const cells = $(el).find("td");
/** The first 5 fields are just simple strings */
const strings = cells
.toArray()
.map((el) => $(el).text().trim())
.slice(0, 5);
/** The 6th field contains links to 1 or more PDFs, including:
* - Development Application
* - Bushfire Hazard Assessments
* - Environmental Impact Assessments
* - etc
*/
const documents = $(el)
.find("a")
.toArray()
.map((el) => $(el).attr("href"))
.filter((s) => !!s);
/** Assign the string fields to variables */
const [council_reference, address, on_notice_from, on_notice_to, description,] = strings;
return {
council_reference,
address: `${address}, Tasmania`,
description,
info_url,
comment_url,
date_scraped: luxon_1.DateTime.now().toISODate(),
on_notice_from:
/** Convert the date strings from localised version to ISO */
luxon_1.DateTime.fromFormat(on_notice_from, "d MMM yyyy").toISODate() || "",
on_notice_to: luxon_1.DateTime.fromFormat(on_notice_to, "d MMM yyyy").toISODate() || "",
/** Dump the additional PDF links in to this extra variable
* and figure out what to do with them later 🤷‍♂️.
* morph.io API could be used to access this and download files */
documents: JSON.stringify(documents),
};
});
console.log(data);
(0, db_1.insertData)(data);
})();
17 changes: 0 additions & 17 deletions composer.json

This file was deleted.

Loading