Skip to content

Commit

Permalink
Merge pull request #12 from armand1m/fix/export-utilities
Browse files Browse the repository at this point in the history
fix: export utilities and add managed jsdom example
  • Loading branch information
armand1m authored Nov 15, 2021
2 parents c48dff8 + 132038b commit 1100792
Show file tree
Hide file tree
Showing 4 changed files with 77 additions and 1 deletion.
5 changes: 4 additions & 1 deletion examples/typescript/src/hacker-news/scraper.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ const main = async () => {
baseUrl: "https://news.ycombinator.com/",
target: ".athing",
selectors: {
rank: ({ text }) => text('.rank'),
rank: (utils) => {
const value = utils.text('.rank').replace(/^\D+/g, '');
return Number(value);
},
name: ({ text }) => text('.titlelink'),
url: ({ href }) => href('.titlelink'),
score: ({ element }) => {
Expand Down
55 changes: 55 additions & 0 deletions examples/typescript/src/managed-jsdom/scraper.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import pino from 'pino'
import { scrape, fetchPage, createWindow } from '@armand1m/papercut';

const main = async () => {
const logger = pino({
name: 'Hacker News',
enabled: false
});

const rawHTML = await fetchPage('https://news.ycombinator.com/')
const window = createWindow(rawHTML);

const results = await scrape({
strict: true,
logger,
document: window.document,
target: ".athing",
selectors: {
rank: (utils) => {
const value = utils.text('.rank').replace(/^\D+/g, '');
return Number(value);
},
name: ({ text }) => text('.titlelink'),
url: ({ href }) => href('.titlelink'),
score: ({ element }) => {
return element.nextElementSibling?.querySelector('.score')
?.textContent;
},
createdBy: ({ element }) => {
return element.nextElementSibling?.querySelector('.hnuser')
?.textContent;
},
createdAt: ({ element }) => {
return element.nextElementSibling
?.querySelector('.age')
?.getAttribute('title');
},
},
options: {
log: false,
cache: true,
concurrency: {
page: 2,
node: 2,
selector: 2
}
}
});

window.close();

console.log(JSON.stringify(results, null, 2));
};

main();
3 changes: 3 additions & 0 deletions src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,6 @@ export type { SelectorUtilities } from './selectors/createSelectorUtilities';

export { geosearch } from './http/geosearch';
export type { GeosearchResult } from './http/geosearch';

export { fetchPage } from './http/fetchPage';
export { createWindow } from './utilities/createWindow';
15 changes: 15 additions & 0 deletions template.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,21 @@ Then run it using `node` or `ts-node`:
npx ts-node ./paginated-scraper.ts
```

#### Managed JSDOM

In case you want to use your own JSDOM and Pino instance and tweak/configure as much as you prefer, you can use the `scrape` function instead.

In the example below, we use the exposed `createWindow` and `fetchPage` utilities for convenience. You can use JSDOM constructor directly and any other strategy to fetch your page HTML as desired.

```ts file=./examples/typescript/src/managed-jsdom/scraper.ts
```

Then run it using `node` or `ts-node`:

```sh
npx ts-node ./managed-jsdom.ts
```

## API Reference

[Click here to open the API reference.](https://armand1m.github.io/papercut)
Expand Down

0 comments on commit 1100792

Please sign in to comment.