Skip to content

Commit

Permalink
add readme and publish workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
sbusso committed Mar 23, 2024
1 parent 1019917 commit 1a1f5ae
Show file tree
Hide file tree
Showing 2 changed files with 117 additions and 1 deletion.
19 changes: 19 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: Publish Python 🐍 distributions 📦 to PyPI

on:
release:

jobs:
pypi-publish:
name: Upload release to PyPI
runs-on: ubuntu-latest
# environment:
# name: github
# url: https://pypi.org/p/scrapework
permissions:
id-token: write # IMPORTANT: this permission is mandatory for trusted publishing
steps:
# retrieve your distributions here

- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
99 changes: 98 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,100 @@
# Scrapework

A simple scraping framework based on scrapy for simple tasks and management. Using convention over configuration, it allows you to focus on the scraping logic and not on the boilerplate code.
Scrapework is a simple scraping framework inspired by Scrapy. It's designed for simple tasks and management, allowing you to focus on the scraping logic and not on the boilerplate code.

- No CLI
- No twisted / async
- Respectful and slow for websites

## Getting Started

### Installation

First, clone the repository and install the dependencies:

```sh
git clone https://github.com/yourusername/scrapework.git
cd scrapework
poetry install
```

### Creating a Spider

A Spider is a class that defines how to navigate a website and extract data. Here's how you can create a Spider:

```python
from scrapework.spider import Spider

class MySpider(Spider):
start_urls = ['http://quotes.toscrape.com']

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
}
```

The `parse` method is where you define your scraping logic. It's called with the HTTP response of the initial URL.

### Creating an Extractor

An Extractor is a class that defines how to extract data from a webpage. Here's how you can create an Extractor:

```python
from scrapework.extractors import Extractor

class MyExtractor(Extractor):
def extract(self, selector):
return {
'text': selector.css('span.text::text').get(),
'author': selector.css('span small::text').get(),
}
```

The `extract` method is where you define your extraction logic. It's called with a `parsel.Selector` object that you can use to extract data from the HTML.

### Creating a Pipeline

A Pipeline is a class that defines how to process and store the data. Here's how you can create a Pipeline:

```python
from scrapework.pipelines import ItemPipeline

class MyPipeline(ItemPipeline):
def process_items(self, items, config):
for item in items:
print(f"Quote: {item['text']}, Author: {item['author']}")
```

The `process_items` method is where you define your processing logic. It's called with the items extracted by the Extractor and a `PipelineConfig` object.

### Running the Spider

To run the Spider, you need to create an instance of it and call the `start_requests` method:

```python
spider = MySpider()
spider.start_requests()
```

## Advanced Usage

For more advanced usage, you can override other methods in the Spider, Extractor, and Pipeline classes. Check the source code for more details.

## Testing

To run the tests, use the following command:

```sh
pytest tests/
```

## Contributing

Contributions are welcome! Please read the contributing guidelines first.

## License

Scrapework is licensed under the MIT License.

0 comments on commit 1a1f5ae

Please sign in to comment.