Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy #296

Open
uniquejava opened this issue Apr 12, 2020 · 0 comments
Open

Scrapy #296

uniquejava opened this issue Apr 12, 2020 · 0 comments
Labels

Comments

@uniquejava
Copy link
Owner

uniquejava commented Apr 12, 2020

python environment = python interpreter + installed packages

pipenv详解

准备(pipenv, 这东西性能太差,已放弃,重回venv)

# 全局安装在pyenv下
$ pip install pipenv

# 创建一个隔离的python env (和virtualenv一样)

$ cd ~/gitibm/scrapy_tutorial
$ pipenv shell
✔ Successfully created virtual environment!
Virtualenv location: /Users/xxx/.local/share/virtualenvs/scrapy_tutorial-RsU7P9xB

# 配置pipenv使用阿里云镜像
$ vi Pipfile

[[source]]
name = "pypi"
url = "https://mirrors.aliyun.com/pypi/simple/"
verify_ssl = true

# 查看当前环境下installed packages
$ pipenv run pip freeze

# 安装依赖
$ pipenv install scrapy

新建项目

scrapy startproject book_crawler

book_crawler/
    scrapy.cfg   <-- Configuration file (DO NOT TOUCH!)
    tutorial/
        __init__.py              <-- Empty file that marks this as a Python folder
        items.py                 <-- Model of the item to scrap
        middlewares.py    <-- Scrapy processing hooks (DO NOT TOUCH)
        pipelines.py           <-- What to do with the scraped item
        settings.py             <-- Project settings file
        spiders/                  <-- Directory of our spiders (empty by now)
            __init__.py

新建爬虫

scrapy genspider fiction books.toscrape.com

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'fiction'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        pass

运行

scrapy crawl fiction

保存

scrapy crawl fiction -o books.json
scrapy crawl fiction -o books.csv
scrapy crawl fiction -o books.xml

在控制台调式

$ scrapy shell 'http://books.toscrape.com/'
 >>> response.css(...)
 >>> response.xpath(...)

extract 相当于querySelectorAll, 返回list
extract_first相当于querySelector, 返回第一个匹配的元素

Rererences

  1. Creating your first spider - 01 - Python scrapy tutorial for beginners
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant