Skip to content
Change the repository type filter

All

    Repositories list

    • extruct

      Public
      Extract embedded metadata from HTML markup
      Python
      BSD 3-Clause "New" or "Revised" License
      1138493814Updated Nov 8, 2024Nov 8, 2024
    • Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      48700Updated Nov 7, 2024Nov 7, 2024
    • Extract price amount and currency symbol from a raw text string
      Python
      BSD 3-Clause "New" or "Revised" License
      50316179Updated Nov 6, 2024Nov 6, 2024
    • Page Object pattern for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      2811994Updated Nov 4, 2024Nov 4, 2024
    • spidermon

      Public
      Scrapy Extension for monitoring spiders execution.
      Python
      BSD 3-Clause "New" or "Revised" License
      97533396Updated Oct 31, 2024Oct 31, 2024
    • Python
      BSD 3-Clause "New" or "Revised" License
      141321Updated Oct 26, 2024Oct 26, 2024
    • python parser for human readable dates
      Python
      BSD 3-Clause "New" or "Revised" License
      4672.6k28750Updated Oct 25, 2024Oct 25, 2024
    • Parse numbers written in natural language
      Python
      BSD 3-Clause "New" or "Revised" License
      23109126Updated Oct 23, 2024Oct 23, 2024
    • Software stack with latest Scrapy and updated deps
      Dockerfile
      BSD 3-Clause "New" or "Revised" License
      206220Updated Oct 22, 2024Oct 22, 2024
    • web-poet

      Public
      Web scraping Page Objects core library
      Python
      BSD 3-Clause "New" or "Revised" License
      15951413Updated Oct 16, 2024Oct 16, 2024
    • andi

      Public
      Library for annotation-based dependency injection
      Python
      BSD 3-Clause "New" or "Revised" License
      52131Updated Oct 16, 2024Oct 16, 2024
    • A python binding for crfsuite
      Python
      MIT License
      221770453Updated Oct 1, 2024Oct 1, 2024
    • streamparse lets you run Python code against real-time streams of data. Integrates with Apache Storm.
      Python
      Apache License 2.0
      218201Updated Sep 20, 2024Sep 20, 2024
    • splash

      Public
      Lightweight, scriptable browser as a service with an HTTP API
      Python
      BSD 3-Clause "New" or "Revised" License
      5134.1k37726Updated Aug 2, 2024Aug 2, 2024
    • A Postgres-backed ContentsManager implementation for IPython
      Python
      Apache License 2.0
      83201Updated Jul 18, 2024Jul 18, 2024
    • Crawl Frontier HCF backend
      Python
      BSD 3-Clause "New" or "Revised" License
      5721Updated Jul 17, 2024Jul 17, 2024
    • shublang

      Public
      Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
      Python
      BSD 3-Clause "New" or "Revised" License
      815236Updated Jul 9, 2024Jul 9, 2024
    • Scrapy entrypoint for Scrapinghub job runner
      Python
      BSD 3-Clause "New" or "Revised" License
      162570Updated Jul 8, 2024Jul 8, 2024
    • An opinionated fork of the Drone CI system
      Go
      Other
      369005Updated Jul 7, 2024Jul 7, 2024
    • varanus

      Public
      A command line spider monitoring tool
      Python
      7822Updated Jul 6, 2024Jul 6, 2024
    • scrapyrt

      Public
      HTTP API for Scrapy spiders
      Python
      BSD 3-Clause "New" or "Revised" License
      162833246Updated Jun 28, 2024Jun 28, 2024
    • portia

      Public
      Visual scraping for Scrapy
      Python
      BSD 3-Clause "New" or "Revised" License
      1.4k9.3k11119Updated Jun 26, 2024Jun 26, 2024
    • scikit-learn inspired API for CRFsuite
      Python
      215200Updated Jun 18, 2024Jun 18, 2024
    • Python
      MIT License
      2403Updated Jun 17, 2024Jun 17, 2024
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      Apache License 2.0
      431102Updated Jun 17, 2024Jun 17, 2024
    • Python wrapper for the Intercom API.
      Python
      Other
      143101Updated Jun 17, 2024Jun 17, 2024
    • luigi

      Public
      Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
      Python
      Apache License 2.0
      2.4k401Updated Jun 7, 2024Jun 7, 2024
    • mrjob

      Public
      Run MapReduce jobs on Hadoop or Amazon Web Services
      Python
      Other
      587001Updated Jun 6, 2024Jun 6, 2024
    • Keep docker hosts tidy
      Python
      Apache License 2.0
      50001Updated May 21, 2024May 21, 2024
    • aduana

      Public
      Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).
      C
      BSD 3-Clause "New" or "Revised" License
      95592Updated May 21, 2024May 21, 2024