Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Steel Web Loader with Browser Automation Support #28758

Closed
wants to merge 6 commits into from

Conversation

rezapex
Copy link

@rezapex rezapex commented Dec 17, 2024

This PR adds a Steel document loader for web automation.

Changes

  • Add Steel loader with browser automation support:
    • Uses Steel's managed browser infrastructure
    • Includes session management for proper cleanup
    • Supports proxy network and CAPTCHA solving
    • Provides proper error handling and logging

The loader is implemented in:
libs/community/langchain_community/document_loaders/steel.py

Features

  • Multiple content extraction strategies (text, markdown, HTML)
  • Automatic session management and cleanup
  • Integration with Steel's proxy network
  • Automated CAPTCHA solving
  • Session viewer URLs for debugging

Dependencies

Required packages added to pyproject.toml:

  • steel-browser-python for Steel API integration
  • playwright for browser automation

Testing

The implementation includes:

  • Unit tests with mocked components
  • Integration tests for real-world usage
  • Session management tests
  • Error handling tests

Added GitHub Action workflow that:

  • Tests across Python versions 3.8-3.11
  • Installs required dependencies
  • Runs on PR updates and manual triggers
  • Requires STEEL_API_KEY secret for integration tests

…ctionality

**Description:**
- Refactor ChatLiteLLMRouter to address runtime issues and improve constructor validation.
- Introduce a new `use_span_tokenize` parameter in NLTKTextSplitter to handle sentence tokenization more effectively, resolving issues with `add_start_index=True`.

**Issues addressed:** #19356, #27455, #28077, #27781

**Co-authored-by:** Chester Curme <[email protected]>, Erick Friis <[email protected]>
**Description:**
- Introduced `steel-browser-python` as a new dependency in the `pyproject.toml` file, specifying version `^1.0.0`.

**Dependencies:** None.
…scraping using Steel.dev's managed browser infrastructure. This implementation improves upon PR #28757 by using direct browser automation instead of REST APIs.

## Changes

- Adds `SteelWebLoader` class that uses Playwright with Steel's browser infrastructure
- Supports multiple content extraction strategies (text, markdown, HTML)
- Includes proxy network and CAPTCHA solving configuration
- Provides session management with proper cleanup
- Implements async support with proper error handling
- Includes comprehensive tests and documentation

## Key Features

1. Browser Automation:
   - Direct browser control using Playwright
   - Support for complex web applications
   - Better JavaScript rendering support

2. Advanced Features:
   - Multiple content extraction strategies
   - Proxy network support
   - Automated CAPTCHA solving
   - Session debugging capabilities

3. Developer Experience:
   - Proper async/await support
   - Comprehensive error handling
   - Session viewer URLs for debugging
   - Detailed documentation and examples

## Testing

- Unit tests with mocked Playwright
- Integration tests with real Steel sessions
- Strategy-specific tests
- Error case coverage

## Documentation

- Added detailed documentation notebook
- Includes basic and advanced usage examples
- Shows integration with LangChain agents
- Provides best practices and debugging tips

## Dependencies

Required packages:
- playwright
- langchain-core
- langchain-community

## Example Usage

```python
from langchain_community.document_loaders import SteelWebLoader

loader = SteelWebLoader(
    "https://example.com",
    steel_api_key="your-api-key",
    extract_strategy="text"
)
documents = loader.load()
- Demonstrates using Steel loader with LangChain agents
- Shows basic and advanced web automation tasks
- Includes multi-step automation example
- Provides best practices and debugging tips
Copy link

vercel bot commented Dec 17, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ❌ Failed (Inspect) Dec 17, 2024 6:35am

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Dec 17, 2024
- Add Steel loader implementation with web page loading
- Follow Langchain's loader patterns and conventions
- Include proper error handling and logging
- Add workflow to test Steel loader implementation
- Run tests across Python versions 3.8-3.11
- Install Playwright dependencies
- Configure Steel API key secret
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

1 participant