-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Steel Web Loader with Browser Automation Support #28758
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ctionality **Description:** - Refactor ChatLiteLLMRouter to address runtime issues and improve constructor validation. - Introduce a new `use_span_tokenize` parameter in NLTKTextSplitter to handle sentence tokenization more effectively, resolving issues with `add_start_index=True`. **Issues addressed:** #19356, #27455, #28077, #27781 **Co-authored-by:** Chester Curme <[email protected]>, Erick Friis <[email protected]>
**Description:** - Introduced `steel-browser-python` as a new dependency in the `pyproject.toml` file, specifying version `^1.0.0`. **Dependencies:** None.
…scraping using Steel.dev's managed browser infrastructure. This implementation improves upon PR #28757 by using direct browser automation instead of REST APIs. ## Changes - Adds `SteelWebLoader` class that uses Playwright with Steel's browser infrastructure - Supports multiple content extraction strategies (text, markdown, HTML) - Includes proxy network and CAPTCHA solving configuration - Provides session management with proper cleanup - Implements async support with proper error handling - Includes comprehensive tests and documentation ## Key Features 1. Browser Automation: - Direct browser control using Playwright - Support for complex web applications - Better JavaScript rendering support 2. Advanced Features: - Multiple content extraction strategies - Proxy network support - Automated CAPTCHA solving - Session debugging capabilities 3. Developer Experience: - Proper async/await support - Comprehensive error handling - Session viewer URLs for debugging - Detailed documentation and examples ## Testing - Unit tests with mocked Playwright - Integration tests with real Steel sessions - Strategy-specific tests - Error case coverage ## Documentation - Added detailed documentation notebook - Includes basic and advanced usage examples - Shows integration with LangChain agents - Provides best practices and debugging tips ## Dependencies Required packages: - playwright - langchain-core - langchain-community ## Example Usage ```python from langchain_community.document_loaders import SteelWebLoader loader = SteelWebLoader( "https://example.com", steel_api_key="your-api-key", extract_strategy="text" ) documents = loader.load()
- Demonstrates using Steel loader with LangChain agents - Shows basic and advanced web automation tasks - Includes multi-step automation example - Provides best practices and debugging tips
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
dosubot
bot
added
size:XXL
This PR changes 1000+ lines, ignoring generated files.
community
Related to langchain-community
Ɑ: doc loader
Related to document loader module (not documentation)
🤖:docs
Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
labels
Dec 17, 2024
- Add Steel loader implementation with web page loading - Follow Langchain's loader patterns and conventions - Include proper error handling and logging
- Add workflow to test Steel loader implementation - Run tests across Python versions 3.8-3.11 - Install Playwright dependencies - Configure Steel API key secret
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
community
Related to langchain-community
Ɑ: doc loader
Related to document loader module (not documentation)
🤖:docs
Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder
size:XXL
This PR changes 1000+ lines, ignoring generated files.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a Steel document loader for web automation.
Changes
The loader is implemented in:
libs/community/langchain_community/document_loaders/steel.py
Features
Dependencies
Required packages added to pyproject.toml:
Testing
The implementation includes:
Added GitHub Action workflow that: