Skip to content

sqrDAO/twitter-scraper-finetune

 
 

Repository files navigation

Degen Scraper

Pipeline for generating AI character files and training datasets by scraping public figures' online presence across Twitter and blogs.

⚠️ IMPORTANT: Create a new Twitter account for this tool. DO NOT use your main account as it may trigger Twitter's automation detection and result in account restrictions.

Setup

  1. Install dependencies:

    npm install
  2. Copy the .env.example into a .env file:

    # (Required) Twitter Authentication
    TWITTER_USERNAME=     # your twitter username
    TWITTER_PASSWORD=     # your twitter password
    TWITTER_EMAIL=        # your twitter email
    
    # RapidAPI Configuration
    RAPIDAPI_URL=
    RAPIDAPI_KEY=
    
    # (Optional) Blog Configuration
    BLOG_URLS_FILE=      # path to file containing blog URLs
    
    # (Optional) Scraping Configuration
    MAX_TWEETS=          # max tweets to scrape
    MAX_RETRIES=         # max retries for scraping
    RETRY_DELAY=         # delay between retries
    MIN_DELAY=           # minimum delay between requests
    MAX_DELAY=           # maximum delay between requests

Update

Add Rapid API to get more data.

Get full text tweet:

const twitterCrawlAPI = new TwitterCrawlAPI();
twitterCrawlAPI.getFullTextTweet();

Use puppeteer to get full text tweet with tweet before Sep 29, 2022:

twitterCrawlAPI.fallbackGetFullTextTweet();

Get message examples:

this.messageExamplesCrawler = new MessageExamplesCrawler();
messageExamplesCrawler.addExample();

Usage

Run as Server

npm run start

Add express Server

APIs:

  • GET /api/characters/:username - get character data by username
  • POST /api/characters - scrape tweets and blogs by username
{
  "username": "pmarca", // twitter username
  "date": "2024-12-23", // generate character from this date
  "is_crawl": true // scrape tweets and blogs
}

Collect Tweets and Blogs by using CLI

Twitter Collection

npm run twitter -- username

Example: npm run twitter -- pmarca

Blog Collection

npm run blog

Generate Character

npm run character -- username

Example: npm run character -- pmarca

Finetune

npm run finetune

Finetune (with test)

npm run finetune:test

Generate Virtuals Character Card

https://whitepaper.virtuals.io/developer-documents/agent-contribution/contribute-to-cognitive-core#character-card-and-goal-samples

Run this after Twitter Collection step

npm run generate-virtuals -- username date

Example: npm run generate-virtuals -- pmarca 2024-11-29 Example without date: npm run generate-virtuals -- pmarca

The generated character file will be in the characters/[username].json directory. Edit clients and modelProvider fields to match your needs.

The generated tweet dataset file will be in pipeline/[username]/[date]/raw/tweets.json.

About

Scrape twitter accounts for fine tuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 98.4%
  • Other 1.6%