Pipeline for generating AI character files and training datasets by scraping public figures' online presence across Twitter and blogs.
⚠️ IMPORTANT: Create a new Twitter account for this tool. DO NOT use your main account as it may trigger Twitter's automation detection and result in account restrictions.
-
Install dependencies:
npm install
-
Copy the
.env.example
into a.env
file:# (Required) Twitter Authentication TWITTER_USERNAME= # your twitter username TWITTER_PASSWORD= # your twitter password TWITTER_EMAIL= # your twitter email # RapidAPI Configuration RAPIDAPI_URL= RAPIDAPI_KEY= # (Optional) Blog Configuration BLOG_URLS_FILE= # path to file containing blog URLs # (Optional) Scraping Configuration MAX_TWEETS= # max tweets to scrape MAX_RETRIES= # max retries for scraping RETRY_DELAY= # delay between retries MIN_DELAY= # minimum delay between requests MAX_DELAY= # maximum delay between requests
Add Rapid API to get more data.
Get full text tweet:
const twitterCrawlAPI = new TwitterCrawlAPI();
twitterCrawlAPI.getFullTextTweet();
Use puppeteer to get full text tweet with tweet before Sep 29, 2022:
twitterCrawlAPI.fallbackGetFullTextTweet();
Get message examples:
this.messageExamplesCrawler = new MessageExamplesCrawler();
messageExamplesCrawler.addExample();
npm run start
Add express
Server
- GET
/api/characters/:username
- get character data by username - POST
/api/characters
- scrape tweets and blogs by username
{
"username": "pmarca", // twitter username
"date": "2024-12-23", // generate character from this date
"is_crawl": true // scrape tweets and blogs
}
npm run twitter -- username
Example: npm run twitter -- pmarca
npm run blog
npm run character -- username
Example: npm run character -- pmarca
npm run finetune
npm run finetune:test
Run this after Twitter Collection step
npm run generate-virtuals -- username date
Example: npm run generate-virtuals -- pmarca 2024-11-29
Example without date: npm run generate-virtuals -- pmarca
The generated character file will be in the characters/[username].json
directory. Edit clients
and modelProvider
fields to match your needs.
The generated tweet dataset file will be in pipeline/[username]/[date]/raw/tweets.json
.