Transform Twitter archives into ML datasets and generate voice-matched tweets with RAG-powered GPT-5 integration.
A comprehensive toolkit that:
- Cleans and extracts your Twitter archive data (removes DMs, saves GBs of space)
- Analyzes 53+ features from every tweet (structural, temporal, linguistic, content)
- Identifies engagement patterns through statistical correlation analysis
- Generates weighted datasets for fine-tuning LLMs, RAG systems, and few-shot prompting
- Powers AI tweet generation using RAG + GPT-5 to write in your voice
The analysis scripts identify patterns in your Twitter data, such as:
- Original tweets often get 5-10x more engagement than replies
- Media (images/videos) typically provide 2-4x engagement boost
- URLs can increase engagement 2-5x
- Optimal posting times vary by account (identified automatically)
- Replies and mentions may reduce engagement for some accounts
The scripts will analyze:
- Total tweet count and date range
- Engagement distribution (likes, retweets)
- Tweet type breakdown (originals vs replies)
- Media usage patterns
- Temporal posting patterns
Note: Results vary significantly by account. Run the analysis to discover your unique patterns!
# Requires Python 3.13+ and uv package manager
pip install uv
# OpenAI API key (for RAG tweet generator)
export OPENAI_API_KEY="sk-..."# Extract just your public tweets from the Twitter export
python cleanup_twitter_archive.py
# Result: twitter_archive_clean/ with clean JSON data
# Saves GBs by removing DMs and tracking data# Extract 53 features from all tweets
uv run python extract_features.py
# Result: corpus/features.json
# Features: length, media, timing, sentiment, engagement metrics, etc.# Find what correlates with high engagement in YOUR data
uv run python analyze_engagement.py
# Results:
# corpus/engagement_analysis.json - Full correlation data
# corpus/engagement_analysis.txt - Human-readable insights# Create weighted datasets with multiple schemes
uv run python generate_corpus.py
# Results in corpus/:
# finetune/ - 6 weighted JSONL datasets
# rag/ - High-engagement tweets for retrieval
# fewshot/ - Top-performing examples for prompting
# templates/ - Style patterns from best performers# Index your high-engagement tweets for semantic search
uv run setup_rag.py
# Result: chroma_db/ with embeddings (~$0.01 one-time cost)# Generate tweets in your voice using RAG + GPT-5
uv run tweet_helper.py generate "morning coffee routine"
# Other commands:
uv run tweet_helper.py search "productivity tips" # Find similar successful tweets
uv run tweet_helper.py analyze "my draft tweet text" # Check voice match & predict engagement
uv run tweet_helper.py interactive # Interactive generation sessiontwitter_archive_clean/
├── data/
│ ├── tweets.json # Your clean tweets
│ ├── likes.json # Your likes
│ ├── followers.json # Your followers
│ └── following.json # Your following
└── media/
└── tweets_media/ # Images/videos from tweets
corpus/
├── features.json # All tweets with 53 extracted features
├── feature_summary.txt # Feature documentation
├── engagement_analysis.json # Correlation analysis
├── engagement_analysis.txt # Insights report
├── voice_profiles.json # Your writing style analysis
├── finetune/
│ ├── tweets_linear.jsonl # All tweets, linearly weighted
│ ├── tweets_log.jsonl # Log-weighted (reduces outlier impact)
│ ├── tweets_top1pct.jsonl # Only viral tweets
│ ├── tweets_top5pct.jsonl # Top 5%
│ ├── tweets_top10pct.jsonl # Top 10%
│ └── tweets_top25pct.jsonl # Top 25%
├── rag/
│ ├── high_engagement.json # Top 25% tweets for retrieval
│ └── metadata.json # Corpus metadata
├── fewshot/
│ ├── viral_examples.json # Top 1% viral hits
│ └── top_originals.json # Best non-reply tweets
└── templates/
└── winning_patterns.json # Categorized style patterns
chroma_db/ # RAG vector database (generated by setup_rag.py)
└── metadata.json
Generate tweets in your voice using GPT-5:
# Generate on a topic
uv run tweet_helper.py generate "weekend plans"
# Features:
# - RAG retrieval finds your similar successful tweets
# - GPT-5 generates in your voice and style
# - Voice match scoring (based on your patterns)
# - Predicted engagement ranges
# - Interactive refinementUse the weighted JSONL datasets to fine-tune models on your successful style:
# Use tweets_top10pct.jsonl for training
# Each line has: text, weight, engagement, metadata
# Example with OpenAI fine-tuning:
# Convert to their format and uploadUse high-engagement tweets as retrieval examples:
# Load RAG corpus
with open('corpus/rag/high_engagement.json') as f:
tweets = json.load(f)
# Embed and index for semantic search
# Query: "funny observational humor"
# Retrieve: Your best tweets in that styleUse top-performing examples to guide LLM generation:
# Load viral examples
with open('corpus/fewshot/viral_examples.json') as f:
examples = json.load(f)['examples']
# Prompt engineering:
# "Write tweets like these examples: [your_viral_tweets]"Use the insights to inform what you post based on YOUR data:
Example findings might show:
- ✅ Original tweets perform better than replies
- ✅ Media inclusion may boost engagement
- ✅ URLs can increase engagement
- ✅ Optimal posting times (unique to your audience)
- ❌ Excessive @mentions might reduce engagement
- ❌ Certain emoji patterns may not help
Run the analysis to discover your account's unique patterns!
| Script | Purpose | Output |
|---|---|---|
cleanup_twitter_archive.py |
Extract public data, remove DMs | twitter_archive_clean/ |
analyze_tweets.py |
Basic statistics and insights | Analytics report |
extract_features.py |
Feature engineering (53 features/tweet) | corpus/features.json |
analyze_engagement.py |
Statistical correlation analysis | Correlation reports |
generate_corpus.py |
Create weighted training datasets | Multiple JSONL files |
voice_analysis.py |
Extract your writing style profile | corpus/voice_profiles.json |
setup_rag.py |
Index tweets for semantic search | chroma_db/ vector database |
tweet_helper.py |
AI tweet generation CLI | Generated tweets |
- Length (characters, words)
- Media presence and count
- URLs, hashtags, mentions
- Tweet type (original/reply/retweet)
- Hour, day of week, month, year
- Time of day categories (morning/afternoon/evening/night)
- Days since account start
- Punctuation (questions, exclamations, periods)
- Emoji count and presence
- Capitalization patterns
- Basic sentiment (positive/negative/neutral)
- Opening words
- Hashtags (list)
- Mentions (list)
- Clean text (URLs/mentions removed)
- Clean text length
- Likes, retweets, total engagement
- Z-score (standard deviations from mean)
- Percentile ranking
- Engagement tier (viral/high/good/average/low)
- Like-to-retweet ratio
- Log-scaled engagement
The analyze_engagement.py script finds statistically significant correlations (p < 0.05) between features and engagement.
Example patterns found in typical analyses:
| Feature Type | Common Findings |
|---|---|
is_original |
Often shows positive correlation (originals outperform replies) |
has_media |
Typically positive correlation (visual content engages) |
has_url |
Often positive (sharing content drives engagement) |
is_reply |
Frequently negative (replies get less visibility) |
has_mention |
May be negative (tagged tweets can underperform) |
posting_time |
Varies by account's audience timezone |
Your results will differ - run the analysis to find your unique patterns!
-
Indexing (
setup_rag.py):- Embeds your top 25% tweets using OpenAI's text-embedding-3-small
- Stores in local ChromaDB for fast semantic search
- One-time cost: ~$0.01-0.03 depending on archive size
-
Generation (
tweet_helper.py):- Queries: "coffee morning routine"
- RAG retrieves: Your 3-5 most similar successful tweets
- GPT-5 generates: New tweet matching your voice + those examples
- Analysis: Voice match score & predicted engagement
Automatically extracted from your tweets:
- Tone distribution (positive/negative/neutral percentages)
- Formality score (casual vs professional)
- Personal vs observational ratio
- Average length, profanity usage, humor markers
- Optimal character count
- Search: Find similar successful tweets by topic
- Generate: Create new tweets on any topic in your voice
- Analyze: Check drafts for voice match and predict engagement
- Compare: A/B test draft variations
- Interactive: Iterative refinement session
Migrated to OpenAI's Responses API with:
- Structured XML prompting for better instruction following
- Conditional parameter handling (no temperature for reasoning models)
- Optimized for tweet generation (low verbosity, clear constraints)
- Python 3.13+
- uv (package manager)
- OpenAI API (for RAG tweet generator)
- ChromaDB (vector database)
- nltk (NLP)
- scikit-learn (ML/stats)
- vaderSentiment (sentiment analysis)
- numpy, scipy (numerical operations)
# Clone repository
git clone https://github.com/bwl/xtwact.git
cd xtwact
# Install dependencies
uv sync
# Setup environment
cp .env.example .env
# Add your OPENAI_API_KEY to .envEdit .env for RAG tweet generator:
OPENAI_API_KEY=sk-...
DEFAULT_MODEL=gpt-5 # or gpt-4o, gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small
DEFAULT_TEMPERATURE=0.7 # Only for non-reasoning models
MAX_TOKENS=100 # Ignored by GPT-5 (uses Responses API)- All analysis is local
- RAG embeddings stored locally in ChromaDB
- OpenAI API used only for embeddings and generation
- Original archive preserved (never modified)
- Generated corpus is gitignored
{"text": "tweet content", "weight": 0.95, "engagement": 1234, "metadata": {...}}{
"id": "tweet_id",
"text": "clean text",
"engagement": 1234,
"engagement_tier": "high",
"metadata": {
"has_media": true,
"sentiment": "positive",
"is_original": true,
"hour": 14,
"day_of_week": 4
}
}{
"description": "Top 1% viral tweets",
"count": 152,
"examples": [
{
"text": "...",
"engagement": 1234,
"has_media": true
}
]
}{
"solid": {
"style": {
"formality_score": 6.8,
"personal_pct": 77.3,
"observational_pct": 22.7
},
"sentiment": {
"overall_tone": "Slightly Positive",
"distribution": {"positive_pct": 47, "neutral_pct": 42, "negative_pct": 11}
},
"linguistic": {
"avg_char_count": 62,
"profanity_pct": 1.2,
"humor_markers_pct": 8.5
}
}
}After running the full pipeline:
- Cleaned archive with sensitive data removed
- Feature dataset with 53 features per tweet
- Correlation analysis identifying your engagement drivers
- 6 weighted training datasets for different use cases
- RAG corpus indexed for semantic search
- Voice profile capturing your writing style
- AI tweet generator that writes in your voice
Space savings: ~5-10 GB by removing DMs and tracking data
- Run on archives with 1000+ tweets for statistical significance
- More engagement history = better insights
- Viral tweets (outliers) can skew correlations - use log-weighted datasets
- Use the RAG tweet generator after indexing with
setup_rag.py - Test different models: gpt-5 (best quality), gpt-4o (balanced), gpt-4o-mini (fast/cheap)
- Iterate with
interactivemode for refinement - Save favorites to build a curated collection
- Start with
tweets_top10pct.jsonl(good balance of quality and quantity) - Use
tweets_top1pct.jsonlonly if you have 10,000+ tweets tweets_log.jsonlreduces impact of viral outliers
Built for analyzing personal Twitter/X archives and generating content in your voice.
Uses:
- OpenAI API (embeddings, GPT-5 generation)
- ChromaDB (vector database)
- Standard data science libraries (nltk, scikit-learn, scipy)
Start generating better tweets by learning from your own data! 🚀