Skip to content
/ xtwact Public

Transform Twitter archives into ML datasets and generate voice-matched tweets with RAG-powered GPT-5 integration

Notifications You must be signed in to change notification settings

bwl/xtwact

Repository files navigation

xtwact - Twitter Archive Intelligence & AI Tweet Generator

Transform Twitter archives into ML datasets and generate voice-matched tweets with RAG-powered GPT-5 integration.

🎯 What This Does

A comprehensive toolkit that:

  1. Cleans and extracts your Twitter archive data (removes DMs, saves GBs of space)
  2. Analyzes 53+ features from every tweet (structural, temporal, linguistic, content)
  3. Identifies engagement patterns through statistical correlation analysis
  4. Generates weighted datasets for fine-tuning LLMs, RAG systems, and few-shot prompting
  5. Powers AI tweet generation using RAG + GPT-5 to write in your voice

📊 Example Insights You Can Extract

The analysis scripts identify patterns in your Twitter data, such as:

Typical Engagement Drivers Found

  • Original tweets often get 5-10x more engagement than replies
  • Media (images/videos) typically provide 2-4x engagement boost
  • URLs can increase engagement 2-5x
  • Optimal posting times vary by account (identified automatically)
  • Replies and mentions may reduce engagement for some accounts

Your Archive Breakdown

The scripts will analyze:

  • Total tweet count and date range
  • Engagement distribution (likes, retweets)
  • Tweet type breakdown (originals vs replies)
  • Media usage patterns
  • Temporal posting patterns

Note: Results vary significantly by account. Run the analysis to discover your unique patterns!

🚀 Quick Start

Prerequisites

# Requires Python 3.13+ and uv package manager
pip install uv

# OpenAI API key (for RAG tweet generator)
export OPENAI_API_KEY="sk-..."

1. Clean Your Archive

# Extract just your public tweets from the Twitter export
python cleanup_twitter_archive.py

# Result: twitter_archive_clean/ with clean JSON data
# Saves GBs by removing DMs and tracking data

2. Extract Features

# Extract 53 features from all tweets
uv run python extract_features.py

# Result: corpus/features.json
# Features: length, media, timing, sentiment, engagement metrics, etc.

3. Analyze Engagement

# Find what correlates with high engagement in YOUR data
uv run python analyze_engagement.py

# Results:
#   corpus/engagement_analysis.json - Full correlation data
#   corpus/engagement_analysis.txt  - Human-readable insights

4. Generate Training Datasets

# Create weighted datasets with multiple schemes
uv run python generate_corpus.py

# Results in corpus/:
#   finetune/   - 6 weighted JSONL datasets
#   rag/        - High-engagement tweets for retrieval
#   fewshot/    - Top-performing examples for prompting
#   templates/  - Style patterns from best performers

5. Setup RAG Tweet Generator

# Index your high-engagement tweets for semantic search
uv run setup_rag.py

# Result: chroma_db/ with embeddings (~$0.01 one-time cost)

6. Generate AI Tweets

# Generate tweets in your voice using RAG + GPT-5
uv run tweet_helper.py generate "morning coffee routine"

# Other commands:
uv run tweet_helper.py search "productivity tips"      # Find similar successful tweets
uv run tweet_helper.py analyze "my draft tweet text"  # Check voice match & predict engagement
uv run tweet_helper.py interactive                    # Interactive generation session

📁 Output Structure

twitter_archive_clean/
├── data/
│   ├── tweets.json              # Your clean tweets
│   ├── likes.json              # Your likes
│   ├── followers.json          # Your followers
│   └── following.json          # Your following
└── media/
    └── tweets_media/           # Images/videos from tweets

corpus/
├── features.json               # All tweets with 53 extracted features
├── feature_summary.txt         # Feature documentation
├── engagement_analysis.json    # Correlation analysis
├── engagement_analysis.txt     # Insights report
├── voice_profiles.json         # Your writing style analysis
├── finetune/
│   ├── tweets_linear.jsonl     # All tweets, linearly weighted
│   ├── tweets_log.jsonl        # Log-weighted (reduces outlier impact)
│   ├── tweets_top1pct.jsonl    # Only viral tweets
│   ├── tweets_top5pct.jsonl    # Top 5%
│   ├── tweets_top10pct.jsonl   # Top 10%
│   └── tweets_top25pct.jsonl   # Top 25%
├── rag/
│   ├── high_engagement.json    # Top 25% tweets for retrieval
│   └── metadata.json           # Corpus metadata
├── fewshot/
│   ├── viral_examples.json     # Top 1% viral hits
│   └── top_originals.json      # Best non-reply tweets
└── templates/
    └── winning_patterns.json   # Categorized style patterns

chroma_db/                      # RAG vector database (generated by setup_rag.py)
└── metadata.json

🎓 Use Cases

1. AI-Powered Tweet Generation (New!)

Generate tweets in your voice using GPT-5:

# Generate on a topic
uv run tweet_helper.py generate "weekend plans"

# Features:
# - RAG retrieval finds your similar successful tweets
# - GPT-5 generates in your voice and style
# - Voice match scoring (based on your patterns)
# - Predicted engagement ranges
# - Interactive refinement

2. Fine-Tune a Language Model

Use the weighted JSONL datasets to fine-tune models on your successful style:

# Use tweets_top10pct.jsonl for training
# Each line has: text, weight, engagement, metadata

# Example with OpenAI fine-tuning:
# Convert to their format and upload

3. Build a RAG System

Use high-engagement tweets as retrieval examples:

# Load RAG corpus
with open('corpus/rag/high_engagement.json') as f:
    tweets = json.load(f)

# Embed and index for semantic search
# Query: "funny observational humor"
# Retrieve: Your best tweets in that style

4. Few-Shot Prompting

Use top-performing examples to guide LLM generation:

# Load viral examples
with open('corpus/fewshot/viral_examples.json') as f:
    examples = json.load(f)['examples']

# Prompt engineering:
# "Write tweets like these examples: [your_viral_tweets]"

5. Content Strategy

Use the insights to inform what you post based on YOUR data:

Example findings might show:

  • Original tweets perform better than replies
  • Media inclusion may boost engagement
  • URLs can increase engagement
  • Optimal posting times (unique to your audience)
  • Excessive @mentions might reduce engagement
  • Certain emoji patterns may not help

Run the analysis to discover your account's unique patterns!

📊 Scripts Overview

Script Purpose Output
cleanup_twitter_archive.py Extract public data, remove DMs twitter_archive_clean/
analyze_tweets.py Basic statistics and insights Analytics report
extract_features.py Feature engineering (53 features/tweet) corpus/features.json
analyze_engagement.py Statistical correlation analysis Correlation reports
generate_corpus.py Create weighted training datasets Multiple JSONL files
voice_analysis.py Extract your writing style profile corpus/voice_profiles.json
setup_rag.py Index tweets for semantic search chroma_db/ vector database
tweet_helper.py AI tweet generation CLI Generated tweets

🔬 Features Extracted

Structural (14 features)

  • Length (characters, words)
  • Media presence and count
  • URLs, hashtags, mentions
  • Tweet type (original/reply/retweet)

Temporal (10 features)

  • Hour, day of week, month, year
  • Time of day categories (morning/afternoon/evening/night)
  • Days since account start

Linguistic (14 features)

  • Punctuation (questions, exclamations, periods)
  • Emoji count and presence
  • Capitalization patterns
  • Basic sentiment (positive/negative/neutral)
  • Opening words

Content (4 features)

  • Hashtags (list)
  • Mentions (list)
  • Clean text (URLs/mentions removed)
  • Clean text length

Engagement (11 features)

  • Likes, retweets, total engagement
  • Z-score (standard deviations from mean)
  • Percentile ranking
  • Engagement tier (viral/high/good/average/low)
  • Like-to-retweet ratio
  • Log-scaled engagement

📈 Understanding Correlation Analysis

The analyze_engagement.py script finds statistically significant correlations (p < 0.05) between features and engagement.

Example patterns found in typical analyses:

Feature Type Common Findings
is_original Often shows positive correlation (originals outperform replies)
has_media Typically positive correlation (visual content engages)
has_url Often positive (sharing content drives engagement)
is_reply Frequently negative (replies get less visibility)
has_mention May be negative (tagged tweets can underperform)
posting_time Varies by account's audience timezone

Your results will differ - run the analysis to find your unique patterns!

🤖 RAG Tweet Generator Details

How It Works

  1. Indexing (setup_rag.py):

    • Embeds your top 25% tweets using OpenAI's text-embedding-3-small
    • Stores in local ChromaDB for fast semantic search
    • One-time cost: ~$0.01-0.03 depending on archive size
  2. Generation (tweet_helper.py):

    • Queries: "coffee morning routine"
    • RAG retrieves: Your 3-5 most similar successful tweets
    • GPT-5 generates: New tweet matching your voice + those examples
    • Analysis: Voice match score & predicted engagement

Voice Profile

Automatically extracted from your tweets:

  • Tone distribution (positive/negative/neutral percentages)
  • Formality score (casual vs professional)
  • Personal vs observational ratio
  • Average length, profanity usage, humor markers
  • Optimal character count

Features

  • Search: Find similar successful tweets by topic
  • Generate: Create new tweets on any topic in your voice
  • Analyze: Check drafts for voice match and predict engagement
  • Compare: A/B test draft variations
  • Interactive: Iterative refinement session

GPT-5 Support

Migrated to OpenAI's Responses API with:

  • Structured XML prompting for better instruction following
  • Conditional parameter handling (no temperature for reasoning models)
  • Optimized for tweet generation (low verbosity, clear constraints)

🛠️ Technical Details

Dependencies

  • Python 3.13+
  • uv (package manager)
  • OpenAI API (for RAG tweet generator)
  • ChromaDB (vector database)
  • nltk (NLP)
  • scikit-learn (ML/stats)
  • vaderSentiment (sentiment analysis)
  • numpy, scipy (numerical operations)

Installation

# Clone repository
git clone https://github.com/bwl/xtwact.git
cd xtwact

# Install dependencies
uv sync

# Setup environment
cp .env.example .env
# Add your OPENAI_API_KEY to .env

Configuration

Edit .env for RAG tweet generator:

OPENAI_API_KEY=sk-...
DEFAULT_MODEL=gpt-5              # or gpt-4o, gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small
DEFAULT_TEMPERATURE=0.7          # Only for non-reasoning models
MAX_TOKENS=100                   # Ignored by GPT-5 (uses Responses API)

Data Privacy

  • All analysis is local
  • RAG embeddings stored locally in ChromaDB
  • OpenAI API used only for embeddings and generation
  • Original archive preserved (never modified)
  • Generated corpus is gitignored

📚 Dataset Formats

Fine-Tuning JSONL

{"text": "tweet content", "weight": 0.95, "engagement": 1234, "metadata": {...}}

RAG Corpus

{
  "id": "tweet_id",
  "text": "clean text",
  "engagement": 1234,
  "engagement_tier": "high",
  "metadata": {
    "has_media": true,
    "sentiment": "positive",
    "is_original": true,
    "hour": 14,
    "day_of_week": 4
  }
}

Few-Shot Examples

{
  "description": "Top 1% viral tweets",
  "count": 152,
  "examples": [
    {
      "text": "...",
      "engagement": 1234,
      "has_media": true
    }
  ]
}

Voice Profile

{
  "solid": {
    "style": {
      "formality_score": 6.8,
      "personal_pct": 77.3,
      "observational_pct": 22.7
    },
    "sentiment": {
      "overall_tone": "Slightly Positive",
      "distribution": {"positive_pct": 47, "neutral_pct": 42, "negative_pct": 11}
    },
    "linguistic": {
      "avg_char_count": 62,
      "profanity_pct": 1.2,
      "humor_markers_pct": 8.5
    }
  }
}

🎯 What You'll Get

After running the full pipeline:

  • Cleaned archive with sensitive data removed
  • Feature dataset with 53 features per tweet
  • Correlation analysis identifying your engagement drivers
  • 6 weighted training datasets for different use cases
  • RAG corpus indexed for semantic search
  • Voice profile capturing your writing style
  • AI tweet generator that writes in your voice

Space savings: ~5-10 GB by removing DMs and tracking data

💡 Tips & Best Practices

For Better Analysis

  • Run on archives with 1000+ tweets for statistical significance
  • More engagement history = better insights
  • Viral tweets (outliers) can skew correlations - use log-weighted datasets

For Better AI Generation

  • Use the RAG tweet generator after indexing with setup_rag.py
  • Test different models: gpt-5 (best quality), gpt-4o (balanced), gpt-4o-mini (fast/cheap)
  • Iterate with interactive mode for refinement
  • Save favorites to build a curated collection

For Fine-Tuning

  • Start with tweets_top10pct.jsonl (good balance of quality and quantity)
  • Use tweets_top1pct.jsonl only if you have 10,000+ tweets
  • tweets_log.jsonl reduces impact of viral outliers

📄 License & Credits

Built for analyzing personal Twitter/X archives and generating content in your voice.

Uses:

  • OpenAI API (embeddings, GPT-5 generation)
  • ChromaDB (vector database)
  • Standard data science libraries (nltk, scikit-learn, scipy)

Start generating better tweets by learning from your own data! 🚀

About

Transform Twitter archives into ML datasets and generate voice-matched tweets with RAG-powered GPT-5 integration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages