xtwact - Twitter Archive Intelligence & AI Tweet Generator

Transform Twitter archives into ML datasets and generate voice-matched tweets with RAG-powered GPT-5 integration.

🎯 What This Does

A comprehensive toolkit that:

Cleans and extracts your Twitter archive data (removes DMs, saves GBs of space)
Analyzes 53+ features from every tweet (structural, temporal, linguistic, content)
Identifies engagement patterns through statistical correlation analysis
Generates weighted datasets for fine-tuning LLMs, RAG systems, and few-shot prompting
Powers AI tweet generation using RAG + GPT-5 to write in your voice

📊 Example Insights You Can Extract

The analysis scripts identify patterns in your Twitter data, such as:

Typical Engagement Drivers Found

Original tweets often get 5-10x more engagement than replies
Media (images/videos) typically provide 2-4x engagement boost
URLs can increase engagement 2-5x
Optimal posting times vary by account (identified automatically)
Replies and mentions may reduce engagement for some accounts

Your Archive Breakdown

The scripts will analyze:

Total tweet count and date range
Engagement distribution (likes, retweets)
Tweet type breakdown (originals vs replies)
Media usage patterns
Temporal posting patterns

Note: Results vary significantly by account. Run the analysis to discover your unique patterns!

🚀 Quick Start

Prerequisites

# Requires Python 3.13+ and uv package manager
pip install uv

# OpenAI API key (for RAG tweet generator)
export OPENAI_API_KEY="sk-..."

1. Clean Your Archive

# Extract just your public tweets from the Twitter export
python cleanup_twitter_archive.py

# Result: twitter_archive_clean/ with clean JSON data
# Saves GBs by removing DMs and tracking data

2. Extract Features

# Extract 53 features from all tweets
uv run python extract_features.py

# Result: corpus/features.json
# Features: length, media, timing, sentiment, engagement metrics, etc.

3. Analyze Engagement

# Find what correlates with high engagement in YOUR data
uv run python analyze_engagement.py

# Results:
#   corpus/engagement_analysis.json - Full correlation data
#   corpus/engagement_analysis.txt  - Human-readable insights

4. Generate Training Datasets

# Create weighted datasets with multiple schemes
uv run python generate_corpus.py

# Results in corpus/:
#   finetune/   - 6 weighted JSONL datasets
#   rag/        - High-engagement tweets for retrieval
#   fewshot/    - Top-performing examples for prompting
#   templates/  - Style patterns from best performers

5. Setup RAG Tweet Generator

# Index your high-engagement tweets for semantic search
uv run setup_rag.py

# Result: chroma_db/ with embeddings (~$0.01 one-time cost)

6. Generate AI Tweets

# Generate tweets in your voice using RAG + GPT-5
uv run tweet_helper.py generate "morning coffee routine"

# Other commands:
uv run tweet_helper.py search "productivity tips"      # Find similar successful tweets
uv run tweet_helper.py analyze "my draft tweet text"  # Check voice match & predict engagement
uv run tweet_helper.py interactive                    # Interactive generation session

📁 Output Structure

twitter_archive_clean/
├── data/
│   ├── tweets.json              # Your clean tweets
│   ├── likes.json              # Your likes
│   ├── followers.json          # Your followers
│   └── following.json          # Your following
└── media/
    └── tweets_media/           # Images/videos from tweets

corpus/
├── features.json               # All tweets with 53 extracted features
├── feature_summary.txt         # Feature documentation
├── engagement_analysis.json    # Correlation analysis
├── engagement_analysis.txt     # Insights report
├── voice_profiles.json         # Your writing style analysis
├── finetune/
│   ├── tweets_linear.jsonl     # All tweets, linearly weighted
│   ├── tweets_log.jsonl        # Log-weighted (reduces outlier impact)
│   ├── tweets_top1pct.jsonl    # Only viral tweets
│   ├── tweets_top5pct.jsonl    # Top 5%
│   ├── tweets_top10pct.jsonl   # Top 10%
│   └── tweets_top25pct.jsonl   # Top 25%
├── rag/
│   ├── high_engagement.json    # Top 25% tweets for retrieval
│   └── metadata.json           # Corpus metadata
├── fewshot/
│   ├── viral_examples.json     # Top 1% viral hits
│   └── top_originals.json      # Best non-reply tweets
└── templates/
    └── winning_patterns.json   # Categorized style patterns

chroma_db/                      # RAG vector database (generated by setup_rag.py)
└── metadata.json

🎓 Use Cases

1. AI-Powered Tweet Generation (New!)

Generate tweets in your voice using GPT-5:

# Generate on a topic
uv run tweet_helper.py generate "weekend plans"

# Features:
# - RAG retrieval finds your similar successful tweets
# - GPT-5 generates in your voice and style
# - Voice match scoring (based on your patterns)
# - Predicted engagement ranges
# - Interactive refinement

2. Fine-Tune a Language Model

Use the weighted JSONL datasets to fine-tune models on your successful style:

# Use tweets_top10pct.jsonl for training
# Each line has: text, weight, engagement, metadata

# Example with OpenAI fine-tuning:
# Convert to their format and upload

3. Build a RAG System

Use high-engagement tweets as retrieval examples:

# Load RAG corpus
with open('corpus/rag/high_engagement.json') as f:
    tweets = json.load(f)

# Embed and index for semantic search
# Query: "funny observational humor"
# Retrieve: Your best tweets in that style

4. Few-Shot Prompting

Use top-performing examples to guide LLM generation:

# Load viral examples
with open('corpus/fewshot/viral_examples.json') as f:
    examples = json.load(f)['examples']

# Prompt engineering:
# "Write tweets like these examples: [your_viral_tweets]"

5. Content Strategy

Use the insights to inform what you post based on YOUR data:

Example findings might show:

✅ Original tweets perform better than replies
✅ Media inclusion may boost engagement
✅ URLs can increase engagement
✅ Optimal posting times (unique to your audience)
❌ Excessive @mentions might reduce engagement
❌ Certain emoji patterns may not help

Run the analysis to discover your account's unique patterns!

📊 Scripts Overview

Script	Purpose	Output
`cleanup_twitter_archive.py`	Extract public data, remove DMs	`twitter_archive_clean/`
`analyze_tweets.py`	Basic statistics and insights	Analytics report
`extract_features.py`	Feature engineering (53 features/tweet)	`corpus/features.json`
`analyze_engagement.py`	Statistical correlation analysis	Correlation reports
`generate_corpus.py`	Create weighted training datasets	Multiple JSONL files
`voice_analysis.py`	Extract your writing style profile	`corpus/voice_profiles.json`
`setup_rag.py`	Index tweets for semantic search	`chroma_db/` vector database
`tweet_helper.py`	AI tweet generation CLI	Generated tweets

🔬 Features Extracted

Structural (14 features)

Length (characters, words)
Media presence and count
URLs, hashtags, mentions
Tweet type (original/reply/retweet)

Temporal (10 features)

Hour, day of week, month, year
Time of day categories (morning/afternoon/evening/night)
Days since account start

Linguistic (14 features)

Punctuation (questions, exclamations, periods)
Emoji count and presence
Capitalization patterns
Basic sentiment (positive/negative/neutral)
Opening words

Content (4 features)

Hashtags (list)
Mentions (list)
Clean text (URLs/mentions removed)
Clean text length

Engagement (11 features)

Likes, retweets, total engagement
Z-score (standard deviations from mean)
Percentile ranking
Engagement tier (viral/high/good/average/low)
Like-to-retweet ratio
Log-scaled engagement

📈 Understanding Correlation Analysis

The analyze_engagement.py script finds statistically significant correlations (p < 0.05) between features and engagement.

Example patterns found in typical analyses:

Feature Type	Common Findings
`is_original`	Often shows positive correlation (originals outperform replies)
`has_media`	Typically positive correlation (visual content engages)
`has_url`	Often positive (sharing content drives engagement)
`is_reply`	Frequently negative (replies get less visibility)
`has_mention`	May be negative (tagged tweets can underperform)
`posting_time`	Varies by account's audience timezone

Your results will differ - run the analysis to find your unique patterns!

🤖 RAG Tweet Generator Details

How It Works

Indexing (setup_rag.py):
- Embeds your top 25% tweets using OpenAI's text-embedding-3-small
- Stores in local ChromaDB for fast semantic search
- One-time cost: ~$0.01-0.03 depending on archive size
Generation (tweet_helper.py):
- Queries: "coffee morning routine"
- RAG retrieves: Your 3-5 most similar successful tweets
- GPT-5 generates: New tweet matching your voice + those examples
- Analysis: Voice match score & predicted engagement

Voice Profile

Automatically extracted from your tweets:

Tone distribution (positive/negative/neutral percentages)
Formality score (casual vs professional)
Personal vs observational ratio
Average length, profanity usage, humor markers
Optimal character count

Features

Search: Find similar successful tweets by topic
Generate: Create new tweets on any topic in your voice
Analyze: Check drafts for voice match and predict engagement
Compare: A/B test draft variations
Interactive: Iterative refinement session

GPT-5 Support

Migrated to OpenAI's Responses API with:

Structured XML prompting for better instruction following
Conditional parameter handling (no temperature for reasoning models)
Optimized for tweet generation (low verbosity, clear constraints)

🛠️ Technical Details

Dependencies

Python 3.13+
uv (package manager)
OpenAI API (for RAG tweet generator)
ChromaDB (vector database)
nltk (NLP)
scikit-learn (ML/stats)
vaderSentiment (sentiment analysis)
numpy, scipy (numerical operations)

Installation

# Clone repository
git clone https://github.com/bwl/xtwact.git
cd xtwact

# Install dependencies
uv sync

# Setup environment
cp .env.example .env
# Add your OPENAI_API_KEY to .env

Configuration

Edit .env for RAG tweet generator:

OPENAI_API_KEY=sk-...
DEFAULT_MODEL=gpt-5              # or gpt-4o, gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-small
DEFAULT_TEMPERATURE=0.7          # Only for non-reasoning models
MAX_TOKENS=100                   # Ignored by GPT-5 (uses Responses API)

Data Privacy

All analysis is local
RAG embeddings stored locally in ChromaDB
OpenAI API used only for embeddings and generation
Original archive preserved (never modified)
Generated corpus is gitignored

📚 Dataset Formats

Fine-Tuning JSONL

{"text": "tweet content", "weight": 0.95, "engagement": 1234, "metadata": {...}}

RAG Corpus

{
  "id": "tweet_id",
  "text": "clean text",
  "engagement": 1234,
  "engagement_tier": "high",
  "metadata": {
    "has_media": true,
    "sentiment": "positive",
    "is_original": true,
    "hour": 14,
    "day_of_week": 4
  }
}

Few-Shot Examples

{
  "description": "Top 1% viral tweets",
  "count": 152,
  "examples": [
    {
      "text": "...",
      "engagement": 1234,
      "has_media": true
    }
  ]
}

Voice Profile

{
  "solid": {
    "style": {
      "formality_score": 6.8,
      "personal_pct": 77.3,
      "observational_pct": 22.7
    },
    "sentiment": {
      "overall_tone": "Slightly Positive",
      "distribution": {"positive_pct": 47, "neutral_pct": 42, "negative_pct": 11}
    },
    "linguistic": {
      "avg_char_count": 62,
      "profanity_pct": 1.2,
      "humor_markers_pct": 8.5
    }
  }
}

🎯 What You'll Get

After running the full pipeline:

Cleaned archive with sensitive data removed
Feature dataset with 53 features per tweet
Correlation analysis identifying your engagement drivers
6 weighted training datasets for different use cases
RAG corpus indexed for semantic search
Voice profile capturing your writing style
AI tweet generator that writes in your voice

Space savings: ~5-10 GB by removing DMs and tracking data

💡 Tips & Best Practices

For Better Analysis

Run on archives with 1000+ tweets for statistical significance
More engagement history = better insights
Viral tweets (outliers) can skew correlations - use log-weighted datasets

For Better AI Generation

Use the RAG tweet generator after indexing with setup_rag.py
Test different models: gpt-5 (best quality), gpt-4o (balanced), gpt-4o-mini (fast/cheap)
Iterate with interactive mode for refinement
Save favorites to build a curated collection

For Fine-Tuning

Start with tweets_top10pct.jsonl (good balance of quality and quantity)
Use tweets_top1pct.jsonl only if you have 10,000+ tweets
tweets_log.jsonl reduces impact of viral outliers

📄 License & Credits

Built for analyzing personal Twitter/X archives and generating content in your voice.

Uses:

OpenAI API (embeddings, GPT-5 generation)
ChromaDB (vector database)
Standard data science libraries (nltk, scikit-learn, scipy)

Start generating better tweets by learning from your own data! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
RAG_README.md		RAG_README.md
README.md		README.md
USAGE.md		USAGE.md
analyze_engagement.py		analyze_engagement.py
analyze_tweets.py		analyze_tweets.py
cleanup_twitter_archive.py		cleanup_twitter_archive.py
extract_features.py		extract_features.py
generate_corpus.py		generate_corpus.py
main.py		main.py
mine_patterns.py		mine_patterns.py
pyproject.toml		pyproject.toml
requirements-rag.txt		requirements-rag.txt
setup_rag.py		setup_rag.py
tweet_helper.py		tweet_helper.py
uv.lock		uv.lock
voice_analysis.py		voice_analysis.py

bwl/xtwact

Folders and files

Latest commit

History

Repository files navigation

xtwact - Twitter Archive Intelligence & AI Tweet Generator

🎯 What This Does

📊 Example Insights You Can Extract

Typical Engagement Drivers Found

Your Archive Breakdown

🚀 Quick Start

Prerequisites

1. Clean Your Archive

2. Extract Features

3. Analyze Engagement

4. Generate Training Datasets

5. Setup RAG Tweet Generator

6. Generate AI Tweets

📁 Output Structure

🎓 Use Cases

1. AI-Powered Tweet Generation (New!)

2. Fine-Tune a Language Model

3. Build a RAG System

4. Few-Shot Prompting

5. Content Strategy

📊 Scripts Overview

🔬 Features Extracted

Structural (14 features)

Temporal (10 features)

Linguistic (14 features)

Content (4 features)

Engagement (11 features)

📈 Understanding Correlation Analysis

🤖 RAG Tweet Generator Details

How It Works

Voice Profile

Features

GPT-5 Support

🛠️ Technical Details

Dependencies

Installation

Configuration

Data Privacy

📚 Dataset Formats

Fine-Tuning JSONL

RAG Corpus

Few-Shot Examples

Voice Profile

🎯 What You'll Get

💡 Tips & Best Practices

For Better Analysis

For Better AI Generation

For Fine-Tuning

📄 License & Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages