All codes used in this study are provided in a step by step manner to ensure transparency, reproducibility, and ease of implementation.
The datasets used in this project cannot be publicly released due to privacy and licensing restrictions. However, the full codebase is provided so that all processing steps and results can be reproduced once the data are obtained.
The raw 10-K filings can be downloaded from the following source:
https://sraf.nd.edu/data/stage-one-10-x-parse-data/
After downloading the data, running the scripts in this repository will reproduce all preprocessing steps used in the paper.
The StockTwits dataset used in this project can be obtained from the dataset introduced in the following paper:
Li, Xingji, Aaron R. Kaufman, and Nasser Alansari (2025). StockTwits: Comprehensive records of a financial social media platform from 2008 to 2022. Journal of Quantitative Description: Digital Media.
The dataset and download instructions are available at:
https://stocktwits-nyu.s3.us-west-2.amazonaws.com/dataset/README.md
For convenience, we also provide the trained checkpoints for the best-performing models reported in the paper. These checkpoints allow users to reproduce the reported results without retraining the models from scratch.