Currently, we only host HAR files for the most recent crawl (as discussed in #1011) and all older HAR files have been removed. I've used them extensively in a project and I know others in the community have also relied on them for research across different domains.
Since their removal, reproducing HAR-like data from BigQuery is difficult and expensive. Querying the raw request/response data across multiple tables at page-level granularity quickly becomes cost-prohibitive for many users.
One idea could be to provide a UDF that reassembles HAR-like structures, but that still risks being costly depending on the size of the crawl and query.
We should make historical crawl data more accessible again, in a way that's sustainable and doesn't shift high costs to users. Ideally, the community should be able to query or download HAR-like data efficiently.
Would be good to discuss options for restoring this kind of access, either through BigQuery optimizations or external exports.
Currently, we only host HAR files for the most recent crawl (as discussed in #1011) and all older HAR files have been removed. I've used them extensively in a project and I know others in the community have also relied on them for research across different domains.
Since their removal, reproducing HAR-like data from BigQuery is difficult and expensive. Querying the raw request/response data across multiple tables at page-level granularity quickly becomes cost-prohibitive for many users.
One idea could be to provide a UDF that reassembles HAR-like structures, but that still risks being costly depending on the size of the crawl and query.
We should make historical crawl data more accessible again, in a way that's sustainable and doesn't shift high costs to users. Ideally, the community should be able to query or download HAR-like data efficiently.
Would be good to discuss options for restoring this kind of access, either through BigQuery optimizations or external exports.