Cost-efficient alternatives to HAR files

Currently, we only host HAR files for the most recent crawl (as discussed in #1011) and all older HAR files have been removed. I've used them extensively in a project and I know others in the community have also relied on them for research across different domains.

Since their removal, reproducing HAR-like data from BigQuery is difficult and expensive. Querying the raw request/response data across multiple tables at page-level granularity quickly becomes cost-prohibitive for many users.

One idea could be to provide a UDF that reassembles HAR-like structures, but that still risks being costly depending on the size of the crawl and query.

We should make historical crawl data more accessible again, in a way that's sustainable and doesn't shift high costs to users. Ideally, the community should be able to query or download HAR-like data efficiently.

Would be good to discuss options for restoring this kind of access, either through BigQuery optimizations or external exports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cost-efficient alternatives to HAR files #1092

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cost-efficient alternatives to HAR files #1092

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions