Reducing complexity of implementation in order to be able to add Atlas text search token based pagination#1046
Reducing complexity of implementation in order to be able to add Atlas text search token based pagination#1046kbuma wants to merge 8 commits intomaterialsproject:mainfrom
Conversation
| slice_size = num_params_min_chunk or 1 | ||
| # If successful, continue with normal pagination | ||
| total_data = {"data": []} # type: dict | ||
| total_data["data"].extend(data["data"]) |
There was a problem hiding this comment.
should favor .append(...) w/itertools.chain.from_iterable(...) at the end rather than repeated calls to .extend (especially since there is a loop later).
lines: 656, 701, 732, 806
| for i in range(0, len(split_values), batch_size): | ||
| batch = split_values[i : i + batch_size] |
There was a problem hiding this comment.
Might be a ways off for being the minimum py version, but in 3.12 itertools introduced batched. I've used the approximate implementation from the docs before:
def batched(iterable, n, *, strict=False):
# batched('ABCDEFG', 2) → AB CD EF G
if n < 1:
raise ValueError('n must be at least one')
iterator = iter(iterable)
while batch := tuple(islice(iterator, n)):
if strict and len(batch) != n:
raise ValueError('batched(): incomplete batch')
yield batch
tsmathis
left a comment
There was a problem hiding this comment.
Not really much to say on my end, I am be curious though about the performance/execution time of this implementation vs. the parallel approach.
| r -= 1 | ||
| except MPRestError as e: | ||
| # If we get 422 or 414 error, or 0 results for comma-separated params, split into batches | ||
| if "422" in str(e) or "414" in str(e) or "Got 0 results" in str(e): |
There was a problem hiding this comment.
any(trace in str(e) for trace in ("422","414","Got 0 results"))| ] | ||
| # Batch the split values to reduce number of requests | ||
| # Use batches of up to 100 values to balance URL length and request count | ||
| batch_size = min(100, max(1, len(split_values) // 10)) |
There was a problem hiding this comment.
Should the batch size be chosen according to the limits we (may) impose on a Query? Or alternatively, should there be a check on the length of a batch after fixing the batch size? That way excessively long queries get rejected (e.g., I query for 1M task IDs, 100 batches would still give me an overly-long list of task IDs)
77913b3 to
28e74e1
Compare
Summary
Major changes: