Skip to content

Slow storing of data for shefParser #62

@jbatterman

Description

@jbatterman

@jbkolze @krowvin

We are utilizing shefParser to store data for a couple of our data acquisition processes as we prepare for cloud migration. Those processes either store shefit files to the database with cda, or store shef.e files to the database with cda. It seems to be significantly slower than processShefit was or our old direct-store methods.

For example, utilizing our MRRPPCS data stream, which uses a diode to send main stem data once an hour. For this process, shefParser ingests shefit files and stores them.

Each hour the 6 main stem projects send 14 values (for a total of approximately 84 values). The shefParser program takes on average 8 minutes to store these values.

An example shefit file is attached. Our bash script cycles through 6 shefit files (one for each project) and stores the data. It does this twice since the diode can send data to one of two folders depending on which machine is sending the data. Total runtime is generally 8 minutes to store these 12 files.

Argument used in the bash script:
shefParser -i $FILE --loader cda[$CDA_API_ROOT][$CDA_API_KEY] --processed

Error we see from the bash script log (specifically it seems worrisome to see the "too many 500 responses" and connection pool lines):

Traceback (most recent call last):
  File "/<python path>/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/<python path>/python3.9/site-packages/urllib3/connectionpool.py", line 876, in urlopen
    return self.urlopen(
  File "/<python path>/python3.9/site-packages/urllib3/connectionpool.py", line 876, in urlopen
    return self.urlopen(
  File "/<python path>/python3.9/site-packages/urllib3/connectionpool.py", line 876, in urlopen
    return self.urlopen(
  [Previous line repeated 3 more times]
  File "/<python path>/python3.9/site-packages/urllib3/connectionpool.py", line 866, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "/<python path>/python3.9/site-packages/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<T7 CDA>', port=<port>): Max retries exceeded with url: /nwdm-data/timeseries?create-as-lrts=False&store-rule=REPLACE+WITH+NON+MISSIN
G&override-protection=False (Caused by ResponseError('too many 500 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/<python path>/python3.9/site-packages/shef/loaders/cda_loader.py", line 291, in limited_task
    return await asyncio.to_thread(
  File "/<python path>/python3.9/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
  File "/<python path>/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/<python path>/python3.9/site-packages/cwms/timeseries/timeseries.py", line 354, in store_timeseries
    return api.post(endpoint, data, params)
  File "/<python path>/python3.9/site-packages/cwms/api.py", line 335, in post
    with SESSION.post(endpoint, params=params, headers=headers, data=data) as response:
  File "/<python path>/python3.9/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/<python path>/python3.9/site-packages/requests_toolbelt/sessions.py", line 76, in request
    return super(BaseUrlSession, self).request(
  File "/<python path>/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/<python path>/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/<python path>/python3.9/site-packages/requests/adapters.py", line 510, in send
    raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='<T7 CDA>', port=<port>): Max retries exceeded with url: /nwdm-data/timeseries?create-as-lrts=False&store-rule=REPLACE+WITH+NON+MISSING&
override-protection=False (Caused by ResponseError('too many 500 error responses'))

We can try to get some additional information from the catalina logs if that's necessary.

When we have had missing data, the slowness of shefParser makes it unusable. We have had to revert back to processShefit when we had to backfill more than a few hours.

FTPK-data-20250819-12.txt.shefit.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions