-
Notifications
You must be signed in to change notification settings - Fork 1
snakemake pipeline #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
290b9b6
WIP - Make snakemake work
jeremyestein 58a48ed
More pipeline glue. Pseudonymise to correctly named files. Improvemen…
jeremyestein 1868e13
Merge branch 'dev' into jeremy/pipeline
jeremyestein 42789bd
Create log dir before using it
jeremyestein 75b04cd
Put snakemake preamble in a method to avoid polluting the global
jeremyestein a9ac6f3
Tidy up too new message and track running time of snakemake preamble
jeremyestein cbef594
Log to file for FTPS upload as well
jeremyestein a5b93bb
Change the way the do_upload method expects to receive paths and add a
jeremyestein 6dbed2c
Bring in the envs needed for FTPS upload from a docker mounted file,
jeremyestein f942e33
Give us the option to make the pipeline not run all the way.
jeremyestein be93bb3
Guide to debugging the pipeline
jeremyestein 2bfa635
Mark stub implementation as in need of removal
jeremyestein a9b7cf6
Fix linting errors.
jeremyestein e8d2f24
Fix indentation so pytest tests run in any PR type
jeremyestein 913eb6a
Install PIXL dependency
jeremyestein 5424a3e
Install ourselves correctly with venv
jeremyestein 32b2906
Note that Windows may differ
jeremyestein File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # About | ||
| This is a brief guide to inspecting the state of the waveform data pipeline. | ||
| Eg, looking for intermediate data or error messages if something | ||
| is not coming through as expected. | ||
|
|
||
| # Start here | ||
|
|
||
| It is helpful to refer to the pipeline diagram at | ||
| https://github.com/SAFEHR-data/emap/blob/develop/docs/technical_overview/waveforms/pipeline.md | ||
| to get an overview and find the right place to look. | ||
|
|
||
| First, let's see what data is present in the `waveform-export` directory. | ||
|
|
||
| Are there recent files in `ftps-logs`? | ||
|
|
||
| > [!NOTE] | ||
| > The timestamps that are part of the file names are based on when the data is from, not when they were processed by us! | ||
|
|
||
| > [!TIP] | ||
| > Use `ls -latr` to see the latest files in a directory. | ||
|
|
||
| If the relevant `*.uploaded.json` marker file is present, this means the FTPS | ||
| upload to the DSH happened without an error. | ||
| The file contents are upload stats in JSON format. | ||
| Uploads will also generate email notifications from the DSH. | ||
|
|
||
| If the marker file is not present, let's check the | ||
| other end of our pipeline: are there recent files in `original-csv`? | ||
| If not then you need to look at the `waveform-controller` logs, | ||
| or failing that further upstream, at the Rabbitmq server (see later section). | ||
|
|
||
| If files in `original-csv` are present, then the error is somewhere inside our pipeline, | ||
| and you should check the logs in `snakemake-logs` (see later section). | ||
|
|
||
| Parquets that are a direct translation from the CSV are found in `original-parquet`. | ||
|
|
||
| Parquets that have been pseudonymised are found in `pseudonymised`. | ||
|
|
||
|
|
||
| # Logging summary | ||
|
|
||
| Logs are found in: | ||
| * Docker container logs | ||
| * Snakemake top-level logs | ||
| * Snakemake job-level logs | ||
|
|
||
| > [!CAUTION] | ||
| > Always be aware that logs may contain sensitive information. The only | ||
| > files considered safe for upload to the DSH are those in the `pseudonymised` | ||
| > directory. | ||
|
|
||
| ## Docker logs | ||
|
|
||
| ### `waveform-controller` container | ||
| ```docker compose logs -t waveform-controller``` | ||
| Shows the `waveform-controller` service logs. Useful for: | ||
| - Emap connectivity | ||
| - RabbitMQ connectivity | ||
| - patient correlation query errors (search for "unmatched") | ||
| - CSV output failures | ||
|
|
||
| This log is not very chatty if everything is going well. | ||
|
|
||
| ### `waveform-exporter` container | ||
| ```docker compose logs -t waveform-exporter``` | ||
| Shows the output from the cron-triggered script `scheduled-script.sh`. | ||
| Useful for high-level pipeline failures before Snakemake starts, or | ||
| Snakemake startup failures (eg. when snakemake already running) | ||
|
|
||
| ## Snakemake logs | ||
|
|
||
| Written to the mounted volume under `waveform-export/snakemake-logs/`. | ||
| These logs describe pipeline orchestration and per-rule execution. | ||
|
|
||
| ### `snakemake-outer-log*.log` | ||
| Top-level Snakemake run logs, including: | ||
| - recently written CSVs that were temporarily excluded from processing (search "File too new") | ||
| - job summaries and Snakemake DAG resolution | ||
| - more detailed errors when Snakemake itself fails | ||
|
|
||
| Unlike data files, the timestamps in these file names are when the snakemake | ||
| pipeline was invoked. | ||
|
|
||
| ### `{date}.{hashed_csn}.{stream_id}.{units}.log` | ||
| Job-level log for the `csv_to_parquet` rule. Contains: | ||
| - CSV -> parquet info | ||
| - pseudonymisation steps | ||
|
|
||
| ## FTPS logs and marker files | ||
|
|
||
| Produced under `waveform-export/ftps-logs/`. | ||
|
|
||
| ### `{date}.{hashed_csn}.{stream_id}.{units}.ftps.log` | ||
| Job-level FTPS upload logs. Useful for: | ||
| - connection/authentication errors | ||
| - transfer failures | ||
|
|
||
| ### `{date}.{hashed_csn}.{stream_id}.{units}.ftps.uploaded.json` | ||
| Upload marker file (aka sentinel) written after a successful transfer. | ||
| It contains, in JSON format: | ||
| - `uploaded_file` (the uploaded file path) | ||
| - `upload_time_secs` (time to upload in seconds using monotonic clock) | ||
| - `start_timestamp` and `end_timestamp` (wall clock UTC start and end timestamp) | ||
|
|
||
| Example paths: | ||
| - `waveform-export/snakemake-logs/snakemake-outer-log20260122T173201.log` | ||
| - `waveform-export/snakemake-logs/2025-06-04.acbc4701.52912.mL.log` | ||
| - `waveform-export/ftps-logs/2025-06-04.8bea0824.52912.mL.ftps.log` | ||
| - `waveform-export/ftps-logs/2025-06-04.8bea0824.52912.mL.ftps.uploaded.json` | ||
|
|
||
| # Rabbitmq (part of Emap) | ||
|
|
||
| If the `waveform-controller` service appears to be up and running but is | ||
| not generating data, you could check the `waveform_export` queue | ||
| in the rabbitmq server, which is part of Emap. | ||
|
|
||
| If there are no messages present, it's possible that the waveform reader (also | ||
| part of Emap) is not generating them. | ||
|
|
||
| # Waveform reader (part of Emap) | ||
| This receives HL7 data on a TCP port from the Capsule server. | ||
| It writes received messages | ||
| to the docker host directory `waveform-saved-messages`, so look | ||
| there for recent messages. | ||
|
|
||
| Useful commands, to be run from the emap venv (see Emap repo for more details): | ||
| * `emap docker ps` check for container up status (`waveform-reader`) | ||
| * `emap docker logs waveform-reader` see if HL7 messages are being received, check for errors | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,32 @@ | ||
| #!/bin/bash | ||
|
|
||
| # Run by the cron scheduler | ||
| # Probably want snakemake instead... | ||
| emap-csv-pseudon --help | ||
| emap-send-ftps --help | ||
| set -euo pipefail | ||
|
|
||
| # This script is to be run by the cron scheduler, and its | ||
| # output goes to the docker logs. | ||
| # The snakemake output goes to its own log file as defined here. | ||
| # These files will end up on Windows so be careful about disallowed characters in the names. | ||
| date_str=$(date --utc +"%Y%m%dT%H%M%S") | ||
| SNAKEMAKE_CORES="${SNAKEMAKE_CORES:-1}" | ||
| # for temporarily making the pipeline not go all the way | ||
| SNAKEMAKE_RULE_UNTIL="${SNAKEMAKE_RULE_UNTIL:-all}" | ||
|
|
||
| # log file for the overall snakemake run (as opposed to per-job logs, | ||
| # which are defined in the snakefile) | ||
| outer_log_file="/waveform-export/snakemake-logs/snakemake-outer-log${date_str}.log" | ||
| # snakemake has not run yet so will not create the log dir; do it manually | ||
| mkdir -p "$(dirname "$outer_log_file")" | ||
| echo "$0: invoking snakemake, cores=$SNAKEMAKE_CORES, logging to $outer_log_file" | ||
| touch "$outer_log_file" | ||
| # bring in envs from file because cron gives us a clean environment | ||
| set -a | ||
| source /config/exporter.env | ||
| set +a | ||
| set +e | ||
| snakemake --snakefile /app/src/pipeline/Snakefile \ | ||
| --cores "$SNAKEMAKE_CORES" \ | ||
| --until "$SNAKEMAKE_RULE_UNTIL" \ | ||
| >> "$outer_log_file" 2>&1 | ||
| ret_code=$? | ||
| set -e | ||
| echo "$0: snakemake exited with return code $ret_code" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.