Feature/794 canonical transcript #804

mohamed-laarej · 2025-06-26T16:06:56Z

feat: Add `canonical_transcript` method with supporting helpers

Addresses #794

This PR adds a canonical_transcript() function that identifies the most representative transcript for a gene, based on the total transcribed length (sum of exon lengths). It also introduces the following helper functions:

find_gene_feature() – Locate a gene by ID, name, or attributes
get_gene_transcripts() – Retrieve all transcript features for a gene
get_transcript_exons() – Get all exons for a given transcript
calculate_transcript_length() – Compute total length of a transcript by summing its exon lengths
get_gene_transcript_lengths() – Return lengths of all transcripts for a gene

These functions build on the existing genome_features and genome_feature_children methods. The implementation includes basic error handling for missing or incomplete features.

…`max()` key argument to use a lambda function, addressing mypy's type inference issue with `dict.get` in this context.

mohamed-laarej · 2025-06-26T16:11:47Z

Hi @jonbrenas,
I've pushed the initial implementation of the canonical_transcript method. I believe the core functionality is now in place and addresses the requirements of issue #794.
Once you've had a chance to review, if you think this implementation looks good, I can proceed with adding comprehensive pytest unit tests and developing a notebook example demonstrating its usage.
Please let me know your thoughts.
Thanks!

jonbrenas · 2025-06-27T08:38:18Z

Thanks @mohamed-laarej. This is great! I think you could make it slightly more modular but introducing new functions that do some of the more basic tasks (e.g., a function that lists all transcripts of a gene, a function that lists all exons of a transcript, ...).

jonbrenas · 2025-06-27T15:35:54Z

Another small nitpick: could you add the "type definition" for gene (transcript is already there) to base_params.py?

- Add find_gene_feature() to locate genes by various identifiers - Add get_gene_transcripts() to retrieve all transcripts for a gene - Add get_transcript_exons() to get exons for a transcript - Add calculate_transcript_length() to compute transcript length - Add get_gene_transcript_lengths() to get all transcript lengths for a gene - Refactor canonical_transcript() to use new helper functions - Improves code modularity, testability, and reusability

jonbrenas · 2025-07-04T16:23:56Z

That's great @mohamed-laarej.

I think the doc description might conflict with the way the doc is currently generated by default. Could you remove the example more specifically as the notebooks are used for that purpose?

mohamed-laarej · 2025-07-05T22:14:44Z

Hello @jonbrenas ,
I wanted to mention that due to a typo in the branch name during a pull, an unintended merge commit was created. I have since cleaned up the commit history and force-pushed the branch, so now only the intended docstring change is included. The old merge commit visible on GitHub is just an artifact from before the cleanup and does not affect the current PR state. Please let me know if you need any further clarifications.
Thank you!

leehart · 2025-08-08T08:33:48Z

Looks like this just needs a re-review.

jonbrenas · 2025-08-11T14:59:52Z

Thank you, @mohamed-laarej.

Would it be possible to add a few tests to test_genome_features.py to show that the functions work fine?

mohamed-laarej · 2025-08-11T18:24:38Z

Yes @jonbrenas, I can add them.

…ility - Added extensive tests covering gene/transcript lookup, exon retrieval, length calculations, and integration scenarios, with dynamic fixture handling and graceful skips. - Updated find_gene_feature to use configurable GFF gene name attribute for broader compatibility.

mohamed-laarej · 2025-08-14T17:35:59Z

Hi @jonbrenas ,

Just wanted to let you know that I've pushed the latest changes to this PR.

Regarding the tests, you'll notice that specifically the test_find_gene_feature for the af1_sim fixture is now SKIPPED rather than failing. This is due to an inconsistency in the af1_sim simulated dataset itself, where the find_gene_feature function cannot reliably locate or process genes as expected, even when they appear in a broader query of genome_features.

I've implemented pytest.skip within this specific test to gracefully handle these data limitations, ensuring that the test suite provides accurate feedback. Other tests for af1_sim (like test_get_gene_transcripts, test_canonical_transcript, etc.) are passing if their specific data requirements are met, and all tests for the ag3_sim fixture are passing as expected, confirming the functionality of the new methods.

I'm happy to discuss this further if you'd like to explore options for updating the af1_sim fixture or making the find_gene_feature method more flexible to accommodate such data variations.

Thanks,

jonbrenas · 2025-08-15T07:35:39Z

Thanks @mohamed-laarej. I think the test shouldn't be skipped but the issue isn't with your code so it is fine here. I'll create a new issue to deal with that.

mohamed-laarej added 2 commits June 26, 2025 16:46

feat: Add canonical_transcript function (resolves #794)

9b62e01

fix: Resolve mypy arg-type error in canonical_transcript\n\nRefactor …

745ea8e

…`max()` key argument to use a lambda function, addressing mypy's type inference issue with `dict.get` in this context.

mohamed-laarej added 2 commits June 29, 2025 16:23

add gene type definition and update function parameter types

4395d53

docs: remove usage examples from canonical_transcript docstring

6569103

mohamed-laarej force-pushed the feature/794-canonical-transcript branch from 3d0bf45 to 6569103 Compare July 5, 2025 21:59

Merge branch 'master' into feature/794-canonical-transcript

7410fb2

leehart requested a review from jonbrenas August 8, 2025 08:33

Merge branch 'master' into feature/794-canonical-transcript

4b91154

mohamed-laarej added 2 commits August 14, 2025 15:43

fix: return Python int instead of np.int64

099cff0

Merge branch 'master' into feature/794-canonical-transcript

bfa9d99

Merge branch 'master' into feature/794-canonical-transcript

5c17309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/794 canonical transcript #804

Feature/794 canonical transcript #804

Uh oh!

mohamed-laarej commented Jun 26, 2025 •

edited

Loading

Uh oh!

mohamed-laarej commented Jun 26, 2025

Uh oh!

jonbrenas commented Jun 27, 2025

Uh oh!

jonbrenas commented Jun 27, 2025

Uh oh!

jonbrenas commented Jul 4, 2025

Uh oh!

mohamed-laarej commented Jul 5, 2025

Uh oh!

leehart commented Aug 8, 2025

Uh oh!

jonbrenas commented Aug 11, 2025

Uh oh!

mohamed-laarej commented Aug 11, 2025

Uh oh!

mohamed-laarej commented Aug 14, 2025

Uh oh!

jonbrenas commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Feature/794 canonical transcript #804

Are you sure you want to change the base?

Feature/794 canonical transcript #804

Uh oh!

Conversation

mohamed-laarej commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat: Add canonical_transcript method with supporting helpers

Uh oh!

mohamed-laarej commented Jun 26, 2025

Uh oh!

jonbrenas commented Jun 27, 2025

Uh oh!

jonbrenas commented Jun 27, 2025

Uh oh!

jonbrenas commented Jul 4, 2025

Uh oh!

mohamed-laarej commented Jul 5, 2025

Uh oh!

leehart commented Aug 8, 2025

Uh oh!

jonbrenas commented Aug 11, 2025

Uh oh!

mohamed-laarej commented Aug 11, 2025

Uh oh!

mohamed-laarej commented Aug 14, 2025

Uh oh!

jonbrenas commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohamed-laarej commented Jun 26, 2025 •

edited

Loading

feat: Add `canonical_transcript` method with supporting helpers