Skip to content

Conversation

@mohamed-laarej
Copy link
Collaborator

@mohamed-laarej mohamed-laarej commented Jun 26, 2025

feat: Add canonical_transcript method with supporting helpers

Addresses #794

This PR adds a canonical_transcript() function that identifies the most representative transcript for a gene, based on the total transcribed length (sum of exon lengths). It also introduces the following helper functions:

  • find_gene_feature() – Locate a gene by ID, name, or attributes
  • get_gene_transcripts() – Retrieve all transcript features for a gene
  • get_transcript_exons() – Get all exons for a given transcript
  • calculate_transcript_length() – Compute total length of a transcript by summing its exon lengths
  • get_gene_transcript_lengths() – Return lengths of all transcripts for a gene

These functions build on the existing genome_features and genome_feature_children methods. The implementation includes basic error handling for missing or incomplete features.

…`max()` key argument to use a lambda function, addressing mypy's type inference issue with `dict.get` in this context.
@mohamed-laarej
Copy link
Collaborator Author

Hi @jonbrenas,
I've pushed the initial implementation of the canonical_transcript method. I believe the core functionality is now in place and addresses the requirements of issue #794.
Once you've had a chance to review, if you think this implementation looks good, I can proceed with adding comprehensive pytest unit tests and developing a notebook example demonstrating its usage.
Please let me know your thoughts.
Thanks!

@jonbrenas
Copy link
Collaborator

Thanks @mohamed-laarej. This is great! I think you could make it slightly more modular but introducing new functions that do some of the more basic tasks (e.g., a function that lists all transcripts of a gene, a function that lists all exons of a transcript, ...).

@jonbrenas
Copy link
Collaborator

Another small nitpick: could you add the "type definition" for gene (transcript is already there) to base_params.py?

- Add find_gene_feature() to locate genes by various identifiers
- Add get_gene_transcripts() to retrieve all transcripts for a gene
- Add get_transcript_exons() to get exons for a transcript
- Add calculate_transcript_length() to compute transcript length
- Add get_gene_transcript_lengths() to get all transcript lengths for a gene
- Refactor canonical_transcript() to use new helper functions
- Improves code modularity, testability, and reusability
@jonbrenas
Copy link
Collaborator

That's great @mohamed-laarej.

I think the doc description might conflict with the way the doc is currently generated by default. Could you remove the example more specifically as the notebooks are used for that purpose?

@mohamed-laarej mohamed-laarej force-pushed the feature/794-canonical-transcript branch from 3d0bf45 to 6569103 Compare July 5, 2025 21:59
@mohamed-laarej
Copy link
Collaborator Author

Hello @jonbrenas ,
I wanted to mention that due to a typo in the branch name during a pull, an unintended merge commit was created. I have since cleaned up the commit history and force-pushed the branch, so now only the intended docstring change is included. The old merge commit visible on GitHub is just an artifact from before the cleanup and does not affect the current PR state. Please let me know if you need any further clarifications.
Thank you!

@leehart
Copy link
Collaborator

leehart commented Aug 8, 2025

Looks like this just needs a re-review.

@leehart leehart requested a review from jonbrenas August 8, 2025 08:33
@jonbrenas
Copy link
Collaborator

Thank you, @mohamed-laarej.

Would it be possible to add a few tests to test_genome_features.py to show that the functions work fine?

@mohamed-laarej
Copy link
Collaborator Author

Yes @jonbrenas, I can add them.

…ility

- Added extensive tests covering gene/transcript lookup, exon retrieval, length calculations, and integration scenarios, with dynamic fixture handling and graceful skips.
- Updated find_gene_feature to use configurable GFF gene name attribute for broader compatibility.
@mohamed-laarej
Copy link
Collaborator Author

Hi @jonbrenas ,

Just wanted to let you know that I've pushed the latest changes to this PR.

Regarding the tests, you'll notice that specifically the test_find_gene_feature for the af1_sim fixture is now SKIPPED rather than failing. This is due to an inconsistency in the af1_sim simulated dataset itself, where the find_gene_feature function cannot reliably locate or process genes as expected, even when they appear in a broader query of genome_features.

I've implemented pytest.skip within this specific test to gracefully handle these data limitations, ensuring that the test suite provides accurate feedback. Other tests for af1_sim (like test_get_gene_transcripts, test_canonical_transcript, etc.) are passing if their specific data requirements are met, and all tests for the ag3_sim fixture are passing as expected, confirming the functionality of the new methods.

I'm happy to discuss this further if you'd like to explore options for updating the af1_sim fixture or making the find_gene_feature method more flexible to accommodate such data variations.

Thanks,

@jonbrenas
Copy link
Collaborator

Thanks @mohamed-laarej. I think the test shouldn't be skipped but the issue isn't with your code so it is fine here. I'll create a new issue to deal with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants