Conversation
Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com> Co-Authored-By: Oliver Faulkner Anderson <112665860+oliverfanderson@users.noreply.github.com> Co-Authored-By: Altaf Barelvi <altafayyubibarelvi@gmail.com>
datasets/synthetic-data/Snakefile
Outdated
| @@ -0,0 +1,100 @@ | |||
| pathways = ["Apoptosis_signaling", "B_cell_activation", | |||
There was a problem hiding this comment.
Each of the files have this variable. I think we should have it only in the snakefile and send this list to each of the files that use this pathway list
|
A question I’d appreciate feedback on: Currently, we generate separate source, target, and prize files for each pathway, but we combine all pathways into each thresholded interactome. Should we also create a combined list of sources, targets, and prizes? Should we also combine the gold standard as well? Or would it be better to keep separate interactomes for each individual pathway (keep it the way it is)? |
|
We should have separate gold standards. |
|
When this is reviewed (or before) we should do tests to see how connected the networks are after thresholding, adding back the pathway data, and removing proteins that don't have uniprot ids. |
|
Also there is a chance we can use more panther pathways, we should look to see what else we can use from pathway commons. |
|
@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage. |
Plan to keep all of them in the gold standard. But update the evaluation code to deal with the sources/targets/prizes being in the gold standard and shown as a different baseline where those are all set as frequency 1.0. |
|
Should we also consider how sparse an interactome becomes after applying a threshold to the STRING interactome? When we filter by size, we implicitly accounting for the decrease in graph density as well. Would it make more sense to treat size and density as separate variables when evaluating performance? However, does testing for density even matter in this context; are there any interactomes that aren’t already highly connected? I’m thinking we should first threshold the interactomes, then select only those that are highly connected (e.g., density ≥ 0.85). From that subset, we could choose a few to represent different size scales. |
|
I will be updating how we create interactomes for the Panther pathways dataset. Current: New: For example, in the STRING interaction networks, when using only physical interactions and experimental edge scores, we could aim to keep 25% of all edges.
Now we will be construct new interactomes by removing X% of edges and then adding all edges from all chosen PANTHER pathways. We will only keep downsampled interacomes that satisfy specified properties for a given set of sources and targets. Proposed brute-force method for Panther pathways interactomes:
Randomly remove X edges from the full STRING interactome
Verify that the new network maintains the following properties:
If the properties above are not satisfied, repeat the process with a different random sample. |
|
For this dataset, we are planning on using it for all of the evaluations. I was deciding if we need to use all of the pathways, and I don't think we need to. I decided on a couple that we can use: Balanced Skewed Tiny When making the interactomes, I want to add all of these pathways on the thresholded interactomes and uphold the properties above. I need to double check if I used any of these will break the rules for pilot data/runs; but since we are making a new dataset that wasn't used for my thesis, I think we will be okay. |
|
Made minor changes to fix the interactome fetching - these shouldn't cause any conflicts, nor were the changes I wanted to make as mentioned in Slack. [If they do, feel free to force push.] |
for biopax -> extended sif conversion
|
As a textual tl;dr, I need to still add I'm going to make that trim utility, split this PR to add those utilities, and merge the EGFR interactome updates and sampling using those utility scripts, then jump back to doing the BioPAX parsing. |
There was a problem hiding this comment.
Would you be able to add to the README more information on how we are downsampling the interactomes + add the psuedo code?
| sources, targets = sources_and_targets(node_data_df) | ||
|
|
||
| # TODO: isolate percentage constant (this currently builds up 0%, 10%, ..., 100%) | ||
| for percentage in map(lambda x: (x + 1) / 10, range(10)): |
There was a problem hiding this comment.
What if a user wants to explicitly specify a X%? I was thinking of using 33% and 66% to use for the computational performance.
There was a problem hiding this comment.
I might not understanding what this percentage is. Is this for how large the interactome will be or is this the percentage of sources/targets that are connected?
There was a problem hiding this comment.
I'm currently using it for both. I planned originally to only make X% not hard-coded to be 10%, but I like the idea of decoupling the two as well. I'll tack this on with the graph connectedness check.
| # We ask that at least `percentage` of the sources and targets are connected with one another. | ||
| connection_percentage = float(len(curr_connections)) / float(len(prev_connections)) | ||
|
|
||
| if percentage < connection_percentage: |
There was a problem hiding this comment.
Why is this percentage being used for both the number of sources and targets that are connected and also how much of the interactome to sample?
Am I right to assume that if percetage = 10, then 10% of the interactome is sampled and when checking the sources/target connection_percentage if that is greater than 10% then the downsampled interactome is valid?
| from typing import OrderedDict, Optional | ||
| import os | ||
|
|
||
| def count_weights(weights: dict[float, int]) -> OrderedDict[float, int]: |
There was a problem hiding this comment.
Could you also add a one line explanation why we are bucketing each weight into its own bucket?
| def count_weights(weights: dict[float, int]) -> OrderedDict[float, int]: | ||
| """ | ||
| Returns an ordered map (lowest to highest weight) from the | ||
| weight to the number of elements the weight has. |
There was a problem hiding this comment.
Are the elements each edge in the interactome that correspond to this weight?
| return SourcesTargets(sources, targets) | ||
|
|
||
|
|
||
| def main(): |
There was a problem hiding this comment.
Are you planning on seeding this?
| """ | ||
| return collections.OrderedDict(sorted({k: int(v) for k, v in weights.items()}.items())) | ||
|
|
||
| def find_connected_sources_targets( |
There was a problem hiding this comment.
Are you planning on doing any other properties to check? I would like everything to be in one connected component, so that would be another property to check.
|
|
||
| print(f"Merging {pathway_name} with interactome...") | ||
| # While we are merging this graph, we are preparing to compare the connectedness of the prev[ious] and curr[ent] (merged) graph. | ||
| prev_graph = networkx.from_pandas_edgelist(pathway_df, source="Interactor1", target="Interactor2") |
There was a problem hiding this comment.
What is this previous and current graph? Is the previous connected the actual pathway and the current graph the interactome merged with the actual pathway? If yes, can you document that somewhere as a comment?
|
|
||
| def read_pathway(pathway_name: str) -> pandas.DataFrame: | ||
| """ | ||
| Returns the directed-only pathway from a pathway name, |
There was a problem hiding this comment.
Why are we assuming that the pathway is fully directed? I would need to look at each pathway for an example, but we can't assume that each pathway is fully directed.
There was a problem hiding this comment.
directed = [
"controls-state-change-of",
"controls-transport-of",
"controls-phosphorylation-of",
"controls-expression-of",
"catalysis-precedes",
"consumption-controlled-by",
"controls-production-of",
"controls-transport-of-chemical",
"chemical-affects",
"used-to-produce",
"consumption-controled-by",
]
undirected = ["in-complex-with", "interacts-with", "neighbor-of", "reacts-with"]
This is information from pathwaycommons about the directionality of pathways.
There was a problem hiding this comment.
We aren't assuming that the pathway is directed: I should say 'directed-coerced' instead.
There was a problem hiding this comment.
Why do you need to though?
| Returns an ordered map (lowest to highest weight) from the | ||
| weight to the number of elements the weight has. | ||
|
|
||
| The full workflow for this function should be: |
There was a problem hiding this comment.
Why do you need this here?
| print("Creating item samples...") | ||
| full_list: list[int] = [] | ||
| curr_v = 0 | ||
| for k, v in weight_mapping.items(): |
There was a problem hiding this comment.
| for k, v in weight_mapping.items(): | |
| # Sampling percentage of edges from each weight bucket is equivalent to sampling percentage of the full interactome, since the buckets partition all edges once. | |
| for k, v in weight_mapping.items(): |
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
|
On this PR again. Got sidetracked by #43. |
|
This workflow is currently completely broken for pathway fetching. It turns out that the associated OWL file is about 11gb uncompressed, so I can't do the processing on GHA: I'll approach this similarly to how you suggested for DepMap instead. |
Co-Authored-By: Neha Talluri 78840540+ntalluri@users.noreply.github.com
Co-Authored-By: Oliver Faulkner Anderson 112665860+oliverfanderson@users.noreply.github.com
Co-Authored-By: Altaf Barelvi altafayyubibarelvi@gmail.com