dataset: synthetic from PANTHER by tristan-f-r · Pull Request #25 · Reed-CompBio/spras-benchmarking

tristan-f-r · 2025-07-01T18:44:17Z

Co-Authored-By: Neha Talluri 78840540+ntalluri@users.noreply.github.com
Co-Authored-By: Oliver Faulkner Anderson 112665860+oliverfanderson@users.noreply.github.com
Co-Authored-By: Altaf Barelvi altafayyubibarelvi@gmail.com

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com> Co-Authored-By: Oliver Faulkner Anderson <112665860+oliverfanderson@users.noreply.github.com> Co-Authored-By: Altaf Barelvi <altafayyubibarelvi@gmail.com>

ntalluri · 2025-07-01T20:18:37Z

datasets/synthetic-data/Snakefile

@@ -0,0 +1,100 @@
+pathways = ["Apoptosis_signaling", "B_cell_activation",


Each of the files have this variable. I think we should have it only in the snakefile and send this list to each of the files that use this pathway list

datasets/synthetic-data/Snakefile

ntalluri · 2025-07-03T18:16:43Z

A question I’d appreciate feedback on: Currently, we generate separate source, target, and prize files for each pathway, but we combine all pathways into each thresholded interactome. Should we also create a combined list of sources, targets, and prizes? Should we also combine the gold standard as well? Or would it be better to keep separate interactomes for each individual pathway (keep it the way it is)?

tristan-f-r · 2025-07-28T18:11:41Z

We should have separate gold standards.

ntalluri · 2025-07-28T18:33:18Z

When this is reviewed (or before) we should do tests to see how connected the networks are after thresholding, adding back the pathway data, and removing proteins that don't have uniprot ids.

ntalluri · 2025-07-28T18:52:38Z

Also there is a chance we can use more panther pathways, we should look to see what else we can use from pathway commons.

ntalluri · 2025-07-31T16:35:50Z

@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage.

ntalluri · 2025-08-11T21:25:55Z

@oliverfanderson @ctrlaltaf For the gold standard nodes (and potentially the edges), should we exclude source, target, and prize nodes when defining it? Currently, it looks like we’re including these nodes in the gold standard for each pathway. These nodes overlap with the gold standard, but that overlap should happen naturally, not by construction/being predefined. I’m concerned this could inflate our precision and recall metrics, because of a form of data leakage.

Plan to keep all of them in the gold standard. But update the evaluation code to deal with the sources/targets/prizes being in the gold standard and shown as a different baseline where those are all set as frequency 1.0.

ntalluri · 2025-08-13T19:33:24Z

Should we also consider how sparse an interactome becomes after applying a threshold to the STRING interactome? When we filter by size, we implicitly accounting for the decrease in graph density as well. Would it make more sense to treat size and density as separate variables when evaluating performance? However, does testing for density even matter in this context; are there any interactomes that aren’t already highly connected?

I’m thinking we should first threshold the interactomes, then select only those that are highly connected (e.g., density ≥ 0.85). From that subset, we could choose a few to represent different size scales.

ntalluri · 2025-11-06T20:13:46Z

I will be updating how we create interactomes for the Panther pathways dataset.

Current:
Our current thresholded interactomes are built by applying a hard cutoff on experimental scores (keeping edges with score ≥ x).
While this approach retains high-score edges and removes low-score ones, it distorts the original score distribution, which could be problematic for algorithms that rely on edge scores during optimization.

New:
Instead for the interacomes, we should approach it in a downsampling approach: build smaller interactomes that preserve the original score distribution of the original network while also reducing the total edge count.

For example, in the STRING interaction networks, when using only physical interactions and experimental edge scores, we could aim to keep 25% of all edges.
To achieve this, we could:

Sample 25% (or remove 75%) of edges uniformly at random, ignoring edge scores
Or use stratified sampling by edge score bins to preserve the distribution of scores

Now we will be construct new interactomes by removing X% of edges and then adding all edges from all chosen PANTHER pathways. We will only keep downsampled interacomes that satisfy specified properties for a given set of sources and targets.

Proposed brute-force method for Panther pathways interactomes:

edge removal

Randomly remove X edges from the full STRING interactome

Option A: Remove edges uniformly at random (ignoring scores)
Option B: Stratify edges by score bins and sample within each bin to maintain the overall score distribution
- We can stratify the edges into bins based on their score ranges (example: [0–300], [300–600], [600–900] ...).
- Then, when we randomly remove X edges, we can remove them proportionally from each bin.
- Example: suppose 20% of the original edges fall in the 0–300 bin, 50% in the 300–600 bin, and 30% in the 600–900 bin. If we're removing 1000 edges total, we'd randomly pick ~200 from the first bin, ~500 from the second, and ~300 from the third.

pathway integration
Add all edges from the selected Panther pathways to the new downsampled interactome
Property checks

Verify that the new network maintains the following properties:

All-in-one: all sources and targets lie in the same connected component
- V_{ST} = {all sources} U {all targets}, and after edge removal, all vertices in V_{ST} should belong to the same connected component of the graph
Reachability: every target is reachable from at least one source (can be checked via apsp or bfs)
Might want to soften the criteria to be X% of the sources and targets remain in a single component,
and (potentially) X% of the targets are reachable from at least one source.

Restart is necessary

If the properties above are not satisfied, repeat the process with a different random sample.

ntalluri · 2025-11-06T20:13:52Z

For this dataset, we are planning on using it for all of the evaluations. I was deciding if we need to use all of the pathways, and I don't think we need to. I decided on a couple that we can use:

Balanced
Interleukin_signaling - 86 GS Nodes, 811 GS Edges, 18 S / 16 T (ratios 0.209 / 0.186)
Apoptosis_signaling - 108 GS Nodes, 286 GS Edges, 6 S / 17 T (0.056 / 0.157)

Skewed
Cadherin_signaling - 150 GS Nodes, 2650 GS Edges, 17 S / 3 T (0.113 / 0.02)
PDGF_signaling - 125 GS Nodes, 764 GS Edges, 2 S / 28 T (0.016 / 0.224)
Toll_signaling - 44 GS Nodes, 84 GS Edges, 8 S / 3 T (0.182 / 0.068)

Tiny
Hedgehog_signaling - 19 GS Nodes, 61 GS Edges, 2 S / 2 T (0.105 / 0.105)

When making the interactomes, I want to add all of these pathways on the thresholded interactomes and uphold the properties above.

I need to double check if I used any of these will break the rules for pilot data/runs; but since we are making a new dataset that wasn't used for my thesis, I think we will be okay.

tristan-f-r · 2026-01-24T05:10:53Z

Made minor changes to fix the interactome fetching - these shouldn't cause any conflicts, nor were the changes I wanted to make as mentioned in Slack. [If they do, feel free to force push.]

for biopax -> extended sif conversion

tristan-f-r · 2026-02-11T23:43:24Z

As a textual tl;dr, I need to still add trim.py, and I want to update the paxtools utility to [efficiently] do BioPAX extraction from the large .owl file provided by PC.

I'm going to make that trim utility, split this PR to add those utilities, and merge the EGFR interactome updates and sampling using those utility scripts, then jump back to doing the BioPAX parsing.

ntalluri · 2026-02-12T16:13:40Z

datasets/synthetic-data/scripts/sampling.py

Would you be able to add to the README more information on how we are downsampling the interactomes + add the psuedo code?

datasets/synthetic-data/explore/README.md

yay!!

ntalluri · 2026-02-18T18:22:47Z

datasets/synthetic-data/scripts/sampling.py

+    sources, targets = sources_and_targets(node_data_df)
+
+    # TODO: isolate percentage constant (this currently builds up 0%, 10%, ..., 100%)
+    for percentage in map(lambda x: (x + 1) / 10, range(10)):


What if a user wants to explicitly specify a X%? I was thinking of using 33% and 66% to use for the computational performance.

I might not understanding what this percentage is. Is this for how large the interactome will be or is this the percentage of sources/targets that are connected?

I'm currently using it for both. I planned originally to only make X% not hard-coded to be 10%, but I like the idea of decoupling the two as well. I'll tack this on with the graph connectedness check.

tools/sample.py

ntalluri · 2026-02-18T18:34:28Z

tools/sample.py

+    # We ask that at least `percentage` of the sources and targets are connected with one another.
+    connection_percentage = float(len(curr_connections)) / float(len(prev_connections))
+
+    if percentage < connection_percentage:


Why is this percentage being used for both the number of sources and targets that are connected and also how much of the interactome to sample?

Am I right to assume that if percetage = 10, then 10% of the interactome is sampled and when checking the sources/target connection_percentage if that is greater than 10% then the downsampled interactome is valid?

ntalluri · 2026-02-18T18:39:07Z

tools/sample.py

+from typing import OrderedDict, Optional
+import os
+
+def count_weights(weights: dict[float, int]) -> OrderedDict[float, int]:


Could you also add a one line explanation why we are bucketing each weight into its own bucket?

ntalluri · 2026-02-18T18:40:16Z

tools/sample.py

+def count_weights(weights: dict[float, int]) -> OrderedDict[float, int]:
+    """
+    Returns an ordered map (lowest to highest weight) from the
+    weight to the number of elements the weight has.


Are the elements each edge in the interactome that correspond to this weight?

ntalluri · 2026-02-18T18:41:13Z

datasets/synthetic-data/scripts/sampling.py

+    return SourcesTargets(sources, targets)
+
+
+def main():


Are you planning on seeding this?

ntalluri · 2026-02-18T18:42:12Z

tools/sample.py

+    """
+    return collections.OrderedDict(sorted({k: int(v) for k, v in weights.items()}.items()))
+
+def find_connected_sources_targets(


Are you planning on doing any other properties to check? I would like everything to be in one connected component, so that would be another property to check.

ntalluri · 2026-02-18T18:45:05Z

tools/sample.py

+
+    print(f"Merging {pathway_name} with interactome...")
+    # While we are merging this graph, we are preparing to compare the connectedness of the prev[ious] and curr[ent] (merged) graph.
+    prev_graph = networkx.from_pandas_edgelist(pathway_df, source="Interactor1", target="Interactor2")


What is this previous and current graph? Is the previous connected the actual pathway and the current graph the interactome merged with the actual pathway? If yes, can you document that somewhere as a comment?

ntalluri · 2026-02-18T18:47:20Z

datasets/synthetic-data/scripts/sampling.py

+
+def read_pathway(pathway_name: str) -> pandas.DataFrame:
+    """
+    Returns the directed-only pathway from a pathway name,


Why are we assuming that the pathway is fully directed? I would need to look at each pathway for an example, but we can't assume that each pathway is fully directed.

directed = [
"controls-state-change-of",
"controls-transport-of",
"controls-phosphorylation-of",
"controls-expression-of",
"catalysis-precedes",
"consumption-controlled-by",
"controls-production-of",
"controls-transport-of-chemical",
"chemical-affects",
"used-to-produce",
"consumption-controled-by",
]

undirected = ["in-complex-with", "interacts-with", "neighbor-of", "reacts-with"]

This is information from pathwaycommons about the directionality of pathways.

We aren't assuming that the pathway is directed: I should say 'directed-coerced' instead.

Why do you need to though?

tools/sample.py

ntalluri · 2026-02-18T18:56:31Z

tools/sample.py

+    Returns an ordered map (lowest to highest weight) from the
+    weight to the number of elements the weight has.
+
+    The full workflow for this function should be:


Why do you need this here?

ntalluri · 2026-02-18T18:57:31Z

tools/sample.py

+    print("Creating item samples...")
+    full_list: list[int] = []
+    curr_v = 0
+    for k, v in weight_mapping.items():


Suggested change

for k, v in weight_mapping.items():

# Sampling percentage of edges from each weight bucket is equivalent to sampling percentage of the full interactome, since the buckets partition all edges once.

for k, v in weight_mapping.items():

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

tristan-f-r · 2026-02-24T23:20:47Z

On this PR again. Got sidetracked by #43.

tristan-f-r · 2026-02-25T08:56:29Z

This workflow is currently completely broken for pathway fetching. It turns out that the associated OWL file is about 11gb uncompressed, so I can't do the processing on GHA: I'll approach this similarly to how you suggested for DepMap instead.

feat: synthetic pathways

20b1580

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com> Co-Authored-By: Oliver Faulkner Anderson <112665860+oliverfanderson@users.noreply.github.com> Co-Authored-By: Altaf Barelvi <altafayyubibarelvi@gmail.com>

tristan-f-r added the dataset Mutating datasets in any way. label Jul 1, 2025

ntalluri reviewed Jul 1, 2025

View reviewed changes

datasets/synthetic-data/Snakefile Outdated Show resolved Hide resolved

Merge branch 'main' into synthetic

fc12b4e

tristan-f-r mentioned this pull request Jul 30, 2025

dataset: DepMap #41

Merged

tristan-f-r and others added 5 commits January 5, 2026 19:03

Merge branch 'main' into synthetic

8ff381f

fix: use full protein links to unify synthetic with databases

f7c0c2d

Merge branch 'main' into synthetic

73b6d93

re-correct links

2ce621a

fix: interactome fetching

280b92a

tristan-f-r and others added 9 commits January 23, 2026 21:19

fix(diseases): fetch correct string links

db30556

chore: mv to scripts

0658528

chore: move to scripts, Pathify

e024e2c

style: fmt

7b09381

drop old thresholding

2a5feec

begin sampling

e389b32

chore: mv

af0ac30

rename

d1ade54

fix: compute weight counts normally

7483eea

tristan-f-r added 4 commits February 4, 2026 05:28

fix: file extensions and such

751a8f2

chore: explore and such

2fceaa9

feat: base thresholding workflow

ac5b93c

chore: add paxtools

5cb7352

for biopax -> extended sif conversion

tristan-f-r added 2 commits February 12, 2026 01:09

feat: trimming

9d3e194

style: fmt

81a4e4e

ntalluri reviewed Feb 12, 2026

View reviewed changes

datasets/synthetic-data/explore/README.md Show resolved Hide resolved

tristan-f-r added 2 commits February 18, 2026 05:07

feat: full interactome parsing

d2cc7e4

yay!!

refactor: isolate argparse parser

38aef2c

ntalluri reviewed Feb 18, 2026

View reviewed changes

tools/sample.py Show resolved Hide resolved

ntalluri reviewed Feb 18, 2026

View reviewed changes

tools/sample.py Outdated Show resolved Hide resolved

ntalluri reviewed Feb 18, 2026

View reviewed changes

tools/sample.py Show resolved Hide resolved

ntalluri reviewed Feb 18, 2026

View reviewed changes

tristan-f-r and others added 2 commits February 18, 2026 14:05

docs: suggestion

a881afd

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

Merge branch 'main' into synthetic

5e9653a

reorganize, begin using owl file

db5a09e

tristan-f-r mentioned this pull request Feb 25, 2026

docs(egfr): clarify string ensp nodes #61

Open

		@@ -0,0 +1,100 @@
		pathways = ["Apoptosis_signaling", "B_cell_activation",

	for k, v in weight_mapping.items():
	# Sampling percentage of edges from each weight bucket is equivalent to sampling percentage of the full interactome, since the buckets partition all edges once.
	for k, v in weight_mapping.items():

Comments

Conversation

tristan-f-r commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ntalluri commented Jul 3, 2025

Uh oh!

tristan-f-r commented Jul 28, 2025

Uh oh!

ntalluri commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Jul 28, 2025

Uh oh!

ntalluri commented Jul 31, 2025

Uh oh!

ntalluri commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Aug 13, 2025

Uh oh!

ntalluri commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Nov 6, 2025

Uh oh!

tristan-f-r commented Jan 24, 2026

Uh oh!

tristan-f-r commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ntalluri Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r commented Feb 24, 2026

Uh oh!

tristan-f-r commented Jul 1, 2025 •

edited

Loading

ntalluri commented Jul 28, 2025 •

edited

Loading

ntalluri commented Aug 11, 2025 •

edited

Loading

ntalluri commented Nov 6, 2025 •

edited

Loading

tristan-f-r commented Feb 11, 2026 •

edited

Loading

ntalluri Feb 18, 2026 •

edited

Loading

ntalluri Feb 18, 2026 •

edited

Loading

ntalluri Feb 18, 2026 •

edited

Loading

ntalluri Feb 18, 2026 •

edited

Loading