Skip to content

Optimize MapSet.symmetric_difference/2 when sizes mismatched#15471

Open
preciz wants to merge 1 commit into
elixir-lang:mainfrom
preciz:optimize-mapset-symmetric-difference
Open

Optimize MapSet.symmetric_difference/2 when sizes mismatched#15471
preciz wants to merge 1 commit into
elixir-lang:mainfrom
preciz:optimize-mapset-symmetric-difference

Conversation

@preciz

@preciz preciz commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

By folding over the smaller set and using the larger set as the starting accumulator, the time complexity is reduced from O(large) to O(small) iterations. This provides a over 100x speedup when set sizes are mismatched.

When set sizes match the performance is the same as before.

Assisted-by: Antigravity CLI : Claude Opus 4.6 & Gemini Flash 3.5

By folding over the smaller set and using the larger set as the starting accumulator, the time complexity is reduced from O(large) to O(small) iterations. This provides a over 100x speedup when set sizes are mismatched.
@josevalim

Copy link
Copy Markdown
Member

Hi @preciz, for documentation purposes, can you share the benchmarks you ran, alongside input sizes.

@preciz

preciz commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

It's skewed towards that mismatch size case.

Mix.install([{:benchee, "~> 1.0"}])

defmodule Bench do
  def old_sym_diff(map_set1 = %MapSet{map: set1}, _map_set2 = %MapSet{map: set2}) do
    {small, large} = if :sets.size(set1) <= :sets.size(set2), do: {set1, set2}, else: {set2, set1}

    disjointer_fun = fn elem, {small, acc} ->
      if :sets.is_element(elem, small) do
        {:sets.del_element(elem, small), acc}
      else
        {small, [elem | acc]}
      end
    end

    {new_small, list} = :sets.fold(disjointer_fun, {small, []}, large)
    %{map_set1 | map: :sets.union(new_small, :sets.from_list(list, version: 2))}
  end

  def new_sym_diff(map_set1 = %MapSet{map: set1}, _map_set2 = %MapSet{map: set2}) do
    {small, large} = if :sets.size(set1) <= :sets.size(set2), do: {set1, set2}, else: {set2, set1}

    map =
      :sets.fold(
        fn elem, acc ->
          if :sets.is_element(elem, acc) do
            :sets.del_element(elem, acc)
          else
            :sets.add_element(elem, acc)
          end
        end,
        large,
        small
      )

    %{map_set1 | map: map}
  end
end

equal_small = MapSet.new(1..100)
equal_large = MapSet.new(101..200)

diff_huge1 = MapSet.new(1..100000)
diff_huge2 = MapSet.new(50000..150000)

small_1 = MapSet.new(1..10)
large_1 = MapSet.new(1..100000)

Benchee.run(
  %{
    "old" => fn {set1, set2} -> Bench.old_sym_diff(set1, set2) end,
    "new" => fn {set1, set2} -> Bench.new_sym_diff(set1, set2) end
  },
  inputs: %{
    "Equal Small (100)" => {equal_small, equal_large},
    "Huge Overlapping (100,000)" => {diff_huge1, diff_huge2},
    "Mismatched Sizes (10 vs 100,000)" => {small_1, large_1}
  }
)

On my noisy heat throttling machine:

##### With input Equal Small (100) #####
Name           ips        average  deviation         median         99th %
new       217.85 K        4.59 μs    ±65.02%        4.37 μs        9.13 μs
old       161.03 K        6.21 μs   ±110.95%        5.95 μs        9.81 μs

Comparison:
new       217.85 K
old       161.03 K - 1.35x slower +1.62 μs

##### With input Huge Overlapping (100,000) #####
Name           ips        average  deviation         median         99th %
old          65.69       15.22 ms    ±11.68%       14.78 ms       23.03 ms
new          63.87       15.66 ms    ±12.26%       14.83 ms       23.21 ms

Comparison:
old          65.69
new          63.87 - 1.03x slower +0.43 ms

##### With input Mismatched Sizes (10 vs 100,000) #####
Name           ips        average  deviation         median         99th %
new       973.09 K     0.00103 ms   ±875.87%     0.00094 ms     0.00169 ms
old       0.0777 K       12.88 ms     ±7.94%       12.70 ms       15.89 ms

Comparison:
new       973.09 K
old       0.0777 K - 12530.18x slower +12.88 ms

@josevalim

Copy link
Copy Markdown
Member

I see. For both scenarios (different sizes and similar sizes), We should probably test the cases they have half in common, most in common, and nothing.

@sabiwara

Copy link
Copy Markdown
Contributor

Yes, and also we should bench with memory_time as well.
In this case it seems we're mostly reducing memory usage, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants