Improvements to NonStationaryConvolve2D

This issue collects a few improvements for the NonStationaryConvolve2D as raised in https://github.com/PyLops/pylops/pull/465 (and a couple of others):

* **Refactor bilinear interpolation**
   The [CUDA bilinear interp](https://github.com/PyLops/pylops/blob/ce8a0151cfb6b36a12b5a1f5f6f9e3a5d862bf4c/pylops/signalprocessing/_nonstatconvolve2d_cuda.py#L18-L45) is the same as the [CPU one](https://github.com/PyLops/pylops/blob/ce8a0151cfb6b36a12b5a1f5f6f9e3a5d862bf4c/pylops/signalprocessing/nonstatconvolve2d.py#L175-L197). It then makes sense to wrap the same code under a same function. This can be done by coding a pure Python function and then defining a device function from it. For example:
   ```python
   # CPU
  def fun(a, b):
      return a + b

  # GPU
  dev_fun = cuda.jit(device=True)(fun)
  ```
* **Move CPU to auxiliary file**
    Since the GPU functions are in another file and taking into account the above refactor, it makes sense to move it to another file.

* **Improve thread launch parameters and iteration model**
 Currently, there is a limit to the size of the model, imposed by the number of blocks and threads in the call. There are two approaches to fixing this. The first is by making the total number of threads depend on the model. For example, if the model is 1000 x 1000, launch 1 block of 1024 x 1024. There is a problem which appears when very large models are used: the number of threads that are required exceed the number that the GPU can provide.
 [Grid stride looping](https://towardsdatascience.com/cuda-by-numba-examples-1-4-e0d06651612f) can help with this as allows arbitrarily sized models to be processed by a smaller number of threads. For example, set 1 block of 1024 x 1024 and you can process any model of any size. The issue with naive grid-stride looping is that it may not be optimal in the number of threads. For example, a 32 x 32 model always launches 1024 x 1024 threads.
   A solution to that is to combine both techniques: use an input-dependent number of blocks/threads (that are powers of 2), but set a maximum number of total threads launched. This can even be set per GPU, but is probably ok to hardcode for a reasonable number obtained from the `MAX_GRID_DIM_X` attribute (see [here](https://gist.github.com/cako/6b89a4045918f9428864003a51ab6801#file-cuda-by-numba-examples-01-13-py)).

* **Create operator that takes filters as input**
Currently, our operator takes a generic 2d signal as input and filters as parameter. However, an alternative operator that takes the 2d signal as parameter and the filters as inputs is desirable as it could be used to estimate the best non-stationary filters that match a given data and input signal. This is kind of equivalent to `pylops.avo.prestack.PrestackWaveletModelling` for the 1D case. I suggest making a completely new operator and calling it `NonStationaryWavelets2D` or `NonStationaryFilters2D`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements to NonStationaryConvolve2D #466

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvements to NonStationaryConvolve2D #466

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions