Release the GIL in core transformation code #386

jorisvandenbossche · 2019-08-05T06:51:53Z

By using nogil in cython we can ensure to release the GIL in the hot spots of the transformation code, which can ensure that other applications can use pyproj code in parallel (eg using dask)

The text was updated successfully, but these errors were encountered:

snowman2 · 2019-08-05T12:49:26Z

I think before we do this, we will need to check in with the PROJ devs to see if this is safe. Maybe do some testing with it as well.

Are you currently able to use dask with multithreading? Or is this blocking anything?

jorisvandenbossche · 2019-08-05T12:55:30Z

I think it is certainly interesting to discuss with PROJ devs about thread safety using single PJ objects (to do parallel work in pyproj as you were trying), as I mentioned also on the PR.
But releasing the GIL would already allow to 1) do other calculations that release the GIL in parallel or 2) to do transformations in parallel if you ensure to create the Transformer object in each thread.

Eg to_crs in geopandas creates a Transformer object within that method, so you could see a case where this method is called on different chunks of the dataframe in parallel.

snowman2 · 2019-08-06T02:14:54Z

threading related issue: OSGeo/PROJ#1047

snowman2 · 2019-08-06T02:56:56Z

Looks like you need a new context per thread. This conflicts with this issue #374

snowman2 · 2019-08-15T02:11:12Z

Based on current master for reference with future changes:

import concurrent.futures
from pyproj import Transformer

TRANSFORMER = Transformer.from_crs(4326, 3857)

def transform_point(aa):
    assert TRANSFORMER.transform((12, 11), (13, 12)) == ((1447153.3803125564, 1335833.8895192828), (1345708.4084091093, 1232106.80189676))

def transform_point__recreate(aa):
    assert Transformer.from_crs(4326, 3857).transform((12, 11), (13, 12)) == ((1447153.3803125564, 1335833.8895192828), (1345708.4084091093, 1232106.80189676)) 
    
def transform_threaded_recreate():
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(transform_point__recreate, range(20))

def transform_threaded():
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        executor.map(transform_point, range(20))

def transform():
    [transform_point(ii) for ii in range(20)]
    
def transform_recreate():
    [transform_point__recreate(ii) for ii in range(20)]

Based on this, I think you would have to have a way to prevent re-starting threads and re-creating transformers each time a thread runs a job. You would have to have a each thread do a batch of transforms based on a Transformer created at the start of the thread job.

jorisvandenbossche · 2019-08-20T18:28:17Z

Based on this, I think you would have to have a way to prevent re-starting threads and re-creating transformers each time a thread runs a job. You would have to have a each thread do a batch of transforms based on a Transformer created at the start of the thread job.

I don't fully follow what you are saying here.

The use case I have in mind is to_crs on a dask-based GeoDataFrame. You could transform each chunk in parallel, and indeed in the function you apply in parallel you need to recreate the Transformer object (as otherwise it cannot be threadsafe).

I did a quick experiment, taking the code base from a few weeks ago (before the global PROJ_CTX was introduced), and added a with nogil: statement around the proj_trans_generic call in _Transformer._transform. And with 4 cores, it gives me a 2x speed-up (so not really great).

snowman2 · 2019-08-20T22:17:18Z

Interesting, so a 2x speedup from not releasing the Gil? Does shapely release the Gil?

jorisvandenbossche · 2019-08-21T06:51:59Z

No, not with shapely. I should have been clearer, it was not a direct timing of to_crs (which indeed currently goes through shapely's transform), but just focusing on the pyproj transform part. So what I timed was running this function on 10 million points:

def _geopandas_to_crs(x, y, crs_in, crs_out):
    # specific optimized version for points
    proj_in = pyproj.Proj(crs_in, preserve_units=True)
    proj_out = pyproj.Proj(crs_out, preserve_units=True)
    transformer = pyproj.Transformer.from_proj(proj_in, proj_out, always_xy=True)
    x2, y2 = transformer.transform(x, y)
    return x2, y2

Which is certainly not the only part involved in to_crs. Because if we want to go this way for geopandas in general, we need functionality to first extract all coordinates of the geometries and then reconstruct geometries (something which needs to happen anyway (and currently happens with shapely's transform), but would be interesting to time if doing all proj transformations in one go by passing an array is faster)

jorisvandenbossche added the proposal Idea for a new feature. label Aug 5, 2019

snowman2 mentioned this issue Aug 15, 2019

PJ_CONTEXT not releasing file handle for sqlite until destroyed OSGeo/PROJ#1565

Closed

jorisvandenbossche mentioned this issue Aug 20, 2019

A strange SQLite internal state corruption issue? #374

Closed

snowman2 mentioned this issue Aug 27, 2019

remove parallel omp cython code & parallel CICD matrix items #384

Merged

snowman2 mentioned this issue Sep 4, 2019

"Internal Proj Error: [...] database disk image is malformed" when multiprocessing since pyproj 2.3 #426

Closed

snowman2 mentioned this issue Sep 12, 2019

release gil in core cython/PROJ code #437

Merged

2 tasks

snowman2 closed this as completed in #437 Sep 17, 2019

snowman2 mentioned this issue Sep 18, 2019

fix releasing gil in pyproj.Proj code #446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release the GIL in core transformation code #386

Release the GIL in core transformation code #386

jorisvandenbossche commented Aug 5, 2019

snowman2 commented Aug 5, 2019

jorisvandenbossche commented Aug 5, 2019

snowman2 commented Aug 6, 2019

snowman2 commented Aug 6, 2019

snowman2 commented Aug 15, 2019

jorisvandenbossche commented Aug 20, 2019

snowman2 commented Aug 20, 2019

jorisvandenbossche commented Aug 21, 2019

Release the GIL in core transformation code #386

Release the GIL in core transformation code #386

Comments

jorisvandenbossche commented Aug 5, 2019

snowman2 commented Aug 5, 2019

jorisvandenbossche commented Aug 5, 2019

snowman2 commented Aug 6, 2019

snowman2 commented Aug 6, 2019

snowman2 commented Aug 15, 2019

jorisvandenbossche commented Aug 20, 2019

snowman2 commented Aug 20, 2019

jorisvandenbossche commented Aug 21, 2019