Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instantiating many proj objects much slower than 1.9.6 #661

Closed
dmahr1 opened this issue Jun 17, 2020 · 17 comments · Fixed by #675
Closed

Instantiating many proj objects much slower than 1.9.6 #661

dmahr1 opened this issue Jun 17, 2020 · 17 comments · Fixed by #675
Labels

Comments

@dmahr1
Copy link
Contributor

dmahr1 commented Jun 17, 2020

I have a tool in pyproj 1.9.6 that I use for doing a "reverse lookup" of projections. Given an x/y in unknown CRS and a known longitude/latitude, this finds the projections which place that x/y closest to the known longitude/latitude. This is helpful when trying to track down an unknown coordinate system.

This requires instantiating thousand of proj/Proj objects, but it only takes a few seconds in pyproj 1.9.6. Recently I wanted to upgrade to more recent versions of PROJ and GDAL, but this tool is now taking a few minutes, about 50 times longer in pyproj 2.x:

I know that a lot changed in the underlying PROJ C++ library between pyproj 1.9.6 and 2.x. But is there any way to restore the fast instantiation of the proj/Proj objects? The projections don't have to be exact - just close enough to this reverse lookup tool. Also, I am willing to serialize/pickle the proj objects if that would help, though my understanding was that that didn't work with Python C extensions.

@dmahr1 dmahr1 changed the title Instantiating all proj objects much slower than 1.9.6 Instantiating many proj objects much slower than 1.9.6 Jun 17, 2020
@snowman2
Copy link
Member

Unfortunately, there isn't a way to do so that I am aware of. If you re-use them, you could use a dictionary to store the objects and use it to look them up.

@snowman2
Copy link
Member

@dmahr1
Copy link
Contributor Author

dmahr1 commented Jun 18, 2020

@snowman2 Thanks for the quick reply. To clarify, the repeated transformations is only if the same CRS is transformed to/from, correct? It won't help when I need to instantiate 30k different CRSs?

@snowman2
Copy link
Member

the repeated transformations is only if the same CRS is transformed to/from, correct? It won't help when I need to instantiate 30k different CRSs?

I shared that as another case where we have seen this problem. When you are creating them, it is slower with the new version of PROJ. However, if you are able to cache and re-use them, you will be able to shave off time upon re-use.

@snowman2
Copy link
Member

@dmahr1 you could also try version 2.3.x and check the speed there as well as it had a different method for the context.

@dmahr1
Copy link
Contributor Author

dmahr1 commented Jun 18, 2020

@snowman2 Thank you for the suggestion! In 2.3.1, it took 24 seconds to run the gist, which is still about 10x slower than 1.9.6 but 7x faster than 2.6.1. Even so, I think I found an approach where I can upgrade all the way to 2.6.x.

My plan is to instantiate all of the Proj objects and pickle them. Then, my tool can just unpickle, calculate lng, lat = proj(x, y, inverse=True), and then compute this point's distance to the known lng/lat. I understand that Proj does not account for datum shifts, but the Transformer lacks the __reduce__ needed for pickling. Fortunately, I only need approximate distances to generate a shortlist of viable projections.

How complex would it be to add a __reduce__ method to make Transformer objects pickable?

@snowman2
Copy link
Member

How complex would it be to add a reduce method to make Transformer objects pickable?

Probably possible now with the changes in 3.0. Could probably use the PROJ pipeline string to do that.

My plan is to instantiate all of the Proj objects and pickle them.

How fast can they be un-pickled?

@snowman2
Copy link
Member

it took 24 seconds to run the gist, which is still about 10x slower than 1.9.6 but 7x faster than 2.6.1.

I am wondering if it might be worthwhile to re-look at the implementation in 2.3 and see if there is a better way. It had 2 issues to overcome: 1. Building with Windows 2. Threading

@snowman2
Copy link
Member

snowman2 commented Jun 19, 2020

Looks like the pickling of the transformer may take more thinking:

>>> from pyproj.transformer import Transformer
>>> tr = Transformer.from_crs("epsg:4326", "+proj=aea +lat_0=50 +lon_0=-154 +lat_1=55 +lat_2=65 +x_0=0 +y_0=0 +datum=NAD27 +no_defs +type=crs +units=m", always_xy=True)
>>> tr.definition
'unavailable until proj_trans is called'

@dmahr1
Copy link
Contributor Author

dmahr1 commented Jun 19, 2020

@snowman2 In 2.6.1, instantiating and pickling about 7.5k CRSs took about 800 seconds. Unpickling them took 120 seconds. The "reverse lookup" search took 0.05 seconds 😂

I'll admit that this is a pretty esoteric use case of PROJ/PyProj, so I don't expect the library to be optimized around it. There are so many amazing improvements that you and other contributors have made in the last couple years with the new datum support and everything. It's just a bummer that there's been a bit of a performance cost. Perhaps I will just leave PyProj 1.x installed in a separate Python environment and shell out to it on-demand. Hacky and gross...but if it works?

@snowman2
Copy link
Member

Yeah, sounds like pickling didn't help at all. Oh well. Sounds like having pyproj 1 in another environment doesn't sound too bad of an idea at the moment for your application.

If you turn it into a CLI/GUI application, it could load all of the transformers into dictionaries bases on the input projection name and have it wait for user input. The first one would be slow as it needs to load, but the next ones would be faster since the program is always loaded.

@snowman2
Copy link
Member

snowman2 commented Jul 3, 2020

In #675 it seems like I have achieved speedups. This example is how I got the best speedup:

import pyproj, datetime

test_codes = pyproj.get_codes("EPSG", pyproj.enums.PJType.PROJECTED_CRS, False)
start = datetime.datetime.now()
projs = []
for code in test_codes:
    try:
        projs.append(pyproj.Proj(f'EPSG:{code}'))
    except pyproj.exceptions.ProjError as err:
	pass
print(f'Instantiating {len(projs)} projs took {(datetime.datetime.now() - start).total_seconds()} seconds')

In this example, I am assuming that since you were using the Proj objects, only the Projected CRS objects would provide much value. But, not 100% sure.

Using pyproj 3.0.dev0 with PYPROJ_GLOBAL_CONTEXT=ON, the output was:

Instantiating 4664 projs took 1.592608 seconds

Using pyproj 2.6.1post1

Instantiating 4659 projs took 116.082649 seconds

Using the second gist you linked above, it still took ~100 seconds to initialize everything using the global context. I am guessing it is due to some of the EPSG codes causing errors that slowed things down.

@dmahr1
Copy link
Contributor Author

dmahr1 commented Jul 3, 2020

😲 @snowman2 That is amazing!! And yes, your assumption is correct, I am mostly focusing on projected CRSs rather than different GCSs.

I am curious what you did to achieve this incredible speedup. I've never written any Cython, so I am guessing a bit here. But it looks like the global context is leaving open a persistent connection to the database via the PROJ C API? In other words, it was the setup/teardown of that connection that was causing all of the latency before?

@snowman2
Copy link
Member

snowman2 commented Jul 3, 2020

But it looks like the global context is leaving open a persistent connection to the database via the PROJ C API?

That was one of the settings that needed to be tweaked to get this to work. Also, adding the settings to the context beforehand and not updating them each time shaved off time.

@dmahr1
Copy link
Contributor Author

dmahr1 commented Oct 27, 2020

@snowman2 Thanks again for those speedups. I finally got around to putting this tool into a cloud function and added it in the little form here: https://ihatecoordinatesystems.com/#correct-crs

@snowman2
Copy link
Member

Nice! Thanks for sharing. This was recently added: https://pyproj4.github.io/pyproj/latest/api/database.html#pyproj-database-query-crs-info. Since you start with a lat/lon, this could help subset the number of results you get:

    from pyproj.aoi import AreaOfInterest
    from pyproj.enums import PJType
    from pyproj.database import query_crs_info

    crs_info_list = query_crs_info(
        auth_name="EPSG",
        pj_types=PJType.PROJECTED_CRS,
        area_of_interest=AreaOfInterest(
            west_lon_degree=-10,
            south_lat_degree=-10,
            east_lon_degree=10,
            north_lat_degree=10,
        ),
    )

I thought it might be something worth trying out.

@dmahr1
Copy link
Contributor Author

dmahr1 commented Oct 27, 2020

@snowman2 I definitely thought about using area of interest to filter projections. And that's a really cool new helper function for querying them! But there's nothing stopping a novice GIS user from (wrongly) using a coordinate system for points outside of the area of interest, right? In that case I think it's better to be thorough and just check everything. Each request to the cloud function only takes about 500 ms :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants