Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New OpenStreetMap Carto release, v4.25.0 #264

Closed
jeisenbe opened this issue Feb 1, 2020 · 17 comments
Closed

New OpenStreetMap Carto release, v4.25.0 #264

jeisenbe opened this issue Feb 1, 2020 · 17 comments

Comments

@jeisenbe
Copy link

jeisenbe commented Feb 1, 2020

A new version of OpenStreetMap Carto, v4.25.0, has been released.

I believe there are no major changes required for deployment.

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 5, 2020

Just wanted to double check, if there's something wrong with this release? I've seen a number of reports where people are complaining about rendering issues and gray tiles.

https://munin.openstreetmap.org/openstreetmap.org/rhaegal.openstreetmap.org/mod_tile_fresh.html shows a fairly large amount of "old tile, attempted render". The other bits of the rendering infrastructure seem fairly busy as well: https://munin.openstreetmap.org/mod_tile-week.html
Is this normal for a new release?

This release has a few new places where ST_PointOnSurface is being used. Does it cause performance issues by an increased CPU load?

@tomhughes
Copy link
Member

No idea - what do the graphs show?

@tomhughes
Copy link
Member

I mean obviously it's normal that tiles need rerendering after a change.

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 5, 2020

In absolute numbers those figures are probably not that meaningful. I was trying to compare current ones to how the systems behaved during the last update, and it seems that things like latency or the already mentioned "old tile, attempted render" are unusually large.

https://munin.openstreetmap.org/openstreetmap.org/odin.openstreetmap.org/mod_tile_latency.html

@tomhughes
Copy link
Member

Well it's been so long from the last release that it's hard to compare.

That's just disk I/O time you're looking at so I don't see how that would be impacted?

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 5, 2020

odin is maxed out on CPU since this new Carto release has been deployed: https://munin.openstreetmap.org/openstreetmap.org/odin.openstreetmap.org/cpu.html

I think this box has never been this busy during the last year.

@Firefishy
Copy link
Member

odin is running postgresql-10-postgis-2.4, postgis 3.0+ has the massive ST_PointOnSurface performance optimisation.

@tomhughes
Copy link
Member

We will be moving to the latest postgresql and postgis when we do the reload which I believe is now expected with the next carto release?

Should we rollback in the meantime?

@imagico
Copy link

imagico commented Feb 5, 2020

I think a rollback is the safest and quickest option.

Based on the assumption that ST_PointOnSurface() performance issues are the cause of this (which is plausible - see gravitystorm/openstreetmap-carto#4009) this is not a bug that can be easily and fully fixed with a minor release but requires either re-thinking the strategic decision for moving to ST_PointOnSurface() for polygon labels/icons or a PostGIS update solving the issue.

What seems a bit weird is that this release caused trouble while the previous ones did not - because the biggest uses of ST_PointOnSurface() were already in there in the previous release.

As @pnorman says in #211 (comment) the next step would be for us in OSM-Carto to decide to either make a new major release for you to do a system upgrade with or to roll back the move to ST_PointOnSurface() in a way that makes it suitable for the current infrastructure.

@pnorman
Copy link
Collaborator

pnorman commented Feb 6, 2020

I don't see a rollback as necessary because I see no evidence of a style problem. rhaegal is cpu-constrained and holding the same metatiles/second as before the new style. For Odin, during the peak on the 2nd it had 2.4 CPU/MT/s (cpu MT/s), pre-peak on the 30th it had 2.2 CPU/MT/s (cpu MT/s).

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 6, 2020

I don’t think it’s a good idea to operate a service where we want fast response times at 95% cpu utilization. Although the throughput may be comparable, dropped tiles and largely delayed tiles impact the user experience.

@tomhughes
Copy link
Member

Maxed out CPUs is totally normal when a new style is deployed.

Indeed on the slower machines like rhaegal it's normall most days.

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 14, 2020

Another quick update, collecting some user feedback on different channels:

Now that the new style is in production for about 13 days, users both on the forum and now also on Telegram channels keep on complaining about poor performance, gray tiles, timeouts.

One user remarked that the bigger tile servers seem to have managed to reduce the backlog in the meantime. However, 4 of the smaller ones still seem to struggle quite a bit and they keep dropping tiles.

So no matter what Munin throughput stats say, user feedback seems to indicate that performance degraded quite a bit.

@tomhughes
Copy link
Member

Well maybe but we have no idea how much of that is the caches and how much is the render servers - many of the caches are fairly overloaded and cause those kinds of effects anyway.

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 14, 2020

Many of the users are longtime osm users, and I assume they have quite a bit of experience what to expect in terms of response times and gray tiles. Some of them even acknowledged that switching styles has caused some issues in the past, but reportedly it's never been this bad in recent times and the system recovered much quicker in the past. The situation on the tile caches probably hasn't changed much in the last three weeks.

@tomhughes
Copy link
Member

We are stilling fighting significant issues with the squid 4 migration - just this morning I have found two caches that are in a degraded state and have been fixing them and there are likely others.

@mmd-osm
Copy link
Contributor

mmd-osm commented Feb 14, 2020

Ah, that's good to know. It's hard to reason about this, if all you get from users is more or less "it doesn't work, it's slow, there's timeouts". I'm thinking if it might be worthwhile adding some diagnostics code to osm.org to get a breakdown of tile performance per cache/rendering server. We have all relevant data in some HTTP X fields, but no way to have users report those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants