Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Reflect availability and health of ODP service #789

Open
forman opened this issue Oct 1, 2018 · 12 comments
Open

Reflect availability and health of ODP service #789

forman opened this issue Oct 1, 2018 · 12 comments

Comments

@forman
Copy link
Member

forman commented Oct 1, 2018

Expected behavior

Cate Desktop should "know" if the CCI ODP service is available and should clearly display to to users its health status.

Actual behavior

Users receive error messages when downloading and accessing data (usually connection time-out errors). To users it appears as if Cate was not working correctly.

Steps to reproduce the problem

Download or access ODP data sources, when ODP services are down.

Specifications

Cate 1.0 - 2.0.dev20

@forman
Copy link
Member Author

forman commented Oct 1, 2018

Added label "external" because resolution requires new ODP web service.

@forman
Copy link
Member Author

forman commented Oct 16, 2018

Here is some example JSON, that could be returned by the health care service:

{
  "services": {
    "CSW": {
      "status": "OK"
    },
    "WCS": {
      "status": "OK"
    },
    "ESGF": {
      "status": "OK"
    },
    "OPENDAP": {
      "status": "SLOW",
      "reason": "..."
    },
    "HTTP": {
      "status": "OK"
    },
    "FTP": {
      "status": "DOWN",
      "reason": "..."
    }
  },
  "anouncements": [
    {
      "published": "2018-12-06T10:20:13",
      "status": "DOWNTIME",
      "services":  ["CSW"],
      "period":  ["2019-01-01", "2019-01-03"],
      "title": "Catalogue Service Downtime",
      "description": "The ODP CSW will be down from 2019-01-01 to 2019-01-03 for maintenance reasons."
    },
    {
      "published": "2018-11-23T14:06:31",
      "period":  ["2019-02-10", "2019-02-12"],
      "services":  ["OPENDAP", "CSW", "WCS", "ESGF"],
      "status": "LOWBANDWIDTH",
      "title": "Service Migration",
      "description": "All ODP services will be moved to new infrastructure. From 2019-01-01 to 2019-01-03 you may observe low bandwidth."
    }
  ]
}

@cpaulcox
Copy link

Is the services section meant to be populated as a result of polling the origin servers? If so, then:

  • Why not poll directly? It will be simpler and less reliant on the availability of the "health care service". Making this intermediary service highly available will incur further cost and complexity - "who monitors the monitor".
  • Polling alone will not be able to determine a reason for being "DOWN". If this were updated manually it would unlikely to be done in a timely fashion, reasons for failure are often only known after detailed root cause analysis once the systems are back on-line and generally I'd view this as bad security practice as it could result in information leakage which would aid potential attackers.
  • Polling is unlikely to be able to identify likely connection timeouts unless the target is completely offline due to differences in network paths, inherent latency jitter with the Internet and typically polling requests do simple HEAD requests of a static page.
  • Slowness will also be hard to determine reliably again due to latency issues which are often route specific and highly transient.

@forman
Copy link
Member Author

forman commented Oct 18, 2018

Our aim is to use some RESTful meta-service API that we can use from the CCI Toolbox. Again, we don't care about how this will be implemented on the server side. Timeouts on the clients may have various reasons - we want to know what the status on the server side.

@forman
Copy link
Member Author

forman commented Oct 18, 2018

For example we just received a mail from Alison saying

Just to let you know that there was an issue with the ESGF update that we deployed yesterday, and to fix it, the OPeNDAP (and other ESGF access e.g. HTTP, WMS) will need to be taken offline this afternoon. I’ll let you know as soon as it’s back up and running, but it may be down all afternoon unfortunately. The portal front end and anonymous ftp download should be unaffected.

This is the stuff that we would like to pass over to our users in advance.

@cpaulcox
Copy link

I still don't understand how you expect the services section to be updated? If you want to know the status on the server side it suggests a manual update, which as I've mentioned before won't be workable for unscheduled outages, or some integration on-site with the opendap servers.

@forman
Copy link
Member Author

forman commented Oct 18, 2018

I still don't understand how you expect the services section to be updated?

I don't know. I expect, some experts will find a solution.

E.g. using https://www.nagios.org/

@JanisGailis
Copy link
Member

JanisGailis commented Oct 19, 2018

Just to chime in this discussion a bit. Here are a few examples of how widely used and known services convey status information to their users:

http://status.gandi.net/timeline
https://status.twitterstat.us/#
https://status.status.io/

How exactly the status of a particular system of a particular service is determined and updated is of course specific to each system. From the users' perspective, however, a trusted, machine readable channel is provided.

@forman
Copy link
Member Author

forman commented Oct 19, 2018

Thanks @JanisGailis !

@cpaulcox
Copy link

Interesting. The gandi.net one illustrates a couple of points that I'm trying to make above.

  • The incidents are in the past. I would have expected live incident reporting would be more useful.
  • Their reporting is leaking information about their services and tech stack. This feed should be secured IMHO.

@forman
Copy link
Member Author

forman commented Nov 27, 2018

I'm going to address this now by separating network errors from others, so the GUI can show a different error dialog.

@forman
Copy link
Member Author

forman commented Nov 29, 2018

Now showing the following error dialogs:

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants