Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Availability - Research user expectations #1468

Closed
MarkAckert opened this issue Jun 25, 2020 · 7 comments
Closed

High Availability - Research user expectations #1468

MarkAckert opened this issue Jun 25, 2020 · 7 comments
Assignees
Milestone

Comments

@MarkAckert
Copy link
Member

MarkAckert commented Jun 25, 2020

As part of understanding how Zowe should achieve high availability, we should first understand customer expectations in speaking of it.

We should:

  • Gather more detailed user feedback and requirements for high availability
  • Engage with SME's within our companies for expertise and guidance
  • Ask for Sysplex training from SME :-)

Checkpoints:

  • Can we schedule the SME training and guidance in Zowe Architecture calls?

Output:

  • A clearly defined scope for what high availability means for Zowe. This scope includes the qualities of High Availability to target (i.e., reliability, scalability, disaster recovery) as well as some deployment attributes (sysplex, multi-sysplex, container+sysplex, etc.).
  • This scope should be communicated with the community so that we have a shared knowledge when working on HA.
  • Present results to the team on a Zowe Architecture call
@MarkAckert
Copy link
Member Author

Education session for sysplex will be after sprint 1, likely sprint 2

@1000TurquoisePogs
Copy link
Member

1000TurquoisePogs commented Jun 26, 2020

@John-A-Davies noted that people may be ok with a much easier level of uptime-protection... not quite high availability but pretty good: have a monitor program which records the process IDs of each component of zowe, and if a process stops in a way that is considered abnormal, then the monitor could restart that by going through the standard configure.sh, start.sh component scripts.
This would be at effectively no-cost to system resources, and would keep uptime better than without.

I guess the question is should we do this first, or should we do this instead?

@jackjia-ibm
Copy link
Member

@1000TurquoisePogs thanks for the information. yes this is in scope of research and mainly covered by #1472. Looking forward to discuss in Architecture call to find out what's the best way to achieve this. Is Eureka in better place to solve this issue, or the ZLaunch which Irek is working on, or some other solutions like the nanny process mentioned here.

@jackjia-ibm
Copy link
Member

The slide to kick off the discussion at Jul 7, 2020: 2020PI3 - high availability.pdf

Many comments are added to #1467. Thanks for all who shared their ideas.

Recordings can be found at https://github.com/zowe/community/blob/master/Project%20Management/Architecture%20Call/Archtecture_Call.md

@jackjia-ibm
Copy link
Member

jackjia-ibm commented Jul 14, 2020

The first drop of Zowe-HA-Draft.docx

  • Check below for newer versions.

@jackjia-ibm
Copy link
Member

jackjia-ibm commented Jul 23, 2020

The second version of Zowe-HA-Draft.docx.

Comparing to the first version, we:

  • have a clear definition of Caching API
  • added Limitation section to explain how Zowe HA is limited, either running it on single LPAR, or component functionality.

@jackjia-ibm
Copy link
Member

We have pretty good understanding of what HA means for Zowe and started to work on the draft. The publish of draft will continue on #1477.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants