Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automation test format #349

Open
zcorpan opened this issue Nov 26, 2020 · 29 comments
Open

Automation test format #349

zcorpan opened this issue Nov 26, 2020 · 29 comments
Labels
Agenda+Community Group To discuss in the next workstream summary meeting (usually the last teleconference of the month)

Comments

@zcorpan
Copy link
Member

zcorpan commented Nov 26, 2020

Following discussion in https://docs.google.com/document/d/1jDf_gEQjRppLyEDEW0qPVbq1QaCT36bVDl9JkI-nvCU/edit# (also see #337)

cc @sinabahram @jscholes

We agreed that, to enable automation:

  • Tests need to provide full and exact instructions for which keys to press
  • Screen readers give some output in response to a keypress, and we generally want to assert multiple things. We can do that by having specialized assertions that look for how many times a specific substring appears (for example, for assert_accname: Lettuce, pass if the substring "Lettuce" appears exactly once). And also, we want to assert that the full string matches an expected string.

For example, for the test "Navigate to an unchecked checkbox in reading mode" for NVDA, I envision a text file that acts as the test with the full sequence of where to navigate the browser to, which keys to press, and the assertions. It could look something like this:

nav: reference/two-state-checkbox.html
press: h
assert_contains: Checkbox Example (Two State)
press: x
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Sandwich Condiments grouping list with 4 items Lettuce check box not checked
press: x
assert_accname: Tomato
press: Shift+x
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Lettuce check box not checked
press: Shift+h
assert_contains: Sandwich Condiments
press: f
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Sandwich Condiments grouping list with 4 items Lettuce check box not checked
press: f
assert_accname: Tomato
press: Shift+f
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Lettuce check box not checked
press: Shift+h
assert_contains: Sandwich Condiments
press: Tab
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Sandwich Condiments grouping list with 4 items Lettuce check box not checked
press: Tab
assert_accname: Tomato
press: Shift+Tab
assert_role: checkbox
assert_checked: false
assert_accname: Lettuce
assert_equals: Lettuce check box not checked

Note that this isn't the format I think we should write tests in, that can still be csv and avoid repetition, but the above could be generated from the csv source.

In addition, the NVDA project could choose to have "test expectations" files alongside these, that give the actual output after each interaction and/or each assertion.

nav: reference/two-state-checkbox.html
press: h # Sandwich Condiments heading level 3
assert_contains: Sandwich Condiments # [ PASS ]
press: x # Sandwich Condiments grouping list with 4 items Lettuce check box not checked
assert_role: checkbox # [ PASS ]
assert_checked: false # [ PASS ]
assert_accname: Lettuce # [ FAIL ] issue 12345
assert_equals: Sandwich Condiments grouping list with 4 items check box not checked # [ PASS FAIL ] issue 12346
...

Encoding test expectations for each downstream project using the tests has precedent in web-platform-tests (example). It allows saying that an individual assertion is expected to fail (represented as [ FAIL ] above), and maybe reference a known issue. It also allows saying that a particular result is known to be flaky, i.e. sometimes pass, sometimes fail (represented as [ PASS FAIL ] above).

This way of representing tests would make it possible to programmatically update the expected output to match the actual output when fixing a bug in the screen reader implementation that might affect many tests. So the workflow for an NVDA developer would be:

  1. Fix bug for issue 12345. (This is not an actual bug, but something I made up to demonstrate the idea.)
  2. Run tests. See that the checkbox test now fails because the "Lettuce" accessible name is now spoken, which it was not previously.
  3. Run a separate command to update the test expectations for this test (useful if there are many affected tests), OR manually edit the expectations file(s).
  4. Run tests again, now they pass.
  5. Commit changes and submit a PR to nvda.
@zcorpan
Copy link
Member Author

zcorpan commented Nov 30, 2020

With the substring approach (e.g. assert_role passes if the substring "checkbox" appears exactly once), accessible names would need to be unique and not use strings that the screen reader uses for roles, states, properties, etc. For our tests, this is possible, though doesn't seem ideal. For a solution to be usable for web developers testing their apps, this seems like a non-starter (they would need to be able to use any string in their accessible names).

An alternative is to rely on the internal state of the screen reader, i.e. directly ask for role etc. In combination with assert_equals checking the entire output, we'd still have coverage of the actual output including the things we're interested in.

Ack @mfairchild365

@zcorpan
Copy link
Member Author

zcorpan commented Dec 2, 2020

@zcorpan
Copy link
Member Author

zcorpan commented Dec 3, 2020

Questions:

  • Is the list of commands above match how we think a tester ought to run the test (when running manually)? I've made the assumption that a tester would want to "reset" between different key commands to perform the task, and used "h" and "Shift+h" for that to navigate to a heading.
  • Substring approach vs. checking internal state when asserting role, state, accessible name, etc. Right now I use the substring approach, but I think it's not ideal at least for a wider set of use cases (e.g. web developers writing tests for their website or web app).
  • For NVDA developers: any feedback on the suggested approach? cc @feerrenrut @michaelDCurran
  • For the Community Group: how do we generate these files from the test source material?

@zcorpan zcorpan added the Agenda+Community Group To discuss in the next workstream summary meeting (usually the last teleconference of the month) label Dec 3, 2020
@jscholes
Copy link
Contributor

jscholes commented Dec 3, 2020

@zcorpan

Is the list of commands above match how we think a tester ought to run the test (when running manually)? I've made the assumption that a tester would want to "reset" between different key commands to perform the task, and used "h" and "Shift+h" for that to navigate to a heading.

This currently isn't defined anywhere. I'm strongly of the opinion that a page should be entirely reloaded between test commands, because at present we are just hoping that testers will infer the setup to actuate command 2, command 3, etc. However, testers can't just do a refresh, because this doesn't seem to cause setup scripts for a test to re-run. Therefore, they have to close the window and hit "Open Test Page" again, which admittedly is annoying.

Having said that, regardless of how human testers carry out tests with multiple commands, the annoyance factor need not apply to automated ones. If a computer has to reload the page, not a problem. I think that should be explicitly encoded, rather than trying to exactly mirror what a person would do.

There are also other nuances here. For example, in some cases resetting isn't as straight forward as locating a heading or moving to the top/bottom of the page. Consider the tri-state checkbox example, where the modified test page explicitly includes a link before and after the checkbox because unlike the two-state ones, there is only one tri-state control present. For a select-only combobox, asking users to test both Up and Down Arrow for navigating within the options relies on the combobox being expanded and focus being placed on a specific item.

@zcorpan
Copy link
Member Author

zcorpan commented Dec 3, 2020

From today's telecon, I heard a few wants:

  • A command to repeatedly press a key until a certain output is found, or until the end of the page. (Not sure whether we want to ignore the number of keypresses it took, or whether we want to assert the number of presses.)
  • Ability to collect output across multiple presses. This is to simplify test writing and better ability to reuse assertions across screen readers.
  • Want to discard some information, like "focus mode". (Not sure how this affects the test format?)

@jscholes did I miss something or misrepresent something?

@zcorpan
Copy link
Member Author

zcorpan commented Dec 3, 2020

@jscholes

This currently isn't defined anywhere. I'm strongly of the opinion that a page should be entirely reloaded between test commands, because at present we are just hoping that testers will infer the setup to actuate command 2, command 3, etc. However, testers can't just do a refresh, because this doesn't seem to cause setup scripts for a test to re-run. Therefore, they have to close the window and hit "Open Test Page" again, which admittedly is annoying.

Maybe they should be separate tests?

Having said that, regardless of how human testers carry out tests with multiple commands, the annoyance factor need not apply to automated ones. If a computer has to reload the page, not a problem. I think that should be explicitly encoded, rather than trying to exactly mirror what a person would do.

There are also other nuances here. For example, in some cases resetting isn't as straight forward as locating a heading or moving to the top/bottom of the page. Consider the tri-state checkbox example, where the modified test page explicitly includes a link before and after the checkbox because unlike the two-state ones, there is only one tri-state control present. For a select-only combobox, asking users to test both Up and Down Arrow for navigating within the options relies on the combobox being expanded and focus being placed on a specific item.

Again, maybe separate tests whenever there's a need to "reset"

@jscholes
Copy link
Contributor

jscholes commented Dec 3, 2020

On today's call (3 Dec), we talked more about test automation. Specifically:

  • its impact on/crossover with test writing;
  • how we want to handle certain edge cases or differences between ATs; and
  • what data we want to take away from a test run.

Background

Different screen readers present information in their own unique way(s), creating disparities between the way a tester (human or automated) should interact with a component to cause the information to be conveyed. Key examples mentioned on the call were:

  • If a combobox is preceded by a label element in the DOM, and is then aria-labelledby that label, JAWS will present the label twice in the virtual buffer and require users to press Down Arrow three times to actually reach the combobox. In the same situation, NVDA only requires two presses of the same key.
  • When navigating into a table again by using Down Arrow in reading mode, NVDA will read the first column header inline with the information about the table: role, name, number of columns/rows, etc. This means that a single press of Down Arrow will suffice for reviewing the table info and text of the first cell. JAWS, on the other hand, places the table information in the virtual buffer as a distinct line of text, meaning that users must press Down Arrow twice.

This is a challenge for the ARIA-AT project. When humans are carrying out a test, we can instruct them that they may need to press a particular key more than once, and hope that they infer enough from that to complete the test successfully. There is potential here to discuss how these instructions can be made as clear as possible. But when we automate a test, a computer needs one or more defined stopping points. It's not enough to give it a statement of intent.

Proposed Solutions

The simplest way of addressing this would be to encode the number of required keypresses in a test, customised on a per-AT basis if necessary. We want to avoid going down that road for several reasons:

  • It makes tests overly brittle. If an update to a screen reader decreases the number of required keypresses, we will be unnecessarily testing unrelated speech output. If the number of required keypresses is increased, we risk information loss and false failing assertions. Either way, the test will need to be updated and we gain no valuable insights.
  • Likewise, the number of keypresses will need to be hard-coded for every variant of a pattern (e.g. a checkbox using aria-label versus an actual <label> element).
  • By hard-coding this data, the ARIA-AT project may be seen as implying certain opinions about the efficiency (or lack thereof) in a given screen reader. Our goal is not to dictate desired behaviour.

Similar to the above, we also don't want to enforce a one-to-one mapping between keypresses and assertions. For example, assert that the name of a combobox is conveyed on the first press of Down Arrow, and that the role, state and value are conveyed on the second. This shares all of the problems outlined above, while creating more work for test writers and providing a confusing experience for human testers.

Working Solution

At the end of the call, we all seemed to agree that the following approach warranted further progress:

  1. Continue to write assertions on a per-command (or per-gesture) basis. For example, navigate forwards to a combobox using Down Arrow.
  2. Create test boundaries around a component under test. For example, focusable links before and after a combobox.
  3. Instruct an automated test runner to press the given key as many times as required, either until all assertions have been met or the hard-stop boundary is reached. (note: we may only follow one of these, as the former implies that assertions are actively being tested as speech output progresses).
  4. On each keypress, collect the resulting speech output rather than discarding it and only using the output from the last one, and increment a counter so that the results data can tell us how many keystrokes were eventually required.
  5. Depending on the approach taken from point #3, test the collected output to determine passing/failing assertions and how many keystrokes were ultimately required to meet those assertions (and optionally to move between the two boundaries).

There are several advantages to this approach:

  • It will have very little impact on test writing. There may be certain changes required, such as a boolean indication that a particular test will require this approach. Alternatively we could just assume that all tests will use it, which makes it even more future-proof because we can catch cases where a screen reader or browser update introduces the need for multiple keystrokes where they weren't previously required.
  • The data about how many keystrokes were needed will be incredibly useful in determining differences between screen readers and screen reader versions, browser and browser versions, and mark-up variants. E.g. people may wish to choose a particular labelling pattern based on number of keystrokes imposed on the end-user.
  • Tests won't need to be updated because of changes in screen reader representation of components or mark-up. The resulting data will reflect those changes instead of the other way round.
  • The keystroke count reported in the resulting data will be a statement of fact which people can interpret as they want, rather than the test itself including implications about behaviour.

@sinabahram, @mcking65, @zcorpan and others, let me know if you feel I've missed anything out here. Note that there are things we didn't reach conclusions on, e.g. command sequences like T followed by Down Arrow to reach the first cell of a table.

@jscholes
Copy link
Contributor

jscholes commented Dec 3, 2020

@zcorpan

Maybe they should be separate tests?

Sure, if we're automating them (and have some automation in test generation as well). For a human tester, turning one test with four commands into four tests is not ideal.

@feerrenrut
Copy link

Substring approach vs. checking internal state when asserting role, state, accessible name, etc. Right now I use the substring approach, but I think it's not ideal at least for a wider set of use cases (e.g. web developers writing tests for their website or web app).

I think there are different considerations here, those for the presentation from AT, and those of Web developers wanting to ensure they wrote the right thing. These should be two different test systems.

I'm worried that having a set of asserts (role, state, name, etc) will prove to be limiting, AT will want to be able to present information in new ways. AT (or at least NVDA) presents information differently based on the relationships between objects. For instance the an object with role X may be reported differently when it is the child of an object with role Y. To use a asserts based on internal state seems Accessibility API oriented rather than end user oriented, in my opinion it is important to check that user is actually presented with the information than merely that AT collected it and has it internally. If web developers want to ensure that their HTML results in specific semantics exposed by the accessibility API, their test should be against the browser rather than the AT. The AT has to interpret that information and present it to the user in a friendly way according to user preferences.

However, there is certainly the risk of fragile tests, but at the end of the day if what is presented to the user changes, we need to be sure the change was correct.

@WestonThayer
Copy link

@feerrenrut do you know if there's any internal NVDA abstraction just below what's sent to the vocalizer/braille? Something that would be close to what the end user will experience than role/state/name, which would at least make the tests less brittle if whitespace changes, "check box" becomes "checkbox", or "list with 4 items" is shortened to "list 4 items".

@zcorpan
Copy link
Member Author

zcorpan commented Dec 8, 2020

@jscholes thank you for that summary!

I'm not convinced of the utility of collecting and reporting on the number of keystrokes. Maybe that warrants a separate issue though if we want to make that part of our scope.

I can experiment with adding these:

  • command to press a key until a certain role (or substring) is found.
  • a way to collect output across multiple key presses.

@zcorpan
Copy link
Member Author

zcorpan commented Dec 8, 2020

@feerrenrut thanks for sharing your thoughts. It's possible that the needs for web developer testing are different enough that we should have separate systems. We have proposed additions to WebDriver for roles and accessible names already (w3c/webdriver#1441, w3c/webdriver#1439), which would help for web developers wanting to check things based on the browser's accessibility tree.

@zcorpan
Copy link
Member Author

zcorpan commented Dec 8, 2020

  • command to press a key until a certain role (or substring) is found.
  • a way to collect output across multiple key presses.

Here's a proposal:

  • A new set of commands, press_until_contains and press_until_role, taking two arguments (separated by a comma), to repeatedly press the first argument until the second argument is found, or until an AT-specific document boundary signal is reached (which would fail the test).
  • press and press_until_* accumulate output by default. A new clear_output command sets lastSpeech to the empty string needs to be used where appropriate.

Edit: design doc updated.

@sinabahram
Copy link

Thanks for this great stuff, @zcorpan .

I hear your concern when you say "I'm not convinced of the utility of collecting and reporting on the number of keystrokes. ".

I don't think we need a separate issue for discussing an integer to track the number of keystrokes used in these commands. Can we please increment a number and store it for these tests, even if we delay the reporting and surfacing discussions until later?

This comes down to equity for us. There are hundreds of millions of dollars spent each year on everything from eye-tracking to detailed mouse pointer analysis, time to reach touch targets, and a variety of other metrics for analyzing the metrics of how nominally sighted users interact with interfaces. There's virtually none of that in the assistive technology space. This single number for how many keystrokes are required could begin making progress towards that huge information gap. It will absolutely shock most people when they learn how many keystrokes are required for blind users to interact with something as simple as a video player or menu system, not to mention complex data tables.

The utility is not small here, it's overwhelming because there's so little of these metrics that any of them have the ability to move the entire field forward. I hope that context helps.

I'm only pushing back on this because it feels like a tiny ask RE an integer being incremented in your loop when performing commands.

Happy to discuss further.

@jesdaigle
Copy link

@sinabahram It seems to make sense to me that just as too many clicks is deemed bad UX for sighted users, too many keystrokes should be a similar benchmark for AT users. Am I understanding your motivation correctly?

@sinabahram
Copy link

@jesdaigle that is the impetus, yes. The takeaways may need to be more nuanced e.g. there's 4-finger keystrokes AT users have to perform sometimes, so even if there is only one of them, it can be thought of as worse than maybe two simpler keystrokes; however, all that nuance can come later. For now, just counting the keystrokes addresses the overall concern.

@zcorpan
Copy link
Member Author

zcorpan commented Dec 8, 2020

@sinabahram I think it does warrant a separate issue. It's not currently in scope for aria-at. I would like the keystroke count idea to be properly considered for the entire project, rather than "sneaked in" to one part of it. Would you be interested in opening a new issue?

@feerrenrut
Copy link

@WestonThayer

which would at least make the tests less brittle if whitespace changes, "check box" becomes "checkbox", or "list with 4 items" is shortened to "list 4 items".

Currently there is not. Historically the information has been processed as a text stream. It is a direction I'd like to head in as it would enable new experiences and make new kinds of customisation possible, but there is a long way to go. It should be noted that this information is likely to be screen reader specific and would be at a different abstraction level from what is provided by via A11Y APIs by the browser. If the goal is to test that a HTML sample results in the exposure of a specific role, name, description etc, there are tools to do so against a browser. But each browser would need to be tested independently.
That said, I hear your concern about minor non-semantic changes to the text. These changes are rare however.

@zcorpan
I'm skeptical about using the same keyboard input / assertions for different AT. There currently happens to be some overlap, but this doesn't have to be the case. I think it would be a mistake to design-in a constraint like that. It isn't a goal for screen readers to present information in exactly the same way, it's much more opinionated. Are you planning that there would be a test file for each screen reader for each sample?

Rather than a custom file format, I think it would be wise to contain this information in a common pre-existing format with parsers already available, EG yaml, json, etc

@zcorpan
Copy link
Member Author

zcorpan commented Dec 9, 2020

@feerrenrut indeed, there are differences, and there isn't any intention to remove differences per se.

The manual tests are currently generated from a source that encodes tasks, and those translate to AT-specific commands. Assertions can be shared or different between ATs. See https://github.com/w3c/aria-at/wiki/How-to-contribute-tests#encode-each-interaction-into-a-test

In the end, a test will have AT-specific instructions and AT-specific assertions. As an optimization for writing tests, we're sharing things when they happen to be the same.

The change to accumulate output across multiple keypresses and have assertions for the output from all of them is intended to allow for differences to exist, and also be able to share more things when writing tests (in the csv format).

I'm happy to switch to JSON. I agree that would be wise. 🙂 (Edit: changed to JSON in the design doc.)

@zcorpan
Copy link
Member Author

zcorpan commented Dec 10, 2020

I have now implemented these changes in bocoup/nvda#2 :

  • Switched to JSON format.
  • Accumulate output by default, clear_output to reset.
  • Added press_until_contains and press_until_role.
  • Fixed bug in assert_checked (to support the mixed state).

@zcorpan
Copy link
Member Author

zcorpan commented Dec 10, 2020

We had a one-off meeting about this topic today, minutes here: https://www.w3.org/2020/12/10-aria-at-minutes.html

A couple of points came up

  • @jscholes asked that discussion happens in GitHub rather than in Google Docs to avoid fragmentation and so people get notified by new comments. (I can copy over comments from the google doc to here.)
  • assert_role may not always work, because there isn't a 1:1 mapping from ARIA role to various a11y API roles or what an AT considers a role. For example, role=button and aria-pressed is a "toggle button". There may be more such cases.
  • assert_checked (+ all other not-yet-implemented state/property assertions) should be replaced by assert_state_or_property taking 2 arguments (state or property, value).
  • Maybe the automation "glue" between aria-at and nvda could be an NVDA addon instead of being part of NVDA itself.
  • How does the test runner know that the AT is "done" speaking after a command? What if the AT decides to send multiple "chunks"? In NVDA's System Tests, this appears to be solved, though I'm not sure how. @feerrenrut can you comment on this?
  • At some point, we need to figure out how to get from the csv source files to the automation test files.

I've left out some things for brevity, feel free to add a comment below if I've omitted something relevant. 🙂

@sinabahram
Copy link

Thank you for these notes, @zcorpan . Sorry about my scheduling conflict in not being able to attend this conversation.

I have one question from this point in the notes:
Maybe the automation "glue" between aria-at and nvda could be an NVDA addon instead of being part of NVDA itself.

I would like to voice support for an addon or for the idea of a TTS driver, which I'm assuming did not come up as I don't see it in the minutes. My reasoning for supporting an addon is because I interpret the above to mean that the alternative is patching NVDA directly, which then of course makes this quite difficult to add onto other setups or configs, not necessarily for production but just for testing amongst ourselves. Please let me know if I have that right, or if something else is meant here.

The text to speech (TTS) driver idea is the one I brought up on our last call where the source of truth upstream is the TTS. If a SAPI TTS listener/handler can be written, that in-turn passes through to system SAPI for actual speech (that way blind users can still drive with it), then the driver can simply log out all text it is being asked to say. This feels really elegant and decouples so much of the system from any specific screen reader. This also solves a bit of knowing when it is done speaking, as you can have a timeout to make absolute sure.

If this is done, then the same exact driver, no changes what-so-ever, can be used in Jaws, NVDA, and I think even Narrator since it is SAPI (we should confirm that).

Thanks for letting me contribute those thoughts to this discussion.

@WestonThayer
Copy link

Sorry I missed this as well, as I think the below point might've been from a comment I put on the GDoc.

How does the test runner know that the AT is "done" speaking after a command? What if the AT decides to send multiple "chunks"? In NVDA's System Tests, this appears to be solved, though I'm not sure how

I believe NVDA's speechSpyGlobalPlugin.py has access to NVDA's internal command processing queue. After emulating a key press, it can precisely block until NVDA has finished sending speech to the synth.

speechSpySynthDriver.py instantly "speaks", so there's no delay after NVDA sends some text to the TTS. Nonetheless, speechSpyGlobalPlugin has a 500ms timeout to wait for the synth to finish speaking, I'm not sure why that's necessary since speechSpyGlobalPlugin gets the post_speech callback. @sinabahram I agree it would be great if the custom TTS could still vocalize with a real TTS, and if that feature could be optionally shut off during purely computer-driven tests, I think it would drastically reduce execution time for the suite.

I think speechSpyGlobalPlugin's access to NVDA's internal queue is essential (in addition to the custom TTS). If it didn't have that, the test runner could hit some pretty confusing bugs if it were only relying on a custom TTS. For example, consider a contrived test:

nav: reference/headings.html
press: h # Sandwich Condiments heading level 3
assert_equals: heading level 3 Sandwich Condiments # [ PASS ]
press: h # Favorites heading level 4
assert_equals: heading level 4 Favorites # [ PASS ]
  1. Test runner presses h
  2. NVDA calls TTS.speak("heading level 3 Sandwich Condiments"), but this is opaque to the test runner
  3. Our custom TTS notifies the test runner of the string received by the speak() method, which is then asserted
  4. TTS has some timeout (say 500ms), then decides it probably won't receive any more speech, and notifies the test runner
  5. Test runner presses h again

Suppose NVDA had a bug where, if given enough time, it would've announced some erroneous extra data, maybe start reading the HTML markup for the heading.

heading level 3 Sandwich Condiments
left caret h right caret class equals quote ...

If NVDA hit a processing delay (maybe Windows Defender decides to do a scan, or maybe computing the next speech string is computationally intensive) that lasted longer than the timeout in step (4), the test runner would proceed to step (5), which would cancel the queued "left caret h right caret..." speech, preventing it from ever reaching our custom TTS. Thus, the test would pass, failing to detect this class of bug.

I think there's several classes of bugs that timing issues like this can cause — not sure this is the worst variety, but it makes me nervous. At the minimum I think it could cause a very flakey test suite.

@feerrenrut
Copy link

I'll give my thoughts on a couple of aspects being discussed here:

  • Why is there a time out for speech being "done"?
  • Can a synth only approach be used?

First I'd like to point to what it is we (at NV Access) are emulating; manual tests, ie verifying the end to end user experience. Where the output "end" is where NVDA passes data to a synth driver. As a human screen reader user, the amount of time with no speech, and prior experience of what the SR announces allow you to determine when the SR is done. As much as possible, I think we should try to emulate the real world experience to prevent missing bugs that would be experienced by the user. However, I agree there is a trade off against the time taken to run the tests. Personally, I'd prefer to optimise later, and try to ensure correctness first.

As for why a fixed time limit may be required. A screen reader such as NVDA depends on receiving events from the browser, there is no guarantee about the order or timing of these. In practice, highly delayed events (eg more than a few 10's of milliseconds) would likely be considered a bug and reported to browser developers.

We initially started with purely a time based approach, the 'core asleep' approach came later. We found that this wasn't entirely reliable, and so left the time based approach in place. We have since fixed several other bugs that caused intermittent failures which may have been blamed on this new approach. One difficult aspect of developing this is isolating and identifying bugs in the testing system itself. At this stage these mostly seem to show as intermittent issues. Ideally we would have a mechanism to collect the system test results for every build, so we can use a stats based approach to determine the likelihood that an intermittent issue is resolved.

I think using a synth driver (and later a braille driver) as the interface between the test framework and the SR would be quite elegant and essentially that is where I started with our system tests. As noted by @WestonThayer there would need to be a few concessions made, which would impact the running time for the tests:

  • Knowing when NVDA startup is complete
  • A faster way to quit NVDA
  • Monitoring internal state during key press (because NVDA intercepts, prepares internal state, then forwards the keypress to the application)

@sinabahram Being able to listen along during a test would be a good idea. However building that into the system test driver (eg finding available synths etc) is a lot to create and maintain that is not directly related to the goal. At NV Access, something that has been on my mind is a 'tee' like driver interceptor allowing a primary synth and a set of secondary synths. This would solve a few issues in NVDA, speech viewer, NVDA remote, and allow multiple synths at once. Another alternative, is that the tests output all speech commands that can be replayed after the tests. Another thing that would help debugging is being able to stop the tests either manually (if able to listen along) or on assert failure in order to investigate.

@sinabahram
Copy link

@feerrenrut , major +1 to everything you've said here. And, I love the tee-command analog approach. That's a beautiful pattern that has been not only used within various command line shells but with HTTP streams to test various web servers, etc.

I think that optimizing later for startup times is quite reasonable. I imagine there may be entire portions of NVDA that can be stripped out or jumped over if we know it is running within a test instance. Such optimizations may help any other automated testing on NVDA's side. I'm thinking of how kernels and other distributions are heavily optimized for running within application virtualization containers.

@WestonThayer
Copy link

As a human screen reader user, the amount of time with no speech, and prior experience of what the SR announces allow you to determine when the SR is done.

Excellent point, thank you for changing my perspective. Much can be done to stabilize the host OS environment as well (disabling updates, virus scans, services, etc).

zcorpan added a commit to bocoup/aria-at that referenced this issue Dec 18, 2020
@zcorpan
Copy link
Member Author

zcorpan commented Dec 18, 2020

I've made a few changes:

  • Removed assert_accname since it was equivalent to assert_contains
  • assert_contains takes an optional second argument count, which specifies how many times the string is expected to be found. If not given, it is "at least once".
  • nav URL is relative to the repo root.
  • The automated tests are in the same folder as the manual tests (instead of a subfolder).

@zcorpan
Copy link
Member Author

zcorpan commented Dec 18, 2020

Moving remaining open comments from the google doc to here:

Test format

Context: assert_role

@mfairchild365 commented:

This one is more difficult because context matters. The role might be implied by context, or might be mutated by context.

For example, the option role is not explicitly conveyed by most screen readers because it is implied by the listbox context.

Additionally, aria-pressed + the button role will mutate the role from button to a "toggle button".

Perhaps we need something like assert_does_not_contain for when a role is not expected to be explicitly conveyed, and use assert_contains for role mutations? We could also pass context arguments to assert_role, but that seems more complex. Thoughts?

@zcorpan replied:

Part of the idea is to allow each AT to maintain their own mappings from ARIA role/state/property to expected output. But maybe it's overly complicated. Only contains / not contains / equals is simpler.

Running tests

Context: The AT driver may also need to be able to change settings and maybe read internal state.

@WestonThayer commented:

I think this is a hard requirement to solve the following scenario:

  1. The test runner forwards a command to the AT driver
  2. How long does the test runner wait before forwarding the next command?

A simple timeout could result in timing issues:

  1. Test runner forwards a command
  2. Test runner does not detect speech output within 2 seconds, so it proceeds and forwards the next command
  3. AT finishes processing the command in step (1), but test runner now associates it as output for step (2)

Another issue could be if the custom speech vocalizer (SAPI or other) gets multiple "chunks" of output from a single command. For example could the following sequence happen?

  1. Test runner forwards a command
  2. AT processes. This particular command will result in the AT outputting two "chunks" of speech to the vocalizer, but the test runner doesn't know that. Test runner receives first "chunk" and proceeds
  3. Test runner forwards second command
  4. Test runner now receives the second chunk resulting from step (1), eventually followed by a new chunk from step (2)

The test would probably fail in this case. Timeouts would probably result in very flakey tests.

@WestonThayer
Copy link

I think my comments can be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+Community Group To discuss in the next workstream summary meeting (usually the last teleconference of the month)
Projects
None yet
Development

No branches or pull requests

6 participants