-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automation test format #349
Comments
With the substring approach (e.g. An alternative is to rely on the internal state of the screen reader, i.e. directly ask for role etc. In combination with Ack @mfairchild365 |
I now have a working implementation in our fork of nvda with the above test format. See: |
Questions:
|
This currently isn't defined anywhere. I'm strongly of the opinion that a page should be entirely reloaded between test commands, because at present we are just hoping that testers will infer the setup to actuate command 2, command 3, etc. However, testers can't just do a refresh, because this doesn't seem to cause setup scripts for a test to re-run. Therefore, they have to close the window and hit "Open Test Page" again, which admittedly is annoying. Having said that, regardless of how human testers carry out tests with multiple commands, the annoyance factor need not apply to automated ones. If a computer has to reload the page, not a problem. I think that should be explicitly encoded, rather than trying to exactly mirror what a person would do. There are also other nuances here. For example, in some cases resetting isn't as straight forward as locating a heading or moving to the top/bottom of the page. Consider the tri-state checkbox example, where the modified test page explicitly includes a link before and after the checkbox because unlike the two-state ones, there is only one tri-state control present. For a select-only combobox, asking users to test both Up and Down Arrow for navigating within the options relies on the combobox being expanded and focus being placed on a specific item. |
From today's telecon, I heard a few wants:
@jscholes did I miss something or misrepresent something? |
Maybe they should be separate tests?
Again, maybe separate tests whenever there's a need to "reset" |
On today's call (3 Dec), we talked more about test automation. Specifically:
BackgroundDifferent screen readers present information in their own unique way(s), creating disparities between the way a tester (human or automated) should interact with a component to cause the information to be conveyed. Key examples mentioned on the call were:
This is a challenge for the ARIA-AT project. When humans are carrying out a test, we can instruct them that they may need to press a particular key more than once, and hope that they infer enough from that to complete the test successfully. There is potential here to discuss how these instructions can be made as clear as possible. But when we automate a test, a computer needs one or more defined stopping points. It's not enough to give it a statement of intent. Proposed SolutionsThe simplest way of addressing this would be to encode the number of required keypresses in a test, customised on a per-AT basis if necessary. We want to avoid going down that road for several reasons:
Similar to the above, we also don't want to enforce a one-to-one mapping between keypresses and assertions. For example, assert that the name of a combobox is conveyed on the first press of Down Arrow, and that the role, state and value are conveyed on the second. This shares all of the problems outlined above, while creating more work for test writers and providing a confusing experience for human testers. Working SolutionAt the end of the call, we all seemed to agree that the following approach warranted further progress:
There are several advantages to this approach:
@sinabahram, @mcking65, @zcorpan and others, let me know if you feel I've missed anything out here. Note that there are things we didn't reach conclusions on, e.g. command sequences like T followed by Down Arrow to reach the first cell of a table. |
Sure, if we're automating them (and have some automation in test generation as well). For a human tester, turning one test with four commands into four tests is not ideal. |
I think there are different considerations here, those for the presentation from AT, and those of Web developers wanting to ensure they wrote the right thing. These should be two different test systems. I'm worried that having a set of asserts (role, state, name, etc) will prove to be limiting, AT will want to be able to present information in new ways. AT (or at least NVDA) presents information differently based on the relationships between objects. For instance the an object with role X may be reported differently when it is the child of an object with role Y. To use a asserts based on internal state seems Accessibility API oriented rather than end user oriented, in my opinion it is important to check that user is actually presented with the information than merely that AT collected it and has it internally. If web developers want to ensure that their HTML results in specific semantics exposed by the accessibility API, their test should be against the browser rather than the AT. The AT has to interpret that information and present it to the user in a friendly way according to user preferences. However, there is certainly the risk of fragile tests, but at the end of the day if what is presented to the user changes, we need to be sure the change was correct. |
@feerrenrut do you know if there's any internal NVDA abstraction just below what's sent to the vocalizer/braille? Something that would be close to what the end user will experience than role/state/name, which would at least make the tests less brittle if whitespace changes, "check box" becomes "checkbox", or "list with 4 items" is shortened to "list 4 items". |
@jscholes thank you for that summary! I'm not convinced of the utility of collecting and reporting on the number of keystrokes. Maybe that warrants a separate issue though if we want to make that part of our scope. I can experiment with adding these:
|
@feerrenrut thanks for sharing your thoughts. It's possible that the needs for web developer testing are different enough that we should have separate systems. We have proposed additions to WebDriver for roles and accessible names already (w3c/webdriver#1441, w3c/webdriver#1439), which would help for web developers wanting to check things based on the browser's accessibility tree. |
Here's a proposal:
Edit: design doc updated. |
Thanks for this great stuff, @zcorpan . I hear your concern when you say "I'm not convinced of the utility of collecting and reporting on the number of keystrokes. ". I don't think we need a separate issue for discussing an integer to track the number of keystrokes used in these commands. Can we please increment a number and store it for these tests, even if we delay the reporting and surfacing discussions until later? This comes down to equity for us. There are hundreds of millions of dollars spent each year on everything from eye-tracking to detailed mouse pointer analysis, time to reach touch targets, and a variety of other metrics for analyzing the metrics of how nominally sighted users interact with interfaces. There's virtually none of that in the assistive technology space. This single number for how many keystrokes are required could begin making progress towards that huge information gap. It will absolutely shock most people when they learn how many keystrokes are required for blind users to interact with something as simple as a video player or menu system, not to mention complex data tables. The utility is not small here, it's overwhelming because there's so little of these metrics that any of them have the ability to move the entire field forward. I hope that context helps. I'm only pushing back on this because it feels like a tiny ask RE an integer being incremented in your loop when performing commands. Happy to discuss further. |
@sinabahram It seems to make sense to me that just as too many clicks is deemed bad UX for sighted users, too many keystrokes should be a similar benchmark for AT users. Am I understanding your motivation correctly? |
@jesdaigle that is the impetus, yes. The takeaways may need to be more nuanced e.g. there's 4-finger keystrokes AT users have to perform sometimes, so even if there is only one of them, it can be thought of as worse than maybe two simpler keystrokes; however, all that nuance can come later. For now, just counting the keystrokes addresses the overall concern. |
@sinabahram I think it does warrant a separate issue. It's not currently in scope for aria-at. I would like the keystroke count idea to be properly considered for the entire project, rather than "sneaked in" to one part of it. Would you be interested in opening a new issue? |
Currently there is not. Historically the information has been processed as a text stream. It is a direction I'd like to head in as it would enable new experiences and make new kinds of customisation possible, but there is a long way to go. It should be noted that this information is likely to be screen reader specific and would be at a different abstraction level from what is provided by via A11Y APIs by the browser. If the goal is to test that a HTML sample results in the exposure of a specific role, name, description etc, there are tools to do so against a browser. But each browser would need to be tested independently. @zcorpan Rather than a custom file format, I think it would be wise to contain this information in a common pre-existing format with parsers already available, EG yaml, json, etc |
@feerrenrut indeed, there are differences, and there isn't any intention to remove differences per se. The manual tests are currently generated from a source that encodes tasks, and those translate to AT-specific commands. Assertions can be shared or different between ATs. See https://github.com/w3c/aria-at/wiki/How-to-contribute-tests#encode-each-interaction-into-a-test In the end, a test will have AT-specific instructions and AT-specific assertions. As an optimization for writing tests, we're sharing things when they happen to be the same. The change to accumulate output across multiple keypresses and have assertions for the output from all of them is intended to allow for differences to exist, and also be able to share more things when writing tests (in the csv format). I'm happy to switch to JSON. I agree that would be wise. 🙂 (Edit: changed to JSON in the design doc.) |
I have now implemented these changes in bocoup/nvda#2 :
|
We had a one-off meeting about this topic today, minutes here: https://www.w3.org/2020/12/10-aria-at-minutes.html A couple of points came up
I've left out some things for brevity, feel free to add a comment below if I've omitted something relevant. 🙂 |
Thank you for these notes, @zcorpan . Sorry about my scheduling conflict in not being able to attend this conversation. I have one question from this point in the notes: I would like to voice support for an addon or for the idea of a TTS driver, which I'm assuming did not come up as I don't see it in the minutes. My reasoning for supporting an addon is because I interpret the above to mean that the alternative is patching NVDA directly, which then of course makes this quite difficult to add onto other setups or configs, not necessarily for production but just for testing amongst ourselves. Please let me know if I have that right, or if something else is meant here. The text to speech (TTS) driver idea is the one I brought up on our last call where the source of truth upstream is the TTS. If a SAPI TTS listener/handler can be written, that in-turn passes through to system SAPI for actual speech (that way blind users can still drive with it), then the driver can simply log out all text it is being asked to say. This feels really elegant and decouples so much of the system from any specific screen reader. This also solves a bit of knowing when it is done speaking, as you can have a timeout to make absolute sure. If this is done, then the same exact driver, no changes what-so-ever, can be used in Jaws, NVDA, and I think even Narrator since it is SAPI (we should confirm that). Thanks for letting me contribute those thoughts to this discussion. |
Sorry I missed this as well, as I think the below point might've been from a comment I put on the GDoc.
I believe NVDA's speechSpyGlobalPlugin.py has access to NVDA's internal command processing queue. After emulating a key press, it can precisely block until NVDA has finished sending speech to the synth. speechSpySynthDriver.py instantly "speaks", so there's no delay after NVDA sends some text to the TTS. Nonetheless, speechSpyGlobalPlugin has a 500ms timeout to wait for the synth to finish speaking, I'm not sure why that's necessary since speechSpyGlobalPlugin gets the post_speech callback. @sinabahram I agree it would be great if the custom TTS could still vocalize with a real TTS, and if that feature could be optionally shut off during purely computer-driven tests, I think it would drastically reduce execution time for the suite. I think speechSpyGlobalPlugin's access to NVDA's internal queue is essential (in addition to the custom TTS). If it didn't have that, the test runner could hit some pretty confusing bugs if it were only relying on a custom TTS. For example, consider a contrived test:
Suppose NVDA had a bug where, if given enough time, it would've announced some erroneous extra data, maybe start reading the HTML markup for the heading.
If NVDA hit a processing delay (maybe Windows Defender decides to do a scan, or maybe computing the next speech string is computationally intensive) that lasted longer than the timeout in step (4), the test runner would proceed to step (5), which would cancel the queued "left caret h right caret..." speech, preventing it from ever reaching our custom TTS. Thus, the test would pass, failing to detect this class of bug. I think there's several classes of bugs that timing issues like this can cause — not sure this is the worst variety, but it makes me nervous. At the minimum I think it could cause a very flakey test suite. |
I'll give my thoughts on a couple of aspects being discussed here:
First I'd like to point to what it is we (at NV Access) are emulating; manual tests, ie verifying the end to end user experience. Where the output "end" is where NVDA passes data to a synth driver. As a human screen reader user, the amount of time with no speech, and prior experience of what the SR announces allow you to determine when the SR is done. As much as possible, I think we should try to emulate the real world experience to prevent missing bugs that would be experienced by the user. However, I agree there is a trade off against the time taken to run the tests. Personally, I'd prefer to optimise later, and try to ensure correctness first. As for why a fixed time limit may be required. A screen reader such as NVDA depends on receiving events from the browser, there is no guarantee about the order or timing of these. In practice, highly delayed events (eg more than a few 10's of milliseconds) would likely be considered a bug and reported to browser developers. We initially started with purely a time based approach, the 'core asleep' approach came later. We found that this wasn't entirely reliable, and so left the time based approach in place. We have since fixed several other bugs that caused intermittent failures which may have been blamed on this new approach. One difficult aspect of developing this is isolating and identifying bugs in the testing system itself. At this stage these mostly seem to show as intermittent issues. Ideally we would have a mechanism to collect the system test results for every build, so we can use a stats based approach to determine the likelihood that an intermittent issue is resolved. I think using a synth driver (and later a braille driver) as the interface between the test framework and the SR would be quite elegant and essentially that is where I started with our system tests. As noted by @WestonThayer there would need to be a few concessions made, which would impact the running time for the tests:
@sinabahram Being able to listen along during a test would be a good idea. However building that into the system test driver (eg finding available synths etc) is a lot to create and maintain that is not directly related to the goal. At NV Access, something that has been on my mind is a 'tee' like driver interceptor allowing a primary synth and a set of secondary synths. This would solve a few issues in NVDA, speech viewer, NVDA remote, and allow multiple synths at once. Another alternative, is that the tests output all speech commands that can be replayed after the tests. Another thing that would help debugging is being able to stop the tests either manually (if able to listen along) or on assert failure in order to investigate. |
@feerrenrut , major +1 to everything you've said here. And, I love the tee-command analog approach. That's a beautiful pattern that has been not only used within various command line shells but with HTTP streams to test various web servers, etc. I think that optimizing later for startup times is quite reasonable. I imagine there may be entire portions of NVDA that can be stripped out or jumped over if we know it is running within a test instance. Such optimizations may help any other automated testing on NVDA's side. I'm thinking of how kernels and other distributions are heavily optimized for running within application virtualization containers. |
Excellent point, thank you for changing my perspective. Much can be done to stabilize the host OS environment as well (disabling updates, virus scans, services, etc). |
I've made a few changes:
|
Moving remaining open comments from the google doc to here: Test formatContext: @mfairchild365 commented:
@zcorpan replied:
Running testsContext: The AT driver may also need to be able to change settings and maybe read internal state. @WestonThayer commented:
|
I think my comments can be resolved. |
Following discussion in https://docs.google.com/document/d/1jDf_gEQjRppLyEDEW0qPVbq1QaCT36bVDl9JkI-nvCU/edit# (also see #337)
cc @sinabahram @jscholes
We agreed that, to enable automation:
assert_accname: Lettuce
, pass if the substring "Lettuce" appears exactly once). And also, we want to assert that the full string matches an expected string.For example, for the test "Navigate to an unchecked checkbox in reading mode" for NVDA, I envision a text file that acts as the test with the full sequence of where to navigate the browser to, which keys to press, and the assertions. It could look something like this:
Note that this isn't the format I think we should write tests in, that can still be csv and avoid repetition, but the above could be generated from the csv source.
In addition, the NVDA project could choose to have "test expectations" files alongside these, that give the actual output after each interaction and/or each assertion.
Encoding test expectations for each downstream project using the tests has precedent in web-platform-tests (example). It allows saying that an individual assertion is expected to fail (represented as
[ FAIL ]
above), and maybe reference a known issue. It also allows saying that a particular result is known to be flaky, i.e. sometimes pass, sometimes fail (represented as[ PASS FAIL ]
above).This way of representing tests would make it possible to programmatically update the expected output to match the actual output when fixing a bug in the screen reader implementation that might affect many tests. So the workflow for an NVDA developer would be:
The text was updated successfully, but these errors were encountered: