Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows directory watcher stops working after some time #37233

Closed
jakemac53 opened this issue Jun 11, 2019 · 51 comments
Closed

Windows directory watcher stops working after some time #37233

jakemac53 opened this issue Jun 11, 2019 · 51 comments
Assignees
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-io P1 A high priority bug; for example, a single project is unusable or has many test failures type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)

Comments

@jakemac53
Copy link
Contributor

jakemac53 commented Jun 11, 2019

Original issue on the watcher package dart-lang/tools#1713, related issue in build_runner dart-lang/webdev#436.

I created a simple repro below that uses vanilla Directory.watch. You can run each of these scripts at the same time to repro, which exposes a few issues:

  • If you start the watch script after the file modifying script is already running, then it won't see any events at all.
  • After a while (few thousand events, but not consistent), the directory watcher script simply exits.
  • The watcher script terminates in some weird way, I never see the Stream closed!! message on stdout, but the process does exit. There is no unhandled exception that surfaces either.

File modification script, edits a single file in a loop:

import 'dart:io';

void main() async {
  var i = 0;
  var file = File('out.txt');
  await file.create();
  while (true) {
    i++;
    await file.writeAsString('$i');
  }
}

Watcher script, logs watch events:

import 'dart:io';

main() async {
  var i = 0;
  await for (var event in Directory.current.watch(recursive: true)) {
    print('$i: $event');
    i++;
  }
  print('Stream closed!!');
}

As a result of this we have to use a polling watcher, which doesn't perform well.

@jakemac53 jakemac53 added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-io type-bug Incorrect behavior (everything from a crash to more subtle misbehavior) labels Jun 11, 2019
@sortie sortie added the P2 A bug or feature request we're likely to work on label Jun 11, 2019
@jakemac53
Copy link
Contributor Author

cc @kevmoo @vsmenon

@jakemac53
Copy link
Contributor Author

Just to clarify the impact here, even a basic hello world web project (no angular) will currently consume an entire core indefinitely as long as webdev is running due to using the polling watcher.

@sortie
Copy link
Contributor

sortie commented Jun 13, 2019

How critical is this bug? I probably won't have time to address this quarter meaning it probably won't get fixed until after the summer vacation.

@mnordine
Copy link
Contributor

@sortie Can you take a look at the dart-lang/webdev#436 to see how critical it is?

@sortie
Copy link
Contributor

sortie commented Jun 13, 2019

Do we have an idea when this was introduced?

@jakemac53
Copy link
Contributor Author

Afaik it has been this way forever.

@natebosch
Copy link
Member

The problem has been around for a very long time. The impact has gone up since the new build system relies on caching to disk.

@mnordine
Copy link
Contributor

mnordine commented Jul 2, 2019

Any update on a fix for this?

@supermuka
Copy link

related comment: dart-lang/webdev#436 (comment)

@kevmoo
Copy link
Member

kevmoo commented Jul 23, 2019

CC @a-siva @dgrove – this is really gnarly issue for anyone doing (Flutter) web on Windows!

@mnordine
Copy link
Contributor

This issue is blocking our company from upgrading to Dart 2

@kevmoo kevmoo added P1 A high priority bug; for example, a single project is unusable or has many test failures and removed P2 A bug or feature request we're likely to work on labels Jul 23, 2019
@kevmoo
Copy link
Member

kevmoo commented Jul 23, 2019

FYI @aadilmaan

@zichangg
Copy link
Contributor

This is because of native ReadDirectoryChangesW calls. Once many changes are made within a short period of time, it didn't notify with events. Did some tests here.

void main() async {
  var i = 0;
  var file = File('out.txt');
  await file.create();
  while (true) {
    i++;
    await Future.delayed(new Duration(milliseconds : 1000));
    await file.writeAsString('$i');
  }
}

With around 100 milliseconds sleeping, it works well. Once the sleeping time is down to 10 milliseconds, watcher can't perform well. Not sure how to deal with it.

The other issue on silently shut down. I have no idea. I can't reproduce it locally.

@mnordine
Copy link
Contributor

This is because of native ReadDirectoryChangesW calls

Is that the most efficient way to be notified of file changes for Windows?

I'm no expert, but what about this: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstchangenotificationw

@zichangg
Copy link
Contributor

@bkonyi Any idea? Do we have a way to improve/fix it?

@bkonyi
Copy link
Contributor

bkonyi commented Jul 26, 2019

@bkonyi Any idea? Do we have a way to improve/fix it?

Hm... nope, I'm not terribly familiar with the file system watcher code. I'm wondering if maybe we're seeing a race that's causing events to be dropped or if that's just a limitation of the API itself. I'm not sure why the isolate would be exiting though... it's almost like it things it's done with everything on the event queue and there's nothing left to do so it shuts down. Unfortunately, it's hard to say if you can't reproduce it though.

@mnordine
Copy link
Contributor

I hope you don't mind, but I seem to recall reading on Bruce Dawson's blog him running into this sort of thing, so I asked him for his advice:

They should use the directory monitoring APIs. Copying VsChromium's code would be a good place to start. It does monitoring-and-scanning with throttling to avoid pathologies:
(link: https://chromium.github.io/vs-chromium/) chromium.github.io/vs-chromium/

If you don't know him, he's an expert on Windows performance: https://randomascii.wordpress.com/

(He also works at Google)

I was looking here: https://github.com/chromium/vs-chromium/blob/master/src/Core/Files/DirectoryChangeWatcher.State.cs

@mraleph
Copy link
Member

mraleph commented Jul 27, 2019

MSDN says the following:

When you first call ReadDirectoryChangesW, the system allocates a buffer to store change information. This buffer is associated with the directory handle until it is closed and its size does not change during its lifetime. Directory changes that occur between calls to this function are added to the buffer and then returned with the next call. If the buffer overflows, ReadDirectoryChangesW will still return true, but the entire contents of the buffer are discarded and the lpBytesReturned parameter will be zero, which indicates that your buffer was too small to hold all of the changes that occurred.

There is obviously an inherent reliability problem with the API - no matter what we do it always can loose events if somebody is doing a lot of file system changes. We can try to fight this issue, but I also think that build_runner should change the way it uses watchers to prevent events from build/ from overflowing the watch buffer. @jakemac53 could build_runner avoid watching directory that contains output of the build?

I think the only simple way we could try to cope with this is to try bumping kBufferSize we use for DirectoryWatchHandle higher (e.g. to several megabytes) - because there is an indication that the size of the internal buffer is controlled by that. Another possibility is to try issuing subsequentReadDirectoryChangesW as fast as possible to drain internal OS buffer - but I think that would require some major refactoring of the implementation.

@supermuka
Copy link

Just a simple thought while you lookging for a definitive solution: webdev serve and build_runner serve could have a --without-watcher parameter, for example. We would not have a new build when there were changes. But at least this way we would have a development server running without a high CPU usage.

@natebosch
Copy link
Member

could build_runner avoid watching directory that contains output of the build?

Here is at least one issue where we have discusses changing what we are watching to avoid seeing the updates we make in .dart_tool - dart-lang/build#1579

I think there may have been some other places where it came up.

There are some tricky things to work out if we try to do this - for instance if we start up watchers in the directories we care about initially, and a new directory we would care about gets added, we won't have a watcher in it.

I think we could solve this if there were some SDK and OS supported way of either:

  • Shallowly watching a directory so we can add new watchers as necessary or
  • Filtering out events from a given subdirectory

The other thing to consider would be to move the output somewhere other than .dart_tool - and that is a tricky thing to solve...

@zichangg
Copy link
Contributor

zichangg commented Aug 6, 2019

@jakemac53 Sorry for delayed reply. For all the issue you posted,

If you start the watch script after the file modifying script is already running, then it won't see any events at all.

I managed to find the root cause for this. The reason is that ReadDirectoryChangesW will be started immediately after it is set up. program has not come back to dart side to set up the stream, since modification happens in a very high frequency. Then the event sent by VM will not be received by dart side.

Can you try this cl locally to see whether it solve your problem? https://dart-review.googlesource.com/c/sdk/+/111342

After a while (few thousand events, but not consistent), the directory watcher script simply exits.
The watcher script terminates in some weird way, I never see the Stream closed!! message on stdout, but the process does exit. There is no unhandled exception that surfaces either.

I have never triggered any termination when I worked on the issue. Do you know how to reproduce it consistently?

@jakemac53
Copy link
Contributor Author

I have never triggered any termination when I worked on the issue. Do you know how to reproduce it consistently?

I don't seem to be able to reproduce this any more... 🤷‍♂️.

Can you try this cl locally to see whether it solve your problem? https://dart-review.googlesource.com/c/sdk/+/111342

I don't have the sdk set up to build on my windows box (its my home machine) so I can't easily test, trying to figure out how to just download the version the try bots built though. Or you could land the cl and I could get it from the build bots after they run.

@jakemac53
Copy link
Contributor Author

jakemac53 commented Sep 13, 2019

@zichangg I can now consistently repro the issue where the entire program simply exits - in 3 runs this happened between 100 and 200 events in so it doesn't take long (this is using the two original programs linked at the top). I saw the same behavior regardless of whether I started the watcher script before or after starting the file editing script.

The program exits without any sort of stack trace or anything so I unfortunatley can't help much with debugging here, I don't know if there are additional vm flags I could provide to get more logging?

@jakemac53 jakemac53 reopened this Sep 13, 2019
@zichangg
Copy link
Contributor

@jakemac53 I failed to reproduce it. No matter using powershell, cmd or git bash. Do you have a detailed reproduction? I'm testing with
Dart VM version: 2.6.0-edge.46f2603781be3fae545cdd9f9f2f1a07f0fde0f5 (Mon Sep 16 19:36:40 2019 +0000) on "windows_x64"

I don't know if there are additional vm flags I could provide to get more logging?

I don't think there is specific flags for file watchers. Maybe use observatory and set --pause-isolates-on-exit.

@zichangg
Copy link
Contributor

@jakemac53 You can probably pass --trace-isolates to VM to print more info.

@aam
Copy link
Contributor

aam commented Sep 17, 2019

Does the "program exiting" behavior reproduces more easily if you pause watcher(by clicking somewhere in powershell console which has watcher running) for 4-5 seconds, then resume it(hit Esc, for example).

@jakemac53
Copy link
Contributor Author

I will try this again tomorrow with --trace-isolates. Fwiw I was using whatever the default shell in vscode is.

@mraleph
Copy link
Member

mraleph commented Sep 23, 2019

@jakemac53 did you have a chance to try it out? I would like to make sure that this issue is not falling through.

@jakemac53
Copy link
Contributor Author

Thanks for the ping I had indeed forgotten to check this, I can check it when I get home tonight on my windows machine there.

@jakemac53
Copy link
Contributor Author

I tried with dart --trace-isolates watch.dart but it says it is an unrecognized flag:

Setting VM flags failed: Unrecognized flags: trace_isolates

@jakemac53
Copy link
Contributor Author

jakemac53 commented Sep 23, 2019

I did find --trace_shutdown and got this (but doesn't seem that useful):

PS C:\Users\Jake\Desktop\git\playground> dart --trace_shutdown watch.dart
0: FileSystemModifyEvent('C:\Users\Jake\Desktop\git\playground\ooouut.txt', contentChanged=true)
...
725: FileSystemModifyEvent('C:\Users\Jake\Desktop\git\playground\ooouut.txt', contentChanged=true)
[+2823ms] SHUTDOWN: Starting shutdown
[+2825ms] SHUTDOWN: Shutting down profiling
[+2845ms] SHUTDOWN: Disabling isolate creation
[+2847ms] SHUTDOWN: Killing all app isolates
[+2849ms] SHUTDOWN: Shutting down app isolates
[+2851ms] SHUTDOWN: Shutting down kernel isolate
[+2857ms] SHUTDOWN: Shutting down service isolate
[+2872ms] SHUTDOWN: Waiting for isolate shutdown
[+2880ms] SHUTDOWN: Entering vm isolate
[+2881ms] SHUTDOWN: Deleting thread pool
[+2883ms] SHUTDOWN: Disabling OS Thread creation
[+2885ms] SHUTDOWN: Cleaning up vm isolate
[+2887ms] SHUTDOWN: Deleted os_thread
[+2888ms] SHUTDOWN: Deleting code observers
[+2890ms] SHUTDOWN: Shutting down timeline
[+2893ms] SHUTDOWN: Done

@zichangg
Copy link
Contributor

@jakemac53 Just wanna double check. Is your reproduction the same as what @aam described?

you pause watcher(by clicking somewhere in powershell console which has watcher running) for 4-5 seconds, then resume it(hit Esc, for example).

@jakemac53
Copy link
Contributor Author

jakemac53 commented Sep 24, 2019

I don't need to pause it to get the reproduction - I run the watch script and then run the file modification script. The watch script dies typically within a couple seconds (after starting the file modification one).

@supermuka
Copy link

@zichangg and @jakemac53. A solution for this issue is very important to my company team to start a project with Dart on Windows enviromnent. Is there a deadline to close this issue?

PS: Just one thought, no thorough investigation: analyzer package in Dart I think it uses some kind of watcher to detect code changes, and although a slow one when booting the IDE or code change, apparently it works on Windows. Maybe the same package could be used in webdev.

@jakemac53
Copy link
Contributor Author

The analyzer uses the same watcher - but there is a specific bad interaction with the watcher and package:build due to the high number of file events we cause when doing builds.

@jakemac53
Copy link
Contributor Author

(also anecdotally I have many times seen analyzer stop working on windows and had to restart vscode to get it back, I would be willing to bet this is the reason)

@aam
Copy link
Contributor

aam commented Sep 27, 2019

@jakemac53 can you try the following variant of watcher please?

import 'dart:io';
import 'dart:async';

main() async {
  var i = 0;
  Directory.current.watch(recursive: true).listen((event) {
    i++;
  });
  Timer.periodic(Duration(seconds:1), (Timer t) {
  	print('at ${DateTime.now()} saw $i notifications');
  });
}

Does it still exits for you?

The explanation I have for why your version exits is that the only active receive port in that watcher is directory changes listener, but it just stops(perhaps due to internal constraints of buffer overflow handling), which leads to that single receive port being closed, leading to dart program exit.
If you add another active port - timer - then dart program doesn't exit even when watcher port is closed.

@zichangg
Copy link
Contributor

perhaps due to internal constraints of buffer overflow handling

I think that's root cause. I tested different buffer size with aam's reproduction. With extremely small buffer size, program will exit automatically. With 64k bytes buffer, pausing watcher's powershell for several seconds can reproduce the problem.

ReadDirectoryChangesW will still return true, but the entire contents of the buffer are discarded and the lpBytesReturned parameter will be zero, which indicates that your buffer was too small to hold all of the changes that occurred.

For buffer size,

ReadDirectoryChangesW fails with ERROR_INVALID_PARAMETER when the buffer length is greater than 64 KB and the application is monitoring a directory over the network. This is due to a packet size limitation with the underlying file sharing protocols.

64kb is what we are using now. I don't think we have anything to do here.

@jakemac53
Copy link
Contributor Author

OK - if this is an issue with the underlying OS then we can fall back to trying to mitigate the problem instead.

Can we get some sort of specific exception thrown when this happens? Then we can at least message to the user what has happened.

That would give us a relatively sane path towards enabling this behind a flag.

@aam
Copy link
Contributor

aam commented Oct 1, 2019

Yeah, something needs to get called when watcher exits(currently there is no indication that watcher stopped watching).

The patch below allows to get onDone callback of Directory.watch(...).listen(..., onDone: ...) called.

diff --git a/sdk/lib/_internal/vm/bin/file_patch.dart b/sdk/lib/_internal/vm/bin/file_patch.dart
index 338a260002..90fcf26b3a 100644
--- a/sdk/lib/_internal/vm/bin/file_patch.dart
+++ b/sdk/lib/_internal/vm/bin/file_patch.dart
@@ -171,7 +171,9 @@ class _FileSystemWatcher {
       assert(_watcherPath.count > 0);
       _watcherPath.count--;
       if (_watcherPath.count == 0) {
-        _unwatchPath(_id, _watcherPath.pathId);
+        if (_idMap.containsKey(_watcherPath.pathId)) {
+          _unwatchPath(_id, _watcherPath.pathId);
+        }
         _pathWatchedEnd();
         _idMap.remove(_watcherPath.pathId);
       }
@@ -293,6 +295,8 @@ class _FileSystemWatcher {
           }
         }
       } else if (event == RawSocketEvent.closed) {
+        _idMap.remove(socketId);
+        stops.add([socketId, null]);
       } else if (event == RawSocketEvent.readClosed) {
       } else {
         assert(false);

@supermuka
Copy link

Folks, I'm sorry for the insistence. But would we now have a "light at the end of the tunnel" for this issue?
If this really has low priority, I would like to know for me to try to look for other workarounds.

@aam
Copy link
Contributor

aam commented Oct 4, 2019

This is not a low priority.
There is https://dart-review.googlesource.com/c/sdk/+/119524 with the fix that hopefully will be landed soon. The fix being exception thrown by the watcher when OS closes it. That should give an app an opportunity to handle this in some reasonable fashion for example by restarting the listener.

dart-bot pushed a commit that referenced this issue Oct 9, 2019
According to https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-readdirectorychangesw,
When using ReadDirectoryChangesW, buffer overflows will still return true. It ends up closing the stream without any notification.
Throw an exception to notify users.

Bug: #37233
Change-Id: I9aebed8b1f30b5e843ad37a51b87d234aa1d8ce6
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/119524
Commit-Queue: Zichang Guo <[email protected]>
Reviewed-by: Siva Annamalai <[email protected]>
Reviewed-by: Alexander Aprelev <[email protected]>
@aam
Copy link
Contributor

aam commented Oct 10, 2019

https://dart-review.googlesource.com/c/sdk/+/119524 landed, it has an example of how one can catch those exceptions(watcher method used testWatchOverflow test https://dart-review.googlesource.com/c/sdk/+/119524/16/tests/standalone_2/io/file_system_watcher_test.dart#451). Please see if it works for you.

@aam aam closed this as completed Oct 10, 2019
@zichangg
Copy link
Contributor

To handle the exception, simply surround stream.listen() with runZoned() and provide onError() callback which will be invoked if exception occurs.

  runZoned(() {
    watcher.listen((data) async {
      // handle event
    });
  }, onError: (error) {
    // handle error
  });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-io P1 A high priority bug; for example, a single project is unusable or has many test failures type-bug Incorrect behavior (everything from a crash to more subtle misbehavior)
Projects
None yet
Development

No branches or pull requests

10 participants