Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache write race condition across processes #4444

Open
bookman25 opened this issue Sep 7, 2017 · 123 comments · Fixed by microsoft/rushstack#3606
Open

cache write race condition across processes #4444

bookman25 opened this issue Sep 7, 2017 · 123 comments · Fixed by microsoft/rushstack#3606

Comments

@bookman25
Copy link
Contributor

Do you want to request a feature or report a bug?
bug

What is the current behavior?
When running tests in parallel with the new atomic cache writing, we're getting rename errors as multiple processes try to write to the same files. Even with --no-cache option set it's still hitting rename errors because it's still trying to write to the files.

What is the expected behavior?

  1. I think that --no-cache should not write cache files
  2. Caching across multiple processes should not collide, or should be able to restart the test.

Please provide your exact Jest configuration and mention your Jest, node, yarn/npm version and operating system.

{
	"clearMocks": true,
	"collectCoverageFrom": [
		"packages/**/src/**/*.{ts,tsx}",
		"!packages/sf-lint/**",
		"!**/*.d.ts"
	],
	"coverageReporters": [
		"text-summary"
	],
	"moduleFileExtensions": [
		"ts",
		"tsx",
		"js",
		"json"
	],
	"setupTestFrameworkScriptFile": "<rootDir>/jestEnvironment.js",
	"transform": {
		"\\.(ts|tsx)$": "<rootDir>/scripts/preprocessTypescript.js",
		"\\.(less|css|svg|gif|png|jpg|jpeg)$": "<rootDir>/scripts/preprocessText.js"
	},
	"testRegex": "(Spec|.spec).tsx?$"
}

jest 21.0.1
node 6.9.2
yarn 0.27.x/1.0.0
OS Windows

@thymikee
Copy link
Collaborator

Isn't this related? #4432

@bookman25
Copy link
Contributor Author

I don't believe so. I believe the case we see in our repo is the exact same file getting mocked for 2 different processes (while running in parallel) which causes the cache write operation to fail because one process has the file locked. That ticket looks like is more about different files with the same contents. We don't have any of those cases in the repositories we host that we ran into this issue.

@cristiango
Copy link

We basically run into the same issue with our tests. One easy way to reproduce was to remove jest cacheDirectory to force cache generation on the next run.


    jest: failed to cache transform results in: 
C:/myniceproject/src/jest-cache/jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b/3f/settingsProvider_3f1439e55275a95ecfdb7dcb432f7958
   Failure message: EPERM: operation not permitted, rename 
'C:\myniceproject\src\jest-cache\jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958.1630729137' 
-> 
'C:\myniceproject\src\jest-cache\jest-transform-cache-b2e8f1f700b9bd266a0d27bb01b47a2b-34a7e3d71d38ff01f65fdb5abdf5126b\3f\settingsProvider_3f1439e55275a95ecfdb7dcb432f7958'`

@DavidKuennen
Copy link

Having the same issue and can't find a way around it. Jest is basically unusable for us like this.

@vkrol
Copy link
Contributor

vkrol commented Sep 27, 2017

We are trying to update to 21.2.0 from 20.0.4 and now we have the following error on our build servers:

Test suite failed to run
[13:46:50]
[13:46:50]    jest: failed to cache transform results in: C:/TeamCity/buildAgent/temp/buildTmp/jest/jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745/3b/fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770
[13:46:50]    Failure message: EPERM: operation not permitted, rename '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770.1701848979' -> '...\jest\jest-transform-cache-c60bb8ad55f1dbc29115038800657f2f-4895fc34da3cd9ad1e120af622bca745\3b\fe-selectors_3b56db772e798e2e4d0c9fc7c9e4a770'
[13:46:50]      
[13:46:50]      at Object.fs.renameSync (fs.js:772:18)
[13:46:50]      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

@apphelpyulongs
Copy link

I'm now having the same issue tests are broken randomly.

@vkrol
Copy link
Contributor

vkrol commented Sep 29, 2017

If I run the tests with the --runInBand flag then as expected everything is OK.

@asapach
Copy link
Contributor

asapach commented Oct 7, 2017

I can see the same issue fairly consistently:

  ● Test suite failed to run

    jest: failed to cache transform results in: .../jest-transform-cache-...
    Failure message: EPERM: operation not permitted, rename '...' -> '...'
        at Error (native)

      at Object.fs.renameSync (fs.js:810:18)
      at Function.writeFileSync [as sync] (node_modules/write-file-atomic/index.js:192:8)

jest 21.2.1
node 6.11.1
OS Windows

--no-cache does not help and jest-transform-cache is still being written. The only thing that helps is --runInBand, which is hardly acceptable for large projects.

Anything we can do to help diagnose the issue? Should I create a repro case?

Is this error critical? Can it be treated as a warning rather than taking down the whole test suite? Is there a way to back off and retry?

@SimenB
Copy link
Member

SimenB commented Oct 7, 2017

Having a small repro would be great

@asapach
Copy link
Contributor

asapach commented Oct 7, 2017

Here's the repro: https://github.com/asapach/jest-cache-issue/
It effectively runs lodash-es through babel-jest to populate the transform cache.
This fails for me 80% of the time on two different machines (Win8.1 and Win10).
If you remove --no-cache it fails 100% of the time. Adding --runInBand brings it down to 0%.

(Out of curiosity tried running it in WSL on Win10 and the issue is not reproducible using Posix API)

@SimenB
Copy link
Member

SimenB commented Oct 8, 2017

Is this just happening on Windows? I don't have access to windows machines beyond virtual machines, so not the easiest to debug for me...

@jeanlauliac you added write-file-atomic in #4088, would you be able to help out?

@vkrol
Copy link
Contributor

vkrol commented Oct 8, 2017

This problem is very similar to npm/write-file-atomic#10 and npm/write-file-atomic#22.

@asapach
Copy link
Contributor

asapach commented Oct 8, 2017

Just ran a procmon trace, here's an example of the issue:

Time of Day Process Name PID Operation Path Result Detail
16:54:43.2304011 node.exe 7624 SetRenameInformationFile ...\constant_ee286bbcf367680ce61a90e506522f92.82986661 SUCCESS ReplaceIfExists: True, FileName: ...\constant_ee286bbcf367680ce61a90e506522f92
16:54:43.2305499 node.exe 8208 SetRenameInformationFile ...\constant_ee286bbcf367680ce61a90e506522f92.103872574 ACCESS DENIED ReplaceIfExists: True, FileName: ...\constant_ee286bbcf367680ce61a90e506522f92

As you can see two processes are trying rename the same file within 1ms of each other and the second one fails.

I think npm/write-file-atomic#22 addresses the async version of writeFile(), but writeFileSync() is still affected.

@SimenB
Copy link
Member

SimenB commented Oct 8, 2017

Would it be possible to create a repro showing that just using write-file-atomic in worker-farm against the same file fails somehow? Would be great to open an issue against that project, as I think that's where the fix should be.

Or if you could write a test within jest that shows the same error (we have appveyor CI) that could be a start as well?

I'm not even sure what behavior we want in case of this error. Retry the write? Rerun the test? The whole test file?

@asapach
Copy link
Contributor

asapach commented Oct 8, 2017

OK, I'll try to create another repro. Not sure it's possible to create a jest test, because it would require spawning multiple processes, disabling/cleaning the cache and keep running until it fails.

I'm not even sure what behavior we want in case of this error.

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated.
Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()? That way write-file-atomic should retry automatically (I'll create a test to confirm).

@SimenB
Copy link
Member

SimenB commented Oct 8, 2017

Well, firstly the issue should not even happen when --no-cache is on, since the cache should not be populated.

That's a good point, and should be fixed separately. That way --no-cache can at least be a workaround.

Secondly, I'm not sure it's possible to retry the sync operation properly - is it possible to use writeFile() instead of writeFileSync()?

@cpojer thoughts on making it not be sync? Not sure how that scales. Or if you have another idea on how to fix this

@cpojer
Copy link
Member

cpojer commented Oct 8, 2017

  • --no-cache is more like --reset-cache actually. It means it won't use the existing cache, but it will still write cache. I'd like to retain that.
  • These operations have to be sync, because they happen during require calls in user code, so we can't change that.

@asapach
Copy link
Contributor

asapach commented Oct 8, 2017

Here's the other repro with worker-farm and write-file-atomic: https://github.com/asapach/write-atomic-issue

Findings so far: the sync version fails as expected, but surprisingly the async version fails as well. This means that they probably implement a retry queue only when it runs in the same process, which doesn't help in our case.

@SimenB
Copy link
Member

SimenB commented Oct 8, 2017

I'd like to retain that.

New flag? It's a highly misleading name. And on e.g. CI you rarely want the cache anyways, so it's just wasted resources. Or is a cache generated within a single test run used during --no-cache, and it only ignores existent caches?

Here's the other repro with worker-farm and write-file-atomic

Awesome! Could you open up an issue against write-file-atomic? It feels like a fix should go there, and if not (they don't want to support multiple processes writing at once) we can revisit on our side. WDYT?

@jwbay
Copy link
Contributor

jwbay commented Oct 8, 2017

A patch I tried locally that seemed to work is ignoring the error if it comes from trying to rename to a file with the same content. Since it just means another process 'won' the race, we can ignore it and move on.

const cacheWriteErrorSafeToIgnore = (
  e: Error,
  cachePath: Path,
  fileData: string,
) => {
  if (
    e.message &&
    e.message.indexOf('EPERM: operation not permitted, rename') > -1
  ) {
    try {
      const currentContent = fs.readFileSync(cachePath, 'utf8');
      return fileData === currentContent;
    } catch (e) {
    }
  }
  return false;
};

@asapach
Copy link
Contributor

asapach commented Oct 8, 2017

@SimenB, sure, I'll file an issue.

@cpojer, can this error be swallowed/ignored and treated as a warning? It implies that the file has already been written and no data should be lost.

@asapach
Copy link
Contributor

asapach commented Oct 8, 2017

Upstream issue: npm/write-file-atomic#28

@jeanlauliac
Copy link
Contributor

I think this means "rename" is not an atomic operation on Windows, so it breaks the assumption made by write-file-atomic. Unless there's a flag that could be enabled at the Windows API level, this could mean it's impossible to have atomic writes/renames on Windows altogether.

@jwbay your solution looks reasonable to me! 👍 Instead of using indexOf however, I'd use e.code === 'EPERM' (more robust, doesn't depend on specific message). I don't think we should read the file again to check the value, because this could introduce additional concurrency issues (ex. if the file is being written by yet another process at the same time). Would you mind sending a PR, please?

@davidjgoss
Copy link
Contributor

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on). But if we're happy to handle this at the Jest level, I won't hurry. cc @jeanlauliac

@jeanlauliac
Copy link
Contributor

jeanlauliac commented Oct 9, 2017

I was about to start work on a PR for write-file-atomic along the lines of "if we're asked to write a file sync but it's already in the queue to be written async, bail out" (maybe with an option to switch the behaviour on).

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

To fix concurrency issues once and for all, we may have to reconsider how we do caching, for example have a single process that access the cache, with which we communicate over some kind of IPC. Existing key/value store systems may be handy, such as memcached.

@davidjgoss
Copy link
Contributor

I think adding this logic (local queue) wouldn't fix the issue, because it happens mostly when different processes try to write to (rename to) the same file.

Ah, I perhaps misunderstood the issue then. The way I read it, the library already has a queuing mechanism that works nicely for the async requests, but if you mix in sync requests as well you can get collisions.

@mekwall
Copy link

mekwall commented Oct 10, 2017

My above referenced pull request should solve this issue. At least it did it for me!

@chekrd
Copy link

chekrd commented Feb 15, 2023

@fluffynuts Thanks :) I have applied the beta version to our project.
Beta patch output seems promising:
image

But it fails:

2023-02-15T14:34:35.4219806Z stderr:   ● Test suite failed to run
2023-02-15T14:34:35.4220467Z     jest: failed to cache transform results in: C:/agent/_work/17/s/node_modules/.cache/jest/jest-transform-cache-f8308df0ef9e0e511375dd771d3d65ee-abc22ad5e77a7416ef9c18e131733e53/23/PercentSign_2304649ae86e72b79ce79325a2749608
2023-02-15T14:34:35.4222771Z     Failure message: EPERM: operation not permitted, rename 'C:\agent\_work\17\s\node_modules\.cache\jest\jest-transform-cache-f8308df0ef9e0e511375dd771d3d65ee-abc22ad5e77a7416ef9c18e131733e53\23\PercentSign_2304649ae86e72b79ce79325a2749608.3880454289' -> 'C:\agent\_work\17\s\node_modules\.cache\jest\jest-transform-cache-f8308df0ef9e0e511375dd771d3d65ee-abc22ad5e77a7416ef9c18e131733e53\23\PercentSign_2304649ae86e72b79ce79325a2749608'
2023-02-15T14:34:35.4224038Z       327 | import { default as Pc } from './Pc';
2023-02-15T14:34:35.4224354Z       328 | import { default as Pda } from './Pda';
2023-02-15T14:34:35.4224712Z     > 329 | import { default as PercentSign } from './PercentSign';
2023-02-15T14:34:35.4224990Z           | ^
2023-02-15T14:34:35.4225266Z       330 | import { default as Perfume } from './Perfume';
2023-02-15T14:34:35.4225645Z       331 | import { default as PermissionList } from './PermissionList';
2023-02-15T14:34:35.4226067Z       332 | import { default as Personalisation } from './Personalisation';
2023-02-15T14:34:35.4226551Z       at writeFileSync (node_modules/jest-runner/node_modules/write-file-atomic/lib/index.js:252:8)
2023-02-15T14:34:35.4227096Z       at Object.<anonymous> (packages/component-library/components/Icons/components/index.ts:329:1)
2023-02-15T14:34:35.4227637Z       at Object.<anonymous> (packages/component-library/components/Icons/AnimatedProgressIcon.tsx:6:1)
2023-02-15T14:34:35.4228161Z       at Object.<anonymous> (packages/component-library/components/Icons/index.ts:2:1)
2023-02-15T14:34:35.4228626Z       at Object.<anonymous> (packages/component-library/components/index.ts:3:1)
2023-02-15T14:34:35.4229068Z       at Object.<anonymous> (packages/component-library/index.ts:1:1)
2023-02-15T14:34:35.4229691Z       at Object.<anonymous> (client/app/applications/itemEditor/features/ContentItemEditing/components/elements/subComponents/limitInfoMessages/ItemLimitStatusMessage.tsx:1:1)
2023-02-15T14:34:35.4230422Z       at Object.<anonymous> (client/app/applications/richText/utils/internalApi/content/general/editorTextUtils.ts:15:1)
2023-02-15T14:34:35.4231038Z       at Object.<anonymous> (client/app/applications/richText/utils/internalApi/content/general/editorContentUtils.ts:88:1)
2023-02-15T14:34:35.4231660Z       at Object.<anonymous> (client/app/applications/richText/utils/internalApi/content/images/editorImageUtils.ts:26:1)
2023-02-15T14:34:35.4232207Z       at Object.<anonymous> (test/unit/richText/editorApiTestData.ts:11:1)
2023-02-15T14:34:35.4232747Z node:internal/process/promises:289
2023-02-15T14:34:35.4233090Z             triggerUncaughtException(err, true /* fromPromise */);
2023-02-15T14:34:35.4233370Z             ^
2023-02-15T14:34:35.4233453Z 
2023-02-15T14:34:35.4234050Z [UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "1".] {
2023-02-15T14:34:35.4235011Z   code: 'ERR_UNHANDLED_REJECTION'
2023-02-15T14:34:35.4235241Z }
2023-02-15T14:34:35.4235320Z 
2023-02-15T14:34:35.4235523Z Node.js v19.2.0

By the way, the --disable-write-cache flag has no effect. It's flaky and it took a while to see no difference with and without the flag.

@fluffynuts
Copy link

so I can try to repro, please provide the exact version of jest you're using?

@fluffynuts
Copy link

alternatively, if your project is opensource, please link me to a branch causing this, or create a minimal repro-repo - the more info I have, the more likely I am to squash it. We're not seeing this error on our Jenkins CI (windows) or any of our windows dev machines, so right now, I can't repro.

@rdeangelis83
Copy link

@fluffynuts strange the ScriptTransformer contains a cacheWriteErrorSafeToIgnore function that should fix this issue:

const writeCacheFile = (cachePath, fileData) => {
  try {
    (0, _writeFileAtomic().sync)(cachePath, fileData, {
      encoding: 'utf8',
      fsync: false
    });
  } catch (e) {
    if (cacheWriteErrorSafeToIgnore(e, cachePath)) {
      return;
    }

    e.message =
      'jest: failed to cache transform results in: ' +
      cachePath +
      '\nFailure message: ' +
      e.message;
    removeFile(cachePath);
    throw e;
  }
};
const cacheWriteErrorSafeToIgnore = (e, cachePath) =>
  process.platform === 'win32' &&
  e.code === 'EPERM' &&
  fs().existsSync(cachePath); 

@rdeangelis83
Copy link

Okay I made some progress: the problem seems to be the fs().existsSync(cachePath) check.
In some cases, under windows if you have no access to the file, also fs().existsSync(cachePath) will return false!

@lukeapage
Copy link
Contributor

The problem is half fixed in master - it is missing #11104

The half fix that is merged is
#11423

@rdeangelis83
Copy link

The problem is half fixed in master - it is missing #11104

The half fix that is merged is

#11423

Okay somehow I missed your PR. But exactly the same change I made in my PR 😊 So I can close my PR.

In nodejs/node#17921 (comment) it will be mentioned that fs.exists will return false on EPERM error so it is clear that the current code can not work.

But why has you PR not been merged?

@chekrd
Copy link

chekrd commented Feb 22, 2023

so I can try to repro, please provide the exact version of jest you're using?

We are always on the latest released version (29.4.3 at the time).
[email protected], [email protected].
Using npx -y patch-jest-cache@beta.

alternatively, if your project is opensource, please link me to a branch causing this, or create a minimal repro-repo - the more info I have, the more likely I am to squash it. We're not seeing this error on our Jenkins CI (windows) or any of our windows dev machines, so right now, I can't repro.

The project is not open source, I am sorry. We have over 5k Jest tests and large codebase. And I guess it is related to the huge amount of transpiled modules.

@acherkashin
Copy link

We are using forked CRA v5.0.1 that uses Jest v27.4.3.

Is the issue fixed in latest jest, and we should try to upgrade it?
Reading this long discussion, It seems, it is easier to try using WSL.

We are using

  • Windows 10
  • Jenkins
  • Node version - 16.15.1
  • pnpm version - 7.18.1
  • Jest version - 27.4.3

Issue happen pretty often.

@AngusMacrae
Copy link

I'm getting this issue too. Only a few hundred tests in my codebase.

mmichaelis added a commit to CoreMedia/ckeditor-plugins that referenced this issue Oct 26, 2023
Running Jest tests on windows may trigger some error like:

    EPERM: operation not permitted

This related to issue: jestjs/jest#4444

As one possible workaround, you may run tests with --runInBand.
An artificial script named `jest:windows:workaround:issue:4444` has
been added for this.
@fluffynuts
Copy link

fluffynuts commented Oct 26, 2023

@AngusMacrae have you tried my hack-around?

npx -y patch-jest-cache

obviously, run after npm i, otherwise the patch could be chucked away - it literally patches the two functions in question to be more tolerant of errors - a cache miss due to access errors should just mean that the cache item is regenerated (ie, lose some time, not die catastrophically, because a cache should never be the keystone that brings your process down when it's not available - cache should be an optimistic attempt to speed things up)

@acherkashin
Copy link

We started using Linux agent for Jenkins. It fixed the issue.

@fluffynuts
Copy link

We started using Linux agent for Jenkins. It fixed the issue.

well, it works around the issue because linux can do COW for files in use, but yes - this is why the issue has remained unfixed in Jest: I think most people are using linux or osx so they just don't care :|

@acherkashin
Copy link

We started using Linux agent for Jenkins. It fixed the issue.

well, it works around the issue because linux can do COW for files in use, but yes - this is why the issue has remained unfixed in Jest: I think most people are using linux or osx so they just don't care :|

Yes, it is a pity that nobody cares about this issue.

@lukeapage
Copy link
Contributor

@fluffynuts seems jest v30 will break jest-patch-cache

https://www.npmjs.com/package/@jest/transform/v/30.0.0-alpha.1?activeTab=code

Because now all the files are bundled into one index file

@fluffynuts
Copy link

@fluffynuts seems jest v30 will break jest-patch-cache

https://www.npmjs.com/package/@jest/transform/v/30.0.0-alpha.1?activeTab=code

Because now all the files are bundled into one index file

hooray! :D PRs welcome; I don't really have the time to look at this rn and we haven't upgraded to that yet, so I guess I can put it off for a bit :D

@lukeapage
Copy link
Contributor

My focus is going to be on getting this PR merged: #11104

which I believe will solve the issue for good and retire your tool.

as per previous comments the other change which was merged was: #11423

@fluffynuts
Copy link

My focus is going to be on getting this PR merged: #11104

which I believe will solve the issue for good and retire your tool.

as per previous comments the other change which was merged was: #11423

Good luck 🙂 I'd like nothing better than for my hack to be retired.

@sschneider-ihre-pvs
Copy link

Nothing is more permanent than a temporary solution.

@arthow4n
Copy link

Upgrading write-file-atomic to 6.0.0 seems to solve issue for me, had to use npm overrides (yarn resolutions) to force the installed package version though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet