Skip to content
This repository has been archived by the owner on Aug 2, 2020. It is now read-only.

Successful build error #206

Closed
snowleopard opened this issue Feb 10, 2016 · 55 comments
Closed

Successful build error #206

snowleopard opened this issue Feb 10, 2016 · 55 comments
Assignees
Milestone

Comments

@snowleopard
Copy link
Owner

After a build is successful we sometimes see a mysterious error:

shakeArgsWith                         0.001s    0%                           
Function shake                        0.075s    0%                           
Database read                         0.362s    0%                           
With database                         0.018s    0%                           
Running rules                      1789.519s   99%  =========================
Pool finished (18347 threads, 4 max)  0.000s    0%                           
Lint checking                         0.608s    0%                           
Total                              1790.584s  100%                           
build: .build/stage0/utils/genprimopcode/package-data.mk: openFile: does not exist (No such file or directory)

or

shakeArgsWith                        0.001s    0%
Function shake                       0.056s    0%
Database read                        0.855s    0%
With database                        0.028s    0%
Running rules                      362.496s   99%  =========================
Pool finished (7433 threads, 4 max)  0.000s    0%
Lint checking                        0.254s    0%
Total                              363.690s  100%
build: .build/stage1/gmp/gmp-lib-names: openFile: does not exist (No such file or directory)

I can see that the file does actually exist.

The current conjecture is that this happens during Shake's lint checking phase.

If I run a rebuild, nothing gets rebuilt, which confirms that the build was actually successful.

@snowleopard snowleopard added this to the tree-tremble milestone Feb 10, 2016
@ndmitchell
Copy link
Collaborator

It's very weird. If we had a little more call stack that would help - building with GHC 8.0 might give that to us. I've also seen this "error" some times - next time it happens can you immediately do echo $? and check that is really does exit with non-zero (e.g. failure).

@ndmitchell
Copy link
Collaborator

@snowleopard, you said something about opening this file, writing it, opening it again and modifying it? Can you point me at the code that does that? I can only assume that's at least tangentially related to the issue.

@snowleopard
Copy link
Owner Author

Sure, here is the line that does post-processing of package-data.mk files: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Rules/Data.hs#L54.

Note, the second example above does not involve rewriting a file, however, what they both have in common is writeFileChanged: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Rules/Gmp.hs#L101.

@ndmitchell
Copy link
Collaborator

On https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Rules/Data.hs#L138 I see:

Reason: Shake's built-in makefile parser doesn't recognise slashes

Can you give a bit more background on that? As far as I am aware, I support slashes in all the normal places. If there's a difference from GNU Make at parsing the dependency patterns I'd like to fix it.

@snowleopard
Copy link
Owner Author

Sure, here are some problematic lines:

libraries/array_dist-install_VERSION = 0.5.1.0
...
libraries/array_dist-install_CMM_SRCS  := $(addprefix cbits/,$(notdir $(wildcard libraries/array/cbits/*.cmm)))
...
$(eval $(libraries/array_PACKAGE_MAGIC))

Note the slashes in names of the variables, and some dollars here and there. I don't think you'll be able to do anything about dollars (apart from ignoring such lines instead of failing), but fixing the slash issue shouldn't be difficult.

@ndmitchell
Copy link
Collaborator

Do they both have writeFileChanged in common? I see fixFile in one instance. For fixFile I recommend using readFile' from the extra library - simply doing a readFile and seq (length ...) is not enough - it ensures the file handle has been marked "closable", but doesn't actually close it.

I've just reviewed writeFileChanged, and it looks safe to me...

@ndmitchell
Copy link
Collaborator

@snowleopard, ah, those are variable definitions, which is what Shake doesn't really like - the makefile parser is designed to slurp things from ghc -M and gcc -M, not variable definitions or macros.

@ndmitchell
Copy link
Collaborator

In fact, I think the optimal form is:

new <- withFile ReadMode file $ \h -> do
    new <- f <$> hGetContents h
    evaluate $ rnf new
    return new
writeFile file new

This ensures that f can be applied over a lazy string, but guarantees that file has been shut before writeFile. Sprinkle with liftIO to season.

@snowleopard
Copy link
Owner Author

Thanks, I'll fix fixFile using your suggestion.

Do they both have writeFileChanged in common?

Looks like I'm mistaken. I think I can't use writeFileChanged here, because that would need the source file leading to a cyclic dependency.

@ndmitchell
Copy link
Collaborator

Yes, not saying you should use writeFileChanged - having a separate fixFile function is perfectly reasonable. A little interesting that two similar functions have similar broken issues. I know GHC 7.6 and below had a race condition on file modification. I wonder if we could be tripping another similar issue in a similar place? Or maybe we both made some mistake? Anyway, something going on here.

@snowleopard
Copy link
Owner Author

Have a look at this line: https://github.com/snowleopard/shaking-up-ghc/blob/master/src/Settings/Builders/Ghc.hs#L31.

This is where I read the file created by that call to writeFileChanged. Anything wrong here? Apart from the fact that this should actually live inside an oracle.

@ndmitchell
Copy link
Collaborator

Nothing obvious... Do you use readFileLines before writing it? That might be a potential problem.

@snowleopard
Copy link
Owner Author

Well, readFileLines will need the file and it will be created by that rule with writeFileChanged. So should fine.

snowleopard added a commit that referenced this issue Feb 10, 2016
@snowleopard
Copy link
Owner Author

I realised that file gmp-lib-names is not needed at all. We can easily get what we used to keep there from getSetting GmpLibDir instead integer-gmp.buildinfo directly.

With this file gone, I suggest to close this issue.

@ndmitchell
Copy link
Collaborator

I note we've had the issue occur with a few different files, and your patch only removes one. I have a suspicion this one might recur - but we can always open another when it does.

@snowleopard
Copy link
Owner Author

@ndmitchell There are actually two patches:

  • 27317cf fixes the problem with package-data.mk files.
  • The subsequent commits remove gmp-lib-names.

So, I hope we won't come across the instances reported at the top again. But I agree, we might come across others in which case we can just reopen this issue.

@ndmitchell ndmitchell reopened this Feb 16, 2016
@snowleopard
Copy link
Owner Author

@ndmitchell Were you hit by this again? Which file this time?

@ndmitchell
Copy link
Collaborator

I just got:

shakeArgsWith                         0.000s    0%
Function shake                        0.294s    0%
Database read                         0.438s    0%
With database                         0.020s    0%
Running rules                       444.763s   99%  ========================
Pool finished (10984 threads, 8 max)  0.000s    0%
Lint checking                         0.206s    0%
Total                               445.722s  100%
build: .build/stage1/libraries/pretty/package-data.mk: openFile: does not exist (No such file or directory)

Neil@Neil MSYS /c/Neil/ghc/conf3/ghc
$ echo $?
1

So it remains an error. Another build causes more to build, so it seems it didn't finish. So I wonder what's going on... Maybe lint checking is failing? Maybe something else? I'd love a stack trace... Do we still do lint checking even if there is a failure in the build (answer: no)? Does timing print regardless (answer:yes)? Does something change directory, which would result in the file not existing (at least the relative file)? Why did it do lint checking happen if the build didn't finish?

@snowleopard
Copy link
Owner Author

Ouch, this seems more serious. I always thought that Shake starts to print build stats and do lint checking only after a successful build, but it appears not to be the case?

By the way, as part of the work on #207 I accidentally broke the build yesterday, which I have hopefully just fixed by 5fcb480. Not sure if this could be related, but in any case your error is interesting to investigate.

@ndmitchell
Copy link
Collaborator

No, build stats are always printed (if you request that using the flags). They are really intended for debugging (although showing them always is quite happy).

It's possible your recent fix was the root cause, but I remain deeply skeptical about what's going on...

@snowleopard
Copy link
Owner Author

I actually meant that I thought Shake does link checking only after a build completes successfully.

However, we have Lint checking 0.206s 0% line here for an unsuccessful build.

It seems though that we are still hitting the same problem: lint is trying to open a file that hasn't yet been properly closed. In case of package-data.mk there is only one place to look: Rules/Data.hs.

@ndmitchell
Copy link
Collaborator

Shake only does lint checking after a successful build. When I reran the build built more. Therefore it was not a complete build. I am just very confused...

And the file is not there. If it were merely open already that should be fine. But it is there. Again, confused...

@ndmitchell
Copy link
Collaborator

I've modified Shake with lots more try/catch around the relevant pieces, and logging saying its starting lint/finishing. I'm now running a loop which wipes/builds repeatedly until it hits an error. Hopefully that will flush out anything transient.

@snowleopard
Copy link
Owner Author

Hopefully this will help to solve the mystery 👍

@ndmitchell
Copy link
Collaborator

I can't reproduce it either. Lets assume it got fixed and reopen if it didn't.

@ndmitchell
Copy link
Collaborator

I keep getting:

Lint checking error - 2 values have changed since being depended upon:
  Key:  .build/stage2/utils/ghctags/package-data.mk
  Old:  File {mod=0x392EFD51,size=0x1C54,digest=0xA05C805}
  New:  File {mod=0xED047EDF,size=0x1C54,digest=0xA05C805}

  Key:  .build/stage2/utils/mkUserGuidePart/package-data.mk
  Old:  File {mod=0x3938CBD2,size=0x2055,digest=0xACAB3C21}
  New:  File {mod=0xED04CD10,size=0x2055,digest=0xACAB3C21}

Which looks like a Shake bug to me, and unrelated to this issue - I'll fix that.

@snowleopard
Copy link
Owner Author

@ndmitchell Can you clarify what's going on? I'm still lost.

@ndmitchell
Copy link
Collaborator

@snowleopard - not really sure. Are you getting random lint failures? I suddenly am this morning. I even wonder if it's my computer reporting dodgy modtimes at various times (it sometimes happens).

@snowleopard
Copy link
Owner Author

@ndmitchell I've never seen random lint errors as far a I remember. Well, apart from the strange behaviour we are investigating in this issue, but it's not yet certain that lint is to blame here.

Are you using the same version of Shake as I (the released one)?

@ndmitchell
Copy link
Collaborator

I'm using HEAD. But it worked fine a few weeks ago, and I don't think anything relevant has changed, so I'm still confused.

@snowleopard
Copy link
Owner Author

Perhaps, to provide another point of reference, you could follow the steps I listed in #211 (comment) on your VM installation?

Let me copy the steps here for convenience:

  • I run a successful build with ./build.bat -j4 --flavour=quick.
  • Opened system.config and set make =.
  • Restarted ./build.bat -j4 --flavour=quick.
  • This led to the following error, as intended:
Builder 'make' is not specified in system.config file. Cannot proceed without it.
  • Opened system.config and set make = make, as before.
  • Restarted ./build.bat -j4 --flavour=quick.
  • The build finished and reported the elusive error (echo $? said 1):
build: .build/stage2/utils/ghctags/package-data.mk: openFile: does not exist (No such file or directory)

I've just repeated the steps on a different Windows machine and the error is still reproducible, albeit with a different file:

build: .build/stage2/utils/mkUserGuidePart/package-data.mk: openFile: does not exist (No such file or directory)

So, things seem to be converging on ghctags and mkUserGuidePart, and I can now robustly reproduce the error. The only difference between my and your environments seems to be the version of Shake.

@ndmitchell
Copy link
Collaborator

And I'm getting a lint error exactly where you are getting an openFile error. Maybe this is really a lint error that comes out wrong? Or the lint is in some way important?

@ndmitchell ndmitchell reopened this Mar 4, 2016
@ndmitchell ndmitchell self-assigned this Mar 4, 2016
@ndmitchell
Copy link
Collaborator

OK, I'll debug this tonight, but sounds like we've caught something.

@snowleopard
Copy link
Owner Author

Thanks! We might want to re-examine the fixFile implementation:

-- Transform a given file by applying a function to its contents
fixFile :: FilePath -> (String -> String) -> Action ()
fixFile file f = do
    putBuild $ "| Fix " ++ file
    contents <- liftIO $ IO.withFile file IO.ReadMode $ \h -> do
        old <- IO.hGetContents h
        let new = f old
        IO.evaluate $ rnf new
        return new
    liftIO $ writeFile file contents

Is there a chance it could still let a rule (which called it) finish before the file is written to and closed? This is the only sane explanation of the above errors I can come up with, because it can cause both lint errors (rule finished, modtime/value locked, but the mod time is changed again when the file is closed), and openFile errors (lint trying to open the file for inspection, but it is still not closed).

@ndmitchell
Copy link
Collaborator

No, fixFile looks correct, it's pretty simple. I'll investigate it though...

@ndmitchell
Copy link
Collaborator

I note that you are still building utils/ghctags/stage2/package-data.mk - I thought you pushed all build products into .build? Could this have something to do with the upsetness?

@snowleopard
Copy link
Owner Author

Build products produced by ghc-cabal can't be moved unfortunately -- they are put into a fixed location in the package directory. I copy these files to .build, where the rest of the action happens, with hope that eventually we will be able to get rid of ghc-cabal and in-tree build artefacts altogether #18, #113.

The relevant lines are:

-- TODO: get rid of this, see #113
liftIO $ IO.copyFile inTreeMk dataFile
[...]
postProcessPackageData context dataFile

As far as I can see, this shouldn't be relevant to this issue, but I may be wrong...

@ndmitchell
Copy link
Collaborator

I'm sure it's relevant, I just don't know how :) - best theory is Shake gets confused with also files and lint checking, but still not sure just how yet.

@snowleopard
Copy link
Owner Author

Ah, could it be because two logically different rules are mixed in one &%> rule? I could easily split the rule in two, which could also make it more readable.

@ndmitchell
Copy link
Collaborator

It does seem that there are two separate processes which generate the separate files, so splitting it seems the simpler way to go. But I'd still like to track down this bug separately, since I suspect it is in Shake itself.

@ndmitchell
Copy link
Collaborator

I found the bug :) - ndmitchell/shake#427

My guess is that in the last release it comes up with the error we saw, and in the latest release I fixed something (absolutely no idea what!) so it now gives the error I see. Perhaps just disable lint checking until we've got this squashed?

@ndmitchell
Copy link
Collaborator

I've added a new bug-upstream label, to indicate things that are bugs in the build system, but that are the fault of something upstream (typically Shake) - if that tagging doesn't suit you just delete it.

@snowleopard
Copy link
Owner Author

Cool! Thanks for hunting the bug down :-) The new label is fine!

Perhaps just disable lint checking until we've got this squashed?

I'm very rarely affected by this, and CI never fails because of this. Lint checking on the other hand is very useful, so I think we better keep things as they are. The important thing is that we now know it's not a bug in the build system.

Do I understand correctly that if I split the rule in two (which seems to be a reasonable thing to do anyway) we will no longer be hit by the bug?

@ndmitchell
Copy link
Collaborator

Correct, splitting will fix the issue here - although it could equally happen in other also-file rules.

It sounds like splitting is also semantically the right thing to do anyway.

@snowleopard
Copy link
Owner Author

OK, I'll do the splitting and will close this issue.

snowleopard added a commit that referenced this issue Mar 5, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants