Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple error reporting from Stream #388

Closed
banacorn opened this issue Nov 26, 2019 · 11 comments
Closed

Decouple error reporting from Stream #388

banacorn opened this issue Nov 26, 2019 · 11 comments

Comments

@banacorn
Copy link

Many of the methods of Stream, like reachOffset, showTokens, are related to printing pretty error messages.

But it's a pain when it comes to implementing these methods for custom token streams.

Perhaps we should make error reporting a typeclass of its own:

class Stream s => ReportError s where

So that people who have their own way of reporting error don't have to implement those methods.

@mrkkrp
Copy link
Owner

mrkkrp commented Nov 27, 2019

It is like this because I wanted to do calculation of source position and fetching of line to display in one pass. I can imagine why you would want this, though.

Can you describe what exactly seems to be problematic about implementing reachOffset and showTokens in your case?

Also here is an example (granted, it's a very simple one): https://markkarpov.com/tutorial/megaparsec.html#working-with-custom-input-streams

@banacorn
Copy link
Author

banacorn commented Nov 28, 2019

I'm working on a parser that consumes custom token streams.
The tokens varied in "width" (like #370), some of them may even span across several lines.
This complicates the calculation of source positions and the offending line.

But the token streams I'm working with also happens to be equipped with source positions:

-- | Run a lexer on a string and produce a lazy stream of tokens
runLexer
  :: forall tok.
     Lexer tok -- ^ lexer specification
  -> String -- ^ source file name (used in locations)
  -> String -- ^ source text
  -> TokenStream (L tok)

(here runLexer is from lexer-applicative, the L is from srcloc for encompassing source locations)

In my case, I don't really need reachOffset or showTokens.
But it feels wrong not to implement these requirements (so i implemented them anyway)

edit: runLexer

@ag-eitilt
Copy link

I'm having the same issue; I've written a lexer with a lot of error recovery, and while the output from that stage is usually not too changed from the input, there are still a lot of long, complex sequences compressed to a single token, a few tokens that get inserted from nothing, and a lot of the line breaks that have been removed. With all that, even if "line" still had any meaning, I'd really rather just use the stage-1 tokens themselves in warning messages for the next parser pass (populating a data model), even if it means writing the printer myself.

All that is even more strongly underscored by the fact that I'm (ab)using the parsers in a way that they can't fail at the outermost level -- I'm basically required to write a lot of code and carry a lot of data around in order to generate strings which I'll never use. I get its utility in the general case, but here I'm winding up fighting against something that's not even part of the core purpose of Megaparsec (i.e. parser combinators).

@banacorn
Copy link
Author

Confession: I just read the lines I need to display directly from the file.

@hesiod
Copy link
Contributor

hesiod commented Jan 22, 2020

In fall 2019 I tried to overhaul the Stream interface to solve this issue as well as #345/#295. I couldn't find a "perfect" solution since there's always a tradeoff between visible API changes and internal overhauls. But since the branches are only going to diverge further, I might as well publish my experimental branch in the next few days.
Essentially, from my point of view those are limitiations of the current Stream design:

  • The documentation for the Stream typeclass is a litte bit lacking. Especially the reachOffset machinery is somewhat confusing at first.
  • The Stream typeclass assumes that your token stream type is chunky, i.e. an isomorphism between a chunk and a list of your tokens makes sense. This is not necessarily the case, especially if you're using a complex, fancy token type for which no natural chunk type exists. Without going into much detail, this can actually be solved relatively simply by moving tokenToChunk, tokensToChunk, chunkToTokens, tokenLength, chunkEmpty, takeN_, and takeWhile_ into a new ChunkedStream typeclass. The implications in the rest of the megaparsec codebase are surprisingly small since the chunk functions are mostly used by the lexers.
  • You can similarly extract showTokens and tokensLength into a new typeclass without any intrusive changes.
  • The Stream typeclass assumes that you want to implement stream location lookup using the PosState machinery. It essentially imposes a certain style of location lookukp that is not necessarily ideal for every kind of stream. However, it is not really possible to give the instance a lot of freedom in how it implements this part without sacrificing performance. The most useful change would be to change the reachOffset family to not to explicitly reference PosStates in the class definition. Doing this requires some more intrusive changes however, since...
  • ...megaparsec currently assumes that the stream type (more precisely, the s in Stream s) is precisely equal to the underlying string-like type. It is currently not possible to include related data in the stream type, since the codebase relies on that assumption in a number of places. I've had some success trying to remove this restriction in my experimental branch, but this requires much more intrusive changes compared to the ChunkedStream/showTokens modifications mentioned above.
  • Lastly, the error reporting is a little bit rigid in that it couples two steps, namely 1) the resolution of abstract, unresolved locations (simple stream offsets) to concrete, resolved file locations (line and column numbers) and 2) pretty-printing the ParseErrorBundle, into a single function errorBundlePretty. I have been quite successful in decoupling errorBundlePretty into two functions resolveBundle and describeBundle in my experimental branch. The changes would not really be user visible, since all parser invocations would still return an error bundle, just a resolved error bundle that contains concrete error locations.

@mrkkrp
Copy link
Owner

mrkkrp commented Jan 27, 2020

@hesiod Where can I see your changes?

@chrisdone
Copy link

FWIW, I'm also facing the same challenge. I'm trying to port over my lexer and my parser from Parsec to Megaparsec. The lexer port was easy. The parser of the lexed tokens, not so much.

In Parsec I simply had to define one combinator that would extract a token from the stream and a location/text equivalent. So Parsec was able to show in error messages what and where:

-- | Consume the given predicate from the token stream.
consumeToken :: (Token -> Maybe a) -> TokenParser (a, Location)
consumeToken f = do
  u <- getState
  tokenPrim
    tokenString
    tokenPosition
    (\(tok, loc) ->
       if locationStartColumn loc > u
         then fmap (, loc) (f tok)
         else Nothing)
-- | Make a string out of the token, for error message purposes.
tokenString :: (Token, Location) -> [Char]
tokenString = tokenStr . fst
-- | Update the position by the token.
tokenPosition :: SourcePos -> (Token, Location) -> t -> SourcePos
tokenPosition pos (_, l) _ =
  setSourceColumn (setSourceLine pos line) col
  where (line,col) = (locationStartLine l, locationStartColumn l)

As for megaparsec, I attempted to implement Stream, which on an older megaparsec, on a different project, was fairly easy:

instance Ord a => Mega.Stream (Seq (Located a)) where
  type Token (Seq (Located a)) = Located a
  type Tokens (Seq (Located a)) = Seq (Located a)
  tokenToChunk Proxy = pure
  tokensToChunk Proxy = Seq.fromList
  chunkToTokens Proxy = toList
  chunkLength Proxy = length
  chunkEmpty Proxy = null
  positionAt1 Proxy _ (Located start _ _) = start
  positionAtN Proxy pos Seq.Empty = pos
  positionAtN Proxy _ (Located start _ _ :<| _) = start
  advance1 Proxy _ _ (Located _ end _) = end
  advanceN Proxy _ pos Seq.Empty = pos
  advanceN Proxy _ _ ts =
    let Located _ end _ = last (toList ts) in end
  take1_ Seq.Empty = Nothing
  take1_ (t :<| ts) = Just (t, ts)
  takeN_ n s
    | n <= 0   = Just (mempty, s)
    | null s   = Nothing
    | otherwise = Just (Seq.splitAt n s)
  takeWhile_ = Seq.spanl

instance Mega.ShowToken (Located Token) where
  showTokens = unwords . map (showToken . locatedThing) . toList

Whereas today, the position/advance-ish functions have been replaced with reachOffset. Implementing that function is a lot more work for a few reasons:

  1. I took a look at the library's own implementations which are using its internally-defined reachOffset'. I cloned the repo so I could use a local copy which exposes that function but,
  2. It expects tokens to have an Ord instance, but Located Token would have line/pos, so nothing would ever be equal.

At this point I figured I would come and see if there was any plan to make this easier in the issue tracker.

For now I'll stick with megaparsec-6.4.1 because I already have a working parser infrastructure with that and I've used up my budgeted time on this instead of writing out the parser for my language. I don't have a speed requirement whatsoever, I'm just interested in megaparsec for its ability to use custom data structures for error messages, so I might not be in your target userbase.

@mitchellwrosen
Copy link

I just attempted to port https://github.com/unisonweb/unison's parser to the latest version of megaparsec, and had to back out after 90 minutes. I had the same issue as others in this thread: it's really hard to implement reachOffset if your Stream is something like [Token].

First I thrashed around a bit, trying to figure out how things are supposed to work from reading the haddocks and the source code. Then I read a couple threads on this issue tracker, and found an example in the tutorial, which shows how to do this by carrying around the original non-lexed source, but ultimately calculating the right offsets and such was just too confusing for me to push through tonight. Unison has some lexemes that don't take any space, like virtual semicolons, which may complicate things (but maybe not).

Anyways, my task is to bump Unison to GHC 8.8, and it's a stack project, so the simplest way seemed to be to bump the LTS resolver. That's how I ended up with my eyes on megaparsec-8.0. But now, my plan is to leave megaparsec on 6.5, and see if I can find a set of dependencies that work together with help from the cabal constraint solver.

@1Computer1
Copy link
Contributor

I've attempted this with #415. Let me know what you think!

@mrkkrp
Copy link
Owner

mrkkrp commented Aug 9, 2020

Please also see my comment #415 (comment).

@mrkkrp
Copy link
Owner

mrkkrp commented Sep 2, 2020

Resolved in #415.

@mrkkrp mrkkrp closed this as completed Sep 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants