Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iron out API discrepancies #289

Open
Bodigrim opened this issue Sep 27, 2020 · 14 comments
Open

Iron out API discrepancies #289

Bodigrim opened this issue Sep 27, 2020 · 14 comments

Comments

@Bodigrim
Copy link
Contributor

Bodigrim commented Sep 27, 2020

Missing for lazy bytestrings:

Missing both for strict and lazy bytestrings:

There are minor but annoying differences between API provided by Data.ByteString, Data.ByteString.Char8, Data.ByteString.Lazy, Data.ByteString.Lazy.Char8. Plus we are missing certain encoding-agnostic functions available for Text, such as compareLength, replace, splitOn, etc.

A good starting point (even if a bit outdated) is https://github.com/haskell-backpack/backpack-str/tree/master/str-sig#feature-matrix

@sjakobi
Copy link
Member

sjakobi commented Sep 28, 2020

There are minor but annoying differences between API provided by Data.ByteString, Data.ByteString.Char8, Data.ByteString.Lazy, Data.ByteString.Lazy.Char8.

I agree that it would be nice if these APIs were more consistent.

Plus we are missing certain encoding-agnostic functions available for Text, such as compareLength, replace, splitOn, etc.

compareLength makes some sense for lazy bytestrings, but not so much for strict ones IMHO.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Oct 3, 2020

Missing for lazy bytestrings:

  • compareLength :: ByteString -> Int -> Ordering
  • foldr' and foldr1' (not that strict folds make tremendously much sense for lazy structures, but we already have foldl' and foldl1' in place)
  • scanl1, scanr and scanr1
  • unfoldrN
  • isInfixOf, breakSubstring (these may be non-trivial to do right)
  • takeEnd, dropEnd, takeWhileEnd, dropWhileEnd, spanEnd and breakEnd (but these require design discussion first about a desired degree of laziness).

Missing both for strict and lazy bytestrings:

  • concatReplicate :: Int -> ByteString -> ByteString (which is called replicate in text)
  • replace :: ByteString -> ByteString -> ByteString -> ByteString
  • splitOn :: ByteString -> ByteString -> [ByteString]

@sjakobi
Copy link
Member

sjakobi commented Oct 4, 2020

  • concatReplicate :: Int -> ByteString -> ByteString (which is called replicate in text)

I think users could simply use stimes. No need for an API extension.

For an earlier attempt at a lazy breakSubstring see #129.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Oct 4, 2020

I think users could simply use stimes. No need for an API extension.

Good idea. In this case it boils down to defining an efficient stimes.

@kindaro
Copy link
Contributor

kindaro commented Feb 18, 2021

I would like to dibs this. (Together with all the subordinate issues.) It looks like a bunch of fun little exercises, and I think I am able to prototype all of them easily. Then there is the low level optimization part, and at that I may need some advice. Would this sort of help be wanted?

My first goal would be to make a draft pull request with all functions doing roughly the right thing, so that we may discuss small discrepancies in behaviour to death and codify the final behaviour in automated checks.

@Bodigrim
Copy link
Contributor Author

@kindaro yes, sure. Please leave functions depending on lazy breakSubstring out of scope for a while, because we need to decide on a search algorithm first (see #307), but everything else on the list should be accessible and mergeable in a finite amount of time.

@kindaro
Copy link
Contributor

kindaro commented Mar 1, 2021

sort is missing from lazy variants of byte strings. Is it intentional? I understand that it would need to force the entire stream, but the same reasoning applies to Data.List and to strict folds that we are presently adding and no one is objecting to it. Should we add lazy sorting as well? It would likely mean mashing the chunks together, delegating to strict sort and then chunking up again.

@sjakobi
Copy link
Member

sjakobi commented Mar 1, 2021

Should we add lazy sorting as well? It would likely mean mashing the chunks together, delegating to strict sort and then chunking up again.

Yeah, let's address that discrepancy too. I think we can use a counting sort where we make a single pass over the input to determine the total length and the number of occurrences for each byte. Then we simply write out the bytes. Take a look at the strict sort which uses a counting sort for larger inputs too.

@kindaro
Copy link
Contributor

kindaro commented Mar 1, 2021

Alright, I added sort to the list in #364. A counting algorithm makes sense, I shall use it.

@sjakobi
Copy link
Member

sjakobi commented Mar 1, 2021

@kindaro My recommendation would be to split this work across multiple PRs. If you do it all in a single PR, it will be harder to review and you're more likely to run into merge conflicts.

@kindaro
Copy link
Contributor

kindaro commented Mar 1, 2021

Can you explain why it will be easier to review multiple small pull requests? Either way there is the same amount of code. But with smaller pull requests more bureaucracy is added. What is the gain?

@sjakobi
Copy link
Member

sjakobi commented Mar 1, 2021

I simply find large diffs harder to understand and navigate. For example, the question "What tests do we have for addition x?" is IMHO easier to figure out in a small diff.

@kindaro
Copy link
Contributor

kindaro commented Mar 1, 2021

This makes sense. But by this logic I should add functions one by one. This would result in there being a dozen of tiny pull requests. It would be hard to manage.

"What tests do we have for addition x?"

So I think we should find a cheaper way to answer this question. In most cases you can replace a function by an error "…" and see what breaks. (In the presence of fancy types, some values cannot be replaced by error "…", but we do not have any fancy types in this project.)

I agree that sort may be out of place in #364, since all other features there are essentially variations of foldr.

@kindaro
Copy link
Contributor

kindaro commented Mar 1, 2021

I narrowed the scope of #364 to folds and scans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants