Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow using tokio's AsyncBufRead [Rebased] #314

Closed
wants to merge 9 commits into from

Conversation

itsgreggreg
Copy link

This is a rebase of this PR #233 on top of the current master.

I don't know if there was a better way to do this and my intention is not to step on anyone's work but rather keep it moving along. I'm happy to take this PR down if you want to update yours @endor or honestly whatever works for people. This PR is a proper rebase of the endor/async on top of tafia/master so line by line credit is still in tact except for the few places I had to fix rebase conflicts.

@endor, @tafia happy to help with whatever y'all need help with here, lemme know!

@itsgreggreg itsgreggreg marked this pull request as ready for review August 25, 2021 12:57
@endor
Copy link

endor commented Aug 25, 2021

Looks good to me. After we merge, it would probably make sense to extract the BufferedInput from the sync reader into the mod and have one implementation in sync and one in asynchronous.

Copy link
Owner

@tafia tafia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

I have a few small questions but I am willing to merge it so we can iterate on it.

I don't really like all the code duplication and I believe there should be a way to be more DRY.

EDIT: could you run a cargo fmt?

src/utils.rs Outdated
}
}
pub fn write_cow_string(f: &mut Formatter<'_>, s: &[u8]) -> Result {
write_byte_string(f, s)?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope! good eyes, reverted.

loop {
let event = match reader.read_event(&mut buf)? {
Eof => break,
Start(elem) => {
let mut attrs = elem.attributes().collect::<Result<Vec<_>>>()?;
let mut attrs = elem.attributes().collect::<Result<Vec<_>>>().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not keeping the ??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted!

@itsgreggreg
Copy link
Author

Ran the formatter, couple small cleanups.

@endor
Copy link

endor commented Sep 12, 2021

What is this waiting for?

@tafia
Copy link
Owner

tafia commented Sep 19, 2021

There is a missing file, @itsgreggreg any idea?

@itsgreggreg
Copy link
Author

It's a path joining error, If you look at the error the first separator in the path is a windows style \ and the rest of them are unix style /.

https://github.com/tafia/quick-xml/runs/3644904506?check_suite_focus=true#step:7:5

I don't know what's causing that.

@itsgreggreg
Copy link
Author

itsgreggreg commented Sep 19, 2021

whelp, here's why https://doc.rust-lang.org/std/macro.include_bytes.html

The file is located relative to the current file (similarly to how modules are found). The provided path is interpreted in a platform-specific way at compile time. So, for instance, an invocation with a Windows path containing backslashes \ would not compile correctly on Unix.

@tafia
Copy link
Owner

tafia commented Sep 21, 2021

I'm a bit surprised, I have used forward slash in windows quite a lot at work and I never had this issue, I understand the other way may not work (escaping characters) but it doesn't seem to be the case here.

@itsgreggreg
Copy link
Author

itsgreggreg commented Sep 22, 2021

Upon closer inspection it seems that through the rebase I missed some changes to xmlrs_reader_tests.rs that removed that path but the file itself was still removed. I went through the file and hopefully made sure it was up to date. My bad for rushing to a conclusion when really I should have noticed it was working before.

Interesting that the rust docs say paths in include_str and friends have to be OS specific but unix paths work on windows. I wonder if windows paths work on *nix.

@999eagle
Copy link
Contributor

999eagle commented Jul 4, 2022

I'm currently building a crate for an XML-based protocol based on this PR's source branch as I need pull-based async XML deserialization for my crate. It'd be really helpful if this could be merged so I can publish my work to crates.io. Is there anything I could help with to get this merged?

@notgull
Copy link

notgull commented Jul 4, 2022

You could use tokio's spawn_blocking functionality until this is merged. Alternatively, since &[u8] implements Read, you could use the asynchronous file system functions to read into a buffer and then pass that buffer to this crate

@999eagle
Copy link
Contributor

999eagle commented Jul 5, 2022

@notgull sadly the specifics of the protocol I'm dealing with require me to read from a network socket until I have read one full XML element. As my network socket should be async, I need pull-based XML parsing from an AsyncRead. Currently I've implemented this by reading into a buffer until the last read bytes look like the end tag corresponding to the first start tag received, but this PR provides a much cleaner solution.

@notgull
Copy link

notgull commented Jul 5, 2022

What protocol is this?

@999eagle
Copy link
Contributor

999eagle commented Jul 6, 2022

It's the Greenbone Management Protocol. A vendor-specific protocol that, as far as I can tell, only has an official Python implementation, which also does nothing more than use a pull-based XML parser to read all received data until one full XML element has been read. Yes, I know it's a bit stupid but I can't really change it. The protocol really just sends XML data over a TLS-encrypted TCP stream without any kind of header or footer with the connection being kept open for further requests.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything I could help with to get this merged?

@999eagle, Yes, you can. I'd like to see the following changes:

  • investigate the possibility to depend on the crate with only async trait, without async runtime. Runtime is a quite big thing and I would prefer not to have it because we don't need it. It is fine to have it if there is no other way, but this should be proved
  • (required) eliminate code duplication either by introducing a trait and different implementations for the sync and async readers, or by using macros
  • I prefer to have a read_event_async (borrowed version) / read_event_into_async (copying into provided buffer version) instead of a separate reader, because I want to introduce a dedicated reader for namespace support and do not want to have 4 readers (non-ns/ns + sync/async). Async abilities seems does not needed a dedicated reader. I'm open for discussion, although
  • I prefer to put all async examples/tests to their own files
  • tests for internal functions similar to that should be added:

    quick-xml/src/reader.rs

    Lines 2591 to 2599 in 0febc2b

    /// Tests for reader that generates events that borrow from the provided buffer
    mod buffered {
    check!(&mut Vec::new());
    }
    /// Tests for reader that generates events that borrow from the input
    mod borrowed {
    check!(());
    }

    Preferably if it could be added to that group of tests

async support definitely won't be in 0.24.0 release, because I plan to release it when I finish with namespaces support which could be soon. 0.25.0 is an option.

Probably it is better to rewrite async support from scratch using this PR as inspiration

@@ -33,6 +36,7 @@ default = []
encoding = ["encoding_rs"]
serialize = ["serde"]
escape-html = []
asynchronous = ["tokio", "async-recursion"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to name it async

@999eagle
Copy link
Contributor

999eagle commented Jul 7, 2022

@Mingun a few thoughts on your points:

  • There's the option of using futures_io::AsyncBufRead instead of tokio's types, but the traits from futures aren't implemented for tokio's types and vice-versa. Also this PR doesn't include the full tokio runtime by default as it only depends on the feature flags io-util and fs, which is why I think the tokio-dependency is okay as it is right now.
  • I think sync and async reading are different enough to require two different readers as they require different traits for the underlying reader. One option for implementing this would be to separate reading from parsing. That is, there's an async reader and a sync reader, but both would pass the read bytes on to a parser. As parsing is only CPU-bound as far as I can tell and doesn't involve any reading or waiting on external resources, parsing would always be pureley sync. This also ties in with the next point:
  • The parser could be some implemented using a trait to keep the possibility of having a namespaced and a non-namespaced parser. The trait would be sync and both reader implementations could do whatever they need to do to read bytes and once they have a buffer, just call into the sync parser trait. That would allow for all combinations of sync/async and namespaced/non-namespaced while having minimal duplicate code.

What do you think about those ideas? And if I was to start that work, should it be as a new PR or as a change to this one?

I definitely agree with the last two points that async tests should be separate and that more tests are required.

@Mingun
Copy link
Collaborator

Mingun commented Jul 8, 2022

I've learned a bit about current async support in Rust and it seems that there is no a single crate with just the AsyncRead / AsyncWrite trait. There was some in the past (tokio_io), but it was then merged into tokio crate as io module. Different sources says that there are about three different implementations of such traits in tokio, futures-rs and async-std. Async Book also note, that there is a compatibility layer between some of them, but without performance analysis. So it should fine to have tokio


I think sync and async reading are different enough to require two different readers as they require different traits for the underlying reader.

Actually, BufRead bound is required only on impl block here:

quick-xml/src/reader.rs

Lines 464 to 465 in 0febc2b

/// Read methods
impl<R: BufRead> Reader<R> {

Actually, I had already planned to remove that bound from another places but doubted would that will provide a better ergonomic, but now I have arguments. Perhaps I will make a separate PR with this (and some other) change (#412).

Sync and async readers have the same set of fields, and it is safe to read one event via sync interface and other via async interface -- this would not break anything. So all that we need are a separate impl blocks:

// sync methods
impl<R: BufRead> Reader<R> { /*...*/ }

// async methods
impl<R: AsyncBufRead> Reader<R> { /*...*/ }

We even can have a native implementation for each async library, using different features: async-tokio, async-features, async-std.

So I think, the separate reader is not necessary. For comparison's sake, namespaced reader have additional fields, so a separate struct is required. It is also an error to mix calls to the non-namespace aware methods and a namespace-aware one (currently it is read_event and read_namespaced_event). For example, if you read a Start event using read_namespaced_event and read an End event using read_event, the namespaces, declared in the Start event would not be end their scope and further parsing could be with errors.


One option for implementing this would be to separate reading from parsing. That is, there's an async reader and a sync reader, but both would pass the read bytes on to a parser. As parsing is only CPU-bound as far as I can tell and doesn't involve any reading or waiting on external resources, parsing would always be pureley sync.

Yes, it makes sense to split reading from parsing where possible. Small code duplications are allowed or may be hidden behind macros.


What do you think about those ideas? And if I was to start that work, should it be as a new PR or as a change to this one?

I think it is better to start a new PR -- some parts of code was changed significantly since this PR was created; I also prefer to have a clean history without fixup commits

@999eagle 999eagle mentioned this pull request Jul 11, 2022
@999eagle
Copy link
Contributor

999eagle commented Jul 11, 2022

Okay, I've started working on this (see the linked PR #417 which I've opened to allow for discussion). For now, I had to use an async-specific version of the reader::XmlSource trait because the functions need to return futures instead of results. Some of the impl-blocks for Reader have to be adapted to this now as well.

Additionally, separating implementations based on trait bounds (that is, impl <R: BufRead> Reader<R> and impl<R: AsyncBufRead> Reader<R> won't work because there is no reason types can't implement both BufRead and AsyncBufRead (case in point: both traits are implemented for &[u8]). This requires either different function names between sync/async or different structs to be used.

@dralley dralley closed this Jul 16, 2022
@dralley
Copy link
Collaborator

dralley commented Jul 16, 2022

Closing the old PRs in favor of #417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants