-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Reader
into SliceReader
and BufferedReader
#425
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first commit (0fa2a8b) you've added a new XmlSource
implementation which then was removed in the next commit -- it seems this impl could have not been introduced at all? Anyway, it is completely wrong to create some implementation just to be able to write test code. Tests should test real code that would executed. I think, you could just squash first two commits.
The third commit does too much work -- it replaces usages of read_event_into()
with read_event()
and did some other stuff. I prefer to split that two actions to separate commits. Actually, I'll plan to make a PR with replacements tonight, so you will no need to do the first part of that.
Also, after introducing a helper functions (input_from_bytes
, etc.), comments that explaining positions in the input, no longer spot the correct place in the input. That should be fixed
@Mingun Yes I know. I added this implementation purely so the |
Codecov Report
@@ Coverage Diff @@
## master #425 +/- ##
==========================================
+ Coverage 49.51% 51.93% +2.42%
==========================================
Files 22 25 +3
Lines 13847 13449 -398
==========================================
+ Hits 6856 6985 +129
+ Misses 6991 6464 -527
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Yes, I see that now. It is really hard to follow refactorings where some code moved to another file & changed at the same time. I suggest to rebase this and follow the following pattern:
With such structure all actual changes would be visible in 2nd and next commits and will make review much more simpler. Anyway, try to not mix different kind of changes in one commit. If you need to change something that is not directly related to changes in your commit, stash your work, made necessary commit(s), Very frequently the main work is not to made changes, but split them into several commits in the correct order. I personally have many private branches (~20) in my local working copy, which contains some working code, but their just not yet ready for PR. To prepare them I need to spend several weeks to adapt them |
Code-wise, this all looks fine to me, I don't have any complaints that block the PR. I also checked out the branch, ran tests, went through the code for a few hours while experimenting with some changes on top of it. re: commit structure, I agree about trying to do it that way in the first place, but I don't feel so strongly that it needs to block the PR, especially since fixing it at this point would essentially require redoing the work of entire commit. It is probably easier to review, than to untangle at this point. |
So for the stuff I'm about to say, wait on an acknowledgement and 👍 from @Mingun before putting any effort into addressing it. I have a patch that you can start with or use as reference. I think the abstraction is (very slightly, in a fairly easy-to-fix way) incorrect for the architecture we might want to have going forwards [0]. Basically The idea is that in the near future all decoding will be done internally rather than forcing the user to handle it, which aside from being much easier to use and maintain, would have better performance in the majority of cases. That means, that if you want to parse a byte slice of an unknown encoding, you would need to either decode the data as UTF-8 before parsing it (and the Does that make sense? And on a related note, the [0] @Mingun you never actually gave a direct thumbs-up or thumbs-down |
It is anyway needed to fix explanation comments in tests that shifted in this PR, so anyway there should be a rebase. Also, I hope that it will be rebased on top of #426.
I think, it will be better to spend some time by polishing stuff rather than discovering in the future that some error was made, and we were unable to detect it in time. In the end, we are in no particular hurry and can spend time on this. In the long run, an understandable history is more important.
If you think so, it is another reason to wait some time before merge. The efforts we have been making recently to properly support encoding can affect a lot of things. I also suspended my work on the namespace because of this. I do not forget about your PRs, @dralley, just had no time yet to write my thoughts. |
I just meant the overall idea, rather than any specific PR. Anyway, #426 ought to address the concerns about commit #3, which just leaves #2. I really do appreciate how much attention you give to clean commits, it does make your PRs very easy to review, but in this specific instance I'm not sure if the benefits are worth the significant time investment to break up commit #2. Our test coverage for this functionality is very good (88% for slice reader, 78% for io reader). If you go through the codecov report line by line, the only un-covered code is:
Would it be an acceptable compromise if, instead of spending a bunch of time splitting apart commit #2, @999eagle added more testing for these internal functions and got coverage above 90%? I think this might have a more significant long-term benefit. |
I promised some patches:
With the second commit, you can see the error caused by |
The main advantage of the suggested approach is that is simplifies rebases a lot. Instead of trying to manually apply patches to the copied & changed code (error-prone!) you will need just copy upstream code to the first move-only commit and then solve any conflicts with your subsequent changes. Because of competing changes in switching to the UTF-8 I think that this PR could be rebased several times. So it makes sense to rewrite it in a way that makes this work easier. |
That's true - but it's partly why I'm OK with letting the issues I mentioned be addressed as followups. I'd rather not force @999eagle to go through a bunch of rebasing when we can just get this merged, build on top of it and avoid the effort entirely. So basically, IMO, these are the only 2 things that really ought to happen before merge
Once it's merged the breakages stop being so much of an issue, and the other issues can be fixed independently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased it by myself, applied my suggestions, introduce some intermediate commits with big changes that was in a one commit originally, reordered code slightly to minimize diff and here is the result: master...Mingun:split-reader
I do not like it much. There a several reasons:
- first, as I say, diff a very hard to read. This is fixable and I fixed some of the problems, but this was a quick fix. I did not set myself the task of ensuring the compilation of every commit, although I think this is a mandatory requirement. Tests allowed to fail if that will be fixed in the next commit (although this is still extremely undesirable!), but the code must be compiled. This makes it easier to do rebase and explore code
- current implementation contains commits that did too much work. Again, I tried to separate some of them into separate commits to simplify diff and understand what was happening, but not all
- the main concern is that after this change many code is copied instead of reused. That was main problem in the previous async PRs, but here it even worse, because here the same code duplicated for a different sync implementations
No, I cannot accept this in that state. I hold it for now, because I still think this is a good start, but it requires hard work on polishing to get my approval.
I want:
- maximum reuse
- readable diffs
- small commits (my preference is not more than 5 changed files with significant changes)
- try to minimize non-trivial changes in each commit (ideally not more than 100-300 changed lines)
- each commit should leave project in a compilable state
- (optionally, but very preferrable) CI checks should pass in each commit
This commit only moves code without significant changes (the only changes is: - corrected imports - add imports to the doc comments which have become inaccessible )
Main code moved from `read_namespaced_event_into` to `resolve_namespaced_event_inner`
51fd38f
to
768410c
Compare
Sorry for the silence, I'm also working on other things besides this. So if I understand correctly you want
Quite frankly though, I'm really not sure how this would look like. In your rebased branch even the first commit has a diff over over 700 lines added and removed (not even including the move of I also tried to reuse more code instead of copying but
If you have concrete suggestions on how to continue here I'm open for that but as of right now I honestly don't think your rebased branch could be improved to your standards for the reasons I've given above. |
This is why I say "non-trivial changes". Simple move code from one place to another with minimal changes (only to guarantee compilation) is acceptable of course. Frankly speaking, even 100 changed lines in a file is a very big change that usually is to hard to understand, if that changes not come in a big chunks. I personally try to stick to not more that 50 changed lines in such cases, but this is just a recommendation. All these requirements are aimed at maintaining high code quality. Even high coverage with tests does not guarantee against errors, so I prefer that the average code reader can clearly understand what exactly has changed in the code without relying on technical tools. For example, I just found an error in my own code that was covered by me by tests very well (as I thought!). And when I started refactoring, I accidently encountered an error, which was hided previously (#434).
I talked about the code that was already separated in a similar way, where all buffer-related methods was decoupled into
Unfortunately no, that is why I say that this problem requires investigation. I have not time to do that by myself right now, but I'll return to this problem in the future. I feel that there is a better way than in this PR. |
@Mingun I think if you were to compare the files side-by-side, manually, like this: it would quickly be apparent how little has actually changed. The new As for the duplication of code that remains, given that we are planning to rewrite the My request, if you have no fundamental objections to the architecture itself (which remains very much similar to what we have currently overall), is that we merge this, finish the decoding work, and then evaluate what the appropriate level of code sharing will be and perform the unification at that time - once it is more clear what can and cannot be shared. |
Although @999eagle, if you could please address the failing test, and rename |
This also changes the test cases in the `reader::test::check` macro to allow for reader-specific tests.
768410c
to
c972101
Compare
I've changed the struct name and fixed the failing test. |
Reader
into SliceReader
and IoReader
Reader
into SliceReader
and BufferedReader
Thanks! The last thing I think this needs is a changelog entry. Could you please apply diff --git a/Changelog.md b/Changelog.md
index c305325..e1cb297 100644
--- a/Changelog.md
+++ b/Changelog.md
@@ -137,6 +137,9 @@
- [#423]: All escaping functions now accepts and returns strings instead of byte slices
- [#423]: Removed `BytesText::from_plain` because it internally did escaping of a byte array,
but since now escaping works on strings. Use `BytesText::from_plain_str` instead
+- [#425]: Split the internal implementation of `Reader` into multiple files to better separate the
+ buffered and unbuffered implementations. The buffered methods, e.g. `read_event_into(&mut buf)`,
+ will no longer be available when reading from a slice.
### New Tests
@@ -167,6 +170,7 @@
[#418]: https://github.com/tafia/quick-xml/pull/418
[#421]: https://github.com/tafia/quick-xml/pull/421
[#423]: https://github.com/tafia/quick-xml/pull/423
+[#425]: https://github.com/tafia/quick-xml/pull/425
## 0.23.0 -- 2022-05-08
@Mingun I have given the entire PR a thorough review, including manual comparisons of individual functions to the previous implementations. Since these files are about to be partially rewritten anyway, and that this PR is effectively blocking that work, would you be willing to accept merging this? Deduplication is absolutely something that should be done if possible, but now is not the right time, it will only make the exploration and refactoring I want to do over the next week more difficult. We may even have new opportunities to do so afterwards that aren't available now. Specifically I think once we have the internal buffer, depending on how the implementation works out it might not make sense to continue using user-provided ones. If that's the case we can ditch another chunk of the API surface and potentially consolidate most if not all of the implementation |
And @999eagle, I plan to put up a new draft PR in short order with part of those changes, if you have any suggestions about how to adjust the |
@dralley I've added an entry to the changelog. All I'd change in the |
What I ended up doing, was just move those tests outside of the macro since they don't make sense for both implementations. |
First, let me apologize that this PR takes some long review, but I'm really think that all changes should have rational explanation behind them. For now it is unclear for me, why we should remove Sharing exactly that implementation with an async reader probably would a challenge and may be impossible using the same trait, but should be possibly by using a macro. Something like: macro_rules! impl_methods {
( $($async:ident, $await:ident)? ) => {
$(async)? fn read_bytes_until(&mut self, ...) {
match self.fill_buf() $(.$await)? {
...
}
}
};
}
// Sync-based reader
impl<R> XmlSource for Reader<R> {
impl_methods!();
}
// async in traits is impossible for now
impl<R> AsyncReader<R> {
impl_methods!(async, await);
}
// If even would required
impl<R> AsyncXmlSource for AsyncReader<R> {
fn read_bytes_until(&mut self, ...) -> Poll<...> {
// Trivial call non-trait method, implemented by a macro
self.read_bytes_until()
}
} So I want to investigate possible solutions next week. @dralley, you've said on other PR
Yes, unfortunately that is true. But in my justification, I must say, that I said from the very beginning that the process will not be fast, precisely because I already had other changes planned that intersect with any possible changes in async support. I ask not to take it to heart, it's just a matter of priorities. In my opinion, we can live without asynchronous support, but correct namespace support is more in demand. More error-prone working with different encodings are also more important task, because incorrect working with current API easely could lead to a wrong results and even to security vulnerabilities. And while it can be avoided, it's hard now. Also, precisely so that rebase is not a difficult task, I recommend breaking up commits into small parts and trying not to mix changes with code additions/moves. Precisely because conflicts with simple movement are solved elementary, and after that the conflicts of change are much easier to solve.
Unifying different codes is always more challenging than splitting. Therefore, I want to understand whether separation is so necessary at this stage. As a final note, I should say, that some my concerns still not addressed. I've talked about incorrect alignment in comments, that should explain in tests, where the position would be, for example, here: Lines 974 to 975 in c3a07b6
Position of that indicators should be adjusted in that commit that adds |
The issue is, at least regarding the decoding work, if there's no inner struct then there's nowhere to put the buffer(s?) and other related data for decoding - so it would have to go into the main Reader struct, and it would be there even when it's not needed. So having separate structs makes sense and makes that work a bit easier. To not do so might not be terrible, after all there's also the situation where if the |
@999eagle In practical terms, hold off on rebasing the PR further until there's a consensus on the architecture. I think what we may end up doing is pulling the commits one or two at a time - I've now merged 1 and 7 and 2 is no longer needed after some other changes @Mingun made, and the rest will have to wait until after the experimentation I suppose. |
Alright, I'll hold off on working on this until further notice from you. I fully agree with you @dralley in that multiple structs holding the underlying data source are pretty much a necessity. @Mingun that exact change was actually proposed by you in this comment, so I'm not entirely sure why you're opposed to that now. I also think that implementations between the (what's now called) |
I don't think async makes sense in situations where no IO is necessary to begin with, so we only need to worry about the buffered implementation. Code which doesn't block on IO is perfectly OK to call from async code without needing to be async itself. An idea that I haven't entirely thought through, and I'm not sure if it would work - it might be possible to invert the IO model and have the user perform the IO themselves, "feeding" data to the parser from external user code at the same time as they pull events from it, which would avoid the need to have any asynchronous IO inside the library at all. I'm not entirely sure what that API would look like, though. Just something to explore. Perhaps something like: let mut reader = Reader::new_manual();
reader.trim_text(true);
let mut buf = Vec::new();
loop {
match reader.read_event_into(&mut buf) {
// ...
Ok(Event::Eof) => break,
Err(XmlError::OutOfData) => {
let buffer = file.async_fill_buf().await?;
reader.feed(buffer);
// Somehow communicate back the # of bytes the reader has processed
}
Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
_ => (),
}
} |
I was just going to write the same thing :)
Yes, I known, I propose it as an alternative to the your very complicated first async implementation, that should be solve the conflict problem with the implementing traits. But the more I studied asynchronous wisdom, the more I realized that in fact, probably there is no conflict in the first place. As @dralley wrote, actually we do not need the
Also, I mentioned in the very beginning that I welcome experimentation in search of better design. That is just one iteration of that, I looked into it and it was seemed to me that it is not yet finished.
While I think that this is possible to implement, I also think that this lead to more ugly design. Actually, async code tries to move the user away from the need to do "feeding". So I think, we should focused on the traditional async design. |
This PR was split from #417.
This splits
Reader
into two new structs,SliceReader
andIoReader
to better separate which kind of byte source theReader
uses to read bytes. Changes are based on #417 (comment).A
Reader<SliceReader>
also explicitly doesn't have methods for buffered access anymore.