-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #3314 duplicate code error only shows up with pylint jobs 1 #3458
Fix #3314 duplicate code error only shows up with pylint jobs 1 #3458
Conversation
bc7725d
to
02e055a
Compare
2954120
to
9b71535
Compare
|
bcfd6e7
to
474f2d0
Compare
474f2d0
to
76168d7
Compare
I approve of the changes proposed, this is great work, probably huge for performance on multi core machines on top of fixing a bug. But I just can't review this much clever code in a reasonable time. Do you have an idea of something that could be separated easily? If some refactor would help to make the implementation less verbose, we could merge them before the main change. Surface remark : you don't have to update the copyright manually yourself, there is some kind of script that does it. |
Yes. Absolutely. It will be the changes to
Gotcha :)
Unfortunately performance doesn't improve with this, neither with batching files (another PR I have ready, but dependent upon this PR), at least not in real-world use cases sadface. Performance is a huge topic for pylint, when I started looking at it I thought it might be an easier problem than it is. Issue #2525 has some good notes on this. For example, some rough benchmarking suggests batching-files in -jN mode improves the performance by ~13%.... but that is only in the pylint wrapper system. This is because the astroid checkers seems to spend the same amount of time for each sub-proc as a non-threaded run meaning wall-clock time remains constant for all -j[1.. N]! (as #2525 states). Performance work will take some collaboration it seems. |
76168d7
to
9912b1f
Compare
Thank you for the detailed explanation. If that is not too much work could you create a small merge request with the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py ? Maybe even the typos. We could get that out of the way first.
After looking at the ticket you linked it seems that there is a clear path to make performance better. I would not call it a low hanging fruit, but there's a fruit ! Wouldn't it make sense to do this ticket before so we can see the actual result of parallelization and batching? By the way, how are you testing the changes in performances ? I launch a local analysis on pylint and on apache spark. Which is probably a mediocre test... I'm looking for a fast way to see if the perfs are getting better. |
I've updated this PR with the minimal changes. We can get the rest in after as separate commits. Edit: ~100 lines of the PR is code changes. There are two input-files which are the majority of the rest of the lines |
Yes. Although I'm not 100% sure there's a clear path, yet :)
Have a look at wip PR #3473. I've tried to clarify the full range of considerations and problems - it may still not be very clear what is going on there, but look at the description and the individual commits behind that PR, it should tell a story (I try to tell a story with each commit, especially on complex work). Where I say ~13% gain with batching, I determined that via pytest-benchmark on a couple of machines, and with a carefully designed perf-test, but it only shows part of the picture. ... for the other part, if you've not seen it yet, I generated this performance-heat map for a pylint run against numpy under -j1: That perf-chart is pretty much what I see irrespective of how I run performance analysis, but I'm lacking a good understanding of some of what pylint is doing under the hood (ast-walking and so on). |
Yes. In general it is really simple thanks to
You can see how simple it is from this commit on PR #3473 :) As for the numpy bit, that's best explained by this commit on a branch in my pylint fork (it needs rebasing and tidying and is dependent on #3473). That branch:
My explanation is almost larger than the code to do it, which is why I love python so much :D Having written that, because of its value, it might be worth just doing a numpy-profile PR before discussing in place the profiling regression work. It's easy enough to do, but PyCQA would need to have an S3 account and so on. |
Thank you for the very detailed explanation. I understand now, I missed the pytest option that generates the diagram! I don't think pylint have an S3 account (@PCManticore probably knows better than me). But it would also be useful to be able to do it locally. Also maybe it's possible to curl travis's artifacts, logs, or something similar? |
Yep, being able to do that with logs and so on will be useful. I have a PR that I'm getting ready to just do performance runs against external repos. Until the S3 thing is sorted we can just run it locally. |
See PR #3498 |
Sure thing, check this branch. |
@Pierre-Sassoulas Thank you. I've pulled that, rebased it against latest 2.5 and am testing. |
bcb88df
to
60a823a
Compare
60a823a
to
97bac19
Compare
Hopefully this can finally go in. |
Before adding a new mixin this proves the concept works, adding tests as examples of how this would work in the main linter. The idea here is that, because `check_parallel()` uses a multiprocess `map` function, that the natural follow on is to use a 'reduce` paradigm. This should demonstrate that.
0e1cc77
to
08361eb
Compare
This integrate the map/reduce functionality into lint.check_process(). We previously had `map` being invoked, here we add `reduce` support. We do this by collecting the map-data by worker and then passing it to a reducer function on the Checker object, if available - determined by whether they confirm to the `mapreduce_checker.MapReduceMixin` mixin interface or nor. This allows Checker objects to function across file-streams when using multiprocessing/-j2+. For example SimilarChecker needs to be able to compare data across all files. The tests, that we also add here, check that a Checker instance returns and reports expected data and errors, such as error-messages and stats - at least in a exit-ok (0) situation. On a personal note, as we are copying more data across process boundaries, I suspect that the memory implications of this might cause issues for large projects already running with -jN and duplicate code detection on. That said, given that it takes a long time to perform lints of large code bases that is an issue for the [near?] future and likely to be part of the performance work. Either way but let's get it working first and deal with memory and perforamnce considerations later - I say this as there are many quick wins we can make here, e.g. file-batching, hashing lines, data compression and so on.
08361eb
to
cd21a1d
Compare
@doublethefish sorry for the delay, busy year ! :) And thanks again for the merge request. I fixed the current conflict on https://github.com/Pierre-Sassoulas/pylint/tree/doublethefish-bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1, could you check that the last commit to fix the error that appeared make sense for you, and update the MR with it or authorize edit on your branch by maintainers :) ? Except for this conflict, I think this merge request is now ready, what do you think ? |
There's a rebase of this PR upon master at https://github.com/FPtje/pylint/tree/bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 I'm not familiar with the code, but I tried my best to fix things changed due to the rebase:
Feel free to use the above fork for this PR. I would love to see this merged into master 👍 |
Merged in #4007, thank you for the great work @doublethefish ! |
Steps
doc/whatsnew/<current release.rst>
.Description
This PR allows recombining checker-data across checker instances, such as SimilarChecker, fixing #3314.
This means checkers that need to do analysis across all files can do so, even in a multiprocessing environment. We achieve this through a map/reduce mixin solution.
There are two key changes:
check_parallel
collates map data from each checker that supports it, and passes the collated data to a reduce classmember function on the associated checkerMapReduceMixin
is an opt-in object that specifies the interface that is looked for bycheck_parallel
One other set of changes are minimised changes to SimilarChecker and its tests. The goal here was to reduce the size of the patch and reuse as much existing code as possible - that said the code could do with modernising before further changes.
The other mentionable change is new set of tests for
check_parallel
. Looking at most of the other issues for job-related problems they seem to be configuration and cl-argument related e.g. #374 and #2674, it's not clear to me how to put some of those issues under test, yet - any help appreciated here.To summarise: we previously had 'map' being invoked, here we add 'reduce' support by collecting the map-data and then passing it to reducer function, if available on the checker object - determined by whether they confirm to the
MapReduceMixin
mixing interface or nor.On a personal note, I suspect that the memory implications of this might cause issues for large projects, perhaps even having implications on performance (see #2525), but let's get multiprocess/--jobs working properly first and deal with memory and performance considerations later - I say this as there are many quick wins we can make here, e.g. hashing lines, split work by available procs rather than by file, and so on.
Type of Changes
Related Issue
Closes #3314