Fix #3314 duplicate code error only shows up with pylint jobs 1 #3458

doublethefish · 2020-03-27T08:31:24Z

Steps

Add yourself to CONTRIBUTORS if you are a new contributor.
Add a ChangeLog entry describing what your PR does.
If it's a new feature or an important bug fix, add a What's New entry in doc/whatsnew/<current release.rst>.
Write a good description on what the PR does.

Description

This PR allows recombining checker-data across checker instances, such as SimilarChecker, fixing #3314.

This means checkers that need to do analysis across all files can do so, even in a multiprocessing environment. We achieve this through a map/reduce mixin solution.

There are two key changes:

check_parallel collates map data from each checker that supports it, and passes the collated data to a reduce classmember function on the associated checker
MapReduceMixin is an opt-in object that specifies the interface that is looked for by check_parallel

One other set of changes are minimised changes to SimilarChecker and its tests. The goal here was to reduce the size of the patch and reuse as much existing code as possible - that said the code could do with modernising before further changes.

The other mentionable change is new set of tests for check_parallel. Looking at most of the other issues for job-related problems they seem to be configuration and cl-argument related e.g. #374 and #2674, it's not clear to me how to put some of those issues under test, yet - any help appreciated here.

To summarise: we previously had 'map' being invoked, here we add 'reduce' support by collecting the map-data and then passing it to reducer function, if available on the checker object - determined by whether they confirm to the MapReduceMixin mixing interface or nor.

On a personal note, I suspect that the memory implications of this might cause issues for large projects, perhaps even having implications on performance (see #2525), but let's get multiprocess/--jobs working properly first and deal with memory and performance considerations later - I say this as there are many quick wins we can make here, e.g. hashing lines, split work by available procs rather than by file, and so on.

Type of Changes

	Type
✓	🐛 Bug fix

Related Issue

Closes #3314

coveralls · 2020-03-28T15:21:56Z

Coverage decreased (-0.03%) to 90.245% when pulling 02e055a on doublethefish:bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 into 53e5406 on PyCQA:master.

coveralls · 2020-03-28T15:21:56Z

Coverage increased (+0.02%) to 90.856% when pulling cd21a1d on doublethefish:bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 into fed4438 on PyCQA:master.

tests/input/similar_lines_a.py

pylint/lint.py

pylint/checkers/similar.py

tests/unittest_checker_similar.py

doublethefish · 2020-04-03T12:41:49Z

~~I have performance analysis work that should go before this PR.~~
I have changed my mind on this. The performance work can be separated out.

Pierre-Sassoulas · 2020-04-19T15:05:47Z

I approve of the changes proposed, this is great work, probably huge for performance on multi core machines on top of fixing a bug. But I just can't review this much clever code in a reasonable time. Do you have an idea of something that could be separated easily? If some refactor would help to make the implementation less verbose, we could merge them before the main change.

Surface remark : you don't have to update the copyright manually yourself, there is some kind of script that does it.

doublethefish · 2020-04-19T15:27:02Z

Do you have an idea of something that could be separated easily?

Yes. Absolutely. It will be the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py. That's ~50/60 lines all in. Everything else is tests and work in support of that testing.

Surface remark : you don't have to update the copyright manually yourself, there is some kind of script that does it.

Gotcha :)

I approve of the changes proposed, this is great work, probably huge for performance on multi core machines on top of fixing a bug.

Unfortunately performance doesn't improve with this, neither with batching files (another PR I have ready, but dependent upon this PR), at least not in real-world use cases sadface.

Performance is a huge topic for pylint, when I started looking at it I thought it might be an easier problem than it is. Issue #2525 has some good notes on this. For example, some rough benchmarking suggests batching-files in -jN mode improves the performance by ~13%.... but that is only in the pylint wrapper system. This is because the astroid checkers seems to spend the same amount of time for each sub-proc as a non-threaded run meaning wall-clock time remains constant for all -j[1.. N]! (as #2525 states).

Performance work will take some collaboration it seems.

Pierre-Sassoulas · 2020-04-19T19:59:37Z

Thank you for the detailed explanation. If that is not too much work could you create a small merge request with the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py ? Maybe even the typos. We could get that out of the way first.

Performance work will take some collaboration it seems.

After looking at the ticket you linked it seems that there is a clear path to make performance better. I would not call it a low hanging fruit, but there's a fruit ! Wouldn't it make sense to do this ticket before so we can see the actual result of parallelization and batching?

By the way, how are you testing the changes in performances ? I launch a local analysis on pylint and on apache spark. Which is probably a mediocre test... I'm looking for a fast way to see if the perfs are getting better.

doublethefish · 2020-04-19T20:02:00Z

If that is not too much work could you create a small merge request with the changes to lint.py and the get_map_data/reduce_map_data functions in similar.py ?

I've updated this PR with the minimal changes. We can get the rest in after as separate commits.

Edit: ~100 lines of the PR is code changes. There are two input-files which are the majority of the rest of the lines

pylint/lint.py

doublethefish · 2020-04-19T20:18:31Z

After looking at the ticket you linked it seems that there is a clear path to make performance better. I would not call it a low hanging fruit, but there's a fruit ! Wouldn't it make sense to do this ticket before so we can see the actual result of parallelization and batching?

Yes. Although I'm not 100% sure there's a clear path, yet :)

By the way, how are you testing the changes in performances ? I launch a local analysis on pylint and on apache spark. Which is probably a mediocre test... I'm looking for a fast way to see if the perfs are getting better.

Have a look at wip PR #3473. I've tried to clarify the full range of considerations and problems - it may still not be very clear what is going on there, but look at the description and the individual commits behind that PR, it should tell a story (I try to tell a story with each commit, especially on complex work). Where I say ~13% gain with batching, I determined that via pytest-benchmark on a couple of machines, and with a carefully designed perf-test, but it only shows part of the picture.

... for the other part, if you've not seen it yet, I generated this performance-heat map for a pylint run against numpy under -j1:
https://pylint.s3.amazonaws.com/doublethefish/pylint/output.svg

That perf-chart is pretty much what I see irrespective of how I run performance analysis, but I'm lacking a good understanding of some of what pylint is doing under the hood (ast-walking and so on).

Pierre-Sassoulas · 2020-04-20T08:04:34Z

I've started by reviewing #3473. The perf chart is very good, can you explain how you generated it? In #3013 I was using time python3 -m cProfile -s tottime pylint/__main__.py pylint/, I suppose you're creating the chart based on a similar profile ?

doublethefish · 2020-04-20T09:01:56Z

I suppose you're creating the chart based on a similar profile ?

Yes. In general it is really simple thanks to pytest-profiling. Just add the "-profile-svg" switch to the pytest run once that pytest plugin is installed. Because it's so simple (except for getting it on S3) we use travis to generate it for us. E.g.

python -Wi -m pytest --exitfirst --failed-first --benchmark-enable {toxinidir}/tests {posargs:} --profile-svg --benchmark-group-by="group" --benchmark-save-data --benchmark-autosave

You can see how simple it is from this commit on PR #3473 :)

As for the numpy bit, that's best explained by this commit on a branch in my pylint fork (it needs rebasing and tidying and is dependent on #3473). That branch:

Adds a testenv:profile_against_numpy to tox
In tox we clone numpy at @ master
Tox invoke a custom pytest test run, filtered to a single test-file
That test is disabled, in another commit, for all other tests so we do not get the overhead of running against numpy in day to day testing.
The test then gets all numpy's python files and uses pylint.llint.Run() to run with default config, whilst under pytest's profile plugin (which is sweet to use it made me fly a couple of inches off the floor).
The .svg and .prof output is then uploaded to s3 using travis and the artefacts framework in Build failure on Python 3.9 #3457

My explanation is almost larger than the code to do it, which is why I love python so much :D

Having written that, because of its value, it might be worth just doing a numpy-profile PR before discussing in place the profiling regression work. It's easy enough to do, but PyCQA would need to have an S3 account and so on.

Pierre-Sassoulas · 2020-04-20T19:38:22Z

Thank you for the very detailed explanation. I understand now, I missed the pytest option that generates the diagram! I don't think pylint have an S3 account (@PCManticore probably knows better than me). But it would also be useful to be able to do it locally. Also maybe it's possible to curl travis's artifacts, logs, or something similar?

doublethefish · 2020-04-20T19:44:27Z

Yep, being able to do that with logs and so on will be useful.

I have a PR that I'm getting ready to just do performance runs against external repos. Until the S3 thing is sorted we can just run it locally.

doublethefish · 2020-04-20T21:07:33Z

See PR #3498

Pierre-Sassoulas · 2020-04-27T06:53:53Z

Sure thing, check this branch.
~~There is still a problem with a fixture 'benchmark' not found but I don't think it's caused by a bad rebase.~~ Travis tests passed for my branch probably a problem with tox cache.

doublethefish · 2020-05-11T13:35:58Z

@Pierre-Sassoulas Thank you. I've pulled that, rebased it against latest 2.5 and am testing.

Pierre-Sassoulas · 2020-09-24T20:37:08Z

I think we need to merge #3498 and #3474 before reviewing this.

doublethefish · 2020-10-21T08:06:26Z

Hopefully this can finally go in.

Before adding a new mixin this proves the concept works, adding tests as examples of how this would work in the main linter. The idea here is that, because `check_parallel()` uses a multiprocess `map` function, that the natural follow on is to use a 'reduce` paradigm. This should demonstrate that.

This integrate the map/reduce functionality into lint.check_process(). We previously had `map` being invoked, here we add `reduce` support. We do this by collecting the map-data by worker and then passing it to a reducer function on the Checker object, if available - determined by whether they confirm to the `mapreduce_checker.MapReduceMixin` mixin interface or nor. This allows Checker objects to function across file-streams when using multiprocessing/-j2+. For example SimilarChecker needs to be able to compare data across all files. The tests, that we also add here, check that a Checker instance returns and reports expected data and errors, such as error-messages and stats - at least in a exit-ok (0) situation. On a personal note, as we are copying more data across process boundaries, I suspect that the memory implications of this might cause issues for large projects already running with -jN and duplicate code detection on. That said, given that it takes a long time to perform lints of large code bases that is an issue for the [near?] future and likely to be part of the performance work. Either way but let's get it working first and deal with memory and perforamnce considerations later - I say this as there are many quick wins we can make here, e.g. file-batching, hashing lines, data compression and so on.

Pierre-Sassoulas · 2020-11-29T19:21:47Z

@doublethefish sorry for the delay, busy year ! :) And thanks again for the merge request. I fixed the current conflict on https://github.com/Pierre-Sassoulas/pylint/tree/doublethefish-bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1, could you check that the last commit to fix the error that appeared make sense for you, and update the MR with it or authorize edit on your branch by maintainers :) ?

Except for this conflict, I think this merge request is now ready, what do you think ?

FPtje · 2020-12-18T15:13:49Z

There's a rebase of this PR upon master at https://github.com/FPtje/pylint/tree/bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1

I'm not familiar with the code, but I tried my best to fix things changed due to the rebase:

pylint/lint/parallel.py: fixed conflict by accepting both changes, looks like they don't interfere much with each other
CONTRIBUTORS.txt: fixed conflict by accepting both changes, it doesn't look like there is a sort order in this file.
One commit changes the expected code in test_parallel_execution. Both -j 1 and -j 2 give the same result. Looks like the file was changed to only have errors in it
Use GenericTestReporter instead of TestReporter in this commit

~~It looks like the tests/benchmark/test_baseline_benchmarks.py test throws an error, but that same error is also present in master, so I did not attribute it to this PR.~~
Edit: Looks like this test failed due to pytest-benchmark not being installed locally. My bad!

Feel free to use the above fork for this PR. I would love to see this merged into master 👍

Pierre-Sassoulas · 2021-01-02T08:57:50Z

Merged in #4007, thank you for the great work @doublethefish !

doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch 2 times, most recently from bc7725d to 02e055a Compare March 28, 2020 15:08

doublethefish force-pushed the bug/3314_duplicate_code_error_only_shows_up_with_pylint_jobs_1 branch 2 times, most recently from 2954120 to 9b71535 Compare April 1, 2020 00:13