-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge develop into cascading3 #1776
base: cascading3
Are you sure you want to change the base?
Conversation
This reverts commit f4b77ce.
* remove a few warnings * fix/suppress more warning * fix some more warnings
* use the latest scala versions * fix repl for 2.12.3
The `getCounter` method of the `Reporter` returned from `HadoopFlowProcess` was returning null in some cases for a few jobs that we run in production. (It is unclear why these jobs were seeing null counters.) From looking at the Hadoop source code, getCounter does return null in some instances, in particular the Reporter.NULL implementation unconditionally returns null from its getCounter implementation. Hadoop does this despite not documenting that null is a valid return value. Solution: Null check the return value of `Reporter.getCounter` to workaround the issue. Fixes #1716
protect against null counters
* Implement Dagon.toLiteral * reduce stack depth * rename LitPipe to LiteralPipe * respond to review comments
Use a null check rather than foreach
* Implement Dagon.toLiteral * reduce stack depth * Add generic TypedPipe optimization rules * fix compilation error, add a few more rules * fix serialization issue with 2.12 * Add tests of correctness to optimization rules * add comments, improve some rules * fix bug with outerjoin * fix match error
* Implement Dagon.toLiteral * reduce stack depth * Add generic TypedPipe optimization rules * fix compilation error, add a few more rules * fix serialization issue with 2.12 * Add tests of correctness to optimization rules * add comments, improve some rules * checkpoint * fix bug with outerjoin * Cut over the the compiler approach * add a comment * Use optimization rules to get the tests to pass * fixes to make the tests pass * update comment about dagon post 0.2.2 * fix a bug in the filter composition rule, go tests\!
* Make TypedPipe Optimizations configurable * remove the unsafe cache-cleaning prematurely added * add more test coverage
* create new function to extend daterange in past time * unit tests for preprend/extend/embiggen
* Extend TextLine with TypedSink * Add test for TextLine * Remove need for implicit * Add run so scalding-core tests pass * TextLine requires an offset. Need a separate test name for run and runhadoop
We have observed NPEs for null counters at least two companies using scalding. We have not root-caused this issue or found a good fix, but previously set to ignore all null counters. Since some people rely on counters this is not a great plan, so instead this patch makes ignoring an opt-in behavior.
Add a setting to skip null counters
use https for twttr maven
In current versions of Scalding, we have observed that increasing graph size massively increases planning time. We believe this is due to Cascading code that is cubic (or worse) in the number of vertices. This test currently passes for size=64 (with size=128 commented out) but still takes 18s to plan at size=64, versus <1s for size=32. We ran this test on the cascading3 branch and observed a basically linear behavior (e.g. size=128 ran in <3s).
Make TypedPipe abstract class for better binary compatibility
Tom Dyas seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
I'm going to merge when this goes green since all the PRs have already been merged to develop. |
This seems like a real failure of some of the random planning:
|
cc @cwensel |
here is another one: " Message: registry: MapReduceHadoopRuleRegistry, phase: PostPipelines, failed on rule: RemoveMalformedHashJoinPipelineTransformer, see attached source element-graph" I guess this PR is trying to merge the scalacheck code that randomly generates graphs and tries to plan them and we are seeing it fail. I think this probably means cascading has some rules that can't plan some of these graphs. It would be nice to know what the limitations are in cascading to we can transform them in our rules. @piyushnarang @rubanm any ideas? you both have added some work arounds. |
@johnynek - do you know what graph it fails on? We had to work through a few job failures on the planning phase as the cascading3 planner didn't like the graph (even though the same graph worked fine on cascading2). This was one of our concerns also about merging the branch at twitter as we were afraid of having internal users that hadn't tried cascading3 on their jobs running into these issues. |
@piyushnarang these were randomly generated. The toString is in the message, but it is pretty huge, I don't want to paste it here. The error seems like it is a self hashjoin. I'll try to write a more direct test to trigger it. If that's it, we can fix it with an optimizer rule. |
Improve HashJoin related rules
…more-planning-tests Default to forcing before hashJoin on cascading 3
okay, here is another failure in the current state:
with the cascading stack of:
|
@piyushnarang okay here is a test that is still currently failing even with the hashjoin checkpoints we have added. here is the cascading debug output: You can see the scala code in the PR. I will try to minimize it down a bit and see if it still fails. cc @cwensel |
Okay, I simplified the failing case and it indeed looks related to merges and hashjoins. We had some code to try to protect against this but it looks like it is not working here. Here is the cascading log output: https://www.dropbox.com/s/7qyc4a9pxtstwio/E552D2.tgz?dl=0 as you can see, it is actually not a very complex job that causes the failure: val p1 =
TypedPipe.from(List(1, 2))
.cross(TypedPipe.from(List(3, 4)))
val p2 =
TypedPipe.from(List(5, 6))
.cross(TypedPipe.from(List(8, 9)))
val p3 = (p1 ++ p2) // this can be planned
val p4 = (TypedPipe.from(List((8, 1), (10, 2))) ++ p3) // this cannot be planned. We can possibly add complexity to our rule about HashJoins and merges. Not clear what the recipe is though... @cwensel in a world where people have functions that return pipes, how can you be sure you are not merging HashJoins? Should we be avoiding this, or is this minimal case enough to see about fixing a bug? |
I tried out a few variants of the failing job and it seems like the planner seems to trigger due to both |
I’m pretty concerned we are going to basically render hashJoin worthless or
even harmful by adding too many checkpoints.
I wonder if we should just translate all hashJoins to cogroups at the start
in cascading 3 unless the user specifically opts in to hashJoins in a
config.
Absent a fix in cascading, I don’t see a great option yet.
On Thu, Feb 1, 2018 at 08:49 Piyush Narang ***@***.***> wrote:
I tried out a few variants of the failing job and it seems like the
planner seems to trigger due to both p1 and p2 performing a
cross(hashJoin). If I drop the cross from either p1 or p2, the planner
seems to proceed. It seems like a potential workaround seems to be to add a
forceToDisk on p3. This seems to work val p3 = (p1 ++ p2).forceToDisk.
It's not great as it adds an extra step but it might be a workaround till
we can fix the planner? Not sure if it can be an simple as if both the
pipes feeding into a merge have a hashJoin, then we force to disk.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1776 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEJdjQIyec9sULm81iifkyJUbRUDBKUks5tQgcdgaJpZM4Ryz2B>
.
--
P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin
|
I wonder if the rule is: you can have at most one hashJoin input to a merge. |
It seems like the behavior matches that - I tried out a run where p1 didn't have the I thought flipping the order of the final merge would work, but I realized I had a bug on my end (was writing p3 instead of p4). Seems like if p4's merge lhs / rhs is a pipe which gets two hashJoins it fails. |
filed cwensel/cascading#61 to help track. |
looks like Chris has a repro and a partial fix (does not yet work for Tez) cwensel/cascading#61 (comment) |
Well, cascading 3.3.0-wip-18 fixes the hashjoin merge cases we saw (and so far passes the randomly generated tests), but it seems that it has a regression related to partitioned schemes. |
make sure the cascading branch is up to date with develop
cc @rubanm @fwbrasil