Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock #1081

Open
gcamp opened this issue Dec 20, 2021 · 5 comments
Open

Deadlock #1081

gcamp opened this issue Dec 20, 2021 · 5 comments
Labels
bug Something isn't working (crash, a rule has a problem)

Comments

@gcamp
Copy link

gcamp commented Dec 20, 2021

Bug report

Describe the bug

The validator seems to deadlock sometimes. It seems to be related to out of memory errors.

How we reproduce the bug
We were not able to reproduce it in a reliable way.

Screenshots:

If I run strace -s 99 on the process while it's deadlock this is what I get

[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843564252}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=843581123}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894077086}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=894187471}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944341980}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] <... futex resumed>)      = 0
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=944429322}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994717560}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456231, tv_nsec=994734541}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45116513}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=45146218}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066749] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066749] futex(0x7f3d340cec28, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1066749] getrusage(RUSAGE_THREAD, {ru_utime={tv_sec=4, tv_usec=485045}, ru_stime={tv_sec=1, tv_usec=888009}, ...}) = 0
[pid 1066749] futex(0x7f3d340cec78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=322073722}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95507682}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=95522138}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066956] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066759] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 1066956] futex(0x7f3c40673928, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066759] futex(0x7f3d341ae828, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1066956] <... futex resumed>)      = 0
[pid 1066759] <... futex resumed>)      = 0
[pid 1066956] futex(0x7f3c40673978, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145870548}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1066759] futex(0x7f3d341ae87c, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=1456232, tv_nsec=145884474}, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 1066743 detached

The error log is :

{"notices":[{"code":"runtime_exception_in_validator_error","severity":"ERROR","totalNotices":8,"sampleNotices":[{"validator":"org.mobilitydata.gtfsvalidator.validator.ShapeToStopMatchingValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeTripIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.LocationHasStopTimesValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsabilityValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.GtfsStopTimeStopIdForeignKeyValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.TripUsageValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.StopTimeTravelSpeedValidator","exception":"java.lang.NullPointerException","message":""},{"validator":"org.mobilitydata.gtfsvalidator.validator.BlockTripsWithOverlappingStopTimesValidator","exception":"java.lang.NullPointerException","message":""}]},{"code":"thread_execution_error","severity":"ERROR","totalNotices":1,"sampleNotices":[{"exception":"java.lang.OutOfMemoryError","message":"Java heap space"}]}]}

Environment versions

  • validator version: v3.0.0
  • Java version:
    openjdk 11.0.13 2021-10-19
    OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
    OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
  • OS versions: Ubuntu 20.04.3 LTS
@gcamp gcamp added the bug Something isn't working (crash, a rule has a problem) label Dec 20, 2021
@github-actions
Copy link
Contributor

Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.

@gcamp
Copy link
Author

gcamp commented Dec 20, 2021

Update : we were able to reproduce every time by forcing a memory limit in a cgroup in linux. Using a limit of 500MB and this feed reproduces the bug every time. Using 100MB, the problem is not reproduced and the process is killed by the kernel.

Using cgroups and a memory limit that's big enough to start the process but big enough to make it run all the rules is enough to reproduce the problem.

@asvechnikov2
Copy link
Collaborator

Hi Guillaume,

Thanks for reporting this issue! From a brief look, it seems the validator is failing with OutOfMemoryError, I can reproduce it by adding -Xmx500m java argument.

java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -u https://data.bus-data.dft.gov.uk/timetable/download/gtfs-file/south_west/ -o output

I can see that the feed you provided has ~600MB in size. We didn't do heavy memory optimizations and on some bigger feeds 3-6GB should be enough. We would recommend to reserve at least 3GB of memory, this could be done by passing -Xms3g -Xmx3g arguments.

Could you please provide a bit more context to make sure that we understand the problem correctly?

  • What command line do you use to start the validator?
  • What do you mean by deadlock? Does the validator get stuck and it must be terminated manually?
  • What's the expected behaviour?

@gcamp
Copy link
Author

gcamp commented Jan 25, 2022

@asvechnikov2 yes, the validator deadlocks. It rans out of memory but the problem is actually the deadlock : it hits a point where it's not running anything but never returns or crash. An OutOfMemory crash is actually the expected behaviour.

java -Xmx500m -jar gtfs-validator-v3.0.0_cli.jar -i path_to_gtfs -o output is what we use.

More context : we use cgroups in our data pipeline to make sure a process doesn't use more memory than it's expected to use. We use historical running data to know how much memory to allocate, so if a feed X used 100MB the last time it ran, we'll use a limit of 100MB + a buffer to avoid slowdown on the server.

To reproduce the bug you really need to use cgroups memory limits of the same size of the -Xmx argument.

@maximearmstrong
Copy link
Contributor

maximearmstrong commented Feb 21, 2022

Hi @gcamp !

Thanks for the details. Unfortunately, we were not able to reproduce the exact same behavior using cgroup + -Xmx, where the validator deadlocks. In every test run, the process is killed when a memory problem occurs. Could you provide a code snippet to see if your use of cgroup differs from ours?

That being said, like you, we noticed that using cgroup causes problems, so we ran a script to compare cgroup + -Xmx against -Xmx alone. We ran our tests on a DigitalOcean VM with the following specifications:

  • OpenJDK version “11.0.13” 2021-10-19
  • OpenJDK Runtime Environment (build 11.0.13+8-Ubuntu-0ubuntu1.20.04)
  • OpenJDK 64-Bit Server VM (build 11.0.13+8-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

Here are our observations. Our scripts and results can be found in this archive.

  • For cgroup + -Xmx: once the memory limit set for the cgroup is reached, the process is killed and the validator cannot even generate its validator and system reports.
  • For -Xmx: the validator fails due to an OutOfMemoryError when the memory limit is reached, but is able to generate its reports.
    deadlock

It seems pretty clear that using cgroup with -Xmx causes a problem at this point. We define as uncompleted processes all processes where the validator was unable to generate its reports before being killed, including errors other than memory errors.

As the figure above suggests, I suggest you increase the memory limit of your cgroup to at least 3.5gb when using the validator, or drop cgroup and use only -Xmx. We realize that this may not be the best solution for your current use case. Would this be an option for you?

Our understanding is that the problem may be caused by cgroup itself rather than the validator, but this is still uncertain. Optimizing the validator memory usage is in our backlog, although it is not planned in the short term. If there is a way for you to provide us with more details on how to reproduce the specific deadlock bug, please let us know so we can dig deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working (crash, a rule has a problem)
Projects
None yet
Development

No branches or pull requests

3 participants