JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)

Discussion:

Krystal Mok

2013-03-06 06:07:43 UTC

Hi Uwe,

If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).

- Kris

Hi,
since a few month we are extensively testing various preview builds of JDK 8 for compatibility with Apache Lucene and Solr, so we can find any bugs early and prevent the problems we had with the release of Java 7 two years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed, choosing a different one with different hotspot and garbage collector settings on every run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange behavior in early versions (maybe compiler errors), but no longer at the moment. There is one configuration that constantly and reproducibly hangs in one module that is tested: The configuration uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client does not matter). The JVM running the tests hangs irresponsible (jstack or kill -3 have no effect/cannot connect, standard kill does not stop it, only kill -9 actually kills it). It can be reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all threads (see attachment, dump.txt). As you see all threads of G1GC seem to hang in a syscall (os:park(), a conditional wait in pthread library). Unfortunately that?s all I can give you. A Java stacktrace is not possible because the JVM reacts on neither kill -3 nor jstack. With all other garbage collectors it passes the test without hangs in a few seconds, with 32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC, so only the 32 bit variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with earlier bug reports, there seem to be a problem with bugs.sun.com - there is no response from any reviewer after several weeks and we were able to help to find and fix javadoc and javac-compiler bugs early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/

David Holmes

2013-03-06 07:52:26 UTC

Permalink

If the VM is completely unresponsive then it suggests we are at a safepoint.

The GC threads are not "hung" in os::parK, they are parked - waiting to
be notified of something.

The thing is to find out why they are not being woken up.

Can the gdb log be posted somewhere? I don't know if the attachment made
it to the original posting on hotspot-gc but it's no longer available on
hotspot-dev.

Thanks,
David

Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris

Dawid Weiss

2013-03-06 07:55:28 UTC

Permalink

Here you go:
http://pastebin.com/raw.php?i=b2PHLm1e

Dawid

Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.
The GC threads are not "hung" in os::parK, they are parked - waiting to be
notified of something.
The thing is to find out why they are not being woken up.
Can the gdb log be posted somewhere? I don't know if the attachment made
it to the original posting on hotspot-gc but it's no longer available on
hotspot-dev.
Thanks,
David

Hi,
since a few month we are extensively testing various preview builds of
JDK 8 for compatibility with Apache Lucene and Solr, so we can find any
bugs early and prevent the problems we had with the release of Java 7 two
years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that
has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit)
installed, choosing a different one with different hotspot and garbage
collector settings on every run of the test suite (which takes approx.
30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer at the
moment. There is one configuration that constantly and reproducibly hangs
in one module that is tested: The configuration uses JDK 8 b79 (same for
b78), 32 bit, and G1GC (server or client does not matter). The JVM running
the tests hangs irresponsible (jstack or kill -3 have no effect/cannot
connect, standard kill does not stop it, only kill -9 actually kills it).
It can be reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all
threads (see attachment, dump.txt). As you see all threads of G1GC seem to
hang in a syscall (os:park(), a conditional wait in pthread library).
Unfortunately that?s all I can give you. A Java stacktrace is not possible
because the JVM reacts on neither kill -3 nor jstack. With all other
garbage collectors it passes the test without hangs in a few seconds, with
32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC,
so only the 32 bit variant is affected. Client or Server VM makes no
difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing
with: https://builds.apache.org/job/**Lucene-Artifacts-trunk/2212/**
artifact/lucene/dist/<https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/>
)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the
child VM actually running the test no longer responds). The PID is also
printed. Try to get a stack trace or kill it, no response. Only kill -9
helps. Choosing another garbage collector in the above command line makes
the test finish after a few seconds, e.g. -Dargs="-server
-XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we were able
to help to find and fix javadoc and javac-compiler bugs early. So I hope
you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/

David Holmes

2013-03-06 10:23:10 UTC

Permalink

Post by Dawid Weiss
http://pastebin.com/raw.php?i=b2PHLm1e

Thanks. I would have to say this seems to be the suspicious part:

Thread 22 (Thread 0xf20ffb40 (LWP 22939)):
#0 0xf7743430 in __kernel_vsyscall ()
#1 0xf771e96b in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/i386-linux-gnu/libpthread.so.0
#2 0xf6ec849c in os::PlatformEvent::park() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#3 0xf6e98b82 in Monitor::IWait(Thread*, long long) ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#4 0xf6e99370 in Monitor::wait(bool, long, bool) ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#5 0xf6b5fb16 in SuspendibleThreadSet::join() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6 0xf6b5ea41 in ConcurrentG1RefineThread::run_young_rs_sampling() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#7 0xf6b5ef91 in ConcurrentG1RefineThread::run() ()

The suspendible thread set logic looks 'tricky". Time for the G1 experts
to take over. :)

David

Post by Dawid Weiss
Dawid
On Wed, Mar 6, 2013 at 8:52 AM, David Holmes <david.holmes at oracle.com
If the VM is completely unresponsive then it suggests we are at a safepoint.
The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.
The thing is to find out why they are not being woken up.
Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.
Thanks,
David
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
Hi,
since a few month we are extensively testing various preview
builds of JDK 8 for compatibility with Apache Lucene and
Solr, so we can find any bugs early and prevent the problems
we had with the release of Java 7 two years ago. Currently
we have a Linux (Ubuntu 64bit) Jenkins machine that has
various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older
JRockit) installed, choosing a different one with different
hotspot and garbage collector settings on every run of the
test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some
strange behavior in early versions (maybe compiler errors),
but no longer at the moment. There is one configuration that
constantly and reproducibly hangs in one module that is
tested: The configuration uses JDK 8 b79 (same for b78), 32
bit, and G1GC (server or client does not matter). The JVM
running the tests hangs irresponsible (jstack or kill -3
have no effect/cannot connect, standard kill does not stop
it, only kill -9 actually kills it). It can be reproduced in
this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack
trace on all threads (see attachment, dump.txt). As you see
all threads of G1GC seem to hang in a syscall (os:park(), a
conditional wait in pthread library). Unfortunately that?s
all I can give you. A Java stacktrace is not possible
because the JVM reacts on neither kill -3 nor jstack. With
all other garbage collectors it passes the test without
hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32
bit variant is affected. Client or Server VM makes no
difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but
this should not matter)
- Download Lucene Source code (e.g. the snapshot version we
https://builds.apache.org/job/__Lucene-Artifacts-trunk/2212/__artifact/lucene/dist/
<https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/>)
ant -Dargs="-server -XX:+UseG1GC"
-Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages
(because the child VM actually running the test no longer
responds). The PID is also printed. Try to get a stack trace
or kill it, no response. Only kill -9 helps. Choosing
another garbage collector in the above command line makes
the test finish after a few seconds, e.g. -Dargs="-server
-XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list,
because with earlier bug reports, there seem to be a problem
with bugs.sun.com <http://bugs.sun.com> - there is no
response from any reviewer after several weeks and we were
able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org <mailto:uschindler at apache.org>
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/

Thomas Schatzl

2013-03-06 11:18:08 UTC

Permalink

Hi,

Post by David Holmes

Post by Dawid Weiss
http://pastebin.com/raw.php?i=b2PHLm1e

[...]

Post by David Holmes
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6 0xf6b5ea41 in ConcurrentG1RefineThread::run_young_rs_sampling() ()
from
The suspendible thread set logic looks 'tricky". Time for the G1 experts
to take over. :)

The young gen rs sampling thread is a thread that does some statistical
updates while the application is running. So that in the STW pause not
so much work needs to be done.

At a safepoint it is always suspended, this is normal.

As Bengt mentioned, the problem seems to be thread 10, which is the VM
thread (the one responsible for bringing everything to a safepoint and
then distributing work).

According to the stack trace, this thread seems to be waiting for
synchronization with the marking threads because of a mark stack
overflow during weak reference processing.

However all marking threads are already waiting due to the safepointing
operation, and so it waits endlessly.

As Bengt mentioned, this thread shouldn't be waiting, and shouldn't need
to because it seems to be the only thread working on weak references
anyway (i.e. this phase is single threaded).

(All imo)

Thomas

Bengt Rutisson

2013-03-06 12:08:17 UTC

Permalink

Hi all,

I sent this email earlier, but I did "reply list" instead of "reply
all". Sorry about that.

The hang is due to the fact that we are using single threaded reference
processing but end up in the multi threaded code path and get stuck in a
loop that waits for the other processing threads to terminate.

John Cuthbertson is working on a fix for this. I think we have all the
information we need to solve this.

Bengt

David,
I think this is a VM bug and the thread dumps that Uwe produced are
enough to start tracking down the root cause.

Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.

Yes, we are hanging during a stop-the-world GC, so we are at a safepoint.

Post by David Holmes
The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.

It looks like the reference processing thread is stuck in a loop where
it does wait(). So, the VM is hanging even if that stack trace also
ends up in os::park().

Post by David Holmes
The thing is to find out why they are not being woken up.

Actually, in this case we should probably not even be calling wait...

Post by David Holmes
Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.

I received the attachment with the original email. I've attached it to
the bug report that I created: 8009536. You can find it there if you
want to. But I think we have a fairly good idea of what change caused
the hang.
Bengt

Post by David Holmes
Thanks,
David

Hi,
since a few month we are extensively testing various preview builds
of JDK 8 for compatibility with Apache Lucene and Solr, so we can
find any bugs early and prevent the problems we had with the
release of Java 7 two years ago. Currently we have a Linux (Ubuntu
64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8
snapshot, IBM J9, older JRockit) installed, choosing a different
one with different hotspot and garbage collector settings on every
run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer
at the moment. There is one configuration that constantly and
reproducibly hangs in one module that is tested: The configuration
uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client
does not matter). The JVM running the tests hangs irresponsible
(jstack or kill -3 have no effect/cannot connect, standard kill
does not stop it, only kill -9 actually kills it). It can be
reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on
all threads (see attachment, dump.txt). As you see all threads of
G1GC seem to hang in a syscall (os:park(), a conditional wait in
pthread library). Unfortunately that?s all I can give you. A Java
stacktrace is not possible because the JVM reacts on neither kill
-3 nor jstack. With all other garbage collectors it passes the test
without hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32 bit
variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were
https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because
the child VM actually running the test no longer responds). The PID
is also printed. Try to get a stack trace or kill it, no response.
Only kill -9 helps. Choosing another garbage collector in the above
command line makes the test finish after a few seconds, e.g.
-Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we
were able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/

Uwe Schindler

2013-03-06 12:49:07 UTC

Permalink

Hi Bengt,

That was fast! We are happy that you were able to analyze the bug and will fix it soon. To not make our Jenkins server get stuck in the tests, I will disable G1GC until a new update is installed. We will then only test the other garbage collectors with Lucene.

Do you have an idea, why this bug is not appearing on 64 bit? It might be caused by other GC behavior as the word size is different (the Lucene tests use -Xmx512M, so its fixed in 32 and 64 bit at the moment). I just want to understand this! I can run the test suite with 64 bit JDK over and over, it never hangs. But when running with 32 bit it hangs in all cases.

Uwe

-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/

-----Original Message-----
From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-
bounces at openjdk.java.net] On Behalf Of Bengt Rutisson
Sent: Wednesday, March 06, 2013 1:08 PM
To: hotspot-gc-dev at openjdk.java.net; David Holmes; Dawid Weiss; hotspot-
dev at openjdk.java.net
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Hi all,
I sent this email earlier, but I did "reply list" instead of "reply all". Sorry about
that.
The hang is due to the fact that we are using single threaded reference
processing but end up in the multi threaded code path and get stuck in a loop
that waits for the other processing threads to terminate.
John Cuthbertson is working on a fix for this. I think we have all the
information we need to solve this.
Bengt