Discussion:
JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Krystal Mok
2013-03-06 06:07:43 UTC
Permalink
Hi Uwe,

If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).

- Kris
Hi,
since a few month we are extensively testing various preview builds of JDK 8 for compatibility with Apache Lucene and Solr, so we can find any bugs early and prevent the problems we had with the release of Java 7 two years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed, choosing a different one with different hotspot and garbage collector settings on every run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange behavior in early versions (maybe compiler errors), but no longer at the moment. There is one configuration that constantly and reproducibly hangs in one module that is tested: The configuration uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client does not matter). The JVM running the tests hangs irresponsible (jstack or kill -3 have no effect/cannot connect, standard kill does not stop it, only kill -9 actually kills it). It can be reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all threads (see attachment, dump.txt). As you see all threads of G1GC seem to hang in a syscall (os:park(), a conditional wait in pthread library). Unfortunately that?s all I can give you. A Java stacktrace is not possible because the JVM reacts on neither kill -3 nor jstack. With all other garbage collectors it passes the test without hangs in a few seconds, with 32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC, so only the 32 bit variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with earlier bug reports, there seem to be a problem with bugs.sun.com - there is no response from any reviewer after several weeks and we were able to help to find and fix javadoc and javac-compiler bugs early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
David Holmes
2013-03-06 07:52:26 UTC
Permalink
If the VM is completely unresponsive then it suggests we are at a safepoint.

The GC threads are not "hung" in os::parK, they are parked - waiting to
be notified of something.

The thing is to find out why they are not being woken up.

Can the gdb log be posted somewhere? I don't know if the attachment made
it to the original posting on hotspot-gc but it's no longer available on
hotspot-dev.

Thanks,
David
Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
Hi,
since a few month we are extensively testing various preview builds of JDK 8 for compatibility with Apache Lucene and Solr, so we can find any bugs early and prevent the problems we had with the release of Java 7 two years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed, choosing a different one with different hotspot and garbage collector settings on every run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange behavior in early versions (maybe compiler errors), but no longer at the moment. There is one configuration that constantly and reproducibly hangs in one module that is tested: The configuration uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client does not matter). The JVM running the tests hangs irresponsible (jstack or kill -3 have no effect/cannot connect, standard kill does not stop it, only kill -9 actually kills it). It can be reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all threads (see attachment, dump.txt). As you see all threads of G1GC seem to hang in a syscall (os:park(), a conditional wait in pthread library). Unfortunately that?s all I can give you. A Java stacktrace is not possible because the JVM reacts on neither kill -3 nor jstack. With all other garbage collectors it passes the test without hangs in a few seconds, with 32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC, so only the 32 bit variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with earlier bug reports, there seem to be a problem with bugs.sun.com - there is no response from any reviewer after several weeks and we were able to help to find and fix javadoc and javac-compiler bugs early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Dawid Weiss
2013-03-06 07:55:28 UTC
Permalink
Here you go:
http://pastebin.com/raw.php?i=b2PHLm1e

Dawid
Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.
The GC threads are not "hung" in os::parK, they are parked - waiting to be
notified of something.
The thing is to find out why they are not being woken up.
Can the gdb log be posted somewhere? I don't know if the attachment made
it to the original posting on hotspot-gc but it's no longer available on
hotspot-dev.
Thanks,
David
Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
Hi,
since a few month we are extensively testing various preview builds of
JDK 8 for compatibility with Apache Lucene and Solr, so we can find any
bugs early and prevent the problems we had with the release of Java 7 two
years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that
has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit)
installed, choosing a different one with different hotspot and garbage
collector settings on every run of the test suite (which takes approx.
30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer at the
moment. There is one configuration that constantly and reproducibly hangs
in one module that is tested: The configuration uses JDK 8 b79 (same for
b78), 32 bit, and G1GC (server or client does not matter). The JVM running
the tests hangs irresponsible (jstack or kill -3 have no effect/cannot
connect, standard kill does not stop it, only kill -9 actually kills it).
It can be reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all
threads (see attachment, dump.txt). As you see all threads of G1GC seem to
hang in a syscall (os:park(), a conditional wait in pthread library).
Unfortunately that?s all I can give you. A Java stacktrace is not possible
because the JVM reacts on neither kill -3 nor jstack. With all other
garbage collectors it passes the test without hangs in a few seconds, with
32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC,
so only the 32 bit variant is affected. Client or Server VM makes no
difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing
with: https://builds.apache.org/job/**Lucene-Artifacts-trunk/2212/**
artifact/lucene/dist/<https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/>
)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the
child VM actually running the test no longer responds). The PID is also
printed. Try to get a stack trace or kill it, no response. Only kill -9
helps. Choosing another garbage collector in the above command line makes
the test finish after a few seconds, e.g. -Dargs="-server
-XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we were able
to help to find and fix javadoc and javac-compiler bugs early. So I hope
you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
David Holmes
2013-03-06 10:23:10 UTC
Permalink
Post by Dawid Weiss
http://pastebin.com/raw.php?i=b2PHLm1e
Thanks. I would have to say this seems to be the suspicious part:

Thread 22 (Thread 0xf20ffb40 (LWP 22939)):
#0 0xf7743430 in __kernel_vsyscall ()
#1 0xf771e96b in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/i386-linux-gnu/libpthread.so.0
#2 0xf6ec849c in os::PlatformEvent::park() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#3 0xf6e98b82 in Monitor::IWait(Thread*, long long) ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#4 0xf6e99370 in Monitor::wait(bool, long, bool) ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#5 0xf6b5fb16 in SuspendibleThreadSet::join() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6 0xf6b5ea41 in ConcurrentG1RefineThread::run_young_rs_sampling() ()
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#7 0xf6b5ef91 in ConcurrentG1RefineThread::run() ()

The suspendible thread set logic looks 'tricky". Time for the G1 experts
to take over. :)

David
Post by Dawid Weiss
Dawid
On Wed, Mar 6, 2013 at 8:52 AM, David Holmes <david.holmes at oracle.com
If the VM is completely unresponsive then it suggests we are at a safepoint.
The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.
The thing is to find out why they are not being woken up.
Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.
Thanks,
David
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
Hi,
since a few month we are extensively testing various preview
builds of JDK 8 for compatibility with Apache Lucene and
Solr, so we can find any bugs early and prevent the problems
we had with the release of Java 7 two years ago. Currently
we have a Linux (Ubuntu 64bit) Jenkins machine that has
various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older
JRockit) installed, choosing a different one with different
hotspot and garbage collector settings on every run of the
test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some
strange behavior in early versions (maybe compiler errors),
but no longer at the moment. There is one configuration that
constantly and reproducibly hangs in one module that is
tested: The configuration uses JDK 8 b79 (same for b78), 32
bit, and G1GC (server or client does not matter). The JVM
running the tests hangs irresponsible (jstack or kill -3
have no effect/cannot connect, standard kill does not stop
it, only kill -9 actually kills it). It can be reproduced in
this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack
trace on all threads (see attachment, dump.txt). As you see
all threads of G1GC seem to hang in a syscall (os:park(), a
conditional wait in pthread library). Unfortunately that?s
all I can give you. A Java stacktrace is not possible
because the JVM reacts on neither kill -3 nor jstack. With
all other garbage collectors it passes the test without
hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32
bit variant is affected. Client or Server VM makes no
difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but
this should not matter)
- Download Lucene Source code (e.g. the snapshot version we
https://builds.apache.org/job/__Lucene-Artifacts-trunk/2212/__artifact/lucene/dist/
<https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/>)
ant -Dargs="-server -XX:+UseG1GC"
-Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages
(because the child VM actually running the test no longer
responds). The PID is also printed. Try to get a stack trace
or kill it, no response. Only kill -9 helps. Choosing
another garbage collector in the above command line makes
the test finish after a few seconds, e.g. -Dargs="-server
-XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list,
because with earlier bug reports, there seem to be a problem
with bugs.sun.com <http://bugs.sun.com> - there is no
response from any reviewer after several weeks and we were
able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org <mailto:uschindler at apache.org>
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Thomas Schatzl
2013-03-06 11:18:08 UTC
Permalink
Hi,
Post by David Holmes
Post by Dawid Weiss
http://pastebin.com/raw.php?i=b2PHLm1e
[...]
Post by David Holmes
from
/var/lib/jenkins/tools/java/32bit/jdk1.8.0-ea-b79/jre/lib/i386/server/libjvm.so
#6 0xf6b5ea41 in ConcurrentG1RefineThread::run_young_rs_sampling() ()
from
The suspendible thread set logic looks 'tricky". Time for the G1 experts
to take over. :)
The young gen rs sampling thread is a thread that does some statistical
updates while the application is running. So that in the STW pause not
so much work needs to be done.

At a safepoint it is always suspended, this is normal.

As Bengt mentioned, the problem seems to be thread 10, which is the VM
thread (the one responsible for bringing everything to a safepoint and
then distributing work).

According to the stack trace, this thread seems to be waiting for
synchronization with the marking threads because of a mark stack
overflow during weak reference processing.

However all marking threads are already waiting due to the safepointing
operation, and so it waits endlessly.

As Bengt mentioned, this thread shouldn't be waiting, and shouldn't need
to because it seems to be the only thread working on weak references
anyway (i.e. this phase is single threaded).

(All imo)

Thomas
Bengt Rutisson
2013-03-06 12:08:17 UTC
Permalink
Hi all,

I sent this email earlier, but I did "reply list" instead of "reply
all". Sorry about that.

The hang is due to the fact that we are using single threaded reference
processing but end up in the multi threaded code path and get stuck in a
loop that waits for the other processing threads to terminate.

John Cuthbertson is working on a fix for this. I think we have all the
information we need to solve this.

Bengt
David,
I think this is a VM bug and the thread dumps that Uwe produced are
enough to start tracking down the root cause.
Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.
Yes, we are hanging during a stop-the-world GC, so we are at a safepoint.
Post by David Holmes
The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.
It looks like the reference processing thread is stuck in a loop where
it does wait(). So, the VM is hanging even if that stack trace also
ends up in os::park().
Post by David Holmes
The thing is to find out why they are not being woken up.
Actually, in this case we should probably not even be calling wait...
Post by David Holmes
Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.
I received the attachment with the original email. I've attached it to
the bug report that I created: 8009536. You can find it there if you
want to. But I think we have a fairly good idea of what change caused
the hang.
Bengt
Post by David Holmes
Thanks,
David
Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also
work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
Hi,
since a few month we are extensively testing various preview builds
of JDK 8 for compatibility with Apache Lucene and Solr, so we can
find any bugs early and prevent the problems we had with the
release of Java 7 two years ago. Currently we have a Linux (Ubuntu
64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8
snapshot, IBM J9, older JRockit) installed, choosing a different
one with different hotspot and garbage collector settings on every
run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer
at the moment. There is one configuration that constantly and
reproducibly hangs in one module that is tested: The configuration
uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client
does not matter). The JVM running the tests hangs irresponsible
(jstack or kill -3 have no effect/cannot connect, standard kill
does not stop it, only kill -9 actually kills it). It can be
reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on
all threads (see attachment, dump.txt). As you see all threads of
G1GC seem to hang in a syscall (os:park(), a conditional wait in
pthread library). Unfortunately that?s all I can give you. A Java
stacktrace is not possible because the JVM reacts on neither kill
-3 nor jstack. With all other garbage collectors it passes the test
without hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32 bit
variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were
https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because
the child VM actually running the test no longer responds). The PID
is also printed. Try to get a stack trace or kill it, no response.
Only kill -9 helps. Choosing another garbage collector in the above
command line makes the test finish after a few seconds, e.g.
-Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we
were able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Uwe Schindler
2013-03-06 12:49:07 UTC
Permalink
Hi Bengt,

That was fast! We are happy that you were able to analyze the bug and will fix it soon. To not make our Jenkins server get stuck in the tests, I will disable G1GC until a new update is installed. We will then only test the other garbage collectors with Lucene.

Do you have an idea, why this bug is not appearing on 64 bit? It might be caused by other GC behavior as the word size is different (the Lucene tests use -Xmx512M, so its fixed in 32 and 64 bit at the moment). I just want to understand this! I can run the test suite with 64 bit JDK over and over, it never hangs. But when running with 32 bit it hangs in all cases.

Uwe

-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
-----Original Message-----
From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-
bounces at openjdk.java.net] On Behalf Of Bengt Rutisson
Sent: Wednesday, March 06, 2013 1:08 PM
To: hotspot-gc-dev at openjdk.java.net; David Holmes; Dawid Weiss; hotspot-
dev at openjdk.java.net
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Hi all,
I sent this email earlier, but I did "reply list" instead of "reply all". Sorry about
that.
The hang is due to the fact that we are using single threaded reference
processing but end up in the multi threaded code path and get stuck in a loop
that waits for the other processing threads to terminate.
John Cuthbertson is working on a fix for this. I think we have all the
information we need to solve this.
Bengt
David,
I think this is a VM bug and the thread dumps that Uwe produced are
enough to start tracking down the root cause.
Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.
Yes, we are hanging during a stop-the-world GC, so we are at a safepoint.
Post by David Holmes
The GC threads are not "hung" in os::parK, they are parked - waiting
to be notified of something.
It looks like the reference processing thread is stuck in a loop where
it does wait(). So, the VM is hanging even if that stack trace also
ends up in os::park().
Post by David Holmes
The thing is to find out why they are not being woken up.
Actually, in this case we should probably not even be calling wait...
Post by David Holmes
Can the gdb log be posted somewhere? I don't know if the attachment
made it to the original posting on hotspot-gc but it's no longer
available on hotspot-dev.
I received the attachment with the original email. I've attached it to
the bug report that I created: 8009536. You can find it there if you
want to. But I think we have a fairly good idea of what change caused
the hang.
Bengt
Post by David Holmes
Thanks,
David
Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should
also work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
Hi,
since a few month we are extensively testing various preview builds
of JDK 8 for compatibility with Apache Lucene and Solr, so we can
find any bugs early and prevent the problems we had with the
release of Java 7 two years ago. Currently we have a Linux (Ubuntu
64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8
snapshot, IBM J9, older JRockit) installed, choosing a different
one with different hotspot and garbage collector settings on every
run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no longer
at the moment. There is one configuration that constantly and
reproducibly hangs in one module that is tested: The configuration
uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client
does not matter). The JVM running the tests hangs irresponsible
(jstack or kill -3 have no effect/cannot connect, standard kill
does not stop it, only kill -9 actually kills it). It can be
reproduced in this Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on
all threads (see attachment, dump.txt). As you see all threads of
G1GC seem to hang in a syscall (os:park(), a conditional wait in
pthread library). Unfortunately that?s all I can give you. A Java
stacktrace is not possible because the JVM reacts on neither kill
-3 nor jstack. With all other garbage collectors it passes the test
without hangs in a few seconds, with 32 bit G1GC it can stand still
for hours. The 64 bit JVM passes with G1GC, so only the 32 bit
variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this
should not matter)
- Download Lucene Source code (e.g. the snapshot version we were
https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/
lucene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3
-Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because
the child VM actually running the test no longer responds). The PID
is also printed. Try to get a stack trace or kill it, no response.
Only kill -9 helps. Choosing another garbage collector in the above
command line makes the test finish after a few seconds, e.g.
-Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with
earlier bug reports, there seem to be a problem with bugs.sun.com -
there is no response from any reviewer after several weeks and we
were able to help to find and fix javadoc and javac-compiler bugs
early. So I hope you can help for this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer Bremen, Germany
http://lucene.apache.org/
Thomas Schatzl
2013-03-06 13:43:04 UTC
Permalink
Hi,
Post by Uwe Schindler
Hi Bengt,
That was fast! We are happy that you were able to analyze the bug and will fix it soon. To not make our Jenkins server get stuck in the tests, I will disable G1GC until a new update is installed. We will then only test the other garbage collectors with Lucene.
Do you have an idea, why this bug is not appearing on 64 bit? It might be caused by other GC behavior as the word size is different (the Lucene tests use -Xmx512M, so its fixed in 32 and 64 bit at the moment). I just want to understand this! I can run the test suite with 64 bit JDK over and over, it never hangs. But when running with 32 bit it hangs in all cases.
one possible reason is that the default mark stack size much is larger
on 64 bit, so no mark stack overflow occurs.

E.g. in globals.hpp:

product(uintx, MarkStackSizeMax, NOT_LP64(4*M) LP64_ONLY(512*M),
\

You may want to try to set MarkStackSizeMax to 4M on 64 bit too to test
this.

This is just a hunch though.

Thomas
John Cuthbertson
2013-03-06 18:04:16 UTC
Permalink
Hi Everyone,

All:
I've looked at the bug report (haven't tried to reproduce it yet) and
Bengt's analysis is correct. The concurrent mark thread is entering the
synchronization protocol in a marking step call. That code is waiting
for some non-existent workers to terminate before proceeding. Normally
we shouldn't be entering that code but I think we overflowed the global
marking stack (I updated the CR at ~1am my time with that conjecture). I
think I missed a set_phase() call to tell the parallel terminator that
we only have one thread and it's picking up the number of workers that
executed the remark parallel task.

Thomas: you were on the right track with your comment about the marking
stack size.

David:
Thanks for helping out here. The stack trace you mentioned was for one
the refinement threads - a concurrent GC thread. When a concurrent GC
thread "joins" the suspendible thread set, it means that it will observe
and participate in safepoint operations, i.e. the thread will notice
that it should reach a safepoint and the safepoint synchronizer code
will wait for it to block. When we wish a concurrent GC thread to not
observe safepoints, that thread leaves the suspendible thread set. I
think the name could be a bit better and Tony, before he left, had a
change that used a scoped object to join and leave the STS that hasn't
been integrated yet. IIRC Tony wasn't happy with the name he chose for
that also.

Uwe:
Thanks for bringing this up and my apologies for not replying sooner. I
will have a fix fairly soon. If I'm correct about it being caused by
overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or 4M
entries (which is the current max size).

Cheers,

JohnC
Post by Thomas Schatzl
Hi,
Post by Uwe Schindler
Hi Bengt,
That was fast! We are happy that you were able to analyze the bug and will fix it soon. To not make our Jenkins server get stuck in the tests, I will disable G1GC until a new update is installed. We will then only test the other garbage collectors with Lucene.
Do you have an idea, why this bug is not appearing on 64 bit? It might be caused by other GC behavior as the word size is different (the Lucene tests use -Xmx512M, so its fixed in 32 and 64 bit at the moment). I just want to understand this! I can run the test suite with 64 bit JDK over and over, it never hangs. But when running with 32 bit it hangs in all cases.
one possible reason is that the default mark stack size much is larger
on 64 bit, so no mark stack overflow occurs.
product(uintx, MarkStackSizeMax, NOT_LP64(4*M) LP64_ONLY(512*M),
\
You may want to try to set MarkStackSizeMax to 4M on 64 bit too to test
this.
This is just a hunch though.
Thomas
Uwe Schindler
2013-03-06 18:50:35 UTC
Permalink
Hi John,

Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Hi Everyone,
I've looked at the bug report (haven't tried to reproduce it yet) and Bengt's
analysis is correct. The concurrent mark thread is entering the
synchronization protocol in a marking step call. That code is waiting for some
non-existent workers to terminate before proceeding. Normally we
shouldn't be entering that code but I think we overflowed the global marking
stack (I updated the CR at ~1am my time with that conjecture). I think I
missed a set_phase() call to tell the parallel terminator that we only have one
thread and it's picking up the number of workers that executed the remark
parallel task.
Thomas: you were on the right track with your comment about the marking
stack size.
Thanks for helping out here. The stack trace you mentioned was for one the
refinement threads - a concurrent GC thread. When a concurrent GC thread
"joins" the suspendible thread set, it means that it will observe and
participate in safepoint operations, i.e. the thread will notice that it should
reach a safepoint and the safepoint synchronizer code will wait for it to block.
When we wish a concurrent GC thread to not observe safepoints, that
thread leaves the suspendible thread set. I think the name could be a bit
better and Tony, before he left, had a change that used a scoped object to
join and leave the STS that hasn't been integrated yet. IIRC Tony wasn't
happy with the name he chose for that also.
Thanks for bringing this up and my apologies for not replying sooner. I will
have a fix fairly soon. If I'm correct about it being caused by overflowing the
marking stack you can work around the issue by increasing the
MarkStackSize.you could try increasing it to 2M or 4M entries (which is the
current max size).
Is there a setting on the command line to raise this size? This would be great to check out if one can also do the opposite (lower the size on 64 bit JVM to make the 64 bit one also hang). Unfortunately as a Java programmer I am not so familiar with building the JVM on Ubuntu machines (including the needed IcedTea), so it's hard to me to try this out - I would not even know how to start doing this or finally how to get something like a standard JDK directory so you could use it as JAVA_HOME.

If you need a verification that your patch is working, it would be good to get a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins server that found the bug. Otherwise we would need to wait until a new build appears on jdk8.java.net (including the fix + other fixes in javadoc/javac tool and the class library that we reported earlier).

I could also assist in setting up a Lucene build directory (as reported on the first email), to reproduce the problem with the Lucene source code (which is very easy). As said before, I have no isolated test case :(

Thanks in any case,
Uwe
Post by John Cuthbertson
Cheers,
JohnC
Post by Thomas Schatzl
Hi,
Post by Uwe Schindler
Hi Bengt,
That was fast! We are happy that you were able to analyze the bug and
will fix it soon. To not make our Jenkins server get stuck in the tests, I will
disable G1GC until a new update is installed. We will then only test the other
garbage collectors with Lucene.
Post by Thomas Schatzl
Post by Uwe Schindler
Do you have an idea, why this bug is not appearing on 64 bit? It might be
caused by other GC behavior as the word size is different (the Lucene tests
use -Xmx512M, so its fixed in 32 and 64 bit at the moment). I just want to
understand this! I can run the test suite with 64 bit JDK over and over, it
never hangs. But when running with 32 bit it hangs in all cases.
Post by Thomas Schatzl
one possible reason is that the default mark stack size much is
larger on 64 bit, so no mark stack overflow occurs.
product(uintx, MarkStackSizeMax, NOT_LP64(4*M) LP64_ONLY(512*M),
\
Post by Thomas Schatzl
You may want to try to set MarkStackSizeMax to 4M on 64 bit too to
test this.
This is just a hunch though.
Thomas
John Cuthbertson
2013-03-06 18:56:06 UTC
Permalink
Hi Uwe,

You must have been reading my mind. See inline....
Post by Uwe Schindler
Hi John,
Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner. I will
have a fix fairly soon. If I'm correct about it being caused by overflowing the
marking stack you can work around the issue by increasing the
MarkStackSize.you could try increasing it to 2M or 4M entries (which is the
current max size).
Is there a setting on the command line to raise this size? This would be great to check out if one can also do the opposite (lower the size on 64 bit JVM to make the 64 bit one also hang). Unfortunately as a Java programmer I am not so familiar with building the JVM on Ubuntu machines (including the needed IcedTea), so it's hard to me to try this out - I would not even know how to start doing this or finally how to get something like a standard JDK directory so you could use it as JAVA_HOME.
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32
bit run.
Post by Uwe Schindler
If you need a verification that your patch is working, it would be good to get a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins server that found the bug. Otherwise we would need to wait until a new build appears on jdk8.java.net (including the fix + other fixes in javadoc/javac tool and the class library that we reported earlier).
I could also assist in setting up a Lucene build directory (as reported on the first email), to reproduce the problem with the Lucene source code (which is very easy). As said before, I have no isolated test case :(
I just sent you email. I downloaded a zip file that contains all the jar
files. I don't have ant on my system so ideally I'm looking for a java
command line to tickle the crash. Can you help?

Thanks,

JohnC
Uwe Schindler
2013-03-06 19:17:50 UTC
Permalink
Hi,
Post by Krystal Mok
Hi Uwe,
You must have been reading my mind. See inline....
Post by Uwe Schindler
Hi John,
Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or 4M
entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on 64 bit
JVM to make the 64 bit one also hang). Unfortunately as a Java programmer I
am not so familiar with building the JVM on Ubuntu machines (including the
needed IcedTea), so it's hard to me to try this out - I would not even know
how to start doing this or finally how to get something like a standard JDK
directory so you could use it as JAVA_HOME.
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit run.
I will give it a quick try!
Post by Krystal Mok
Post by Uwe Schindler
If you need a verification that your patch is working, it would be good to get
a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins
server that found the bug. Otherwise we would need to wait until a new
build appears on jdk8.java.net (including the fix + other fixes in javadoc/javac
tool and the class library that we reported earlier).
Post by Uwe Schindler
I could also assist in setting up a Lucene build directory (as
reported on the first email), to reproduce the problem with the Lucene
source code (which is very easy). As said before, I have no isolated
test case :(
I just sent you email. I downloaded a zip file that contains all the jar files. I
don't have ant on my system so ideally I'm looking for a java command line to
tickle the crash. Can you help?
I responded. Unfortunately, the binary Lucene distribution does not contain the tests.... I will try to set something up and share via a download link from my dropbox.
Post by Krystal Mok
Thanks,
JohnC
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Post by Krystal Mok
-----Original Message-----
From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
Sent: Wednesday, March 06, 2013 7:56 PM
To: Uwe Schindler
Cc: 'Thomas Schatzl'; hotspot-gc-dev at openjdk.java.net; 'David Holmes';
'Dawid Weiss'; hotspot-dev at openjdk.java.net
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Hi Uwe,
You must have been reading my mind. See inline....
Post by Uwe Schindler
Hi John,
Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or 4M
entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on 64 bit
JVM to make the 64 bit one also hang). Unfortunately as a Java programmer I
am not so familiar with building the JVM on Ubuntu machines (including the
needed IcedTea), so it's hard to me to try this out - I would not even know
how to start doing this or finally how to get something like a standard JDK
directory so you could use it as JAVA_HOME.
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit run.
Post by Uwe Schindler
If you need a verification that your patch is working, it would be good to get
a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins
server that found the bug. Otherwise we would need to wait until a new
build appears on jdk8.java.net (including the fix + other fixes in javadoc/javac
tool and the class library that we reported earlier).
Post by Uwe Schindler
I could also assist in setting up a Lucene build directory (as
reported on the first email), to reproduce the problem with the Lucene
source code (which is very easy). As said before, I have no isolated
test case :(
I just sent you email. I downloaded a zip file that contains all the jar files. I
don't have ant on my system so ideally I'm looking for a java command line to
tickle the crash. Can you help?
Thanks,
JohnC
Uwe Schindler
2013-03-06 19:31:56 UTC
Permalink
Hi,
Post by Uwe Schindler
Post by Uwe Schindler
Post by Uwe Schindler
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or
4M entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on
64 bit JVM to make the 64 bit one also hang). Unfortunately as a Java
programmer I am not so familiar with building the JVM on Ubuntu
machines (including the needed IcedTea), so it's hard to me to try
this out - I would not even know how to start doing this or finally
how to get something like a standard JDK directory so you could use it as
JAVA_HOME.
Post by Uwe Schindler
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
I will give it a quick try!
4M was too much for a 32bit JVM (it complained about it), but 2M was fine. With that setting the tests went through as they should (in 22 secs on this server). With the default setting it stalled endless.

To test the inverse (make 64 bit hang): What's the default stack size of 32 bit JVMs, so I can set it on 64 bit to make it hang?

Uwe
Post by Uwe Schindler
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Post by Uwe Schindler
-----Original Message-----
From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
Sent: Wednesday, March 06, 2013 7:56 PM
To: Uwe Schindler
Cc: 'Thomas Schatzl'; hotspot-gc-dev at openjdk.java.net; 'David Holmes';
'Dawid Weiss'; hotspot-dev at openjdk.java.net
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32
bit)
Post by Uwe Schindler
Hi Uwe,
You must have been reading my mind. See inline....
Post by Uwe Schindler
Hi John,
Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or 4M
entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on 64 bit
JVM to make the 64 bit one also hang). Unfortunately as a Java
programmer I
Post by Uwe Schindler
am not so familiar with building the JVM on Ubuntu machines (including the
needed IcedTea), so it's hard to me to try this out - I would not even know
how to start doing this or finally how to get something like a standard JDK
directory so you could use it as JAVA_HOME.
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
Post by Uwe Schindler
Post by Uwe Schindler
If you need a verification that your patch is working, it would be good to
get
Post by Uwe Schindler
a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins
server that found the bug. Otherwise we would need to wait until a new
build appears on jdk8.java.net (including the fix + other fixes in
javadoc/javac
Post by Uwe Schindler
tool and the class library that we reported earlier).
Post by Uwe Schindler
I could also assist in setting up a Lucene build directory (as
reported on the first email), to reproduce the problem with the Lucene
source code (which is very easy). As said before, I have no isolated
test case :(
I just sent you email. I downloaded a zip file that contains all the jar files. I
don't have ant on my system so ideally I'm looking for a java command line
to
Post by Uwe Schindler
tickle the crash. Can you help?
Thanks,
JohnC
Thomas Schatzl
2013-03-06 19:44:27 UTC
Permalink
Hi,
Post by Uwe Schindler
Hi,
Post by Uwe Schindler
Post by Krystal Mok
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
I will give it a quick try!
4M was too much for a 32bit JVM (it complained about it), but 2M was fine. With that setting the tests went through as they should (in 22 secs on this server). With the default setting it stalled endless.
Maybe you have to set -XX:MaxMarkStackSize as well, but it does not
matter now I guess.
Post by Uwe Schindler
To test the inverse (make 64 bit hang): What's the default stack size of 32 bit JVMs, so I can set it on 64 bit to make it hang?
Use -XX:+PrintFlagsFinal on the 32 bit VM to get this value. The flag
prints a list of all effective flag values after option processing.

Hth,
Thomas
Uwe Schindler
2013-03-06 20:10:34 UTC
Permalink
Post by Uwe Schindler
Hi,
Post by Uwe Schindler
Hi,
Post by Uwe Schindler
Post by Krystal Mok
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
I will give it a quick try!
4M was too much for a 32bit JVM (it complained about it), but 2M was fine.
With that setting the tests went through as they should (in 22 secs on this
server). With the default setting it stalled endless.
Maybe you have to set -XX:MaxMarkStackSize as well, but it does not matter
now I guess.
Post by Uwe Schindler
To test the inverse (make 64 bit hang): What's the default stack size of 32
bit JVMs, so I can set it on 64 bit to make it hang?
Use -XX:+PrintFlagsFinal on the 32 bit VM to get this value. The flag prints a
list of all effective flag values after option processing.
Thanks. Unfortunately I did not get the 64 bit JVM to hang, not even with 1K stack size. It could be because Lucene or the used UIMA library in this test uses other defaults and behaves different GC-wise, or - as John mentioned - the task queue size is different.

I was able to check some values in 32bit: with the default stack size of 32K it hangs, with 64K the same, 92K also hangs. But 128K passes and works with this test.

Uwe
John Cuthbertson
2013-03-06 20:02:36 UTC
Permalink
Hi Uwe,

The default mark stack size for 32 bit is 32K entries. So try
-XX:MarkStackSize=32K. This might not overflow because the local marking
task queues are larger in the 64 bit JVM so there will be less pressure
on the global mark stack. Unfortunately the task queue size is a
constant (actually a template constant either 16K for 32 bit and 128K
for 64 bit). You might have to go lower.

JohnC
Post by Uwe Schindler
Hi,
Post by Uwe Schindler
Post by Uwe Schindler
Post by Uwe Schindler
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or
4M entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on
64 bit JVM to make the 64 bit one also hang). Unfortunately as a Java
programmer I am not so familiar with building the JVM on Ubuntu
machines (including the needed IcedTea), so it's hard to me to try
this out - I would not even know how to start doing this or finally
how to get something like a standard JDK directory so you could use it as
JAVA_HOME.
Post by Uwe Schindler
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
I will give it a quick try!
4M was too much for a 32bit JVM (it complained about it), but 2M was fine. With that setting the tests went through as they should (in 22 secs on this server). With the default setting it stalled endless.
To test the inverse (make 64 bit hang): What's the default stack size of 32 bit JVMs, so I can set it on 64 bit to make it hang?
Uwe
Post by Uwe Schindler
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
Post by Uwe Schindler
-----Original Message-----
From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
Sent: Wednesday, March 06, 2013 7:56 PM
To: Uwe Schindler
Cc: 'Thomas Schatzl'; hotspot-gc-dev at openjdk.java.net; 'David Holmes';
'Dawid Weiss'; hotspot-dev at openjdk.java.net
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32
bit)
Post by Uwe Schindler
Hi Uwe,
You must have been reading my mind. See inline....
Post by Uwe Schindler
Hi John,
Thanks for the response and the analysis, very informative!
Post by John Cuthbertson
Thanks for bringing this up and my apologies for not replying sooner.
I will have a fix fairly soon. If I'm correct about it being caused
by overflowing the marking stack you can work around the issue by
increasing the MarkStackSize.you could try increasing it to 2M or 4M
entries (which is the current max size).
Is there a setting on the command line to raise this size? This would be
great to check out if one can also do the opposite (lower the size on 64 bit
JVM to make the 64 bit one also hang). Unfortunately as a Java
programmer I
Post by Uwe Schindler
am not so familiar with building the JVM on Ubuntu machines (including the
needed IcedTea), so it's hard to me to try this out - I would not even know
how to start doing this or finally how to get something like a standard JDK
directory so you could use it as JAVA_HOME.
Use: -XX:MarkStackSize=4M to increase the marking stack size in a 32 bit
run.
Post by Uwe Schindler
Post by Uwe Schindler
If you need a verification that your patch is working, it would be good to
get
Post by Uwe Schindler
a i586 Linux tgz file with a binary, so I can do a quick check on the Jenkins
server that found the bug. Otherwise we would need to wait until a new
build appears on jdk8.java.net (including the fix + other fixes in
javadoc/javac
Post by Uwe Schindler
tool and the class library that we reported earlier).
Post by Uwe Schindler
I could also assist in setting up a Lucene build directory (as
reported on the first email), to reproduce the problem with the Lucene
source code (which is very easy). As said before, I have no isolated
test case :(
I just sent you email. I downloaded a zip file that contains all the jar files. I
don't have ant on my system so ideally I'm looking for a java command line
to
Post by Uwe Schindler
tickle the crash. Can you help?
Thanks,
JohnC
Uwe Schindler
2013-03-06 20:16:56 UTC
Permalink
Hi again,

If you don't get it running, I can do the following:

I may set it up in /tmp locally. Then build and test the whole Lucene library including test. I could then TAR it up (might be large approx. 100 MB) and send it to you via dropbox or any other HTTP download. The command line of the JVM uses absolute paths for all JARs and other settings, but if you unpack the whole thing to /tmp, you could reuse the cmd line.

Just tell me, if you were able to set it up, otherwise I can quickly TAR you the whole compiled directory and give you the command line from debugging output. If you unpack to another directory you might need to edit the command line with its absolute paths (which are generated by ANT).

Uwe

-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
-----Original Message-----
From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
Sent: Wednesday, March 06, 2013 8:21 PM
To: Uwe Schindler
Cc: 'Bengt Rutisson'; hotspot-gc-dev at openjdk.java.net;
dev at lucene.apache.org
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Hi Uwe,
Let me try with your detailed instructions below before you go to all of that
trouble. I will let you know how I get on.
Thanks,
JohnC
Hi,
That's unfortunately not so easy, because of project dependencies. To run
the test you have to compile Lucene Core then the specific module + the test
framework (which is special for Lucene) and download some JARs from
Maven central (JAR hell, as usual).
If you give me some time, I would collect all needed JAR files from my local
checkout and provide you the correct cmd line + a ZIP file with maybe a shell
script to startup. It should be doable, but needs some work to collect all
dependencies for the classpath.
- Download ANT 1.8.2 binary zip (unfortunately ANT 1.8.4 has a bug making
http://archive.apache.org/dist/ant/binaries/apache-ant-1.8.2-bin.tar.gz - I
just wonder about the fact: isn't ANT needed to build the JDK classlib by
itself? I remember that the FreeBSD OpenJDK build downloads ANT and does
a large part of the compilation using ANT...
- put the ANT bin/ dir into your PATH
https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/luc
ene/dist/lucene-5.0-2013-03-05_15-37-06-src.tgz
- go to extracted lucene source dir, call "ant ivy-bootstrap" (this
will download Apache IVY, so all dependencies can be downloaded from
Maven Central)
- change to the module that fails: # cd analysis/uima
- execute: # ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3
-Dtests.jvms=1 test
- In a parallel console you might be able to attach to the process, the build
in the main console using ANT runs inside ANT and the test framework
spawns separate worker instances of the JVM to execute the tests. This
makes it hard to reproduce in standalone (the command line passed to the
child JVM is veeeeery long).
I will work on putting together a precompiled ZIP file with all needed JARs +
the command line. Just tell me if you got it managed with the above howto,
then I don?t need to do this.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
-----Original Message-----
From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
Sent: Wednesday, March 06, 2013 7:51 PM
To: Uwe Schindler
Cc: 'Bengt Rutisson'; hotspot-gc-dev at openjdk.java.net;
dev at lucene.apache.org
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32
bit)
Hi Uwe,
I've downloaded lucene-5.0-2013-03-05_15-37-06.zip from
https://builds.apache.org/job/Lucene-Artifacts-
trunk/2212/artifact/lucene/dist/
I don't have ant on my workstation so do you have a java command line
to run the test(s) that generate the error?
Thanks,
JohnC
Hi,
I think this is a VM bug and the thread dumps that Uwe produced are
enough to start tracking down the root cause.
I hope it is enough! If I can help with more details, tell me what I
should do
to track this down. Unfortunately, we have no isolated test case
(like a small java class that triggers this bug) - you have to run
the test cases of this Lucene's module. It only happens there, not in
any other Lucene test suite. It may be caused by a lot of GC activity in this
"UIMA" module or a specific test.
Post by David Holmes
If the VM is completely unresponsive then it suggests we are at a safepoint.
Yes, we are hanging during a stop-the-world GC, so we are at a
safepoint.
Post by David Holmes
The GC threads are not "hung" in os::parK, they are parked -
waiting to be notified of something.
It looks like the reference processing thread is stuck in a loop
where it does wait(). So, the VM is hanging even if that stack
trace also ends up in os::park().
Post by David Holmes
The thing is to find out why they are not being woken up.
Actually, in this case we should probably not even be calling wait...
Post by David Holmes
Can the gdb log be posted somewhere? I don't know if the
attachment made it to the original posting on hotspot-gc but it's
no longer available on hotspot-dev.
I received the attachment with the original email. I've attached it
to the bug report that I created: 8009536. You can find it there if
you want to. But I think we have a fairly good idea of what change
caused the hang.
If it helps: Unfortunately, we had some problems with recent JDK
builds,
because javac and javadoc tools were not working correctly, failing
to build our source code. Since b78 this was fixed. Until this was
fixed, we used build
b65 (which was the last one working) and the G1GC hangs did not
appear on this version. So it must have happened by a change after b65 till
b78.
Uwe
Bengt
Post by David Holmes
Thanks,
David
Post by Krystal Mok
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should
also work; that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is
probably bug in the VM).
- Kris
On Wed, Mar 6, 2013 at 5:48 AM, Uwe Schindler
<uschindler at apache.org>
Post by David Holmes
Post by Krystal Mok
Hi,
since a few month we are extensively testing various preview
builds of JDK 8 for compatibility with Apache Lucene and Solr,
so we can find any bugs early and prevent the problems we had
with the release of Java 7 two years ago. Currently we have a
Linux (Ubuntu 64bit) Jenkins machine that has various JDKs (JDK
6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed,
choosing a different one with different hotspot and garbage
collector settings on every run of the test suite (which takes
approx. 30-45
minutes).
Post by David Holmes
Post by Krystal Mok
JDK 8 b79 works so far very well on Linux, we found some strange
behavior in early versions (maybe compiler errors), but no
longer at the moment. There is one configuration that constantly
and reproducibly hangs in one module that is tested: The
configuration uses JDK 8 b79 (same for b78), 32 bit, and G1GC
(server or client does not matter). The JVM running the tests
hangs irresponsible (jstack or kill -3 have no effect/cannot
connect, standard kill does not stop it, only kill -9 actually
kills it). It can be reproduced in this Lucene module 100% (it hangs
always).
Post by David Holmes
Post by Krystal Mok
I was able to connect with GDB to the JVM and get a stack trace
on all threads (see attachment, dump.txt). As you see all
threads of G1GC seem to hang in a syscall (os:park(), a
conditional wait in pthread library). Unfortunately that?s all I
can give you. A Java stacktrace is not possible because the JVM
reacts on neither kill
-3 nor jstack. With all other garbage collectors it passes the
test without hangs in a few seconds, with 32 bit G1GC it can
stand still for hours. The 64 bit JVM passes with G1GC, so only
the 32 bit variant is affected. Client or Server VM makes no
difference.
Post by David Holmes
Post by Krystal Mok
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but
this should not matter)
- Download Lucene Source code (e.g. the snapshot version we
were
Post by David Holmes
Post by Krystal Mok
https://builds.apache.org/job/Lucene-Artifacts-
trunk/2212/artifact/lucene/dist/)
Post by David Holmes
Post by Krystal Mok
ant -Dargs="-server -XX:+UseG1GC"
-Dtests.multiplier=3
-Dtests.jvms=1 test
After a while the test framework prints "stalled" messages
(because the child VM actually running the test no longer
responds). The PID is also printed. Try to get a stack trace or
kill it, no
response.
Post by David Holmes
Post by Krystal Mok
Only kill -9 helps. Choosing another garbage collector in the
above command line makes the test finish after a few seconds, e.g.
-Dargs="-server -XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because
with earlier bug reports, there seem to be a problem with
bugs.sun.com - there is no response from any reviewer after
several weeks and we were able to help to find and fix javadoc
and javac-compiler bugs early. So I hope you can help for this bug,
too.
Post by David Holmes
Post by Krystal Mok
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer Bremen, Germany
http://lucene.apache.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at lucene.apache.org For additional
commands, e-mail: dev-help at lucene.apache.org
Uwe Schindler
2013-03-05 21:48:40 UTC
Permalink
Hi,

since a few month we are extensively testing various preview builds of JDK 8 for compatibility with Apache Lucene and Solr, so we can find any bugs early and prevent the problems we had with the release of Java 7 two years ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that has various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed, choosing a different one with different hotspot and garbage collector settings on every run of the test suite (which takes approx. 30-45 minutes).

JDK 8 b79 works so far very well on Linux, we found some strange behavior in early versions (maybe compiler errors), but no longer at the moment. There is one configuration that constantly and reproducibly hangs in one module that is tested: The configuration uses JDK 8 b79 (same for b78), 32 bit, and G1GC (server or client does not matter). The JVM running the tests hangs irresponsible (jstack or kill -3 have no effect/cannot connect, standard kill does not stop it, only kill -9 actually kills it). It can be reproduced in this Lucene module 100% (it hangs always).

I was able to connect with GDB to the JVM and get a stack trace on all threads (see attachment, dump.txt). As you see all threads of G1GC seem to hang in a syscall (os:park(), a conditional wait in pthread library). Unfortunately that?s all I can give you. A Java stacktrace is not possible because the JVM reacts on neither kill -3 nor jstack. With all other garbage collectors it passes the test without hangs in a few seconds, with 32 bit G1GC it can stand still for hours. The 64 bit JVM passes with G1GC, so only the 32 bit variant is affected. Client or Server VM makes no difference.

To reproduce:
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were testing with: https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/lucene/dist/)
- change to directory lucene/analysis/uima and run:
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3 -Dtests.jvms=1 test
After a while the test framework prints "stalled" messages (because the child VM actually running the test no longer responds). The PID is also printed. Try to get a stack trace or kill it, no response. Only kill -9 helps. Choosing another garbage collector in the above command line makes the test finish after a few seconds, e.g. -Dargs="-server -XX:+UseConcMarkSweepGC"

I posted this bug report directly to the mailing list, because with earlier bug reports, there seem to be a problem with bugs.sun.com - there is no response from any reviewer after several weeks and we were able to help to find and fix javadoc and javac-compiler bugs early. So I hope you can help for this bug, too.

Uwe

-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dump.txt
Url: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20130305/d1680675/dump-0001.txt
Uwe Schindler
2013-03-06 10:12:09 UTC
Permalink
Thanks, I'll try to use jstack with -F or -m!

Uwe

-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
-----Original Message-----
From: Krystal Mok [mailto:rednaxelafx at gmail.com]
Sent: Wednesday, March 06, 2013 7:08 AM
To: Uwe Schindler
Cc: hotspot-gc-dev at openjdk.java.net; hotspot-dev at openjdk.java.net;
Dawid Weiss; dev at lucene.apache.org
Subject: Re: JVM hanging when using G1GC on JDK8 b78 or b79 (Linux 32 bit)
Hi Uwe,
If you can attach gdb onto it, and jstack -m and jstack -F should also work;
that'll get you the Java stack trace.
(But it probably doesn't matter in this case, because the hang is probably bug
in the VM).
- Kris
Hi,
since a few month we are extensively testing various preview builds of JDK
8 for compatibility with Apache Lucene and Solr, so we can find any bugs
early and prevent the problems we had with the release of Java 7 two years
ago. Currently we have a Linux (Ubuntu 64bit) Jenkins machine that has
various JDKs (JDK 6, JDK 7, JDK 8 snapshot, IBM J9, older JRockit) installed,
choosing a different one with different hotspot and garbage collector
settings on every run of the test suite (which takes approx. 30-45 minutes).
JDK 8 b79 works so far very well on Linux, we found some strange behavior
in early versions (maybe compiler errors), but no longer at the moment.
There is one configuration that constantly and reproducibly hangs in one
module that is tested: The configuration uses JDK 8 b79 (same for b78), 32
bit, and G1GC (server or client does not matter). The JVM running the tests
hangs irresponsible (jstack or kill -3 have no effect/cannot connect, standard
kill does not stop it, only kill -9 actually kills it). It can be reproduced in this
Lucene module 100% (it hangs always).
I was able to connect with GDB to the JVM and get a stack trace on all
threads (see attachment, dump.txt). As you see all threads of G1GC seem to
hang in a syscall (os:park(), a conditional wait in pthread library).
Unfortunately that?s all I can give you. A Java stacktrace is not possible
because the JVM reacts on neither kill -3 nor jstack. With all other garbage
collectors it passes the test without hangs in a few seconds, with 32 bit G1GC
it can stand still for hours. The 64 bit JVM passes with G1GC, so only the 32 bit
variant is affected. Client or Server VM makes no difference.
- Use a 32 bit JDK 8 b78 or b79 (tested on Linux 64 bit, but this should not matter)
- Download Lucene Source code (e.g. the snapshot version we were
https://builds.apache.org/job/Lucene-Artifacts-trunk/2212/artifact/luc
ene/dist/)
ant -Dargs="-server -XX:+UseG1GC" -Dtests.multiplier=3
-Dtests.jvms=1 test After a while the test framework prints "stalled"
messages (because the child VM actually running the test no longer
responds). The PID is also printed. Try to get a stack trace or kill it, no
response. Only kill -9 helps. Choosing another garbage collector in the above
command line makes the test finish after a few seconds, e.g. -Dargs="-server
-XX:+UseConcMarkSweepGC"
I posted this bug report directly to the mailing list, because with earlier bug
reports, there seem to be a problem with bugs.sun.com - there is no
response from any reviewer after several weeks and we were able to help to
find and fix javadoc and javac-compiler bugs early. So I hope you can help for
this bug, too.
Uwe
-----
Uwe Schindler
uschindler at apache.org
Apache Lucene PMC Member / Committer
Bremen, Germany
http://lucene.apache.org/
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe at lucene.apache.org For additional
commands, e-mail: dev-help at lucene.apache.org
Loading...