Life Of Navin

Random Musings, Random Bullshit.

kc

Debugging Running JVM Without Restart

Debugging a process actively running on a JVM without restarting the application and setting up flags and attaching a debugger is a a bit of a pain. However, there's quite a bit you can do to debug a running process without needing a JVM restart.

This. But with less enthusiasm. Much less enthusiasm.
Here's a quick walkthrough, with the associated commands. I'm simply compiling the steps to allow a single lookup for everyone else who has to walk down this road.
  • Ensure that it is the JVM which is consuming too many resources. A combination of free and sorting ps output should do the trick.  free tells us current memory consumption, while ps gives us process level statistics
$ free -m
                            total        used   free   shared   buffers   cached
  Mem:               15040    13956  1083      0          84         890
  -/+ buffers/cache: 12980 2059
  Swap:                  0            0        0

$ ps -eo pmem,pcpu,vsize,pid,cmd | sort -k 1 -nr | head -5 

69.8  67.5 8180544 24204 java -Duser.dir=[...]
  7.5   5.6 1550956 4866 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
  7.4   6.3 1542356 4848 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
  7.4   6.0 1548132 4863 python /mnt/manage.py run_gunicorn -c /mnt/conf.py
  7.4   5.8 1537104 4869 python /mnt/manage.py run_gunicorn -c /mnt/conf.py

Wow, so we seem to have barely a GB of memory free and a resource hog java process and a bunch of gunicorn processes. Let's take a look at the Java process shall we? The pid is 24204 (highlighted in previous command)
  • Let's find thread utilization of resources. ps to the rescue again!
$ ps -mo 'pid lwp stime time pcpu' -p 24204
        PID   LWP   STIME   TIME   %CPU
    24204       -      16:12   00:32:41   17.4
          -   24204   16:12   00:00:00   0.0
          -   24255   16:12   00:00:03   0.0
          -   24256   16:12   00:01:11   0.6
          -   24257   16:12   00:01:11   0.6
          -   24258   16:12   00:01:11   0.6
          -   24259   16:12   00:01:11   0.6
          -   24260   16:12   00:06:34  33.5
          -   24261   16:12   00:00:11   0.0
          -   24262   16:12   00:00:00   0.0
          -   24263   16:12   00:00:00   0.0
          -   24264   16:12   00:00:00   0.0
          -   24265   16:12   00:00:00   0.0
          -   24266   16:12   00:00:43   0.3
Interesting. It looks like a single issue here. How do we trace this? Why, we take a couple of thread dumps of course! We use jstack for this. jstack simply produces a thread dump of the given process.

jstack 24204 > 1.dump
jstack 24204 > 2.dump
jstack 24204 > 3.dump

I usually take multiple jstacks just for further analysis. You may also need to run jstack as sudo as sudo -u user jstack <pid>

  • Now we have the thread dump and the misbehaving thread, so we can check what that thread is doing. Since thread id in the previous command (24260) is a decimal, we first convert it into hex because thread dumps use hex ids. To do this we simply use bc to convert the number, and then tr to transform the uppercase into lowercase. 
$ echo "obase=16; 24260" | bc | tr '[:upper:]' '[:lower:]'
5ec4  # this is the thread id in hex

$ vim 1.dump # Now search for the thread hex in the dump file
[...]
"Concurrent Mark-Sweep GC Thread" os_prio=0 tid=0x00007fea2c083000 nid=0x5ec4 runnable
[...]
So it turns out in this specific case that the GC is taking up a massive number of CPU cycles. The solutions for this are numerous, and may vary depending on what your constraints are. The other issues may be related to database access, I/O access, or, quite frankly, a million other reasons relating to badly written code.

To be honest, this method of debugging does have it's constraints, but can be a godsend when you need to quickly validate your thinking when it comes to resource hog bugs. And of course, you can celebrate by posting memes on slack!

At Bloomreach, we prefer :patre: #insidejoke
Hope this is helpful! Until the next b̶u̶g̶ feature...  :D

PS: Yes, I know the first couple of steps can be done using htop as well, using htop in thread view. I just find ps so much more convenient


0 comments :

Twitterverse

Prologue

Finally after all these years, here's to the beginning of what was there, what is there and hopefully what will remain!! So here are my thoughts & words -Online!!