Avoid memory leaks when caching instance methods in Python classes

While working on a customer's project last week I stumbled across the problem that caching instance method results in Python classes easily creates memory leaks, especially in long-running processes.

Although Python comes with built-in caching functionality (see functools library) and there are 3rd party caching libraries available like cachetools I couldn't find a proper solution that satified my needs. So I wrote my own, by using a very nice function from the cachetools libarary, and building a proper decorator around it.

Before using the code shown below, create a virtualenv and install the cachetools package upfront so that you can try things out yourself:

$ virtualenv pyenv
$ . pyenv/bin/activate
$ pip install cachetools

Now let's start implementing. A naive approach to apply caching functionality to a very simple could be:

from cachetools import cached

mycache = {}

class MyClass:
    @cached(mycache)
    def myfunc(self, a):
        print('myfunc called with', a)
        return a * 2

It uses the cached() decorator from the cachetools libary, and initializes it with a dictionary instance that serves as the actual container for the caching.

Now let's create an instance of MyClass and call myfunc() twice to see that the cache works:

>>> my = MyClass()
>>> my.myfunc(2)
myfunc called with', 2
4
>>> my.myfunc(2)
4

Here we can see that the second call doesn't actually call into to the myfunc() function but receives the result from the previous call from the cache. So far so good.

But: What happens if the my-instance gets deleted (and hence garbage collected)?
Does the cache lose its cached results from those calls above?

Well, let's see. In order to prove that an object is really deleted from the Python interpreter by garbage collection I'm applying a weak reference to the object which I want to delete. If the weak reference returns None, the previously referenced object has been indeed deleted.

Note: Garbage collection doesn't necessarily happen immediately after an object has been orphaned. Instead, the Python interpreter calls the garbage collector at certain trigger points, so it can take a while until objects are actually removed. However, the garbage collector can be called manually to enforce the cleanup process, which is what we do in the following example.

Let's demonstrate this:

>>> import weakref, gc
>>> myinst = MyClass()
>>> myinstref = weakref.ref(myinst)
>>> myinstref()
<__main__.MyClass object at 0x7f6565b11a30>
>>> del myinst
>>> gc.collect()   # enforce garbage collection
>>> myinstref() is None
True

So, this worked. Now let's go back to the my-instance created above, and also try to remove it:

>>> myref = weakref.ref(my)
>>> del my
>>> gc.collect()
>>> myref.ref()
<__main__.MyClass object at 0x7f6565a01c40>

Hmm, this didn't work quite as expected, the my instance is still alive. The reason behind this is actually the cache. When we had called my.myfunc(2) before, not only the parameter 2 and the result 4 got stored in the cache, but also the reference to self, just because self is part of the method signature. This can be made visible by inspecting the cache object, which is a plain dictionary:

>>> mycache
{(<__main__.MyClass object at 0x7f6565a01c40>, 2): 4}

As you can see the cache itself keeps a reference to the my instance which prevents it from being garbage collected. Now imagine you create thousands or millions of MyClass instances, and only ever make the same call my.myfunc(2) to it. The result would be a million entries in the cache for parameter 2 and result 4, just each with a reference to a different short-lived MyClass instance. Voilà, here is our memory leak.

The solution is to clear the cache, and then our my instance will eventually also be garbage collected:

>>> cache.clear()
>>> gc.collect()
>>> myref.ref() is None
True

Constructing a cache that frees instances when they are garbage collected

As we have seen before the weakref-module provides a nice way to keep references to objects without holding a firm grip on them. This approach can be used inside the cache as well. Here is an implementation that I came up with, on this nice Sunday afternoon.

In order to understand this implementation please make sure you are familiar with how decorators work in general. I will only explain details which are specific to the implemenation, not the entire principle of decorators.

from cachetools import cachedmethod
from weakref import WeakKeyDictionary

def methodcache(cache_factory):
    def wrapped_methodcache(method):
        def get_instance_cache(self):
            try:
                instance_cache = weak_cache[self]
            except KeyError:
                instance_cache = weak_cache[self] = cache_factory()
            return instance_cache

        weak_cache = WeakKeyDictionary
        cached_method = cachedmethod(get_instance_cache)(method)
        # Attach the weak dictionary to the decorator so we can access it from outside:
        cached_method.__cache__ = weak_cache
        return cached_method

    return wrapped_methodcache


# Application:
class MyNewClass:
    @methodcache(dict)
    def myfunc(self, a):
        return a * 2

The core of this decorator is the cachetools.cachedmethod() function (which itself is a decorator). It is initialized with a callback function (get_instance_cache() in this case) which is called each time the decorated method is called. It receives a reference to the class instance (MyNewClass() in the demo case above) as argument self.

The approach is that each decorated method will get a WeakKeyDictionary-instance attached to it. Since get_instance_cache() is defined at the same level as the weak_cache variable is assigned the WeakKeyDictionary, it has access to this weak dictionary instance. When it is called it looks up whether this weak dict has already stored a cache for this. If not, such a cache is created via the cache_factory() instance. Then the cache is returned to the cachedmethod decorator function, which itself then is responsible for doing the caching or looking up cached values.

I've to admit that this solution is not super easy to understand. but with some experience with decorators in general and a little digging into the code I hope everyone can profit from this solution.

My website is online again

Finally, after some downtime while I was working on it ... my website ralph-heinkel.com finally is online again.

With fresh design, based on bootstrap and the 'superhero'-theme from bootswatch.com, and all assembled by the static site and blog generator Nikola. It was quite an experience to work with Nikola, the learning curve was a bit steeper than I had hoped, but the result is more than rewarding. I'm really grateful for this project, many thanks to all its contributors.

juraforum.de provided the (free) privacy declaration (Datenschutzerklärung).

Last but not least my old blog postings got recovered, and now hopefully new ones will follow.

Stay tuned ;-)

Repair grub2 boot after OpenSuse 12.3 update messed it up

After my OpenSuse 12.3 installation ran an automatic update on a few packages I wasn't able to boot the system anymore. Instead I ended up in a minimal grub2 shell without any clue what to do next in order to reboot my system again.

Browsing around the internet from another computer I found that things are getting more difficult as I had my root partition encrypted in a LUKS container. But - there is always hope and so I eventually collected enough information from various blogs and news groups to get my system up and running again.

Here are the instructions that have worked for me.

Boot with OpenSuse 12.3 installation DVD

In order to be able to do anything I started up the installation DVD. Since I needed to get access to my encrypted root file system the smartest way is to choose "Installation" when the DVD provides the initial menu. Don't worry, nothing is installed yet, and it won't, because we will jump out in time.

After choosing "Installation" the system asks to confirm the EULA. Click 'Accept' to continue. Next it will find your LUKS partition - and asks you whether you want to provide the passphrase for decrypting it. So yes, choose decryption and enter your password.

Wait until the decryption is done and the installer is waiting for new input from you. At this point the magic stuff starts. Click Ctrl-Alt-F2 to switch to a text console. You will be already logged in as root (the #-prompt is your friend!).

Mount your system partitions

Type the following command to see your partition setup:

# fdisk -l
[... some output omitted ...]
Device Boot         Start         End      Blocks   Id  System
/dev/sda1            2048    87472127    43735040    7  HPFS/NTFS/exFAT
/dev/sda2   *    87472128    87795711      161792   83  Linux
/dev/sda3        87795712   488396799   200300544   8e  Linux LVM

In my case partition 2 contains my boot system, partition 3 my encrypted root and home partitions.

Type another command to look into the encrypted (only accessible because we did the decryption step above):

# lvscan
ACTIVE            '/dev/system/home' [100.00 GiB] inherit
ACTIVE            '/dev/system/root' [25.00 GiB] inherit
ACTIVE            '/dev/system/swap' [4.00 GiB] inherit

Now lets glue together the original system with some mount commands, using the information above:

# mount /dev/system/root /mnt
# mount /dev/sda2 /mnt/boot
# mount --bind /dev /mnt/dev
# mount -o bind /sys /mnt/sys
# mount -t proc /proc /mnt/proc
# cp /proc/mounts /mnt/etc/mtab

Change the root directory into the mounted filesystem and run grub

# sudo chroot /mnt /bin/bash
# grub2-install /dev/sda

The update-grub command mentioaned in some other blogs does not exist any longer, so use grub2-mkconfig instead to finally generate a new grub.cfg file:

# grub2-mkconfig -o /boot/grub2/grub.cfg

This should print a list of added partitions to your screen.

Quit the chroot environment with ctrl-d and reboot your system (reboot or ctrl-alt-del). Hopefully it boots up again as before.

References

http://wiki.ubuntuusers.de/GRUB_2/Reparatur
https://fedoraproject.org/wiki/GRUB_2?rd=Grub2
http://www.gargi.org/showthread.php?4215-openSUSE-12-2-Grub2-Bootloader-wiederherstellen

pyRserve 0.8.1 released

About pyRserve

pyRserve is a (pure python) client for connecting Python to an R process on a remote server via TCP-IP (using Rserve). R is one of the most important and most widely spread open source statistic packages available.

Through such a pyRserve connection the full power of R is available on the Python side without programming in R. From Python variables can get set in and retrieved from R, and R-functions can be created and called remotely. No R-related libraries need to be installed on the client side, pip install pyRserve is all that needs to be done.

Sample code

This code assumes that Rserve (the part that connects the R engine to the network) already is running. Details can be found in the pyRserve docs.

>>> import pyRserve
>>> conn = pyRserve.connect('Rserve.domain.com')
>>> conn.eval('sum( c(1.5, 4) )') # direct evaluation of a statement in R
5.5
>>> conn.r.myList = [1, 2, 3] # bind a Python list in R to variable 'myList'

>>> conn.voidEval('func1 <- function(v) { v*2 }')  # create a function in R
>>> conn.r.func1(4)                                # call the function in R
16

Most important changes in V 0.8.x

  • Support for out-of-bound messages (allows for callbacks from R to Python) (contrib. by Philip. A.)
  • Rserve can now be shutdown remotely (contrib. by Uwe Schmitt)
  • Fixed bug when passing R functions as parameters to R functions
  • Documentation errors have been fixed

Documentation and Support

The documentation for pyRserve is available at http://packages.python.org/pyRserve
The corresponding Google news group can be found at http://groups.google.com/group/pyrserve

Changing selenium's default tmp directory

Whenever Selenium fires up Firefox (we are still running Selenium in RC mode) a new Firefox profile directory will be create at every startup. Usually this directory will be created in /tmp - which is for various reasons not always the desired location.

Selenium RC itself has no configuration option to change this location as it relies on the default value provided by java. Fortunately java provides a command line option -Djava.io.tmpdir allowing to specify a new tmp directory.

So change your startup call of Selenium RC to

java -Djava.io.tmpdir=/your/tmp/dir -jar selenium-server.jar

and you're all set.

Publish-Subscribe with web sockets in Python and Firefox

WebSockets provide a way to communicate through a bi-directional channel on a single TCP connection. This technology is especially interesting since it allows a web server to push data to a browser (client) without having the client to constantly poll for it. In contrast to normal HTTP requests where a new TCP connection gets opened and closed for each request web socket connections are kept open until one party closes them. This allows for communication in both directions, and calls can be made multiple times on the same connection.

In this little article I basically combine what I found on Sylvain Hellegouarch's documentation for ws4py (a WebSocket client and server library for Python) and the article HTML5 Web Socket in Essence by Wayne Ye.

More specifically the examples below shows how multiple clients subscribe via websockets to a cherrypy server through a web socket connection. The first of the two clients in the example below is a very lightweight client based solely on the ws4py package, the other (javascript) implementation is supposed to run in Firefox.

The server

This example provides a minimal publishing engine implemented with cherrypy. An instance of class WebSocketTool is hooked up into cherrypy as a so-called cherrypy tool, and a web socket handler (the Publisher-class) is bound to this tool as a handler for calls to the path /ws:

import cherrypy
from ws4py.server.cherrypyserver import WebSocketPlugin, WebSocketTool
from ws4py.websocket import WebSocket

cherrypy.config.update({'server.socket_port': 9000})
WebSocketPlugin(cherrypy.engine).subscribe()
cherrypy.tools.websocket = WebSocketTool()

SUBSCRIBERS = set()

class Publisher(WebSocket):
    def __init__(self, *args, **kw):
        WebSocket.__init__(self, *args, **kw)
        SUBSCRIBERS.add(self)

    def closed(self, code, reason=None):
        SUBSCRIBERS.remove(self)

class Root(object):
    @cherrypy.expose
    def index(self):
        return open('ws_browser.html').read()

    @cherrypy.expose
    def ws(self):
        "Method must exist to serve as a exposed hook for the websocket"

    @cherrypy.expose
    def notify(self, msg):
        for conn in SUBSCRIBERS:
            conn.send(msg)

cherrypy.quickstart(Root(), '/', 
    config={'/ws': {'tools.websocket.on': True,
                    'tools.websocket.handler_cls': Publisher}})

The only purpose of the Root.ws()-method is to make this method available under /ws in the web server through the cherrypy.expose decorator. Whenever a websocket client makes a request to /ws an instance of class Publisher is created, which registers itself to the global SUBSCRIBERS set on __init__(). When the server goes down, or the client disconnects, its closed() method is called.

The only packages needed for this example are cherrypy and ws4py. Both can be easily installed via easy_install or pip. Save the code above as ws_server.py and start it with

python ws_server.py

Now the server is ready to accept client connections through the web socket protocol. As soon as one of the clients described below has subscribed to this server messages can be published by calling the Root.notify() method. Since it is exposed it is possible to call it from the command line with

curl localhost:9000/notify?msg=HelloWorld

Of course wget works as well.

A pure Python client

The Python client's code is quite short. ws4py provides three sample client implementations, the threaded one has been chosen for this example. The others are using gevent or Tornado.

from ws4py.client.threadedclient import WebSocketClient

class Subscriber(WebSocketClient):
    def handshake_ok(self):
        self._th.start()
        self._th.join()

    def received_message(self, m):
        print "=> %d %s" % (len(m), str(m))

if __name__ == '__main__':
    ws = Subscriber('ws://localhost:9000/ws')
    ws.connect()

The method handshake_ok() has been overridden to keep the thread stored in self._th running continuously (the original implementation quits after one second). After the Subscriber-class has been instantiated it connects to the cherrypy server. Whenever the server sends a message it will be delegated to the method received_message() where it gets printed to stdout.

Just store this code into a file, e.g. ws_subscriber.py and start it in from a new shell. The cherrypy server should print a message to the console that it received a web socket connection.

Now again call the notify-method in the server:

curl localhost:9000/notify?msg=HelloWorld

and the python client should print your message to the screen.

A web socket client in Firefox

This browser client uses the web socket protocol built into Firefox. The example below works for me in FF14, it failed for FF8. I'm not sure which version of Firefox starts to support it. Safari version 5.0 also fails. IE has not been tested.

<html>
  <head>
    <script>
      var websocket = new WebSocket('ws://localhost:9000/ws');
      websocket.onopen    = function (evt) { console.log("Connected to WebSocket server."); };
      websocket.onclose   = function (evt) { console.log("Disconnected"); };
      websocket.onmessage = function (evt) { document.getElementById('msg').innerHTML = evt.data; };
      websocket.onerror   = function (evt) { console.log('Error occured: ' + evt.data); };
    </script>
  </head>
  <body>
    <h1>Websocket demo</h1>
    Message: <span id="msg" />
  </body>
</html>

At load time a Websocket-instance is created and event handlers are installed. The interesting one is the onMessage-handler: it is called for each message received, it copies the message into the SPAN element and thus makes it visible.

Make sure to store this html page in the same directory as the ws_server.py above since we are going to open it through cherrypy's index method. For this to work it has to be named ws_browser.html. Now open Firefox and direct it to http://localhost:9000. You should immediately see this page. The SPAN element should be empty.

Again repeat the curl or wget command in your shell and both the python client (if it is still running) and the SPAN element should display your "HelloWorld" message.

pyRserve 0.6.0 released

While being at EuroPython in Florence the latest version of pyRserve has now been finished and is available via pypi (easy_install -U pyRserve). If you are at EuroPython, too, and want to talk about it just come and see me.

About pyRserve

pyRserve is a (pure python) client for connecting Python to an R process on a remote server via TCP-IP (using Rserve). R is one of the most important and most widely spread open source statistic packages available.

Through such a pyRserve connection the full power of R is available on the Python side without programming in R. From Python variables can get set in and retrieved from R, and R-functions can be called remotely. No R-related libraries need to be installed on the client side.

Sample code

>>> import pyRserve
>>> conn = pyRserve.connect('servername.domain.com')
>>> conn.r('1+1')                # direct evaluation of a statement in R
2
>>> conn.r.myList = [1, 2, 3]    # bind a Python list in R to variable 'myList'

>>> conn.r('func1 <- function(v) { v*2 }') # create a function in R
>>> conn.r.func1(4)                        # call the function in R
16

Most important changes in V 0.6.0

  • Support for Python 3.x (therefore dropped support for Python <= 2.5)
  • Support for unicode strings
  • Suport for Fortran-style ordering of numpy arrays
  • Elements of single-item arrays are now translated to native python data types
  • Full support complex numbers, partial support for 64bit integers and arrays

Documentation and Support

The documentation for pyRserve is available at http://packages.python.org/pyRserve
The corresponding Google news group can be found at http://groups.google.com/group/pyrserve

pyRserve 0.5.2 released

The latest version is now available via pypi (easy_install -U pyRserve).

About pyRserve

pyRserve is a (pure python) client for connecting Python to an R process on a remote server via TCP-IP (using Rserve). Through such a connection variables can be get and set in R from Python, and also R-functions can be called remotely. No R-related libraries need to be installed on the client side.

Sample code

>>> from pyRserve import connect
>>> conn = connect('your R server')
>>> conn.r('1+1')                # direct evaluation of a statement
2
>>> conn.r.myList = [1, 2, 3]    # bind a list within R to variable 'myList'

>>> conn.r('func1 <- function(v) { v*2 }') # create a function in R
>>> conn.r.func1(4)                        # call the function in R
16

Changes in V 0.5.2

  • Fixed bug with 32bit integer conversion on 64bit machines. Upgrade highly recommended!

Documentation and Support

The documentation for pyRserve is available at http://packages.python.org/pyRserve
The corresponding Google news group can be found at http://groups.google.com/group/pyrserve

pyRserve 0.5.0 released

The latest version is now available via pypi (easy_install -U pyRserve).

About pyRserve

pyRServe is a library for connecting Python to an R process running under Rserve. Through such a connection variables can be get and set in R from Python, and also R-functions can be called remotely.

Changes in V 0.5.0

  • Renamed pyRserve.rconnect() to pyRserve.connect(). The former still works but shows a DeprecationWarning
  • String evaluation should now only be executed on the namespace directly, not on the connection object anymore. The latter still works but shows a DeprecationWarning.
  • New kw argument atomicArray=True added to pyRserve.connect() for preventing single valued arrays from being converted into atomic python data types.

Documentation and Support

The documentation for pyRserve is available at http://packages.python.org/pyRserve
The corresponding Google news group can be found at http://groups.google.com/group/pyrserve