Entries tagged with “python”


Page: 1

Python Generic Dispatch?

2010-06-25 22:30:17-0400 - python (2) - 7 comments

My friend Kyle and I were discussing the nice generic dispatch system in Haskell, and how it makes function definitions nice. I mentioned it would be nice to pull that syntax into Python, and an hour later here I am. I have implemented a very rough solution on github (VIEW SOURCE).

Solution Looking for a Problem

So I know this may be a solution looking for a problem, but I’m really happy that by abusing the inspect system of python and using decorators you can write such a system in a concise way. The basic idea is that I’ve implemented a decorator called dispatch. This decorator takes arguments which defined a kind of signature for the function. This system is actually very close to method overloading in Java or function overloading in C. However, you can take it further (much further). The dispatcher can actually take types, predicate functions, values, or lists! We can take the simplest case, values and implement Fibonacci with it:

# inside returns predicate that's predefined
@dispatch(inside(0, 1))
def fib(n):
    return n

@dispatch(int)
def fib(n):
    return fib(n - 1) + fib(n - 2)

>>> fib(6)
13

Now I would never, ever suggest writing a Fibonacci like this, but this just provides an example as to how this may work. Now, here comes the part that’s interesting. Because of the dynamic dispatch nature of this, we can add to this if we need to for later programs. As an example, suppose I want to come up with a fast version of fib for longs. We can just add this:

from mymodule import fib

@dispatch(long)
def fib(n):
    # Faster fib implementation
    a, b = 1, 0
    c, d = 0, 1
    while i > 0:
        if i % 2:
            c, d =  a * (c + d) + b * c, a * c + b * d
            i -= 1
        i >>= 1
        a, b =  a ** 2 + ((a * b) << 1), a ** 2 + b ** 2
    return c

>>> fib(26L) # much faster!
196418

Or let’s say that we want to take strings as an input, we can also add:

@dispatch(basestring)
def fib(n):
    return fib(long(n))

>>> fib('26')
196418

And now each time we do this we update the local scope to have the new function. There is no magic monkey patching done, so if you want to do this you have to create your own global registry. I like this because it means that by default you will branch off and create different versions of the function if you have more than one module. I believe this is what you’d expect if you understood scoping etc.

Deeper Down the Rabbit Hole

We aren’t quite finished, are we? Because we also want to be able to take lists of objects (or I do). Let’s say that I have a function that dumps data but specifically based on what is in the data. Here’s a quick example:

# Generic default
@dispatch()
def dumper(data):
    import pprint
    pprint.pprint(data)

# Now special if the list contains a single dict and another value.
@dispatch([anything, dict]) # anything is a predicate function predefined
def dumper(data):
    print "Data for %s" % data[0]
    for key, value in data[1]:
        print "%s: %s" % (key, value)

Viola! The dispatcher will actually recurse down any list or iterable you have, so you can have @dispatch([anything, [anything], [[int]]]) if you wanted.

There’s one more trick not mentioned here, and that is that if you write a default case (@dispatch() with no specifications), then any further conditions defined later are actually inserted before this case. This allows me to write what I have above, since I have defined the default before the specific case. This would allow me to add other specialties to the function as the function warrants it.

Conclusion

I’m sure there’s plenty more to play with in this arena, and I’m actually somewhat surprised I was not able to find something like this already prepackaged for python. After playing with it for a little while I may put it up on pypi. Of course, the drawback is performance. For a recursive function like the naive fib, the dispatch version is 10x slower. However, for a lot of real-world cases, you aren’t wrapping a recursive function with this, so it may actually be useful.



Fast Bloom Filters in Python

2010-03-20 00:15:03-0400 - development (4),python (2),c (1) - 2 comments

Introduction

It’s been about two years now since I first learned about how useful Bloom filters are. Since they have a niche forte they aren’t really standard on many languages. Since I learned about them, I’ve been wanting a fast, scalable bloom filter implementation for Python.

When I say scalable, I mean that multiprocessed web servers should use them without worrying. This can be a hassle, because if you can imagine 50 web processes each containing having a 50MB bloom filter, that’s almost 2.5GB on just this! The obvious solution is shared memory, but luckily we don’t quite have to go that route. There’s a better way to approach this problem, and that is just by assuming that everything is always on disk. If everything were just constantly read and written to a hard disk, we’d be scalable again! Indeed, there would only be one copy of the file, and everyone would be happy. But alas, this method is too slow. Fortunately, there’s something called mmap, which allows us to open a page file of memory as a sort of virtual representation of a file. This allows us to delay writes to the file but the cache is constantly updated at the kernel level.

Now I’ve been knowing about this solution for a while now, but it’s not until recently that I discovered how fun coding up Cython projects are. And thus, a project is born:

Python Bloom Filter

Python-bloom-filter is now my new project, with a ton of C code and cython code putting it all together. The documentation is available on this server. My initial tests show that it is extremely fast, and the mmap file support makes it extremely handy.

But I like pybloom better!

Jay Baird’s pybloom is a great module–the code is nice, it has great abstractions, but it is very different from my module. I think whether you use mine or his comes down to what features you want. I don’t want mine to support Auto resizing or even capacity limits. This is because I happen to like being able to OR and AND my filters, and when you do this the idea of keeping track of capacities gets fuzzy. Maybe somebody smarter than me will figure this out, but I won’t.

But why is this useful?

I think Bloom filters really shine when your input data set is extremely large and your matching data set is larger than can reasonably fit in memory (or causes too much chaining with hash tables). The archetypal example is a spell checker. I can scan every word in this text box easily and check membership in my Bloom filter. If it passes then I am reasonably confident that the word is spelled correctly. With the binary arithematic you can also use it on clever joins and other cool stuff. At some point in the future I’ll write an article with a cool example. For now, please have fun with the documentation and the downloads!

As always, contact me if you have questions.

Page: 1