Entries tagged with “c”


Page: 1

Fast Bloom Filters in Python

2010-03-20 00:15:03-0400 - development (4),python (2),c (1) - 2 comments

Introduction

It’s been about two years now since I first learned about how useful Bloom filters are. Since they have a niche forte they aren’t really standard on many languages. Since I learned about them, I’ve been wanting a fast, scalable bloom filter implementation for Python.

When I say scalable, I mean that multiprocessed web servers should use them without worrying. This can be a hassle, because if you can imagine 50 web processes each containing having a 50MB bloom filter, that’s almost 2.5GB on just this! The obvious solution is shared memory, but luckily we don’t quite have to go that route. There’s a better way to approach this problem, and that is just by assuming that everything is always on disk. If everything were just constantly read and written to a hard disk, we’d be scalable again! Indeed, there would only be one copy of the file, and everyone would be happy. But alas, this method is too slow. Fortunately, there’s something called mmap, which allows us to open a page file of memory as a sort of virtual representation of a file. This allows us to delay writes to the file but the cache is constantly updated at the kernel level.

Now I’ve been knowing about this solution for a while now, but it’s not until recently that I discovered how fun coding up Cython projects are. And thus, a project is born:

Python Bloom Filter

Python-bloom-filter is now my new project, with a ton of C code and cython code putting it all together. The documentation is available on this server. My initial tests show that it is extremely fast, and the mmap file support makes it extremely handy.

But I like pybloom better!

Jay Baird’s pybloom is a great module–the code is nice, it has great abstractions, but it is very different from my module. I think whether you use mine or his comes down to what features you want. I don’t want mine to support Auto resizing or even capacity limits. This is because I happen to like being able to OR and AND my filters, and when you do this the idea of keeping track of capacities gets fuzzy. Maybe somebody smarter than me will figure this out, but I won’t.

But why is this useful?

I think Bloom filters really shine when your input data set is extremely large and your matching data set is larger than can reasonably fit in memory (or causes too much chaining with hash tables). The archetypal example is a spell checker. I can scan every word in this text box easily and check membership in my Bloom filter. If it passes then I am reasonably confident that the word is spelled correctly. With the binary arithematic you can also use it on clever joins and other cool stuff. At some point in the future I’ll write an article with a cool example. For now, please have fun with the documentation and the downloads!

As always, contact me if you have questions.

Page: 1