Monday, August 21, 2006

Starfish: not yet MapReduce

Lucas Carlson writes about using his Starfish library to do MapReduce. He's not really doing MapReduce, but I like his Starfish library.

I mention this because I wrote a basic MapReduce library in Ruby at Sharpcast. Perhaps I can use Starfish as a good distributed processing abstraction.

Here's the ubiquitous word-frequency program using my MapReduce:

worker :word_count do
# [filename,line] => list([word,1])
map :reads => :text, :writes => :yaml do |filename,string|
string.split.each {|w| emit_intermediate(w, 1)}

# [word,counts] => list([word,freq])
reduce :reads => :yaml, :writes => :yaml do |word,counts|
sum = counts.inject(0) {|sum,v| sum + v}
emit word, sum
I'll see if I can open-source it.


Anonymous Anonymous said...

I'm not Lucas, but I just wanted to observe that your credibility is diminished by criticizing his open-source offering while your offering is not open. You offer sample code that your readers cannot verify. I'd suggest being more constructive in the future, especially before you have accomplished anything yourself.

7:16 PM  
Blogger Adam Rosien said...

I'm not really after credibility. I'm actually being quite direct: here's what a sample program would look like, however my employer owns my code and I need to get permission to release it.

What is offered is a look at how I structured a MapReduce program, which people can verify by looking at it and seeing if it makes sense. I think it's straight-forward; other people may or may not. I hope that if there are comments, it would be either "Your DSL for MapReduce is [good,bad,ugly], because [x,y,z]".

I just think that Lucas is mis-advertising his offering. Heck, I like Starfish! It's better than plain DRb. But let's just all agree it's not MapReduce. It may be like MapReduce in that it simplifies the distributed computation paradigm, but it's not the same thing as MapReduce.

7:42 PM  
Anonymous sebnem said...

can you tell how much code it is? How long you needed to code the library? Did you start the concept from 0 - or did you know of Hadoop (MR in Java)? ...and anything else you consider important to tell someone who needed to code MR in ruby themselves?

(When) Will it be open source?

Good luck!

6:19 AM  
Blogger Adam Rosien said...

It's about 200 lines of Ruby which I wrote in about two days, but note that it doesn't run on a distributed file system, which gives you real scalability. I read the original Google paper and went from there. I'm pretty hosed at work, so I don't know when I'll have time to see if I can open source it.

11:36 AM  

Post a Comment

Links to this post:

Create a Link

<< Home