Cloud9 is a MapReduce library for Hadoop designed to serve as both a teaching tool and to support research in data-intensive text processing. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. Hadoop provides an open-source implementation of the programming model. The library itself is available on github and distributed under the Apache License.

For additional details on MapReduce algorithm design, Data-Intensive Text Processing with MapReduce by Lin and Dyer is a good resource. This library also serves as a repository of many examples discussed in the book.

Getting It

Starting Points

Note: many of these guides are out of date and have not been maintained.

Next Steps

Working with Specific Document Collections

This work is or has been supported by the following sources: NSF under awards IIS-0836560 and IIS-0705832; Google and IBM under the Academic Cloud Computing Initiative (ACCI); the Intramural Research Program of the NIH, National Library of Medicine; DARPA/IPTO Contract No. HR0011-06-2-0001 under the GALE program; and Amazon Web Services. Any opinions, findings, conclusions, or recommendations expressed here do not necessarily reflect those of the sponsors.