Great talk on data scrubbing: Have Data? What now?!
Open Calais (www.opencalais.com) and Freebase as data analysis tools.
Entity disambiguity - ability to discern which goes with what (Hillary shows Cuil's search results on herself)
Company disambiguation - often handled by humans
At Path101 - human, Data APIs (e.g. mTurk) or auto-classification
Shows her google spam box: how does google check it out
eScienceNews uses a vector analysis and hierarchical clustering model (to figure out what is interesting) then uses baysian document classification model. (www.esciencenews.com/about.html)
Discussing clustering and hierarchical clustering and how it is applied to Path 101.
Depending on your algorithm, you need to choose your algorithm - lots of rules of thumb, but the artistry is on knowing how to tune the groups/clusters to the algorithm.