Saturday, May 30

Hillary Mason at BarCampNYC4

Great talk on data scrubbing: Have Data? What now?!

Open Calais ( and Freebase as data analysis tools.

Entity disambiguity - ability to discern which goes with what (Hillary shows Cuil's search results on herself)
Company disambiguation - often handled by humans
At Path101 - human, Data APIs (e.g. mTurk) or auto-classification

Shows her google spam box: how does google check it out
eScienceNews uses a vector analysis and hierarchical clustering model (to figure out what is interesting) then uses baysian document classification model. (

Discussing clustering and hierarchical clustering and how it is applied to Path 101.
Depending on your algorithm, you need to choose your algorithm - lots of rules of thumb, but the artistry is on knowing how to tune the groups/clusters to the algorithm.

