Data Mining
Why Data Mining?
- Available data grows very fast
- Parkinson's law
- Work fills the amount of time available for its completions
- Parkinson's law of data
- Data fills the space available for its storage
- Moore's law
- The number of transistors on a microprocessor doubles every 18 months
- Equivalent of Moore's law for disk storage
- Hard disk capacity doubles every 9 months
What is Data Mining?
- Prediction
- Predicting the value of a continuous variable (e.g., temperature)
- Or different "classes": True <=> false, sick <=> healthy,
etc.
- Clustering
- Discovering interesting patterns
- Something common
- E.g. customer shopping pattern
- (frequent pattern discovery)
- Something uncommon
- E.g. suspicious of terrorism
- (outlier analysis)
- "I know that it's interesting when I see it"
Prediction
Linear Regression
- Simplest case: Predict one value as a linear function of another
- E.g. If I set the heater to "high" what will the temperature be?
http://www.stattucino.com/berrie/dsl/regression/regression.html
Classification
- Predicting a "class":
- Red square or green circle? (see applet below)
- E.g., Flu or not flu?
- Assume axes are body temperature and age
- High age and high body temperature => flu
- Young age and low body temperature => no flu
- Young age and high body temperature may not mean flu
http://www.cs.technion.ac.il/~rani/LocBoost/
- Left click on grid to get set of blue points
- Right click somewhere else to get set of red points
- (possibly repeat a couple of times)
- Select classifier "Naive Bayes" from drop-down box below
- Hit "Classify" button
- To try again, use "Reload" button of browser
Clustering
- Finding groups in data
- Example: Typical student behavior
- There are some students who
- Contribute to class discussions
- Get good grades in projects
- Students in another group
- Do homework early
- May be quiet or contribute to class discussion
- Get good grades in exams
- Finally there are students who
- Don't show up for class
- Don't do their homework
- Get poor grades overall
- Applets again assume only two attributes
- e.g., grade and number of contributions to class discussions
Hierarchical Clustering
- Find pairs of similar objects / clusters
- Step by step go from many small clusters to few large ones
http://www.cs.mcgill.ca/~papou/helppage.htm
- Click on "Run"
- To repeat click "Default"
- You can also use "Clear" and create points by clicking on the canvas
Hierarchical Clustering in a Practical Bioinformatics
Example
- Hierarchical clustering is extensively used for genes
- Genes hold the blueprint for life
- Consist of sequences of 4 types of "letters", A, C, G, and T
- AACGTCA is closer to AACCTCA than to CGTTTAC
- Genes in evolutionarily close organisms are similar
- Build a "tree of life" based on similarity of gene sequences
http://www.genetics.wustl.edu/eddy/atv/atvapplets/ATV_applets.html
- Click on any of the buttons
- You may have to click "zoom in" once to see something
Visualization of Trees and Graphs
Trees
- No loops
- May or may not have root
Graphs
Examples of Graphs
- Web pages
- Social networks
- Airline networks
- Proteins that interact in a body
Information Visualization on a Graph
- Typical graphs are large:
- Navigation important ("moving around" in graph)
http://touchgraph.sourceforge.net/
- Scroll down and click on the image under GraphLayout V1.21
- Check instructions in separate window
http://www.inxight.com/map
- Click on nodes to move around
Compare both visualizations!
Information Visualization on a Tree
- Any of the above can be used
- Alternative tree representation:
http://www.smartmoney.com/marketmap
- Click on rectangles
- Note amount of information available for nodes
Current Research
- Visualize data on nodes
- Each pie slice has different meaning
- Same functionality as "touchgraph" software above