Data Mining and Visualization
Class: M/W 3:30–4:45pm, LSB 312
Instructor: Henry Feild, Ph.D.
Office Hours: MTWRF 10–10:50am in LSB 113.
Schedule
Unless otherwise noted, all homework and projects are due by the beginning of class on the day listed.
In case you are curious, this page is set up to remember what checkboxes you click provided you access this page from the same browser and computer (and don't clear your browser cache). So, check off assignments as you do them to keep track of what you've done!
Wk. | Dates | Topic | Readings | Projects due/Exams |
---|---|---|---|---|
1 | Jan. 23 &25 | Overview |
Han Ch. 1 & 2 McKinney Ch. 1 CodeAcademy: Python |
|
2 | Jan. 30 & Feb. 1 | NumPy, Pandas | McKinney Ch. 4 & 5 | |
3 | Feb. 6 & 8 | Reading/wrangling data |
Han Ch. 2 & 3 McKinney Ch. 6 & 7 |
|
4 | Feb. 13 & 15 | Wrangling/visualizing data | McKinney Ch. 8 | (Meeting in LSB230 both days) |
– | Feb. 20 | No class—Presidents day | ||
5 | Feb. 22 | Exam 1 on Wed. 2/22 | ||
6 | Feb. 27 & Mar. 1 | Data aggregation | McKinney Ch. 9 | |
7 | Mar. 6 & 8 | Advanced data processing |
DOT graph description language Install nxpd |
|
– | Mar. 11–19 | No classes—Spring break | ||
8 | Mar. 20 & 22 | Pattern mining |
Han Ch. 6 |
|
9 | Mar. 27 & 29 | Cluster analysis, scikit-learn |
Han Ch. 10 Intro to scikit-learn scikit-learn clustering |
Exam 2 on Wed. 3/29 |
10 | Apr. 3 & 5 | Intro to machine learning |
Han Ch. 8 What is ML? (notes) Iris dataset (notes) Training models (notes) |
Project 1 (Wed. 4/5) |
11 | Apr. 10 & 12 | Data splits, regression |
Train/test splits
(notes) Cross-validation (notes) Linear regression (notes) |
|
– | Apr. 17 | No class—Patriots day | ||
12 | Apr. 19 | Evaluation, feature/parameter selection |
Grid search
(notes) Metrics (notes) |
|
13 | Apr. 24 & 26 | More classification |
Han Ch. 10
(you should have already read this chapter) |
|
14 | May. 1 & 3 | Review | Exam 3 on Wed. 5/3 | |
May 10 | Project presentations in LSB 312, 10:15am–12:15pm | Project 2 due |
Assignments
Quizzes
Most Mondays will begin with a short quiz based on the homework and material from the previous week. Paper notes may be used, but no electronic resources. Not all quizzes will necessarily be graded.
Homeworks
Each week there will be a homework based on readings and material from class the previous week. Homeworks are meant to help you ensure you understand the material. Multiple choice homework questions on Canvas will be corrected automatically, but other material submitted (usually extensions of what we worked on in class) will be graded based on apparent effort. I will provide solutions when possible. Homeworks are usually due before the start of class each Monday (see the schedule for due dates).
Projects
There will be two large projects due during the class. The first project will consider data exploration, which roughly corresponds to the first half of the semester. The second project is a continuation of the first, and will involve training and evaluating predictive models over the data set. More information can be found on the project assignment pages, listed below.
Late policy
As outlined in the syllabus, no late work will be accepted. Homework and projects must be submitted on time. This gives us the chance to discuss assignments in class right after their deadlines.
(Back to top)Resources
- Style guidelines
- Pandas Cheat Sheet (PDF)
- k-Nearest Neighbors scikit-learn documentation
- Decision trees scikit-learn documentation
- (Sec. 1.1.11) Logistic regression scikit-learn documentation
- Naive Bayes scikit-learn documentation
- SVM scikit-learn documentation