Outline of a 4th year honours course in Data Mining --------------------------------------------------- Markus Hegland, SMS Steve Roberts, SMS February 2002 The aim of this course is to provide an introduction to data mining techniques by the means of the underlying mathematical concepts and algorithms. At the end of the course, the student will be able to understand current data mining literature, have a good idea of the major techniques and algorithms and know how to apply the techniques in data mining studies. While we will focus on the algorithmic side, data analysis issues will be covered as well. In the lectures the techniques and underlying principles will be discussed. The following is a list of issues covered in the course. The major focus will be on the sections 2 to 3. We will provide hand-outs. Prerequisites for this course is some basic understanding of search algorithms, approximation, linear algebra, analysis, probability, statistics and general programming (e.g., in MATLAB, C/C++, Python or Java). Most of the prerequisites will have been covered in earlier maths courses. 1. General issues Large and complex data, curse of dimensionality, concentration of measure, scalable and parallel algorithms, computational issues, data issues -- noise, outliers, missing data, features, preprocessing, application, interestingness, relational data model vs flat files, complexity, data model, data aggregation. 2. Association Rules Definition, Apriori property, hierarchical association rules, quantitative association rules, approximation and discretisation, partitioning. 3. Classification and Regression Misclassification costs, decision trees, Bayesian classifiers, classification and rule induction, regression trees, MARS, additive models, ANOVA decomposition and approximations, smoothing, radial basis functions. 4. Clustering Metrics and impurity measures, Partitioning methods, k-means, k-medoids, hierarchical methods, density-based techniques. Main textbook: * J. Han and M. Kamber: Data Mining, Concepts and Techniques, Morgan Kaufmann, 2001 * Special references will be provided for each section. Assignments: - Coding of a simple data mining technique and/or application to a data set of own choice or from WEB repository. (Apriori, recursive decision tree or K-means.) Interestingness of results, efficiency, functionality and accuracy of code, understanding of data mining principles. Combined assignments with application areas in business and science are encouraged. The software suggested for this assignment is WEKA, see, http://www.cs.waikato.ac.nz/~ml/weka/index.html. This assignment is to be completed during the first half of the course. - Seminar presentation (20 min.) of a data mining paper which will be provided. Understanding and formalisation of mathematical ideas and computational issues. The student talks will be given towards the end of the course. Further Seminars: Students wishing to learn more and further test their understanding may attend regular seminars given by active data mining researchers on application areas, algorithms and theory of data mining. Honours Research Topics: There are several possible research topics in data mining on the web, see http://datamining.anu.edu.au. Interdisciplinary projects are encouraged.