Parallel Data Mining of Large Datasets
The project's main aim is to study the effectiveness of parallelization of some of the data mining techniques like CT-ITL (Compact Tree – Item Transaction List). While several techniques have been proposed and tested by the data mining community very few are suitable for parallel processing. Our project shows both speedup and scale-up achieved by mining (association rules mining) data in parallel of medium and very large transactional datasets. Thus while the data for mining grow by a factor every year, parallelisation is one of the very effective and viable solutions for providing speedup which is crucial to solve such processor intensive tasks.
|
Principal Investigator Amit RudraInformation Systems Curtin University of Technology |
Project e42 |
|
Co-Investigators Yanrong LiYudho Sucahyo Computing Curtin University of Technology |
RFCD Codes 280108 |
Significant Achievements, Anticipated Outcomes and Future Work
Data mining, especially association rules mining (ARM), is becoming a very popular and much sought after method in both business (eg. customer buying profiling, buying patterns, program viewing) and in scientific applications. While techniques of mining data have improved in terms of performance or compute speeds, the amount of data to be mined has also increased quite dramatically. As a result researchers are always seeking better ways to speedup the processing of data mining activity. Our main goal was to show the effective speedup achieved by parallelizing our innovative CT-ITL algorithm. While the CT-ITL was already shown to be performing better than some of the existing ARM techniques in a single processor environment (see reference articles [1, 3]), its ability of executing in parallel was not known. This project proved that our data-partitioning scheme works well for mining large datasets in parallel on a number of nodes of a cluster. Our experiments on the APAC National Facility linux cluster showed that while a significant processing speedup is possible over a number of nodes of the cluster, beyond a certain threshold no further gains in processing is achieved. This is a significant finding and we are researching further what these inhibiting factors, if any, are. In future it will help the data mining community to learn more about these factors and as a result help in achieving a better load balance (both processing and data) on each node. The project also experimented and showed that our parallel techniques scale up for very large datasets. This was quite difficult, if not outright impossible, to show using a single CPU computer.
Computational Techniques Used
Two main C++ programs – makepart and mining were used. While makepart prepared partitions from original dataset readying them for mining, mining was parallelized using MPI library and mined in parallel the partitioned data distributed on various nodes of the cluster. Both parallel and non-parallel versions were extensively tested on model problems. The mining technique was tested using various configurations of nodes (4, 8, 12, ...). Near linear speedup was achieved for and up to 12 nodes. However, data and processing load distribution possibly nullify any speedup gains beyond this limit. This problem is under further investigation.
The details of the algorithms and parallelization strategies are reported in publications [1, 2].
Publications, Awards and External Funding
External Funding and Awards
None.
Publications
1. Y.G. Sucahyo, R.P. Gopalan, A. Rudra. Efficiently Mining Frequent Patterns from Dense Datasets using a Cluster of Computers. Proc. of 16th Australian Conf. on AI (AI2003), pp. 233-244, Perth, Australia, 2003.
2. A. Rudra, R.P. Gopalan, Y.G. Sucahyo. Scalable Parallel Mining for Frequent Patterns from Dense Datasets Using a Cluster of PCs. Proc. of 6th Int. Conf. on Information Technology (CIT03), Bhubaneswar, India, 2003.
3. R.P. Gopalan, Y.G. Sucahyo. Improving the Efficiency of Frequent Pattern Mining by Compact Data Structure Design. Proc. of 4th Int. Conf. on Intelligent Data Engineering and Automated Learning (IDEAL), Hong Kong, 2003.