Parallel Data Mining and Warehousing on Large Transactional Dataset


As data warehouses store large amount of data, one of the efficient ways of manipulating such huge information is to use parallelized algorithms for data mining and data warehousing applications. This project will check the feasibility of using a large number of nodes to house such information and manipulate them using various data mining and OLAP (online analytical processing) operations. Finally, the results of such studies will be published and presented at suitable international fora.


Principal Investigator

Amit Rudra
Information Systems
Curtin University of Technology

Project

g18

Co-Investigators

Yanrong Li
Yudho Sucahyo
Raj Rajgopalan
Computing
Curtin University of Technology


Narasimaha Achuthan
Mathematics
Curtin University of Technology

RFCD Codes

280108


Significant Achievements, Anticipated Outcomes and Future Work

The project this year looked into the feasibility of parallel data mining techniques using MILP (mixed integer linear programming).  To be more specific it used real-life transactional data from a Belgian supermarket retailer.  It proved that these techniques are feasible to be run on the super-cluster for small to medium problems.  For large problems i.e. problems in integer programming that are likely to traverse many nodes it may still be infeasible to solve using the existing configuration and resources and will need further investigation.

We have also published a chapter in a Springer-Verlag publication on data mining using MILP techniques.

 

Data Sources, Curation Techniques, Data Access Policy and Method

None

 

Computational Techniques Used

To solve the problem of data mining optimal item packages using MILP techniques CPLEX package from ILOG was used. This was installed on the APAC-AC for evaluation purpose.

Method:

Following is an integer linear programming model for the OIPP. Let yi denote the 0-1 decision variable that assumes value 1 when ever item i is chosen. Let zj denote the 0-1 decision variable that assumes value 1 whenever the frequent item set Xj is covered by the set of selected items, that is, by the set of items { i: yi = 1}.

(6) Lower and upper bound constraints: NL ≤ NU.

(7) Occurrence constraint of Xj: -nj zj ≥ 0, 1≤j≤k

(8) Storage space constraint: ≤S.

(9) Lower bound constraint on profit: ≥Minprofit

(10) Restrictions on variables: yi = 0 or 1, zt = 0 or 1.

(11) Objective function: Maximize - .

In this value based frequent item set problem the input information regarding X1,..., Xk , pj, fj , si and ci must be extracted through data mining of frequent item sets.

For the above model (6)-(11), the constraints and the objective function may be validated as follows:

Let the set of selected items to cover all the selected frequent item sets be denoted by = { i: yi = 1}. It is easy to see that | | = and the constraint (6) provides the lower and upper bound restrictions on this number. The number of items common to the set and the frequent item set Xj is given by . Whenever = |Xj| = nj , the set covers the frequent item set Xj . The constraint (7) ensures that the decision variable zj is 1 if and only if the frequent item set Xj covered by . In this case note that F = { Xj : zj = 1} is the collection of frequent item sets selected. The storage space required by the items of the selected set is . The constraint (8) expresses the upper bound restriction on the available storage space. The contribution made by the frequent item set Xj to the profit may be expressed as pjfj zj where zj =1 if and only if Xj is covered by the set . The constraint (9) ensures a minimum revenue contribution from the set of all covered frequent item sets. The constraints of (10) express the 0-1 restrictions of the decision variables yi and zj. The objective function in (11) maximizes the total profit contribution expressed as the total net revenue.

 

Publications, Awards and External Funding

External Funding and Awards

None

Publications

N. R. Achuthan, R. Gopalan, & A. Rudra, Mining Value-Based Item Packages - An Integer Programming Approach in Data Mining, G.J. Williams and S.J. Simoff (Eds.): Data Mining, LNAI 3755, 2006, pp. 78 – 89. © Springer-Verlag Berlin Heidelberg.