Parallel Data Mining and Warehousing on Large Transactional Dataset
As data warehouses store large amount of data, one of the efficient ways of manipulating such huge information is to use parallelized algorithms for data mining and data warehousing applications. This project will check the feasibility of using a large number of nodes to house such information and manipulate them using various data mining and OLAP (online analytical processing) operations. Finally, the results of such studies will be published and presented at suitable international fora.
|
Principal Investigator Amit RudraInformation Systems Curtin University of Technology |
Project g18 |
|
Co-Investigators Yanrong LiYudho Sucahyo Raj Rajgopalan Computing Curtin University of Technology Narasimaha Achuthan Mathematics Curtin University of Technology |
RFCD Codes 280108 |
Significant Achievements, Anticipated Outcomes and Future Work
The project this year looked into the feasibility of parallel data mining techniques using MILP (mixed integer linear programming). To be more specific it used real-life transactional data from a Belgian supermarket retailer. It proved that these techniques are feasible to be run on the super-cluster for small to medium problems. For large problems i.e. problems in integer programming that are likely to traverse many nodes it may still be infeasible to solve using the existing configuration and resources and will need further investigation.
We have also published a chapter in a Springer-Verlag publication on data mining using MILP techniques.
Data Sources, Curation Techniques, Data Access Policy and Method
None
Computational Techniques Used
To solve the problem of data mining optimal item packages using MILP techniques CPLEX package from ILOG was used. This was installed on the APAC-AC for evaluation purpose.
Method:
Following is an integer linear programming model for the OIPP. Let yi denote the 0-1 decision variable that assumes value 1 when ever item i is chosen. Let zj denote the 0-1 decision variable that assumes value 1 whenever the frequent item set Xj is covered by the set of selected items, that is, by the set of items { i: yi = 1}.
(6) Lower and upper bound constraints: NL ≤
≤ NU.
(7) Occurrence constraint of Xj:
-nj zj ≥ 0, 1≤j≤k
(8) Storage space constraint:
≤S.
(9) Lower bound constraint on profit:
≥Minprofit
(10) Restrictions on variables: yi = 0 or 1, zt = 0 or 1.
(11) Objective function: Maximize
-
.
In this value based frequent item set problem the input information regarding X1,..., Xk , pj, fj , si and ci must be extracted through data mining of frequent item sets.
For the above model (6)-(11), the constraints and the objective function may be validated as follows:
Let the set of selected items to cover all the selected frequent item sets be denoted by
= { i: yi = 1}. It is easy
to see that |
| =
and the constraint (6) provides the lower and upper bound restrictions on this number. The number
of items common to the set
and the frequent item set Xj is given by
. Whenever
= |Xj| = nj , the set
covers the
frequent item set Xj . The constraint (7) ensures that the decision variable zj is 1 if and only if the frequent item
set Xj covered by
. In this case note that F = { Xj : zj = 1} is the collection of frequent item sets selected. The
storage space required by the items of the selected set
is
. The constraint (8) expresses the upper bound restriction
on the available storage space. The contribution made by the frequent item set Xj to the profit may be expressed as pjfj
zj where zj =1 if and only if Xj is covered by the set
. The constraint (9) ensures a minimum revenue contribution from
the set of all covered frequent item sets. The constraints of (10) express the 0-1 restrictions of the decision
variables yi and zj. The objective function in (11) maximizes the total profit contribution expressed as the total net
revenue.
Publications, Awards and External Funding
External Funding and Awards
None
Publications
N. R. Achuthan, R. Gopalan, & A. Rudra, Mining Value-Based Item Packages - An Integer Programming Approach in Data Mining, G.J. Williams and S.J. Simoff (Eds.): Data Mining, LNAI 3755, 2006, pp. 78 – 89. © Springer-Verlag Berlin Heidelberg.