Data Mining: Concepts and Techniques (2nd edition). Jiawei Han and Micheline Kamber. Morgan Kaufmann Publishers, Bibliographic Notes for Chapter. Request PDF on ResearchGate | On Jan 1, , Jiawei Han and others published Data Mining Concepts and Techniques (2nd Edition). Request PDF on ResearchGate | Data Mining: Concepts and Techniques (2nd edition) | Rule: Basic Concepts n Given: (1) database of transactions, (2) each.
|Language:||English, Spanish, German|
|ePub File Size:||25.45 MB|
|PDF File Size:||13.59 MB|
|Distribution:||Free* [*Regsitration Required]|
Data Mining: Concepts and Techniques, Second Edition. Jiawei Han and . ISBN ISBN For information on all. Data mining: concepts and techniques / Jiawei Han, Micheline Kamber, Jian Pei. – 3rd ed. p. cm. ISBN 1. Data mining. I. Kamber, Micheline. Data Mining: Concepts and Techniques (2nd edition) Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers, Bibliographic Notes for Chapter 2.
Let the cover of a cell be the set of base tuples that are aggregated in the cell. Using this technique, each classifier should have greater accuracy than its predecessor. Create a classifier, Ct from St. Deshpande, and J. The most time- consuming step is the lead discovery phase. This is done so that fewer conditional pattern bases have to be generated and additional sharing can be explored when mining the remaining branches of the FP-tree.
Johnson, R. Ng, V. Poosala, K. Ross, and K.
The New Jersey data reduction report. Technical Committee on Data Engineering, Bruce, D. Donoho, and H. Wavelet analysis. Breiman, J. Friedman, R. Olshen, and C. Classification and Regression Trees. Wadsworth International Group, Bloedorn and R. Data-driven constructive induction: A methodology and its appli- cations. Liu H. Kluwer Academic, Buntine and T. A further comparison of splitting rules for decision-tree induction.
Machine Learning, 8: Ballou and G. Enhancing data quality in data warehouse environments. ACM, Machine Learning on Very large Databases. Thesis, University of Sydney, Visualizing Data. Hobart Press, Ten Lectures on Wavelets. Capital City Press, Probability and Statistics for Engineering and the Science 4th ed. Duxbury Press, Dasu and T.
Exploratory Data Mining and Data Cleaning. Dasu, T. Johnson, S. Muthukrishnan, and V. Mining database structure; or how to build a data quality browser. Dash and H. Feature selection methods for classification.
Intelligent Data Analysis, 1: Dash, H. Liu, and J. Dimensionality reduction of unsupervised data. An Introduction to Generalized Linear Models 2nd ed.
Chapman and Hall, Devore and R. The Exploration and Analysis of Data. Methods for Reducing Costs and Increasing Profits. Finkel and J. A data structure for retrieval on composite keys. ACTA Informatica, 4: Fayyad and K. Multi-interval discretization of continuous-values attributes for classification learning. Joint Conf. Freedman, R. Pisani, and R.
Statistics 3rd ed. A recursive partitioning decision rule for nonparametric classifiers. IEEE Trans. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Declarative data cleaning: Language, model, and algorithms.
Gaede and O. Multidimensional access methods. ACM Comput. Guyon, N. Matic, and V. Discoverying informative patterns and data cleaning. Fayyad, G. Yang and W. Efficient and effective sequence clustering. Yang, W.
Mining asynchronous periodic patterns in time series data. Data Eng. Efficient enumeration of frequent sequences. An efficient algorithm for mining frequent sequences. Machine Learning, Zdonik, U. Cetintemel, M. Cherniack, C. Convey, S.
Lee, G. Seidman, M. Stonebraker, N. Tatbul, and D. Monitoring streams a new class of data management applications. Zaki and C. An efficient algorithm for closed itemset mining. Zaki, N. Lesh, and M. Sequence mining for plan failures. Zhu and D.
Statistical monitoring of thousands of data streams in real time. Mining Data Streams: Rajpoot Registrar,. Baker, R. The state of educational data mining in A review and future visions. The 3-step identification process 2. The 18 identified candidates 3. Algorithm presentations 4. Top 10 algorithms: Open discussions ICDM. Watson Research Center haixun us.
Informatica 37 21 25 21 Data Stream Mining: Ceuta , Porto, Portugal E-mail: Shelke, Suhasini A. Volume 1, Number 2 , pp. Realtionships to mining frequent items 2. Motivations for. Watson Research Center Abstract. There is an extensive literature on data mining techniques,. Search and Data Mining: Jiawei Han and Micheline Kamber http: Data Mining Primitives.
Piotrowo 3a, Poznan, Poland Marek. Wojciechowski cs. Punam V. Sugandha V. Dani Dept. Urban 1, Suzanne W. Dietrich 1, 2, and Yi Chen 1 Arizona. Data Mining: Slides related to: My Research Background. Computer Science, Brown University,.
SMA Statistical Learning and Data Mining in Bioinformatics also listed as 5. Professor Roy Welsch Wed 0 Feb 7: Rajalakshmi 1, Dr. Purusothaman 2, Dr. Gopalan Professor National.
Argiddi Assistant Prof. Computer Science Department,. Reposting is not permited without. Introduction Motivation: Why data mining? What is data mining? On what kind of. Beth Concepcion and Bobby D. Chee and Jenny Y. The article introduced the importance of intrusion detection, as well as. Raissi ema. Alberto Ceselli Lecture Alberto Ceselli alberto. Concepts and Techniques 3 rd ed.
All rights. Devarshi Mehta 2 1 Asst. Annual Report for Period: Yang, Li. Award ID: Western Michigan Univ Title: Projection and Interactive Exploration of. CS Intro. Concepts and Techniques 2 August 27, Data. Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.
Lutu up. Online Mining of Data Streams: Baldaniya, Prof H. Tech Student, 3 Assistant Professor, 1. Knowledge discovery in data that contain temporal information. Available online at www. ISSN Print: Nyaykhor M. Tech, Dept. Classification and Prediction Slides for Data Mining: Concepts and Techniques", The Morgan Kaufmann. HEMA 2 Dept. Log in Registration. Search for. Concepts and Techniques 2nd edition. Start display at page:. Download "Data Mining: Concepts and Techniques 2nd edition ".
Jessie Townsend 2 years ago Views: Similar documents. A Review Mining Data Streams: Rajpoot Registrar, More information. Borgwardt, More information. International Journal of World Research, Vol: Data Stream Mining: More information. Mining Sequence Data. Motivations for More information. There is an extensive literature on data mining techniques, More information. Data Mining Algorithms for the original version: Data Mining Primitives More information.
Frequent Patterns mining in time-sensitive Data Stream www. Dietrich 1, 2, and Yi Chen 1 Arizona More information. Concepts and Techniques Data Mining: Concepts and Techniques Slides related to: Opportunities and Challenges Data Mining: My Research Background More information. Computer Science, Brown University, More information. Concepts and Techniques 2nd edition Data Mining: Gopalan Professor National More information.
Computer Science Department, More information. Interested in learning more? Global Information Assurance Certification Paper. Reposting is not permited without More information. On what kind of More information. Spectators may be students, adults, or seniors, with each category having its own charge rate. Taking this cube as an example, briefly discuss advan- tages and problems of using a bitmap index structure. Bitmap indexing is advantageous for low-cardinality domains.
For example, in this cube, if dimension location is bitmap indexed, then comparison, join, and aggregation operations over location are then reduced to bit arithmetic, which substantially reduces the processing time. For dimensions with high cardinality, such as date in this example, the vector used to represent the bitmap index could be very long. For example, a year collection of data could result in date records, meaning that every tuple in the fact table would require bits or approximately bytes to hold the bitmap index.
Briefly describe the similarities and the differences of the two models, and then analyze their advantages and disadvantages with regard to one another. Give your opinion of which might be more empirically useful and state the reasons behind your answer. They are similar in the sense that they all have a fact table, as well as some dimensional tables. The major difference is that some dimension tables in the snowflake schema are normalized, thereby further splitting the data into additional tables.
The advantage of the star schema is its simplicity, which will enable efficiency, but it requires more space. For the snowflake schema, it reduces some redundancy by sharing common tables: However, it is less efficient and the saving of space is negligible in comparison with the typical magnitude of the fact table.
Therefore, empirically, the star schema is better simply because efficiency typically has higher priority over space as long as the space requirement is not too huge. Another option is to use a snowflake schema to maintain dimensions, and then present users with the same data collapsed into a star . References for the answer to this question include: Understand the difference between star and snowflake schemas in OLAP. Snowflake Schemas.
Design a data warehouse for a regional weather bureau. The weather bureau has about 1, probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour.
All data are sent to the central station, which has collected such data for over 10 years. Your design should facilitate efficient querying and on-line analytical processing, and derive general weather patterns in multidimensional space. Since the weather bureau has about 1, probes scattered throughout various land and ocean locations, we need to construct a spatial data warehouse so that a user can view weather patterns on a map by month, by region, and by different combinations of temperature and precipitation, and can dynamically drill down or roll up along any dimension to explore desired patterns.
The star schema of this weather spatial data warehouse can be constructed as shown in Figure 3. A star schema for a weather spatial data warehouse of Exercise 3. To construct this spatial data warehouse, we may need to integrate spatial data from heterogeneous sources and systems. Fast and flexible on-line analytical processing in spatial data warehouses is an important factor.
There are three types of dimensions in a spatial data cube: We distinguish two types of measures in a spatial data cube: A nonspatial data cube contains only nonspatial dimensions and numerical measures. If a spatial data cube contains spatial dimensions but no spatial measures, then its OLAP operations such as drilling or pivoting can be implemented in a manner similar to that of nonspatial data cubes. If a user needs to use spatial measures in a spatial data cube, we can selectively precompute some spatial measures in the spatial data cube.
Which portion of the cube should be selected for materialization depends on the utility such as access frequency or access priority , sharability of merged regions, and the balanced overall cost of space and on-line computation. A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse multidimensional matrix.
Present an example illustrating such a huge and sparse data cube. For the telephone company, it would be very expensive to keep detailed call records for every customer for longer than three months.
Therefore, it would be beneficial to remove that information from the database, keeping only the total number of calls made, the total minutes billed, and the amount billed, for example. The resulting computed data cube for the billing database would have large amounts of missing or removed data, resulting in a huge and sparse data cube. Regarding the computation of measures in a data cube: Describe how to compute it if the cube is partitioned into many chunks. PN Hint: The three categories of measures are distributive, algebraic, and holistic.
Pn Hint: The variance function is algebraic. If the cube is partitioned into many chunks, the variance can be computed as follows: Read in the chunks one by one, keeping track of the accumulated 1 number of tuples, 2 sum of xi 2 , and 3 sum of xi. Use the formula as shown in the hint to obtain the variance. For each cuboid, use 10 units to register the top 10 sales found so far. Read the data in each cubiod once.
If the sales amount in a tuple is greater than an existing one in the top list, insert the new sales amount from the new tuple into the list, and discard the smallest one in the list. The computation of a higher level cuboid can be performed similarly by propagation of the top cells of its corresponding lower level cuboids.
Suppose that we need to record three measures in a data cube: Design an efficient computation and storage method for each measure given that the cube allows data to be deleted incrementally i. For min, keep the hmin val, counti pair for each cuboid to register the smallest value and its count.
For each deleted tuple, if its value is greater than min val, do nothing. Otherwise, decrement the count of the corresponding node. If a count goes down to zero, recalculate the structure. For each deleted node N , decrement the count and subtract value N from the sum. For median, keep a small number, p, of centered values e.
Each removal may change the count or remove a centered value. If the median no longer falls among these centered values, recalculate the set. Otherwise, the median can easily be calculated from the above set. The generation of a data warehouse including aggregation ii. Roll-up iii. Drill-down iv. Incremental updating Which implementation techniques do you prefer, and why? A ROLAP technique for implementing a multiple dimensional view consists of intermediate servers that stand in between a relational back-end server and client front-end tools, thereby using a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces.
A MOLAP implementation technique consists of servers, which support multidimen- sional views of data through array-based multidimensional storage engines that map multidimensional views directly to data cube array structures. The fact tables can store aggregated data and the data at the abstraction levels indicated by the join keys in the schema for the given data cube. In generating a data warehouse, the MOLAP technique uses multidimensional array structures to store data and multiway array aggregation to compute the data cubes.
To roll-up on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension. For example, to roll-up the date dimension from day to month, select the record for which the day field contains the special value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired roll-up. To perform a roll-up in a data cube, simply climb up the concept hierarchy for the desired dimension.
For example, one could roll-up on the location dimension from city to country, which is more general. To drill-down on a dimension using the summary fact table, we look for the record in the table that contains a generalization on the desired dimension.
For example, to drill-down on the location dimension from country to province or state, select the record for which only the next lowest field in the concept hierarchy for location contains the special value all. In this case, the city field should contain the value all. The value of the measure field, dollars sold, for example, given in this record will contain the subtotal for the desired drill-down. To perform a drill-down in a data cube, simply step down the concept hierarchy for the desired dimension.
For example, one could drill-down on the date dimension from month to day in order to group the data by day rather than by month.
Incremental updating OLAP: To perform incremental updating, check whether the corresponding tuple is in the summary fact table. If not, insert it into the summary table and propagate the result up. Otherwise, update the value and propagate the result up. If not, insert it into the cuboid and propagate the result up.
If the data are sparse and the dimensionality is high, there will be too many cells due to exponential growth and, in this case, it is often desirable to compute iceberg cubes instead of materializing the complete cubes. Suppose that a data warehouse contains 20 dimensions, each with about five levels of granularity.
How would you design a data cube structure to efficiently support this preference?
How would you support this feature? An efficient data cube structure to support this preference would be to use partial materialization, or selected computation of cuboids. By computing only the proper subset of the whole set of possible cuboids, the total amount of storage space required would be minimized while maintaining a fast response time and avoiding redundant computation. Since the user may want to drill through the cube for only one or two dimensions, this feature could be supported by computing the required cuboids on the fly.
Since the user may only need this feature infrequently, the time required for computing aggregates on those one or two dimensions on the fly should be acceptable. A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid. Assume that there are no concept hierarchies associated with the dimensions. This is the maximum number of distinct tuples that you can form with p distinct values per dimensions.
You need at least p tuples to contain p distinct values per dimension. In this case no tuple shares any value on any dimension. The minimum number of cells is when each cuboid contains only p cells, except for the apex, which contains a single cell. What are the differences between the three main types of data warehouse usage: Information processing involves using queries to find and report useful information using crosstabs, tables, charts, or graphs.
Analytical processing uses basic OLAP operations such as slice-and-dice, drill-down, roll-up, and pivoting on historical data in order to provide multidimensional analysis of data warehouse data.
Data mining uses knowledge discovery to find hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. The motivations behind OLAP mining are the following: The high quality of data i.
The available information processing infrastructure surrounding data warehouses means that comprehensive information processing and data analysis infrastructures will not need to be constructed from scratch.
On-line selection of data mining functions allows users who may not know what kinds of knowledge they would like to mine the flexibility to select desired data mining functions and dynamically swap data mining tasks. Assume a base cuboid of 10 dimensions contains only three base cells: The measure of the cube is count. A closed cube is a data cube consisting of only closed cells.
How many closed cells are in the full cube? Briefly describe these three methods i. Note that the textbook adopts the application worldview of a data cube as a lattice of cuboids, where a drill-down moves from the apex all cuboid, downward in the lattice. Star-Cubing works better than BUC for highly skewed data sets. The closed-cube and shell-fragment approaches should be explored.
Here, we have two cases, which represent two possible extremes, 1. The k tuples are organized like the following: However, this scheme is not effective if we keep dimension A and instead drop B, because obviously there would still be k tuples remaining, which is not desirable.
It seems that case 2 is always better. A heuristic way to think this over is as follows: Obviously, this can generate the most number of cells: We assume that we can always do placement as proposed, disregarding the fact that dimensionality D and the cardinality ci of each dimension i may place some constraints.
The same assumption is kept throughout for this question. If we fail to do so e. The question does not mention how cardinalities of dimensions are set. To answer this question, we have a core observation: Minimum case: The distinct condition no longer holds here, since c tuples have to be in one identical base cell now. Thus, we can put all k tuples in one base cell, which results in 2D cells in all. Maximum case: We will replace k with b kc c and follow the procedure in part b , since we can get at most that many base cells in all.
From the analysis in part c , we will not consider the threshold, c, as long as k can be replaced by a new value. Considering the number of closed cells, 1 is the minimum if we put all k tuples together in one base cell.
How can we reach this bound? We assume that this is the case. We also assume that cardinalities cannot be increased as in part b to satisfy the condition. Suppose that a base cuboid has three dimensions A, B, C, with the following number of cells: Suppose that each dimension is evenly partitioned into 10 por- tions for chunking.
The complete lattice is shown in Figure 4. A complete lattice for the cube of Exercise 4. The total size of the computed cube is as follows. The total amount of main memory space required for computing the 2-D planes is: Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet sparse, multidimensional matrix. Note that you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve data from your structures.
Give the reasoning behind your new design. A way to overcome the sparse matrix problem is to use multiway array aggregation.
The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to fit into the memory available for cube computation. Each of these chunks is first compressed to remove cells that do not contain any valid data, and is then stored as an object on disk.
The second step involves computing the aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be revisited, thereby reducing memory access and storage costs.
By first sorting and computing the planes of the data cube according to their size in ascending order, a smaller plane can be kept in main memory while fetching and computing only one chunk at a time for a larger plane.
In order to handle incremental data updates, the data cube is first computed as described in a. Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to recompute the entire cube. This is because, with incremental updates, only one chunk at a time can be affected. The recomputed value needs to be propagated to its corresponding higher-level cuboids.
Thus, incremental data updates can be performed efficiently. When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: Compute the number of nonempty aggregate cells. Comment on the storage space and time required to compute these cells. If the minimum support count in the iceberg condition is two, how many aggregate cells will there be in the iceberg cube?
Show the cells. However, even with iceberg cubes, we could still end up having to compute a large number of trivial uninteresting cells i. Suppose that a database has 20 tuples that map to or cover the two following base cells in a dimensional base cuboid, each with a cell count of Let the minimum support be How many distinct aggregate cells will there be like the following: What are the cells?
We subtract 1 because, for example, a1 , a2 , a3 ,. These four cells are: They are 4: There are only three distinct cells left: Propose an algorithm that computes closed iceberg cubes efficiently. We base our answer on the algorithm presented in the paper: Let the cover of a cell be the set of base tuples that are aggregated in the cell. Cells with the same cover can be grouped in the same class if they share the same measure. Each class will have an upper bound, which consists of the most specific cells in the class, and a lower bound, which consists of the most general cells in the class.
The set of closed cells correspond to the upper bounds of all of the distinct classes that compose the cube. We can compute the classes by following a depth-first search strategy: Let the cells making up this bound be u1 , u2 , Finding the upper bounds would depend on the measure.
Incorporating iceberg conditions is not difficult. Show the BUC processing tree which shows the order in which the BUC algorithm explores the lattice of a data cube, starting from all for the construction of the above iceberg cube. We know that dimensions should be processed in the order of decreasing cardinality, that is, use the most discriminating dimensions first in the hope that we can prune the search space as quickly as possible. In this case we should then compute the cube in the order D, C, B, A.
The order in which the lattice is traversed is presented in Figure 4. BUC processing order for Exercise 4. Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for avg that is no bigger than some value, v.
Instead of using average we can use the bottom-k average of each cell, which is antimonotonic. To reduce the amount of space required to check the bottom-k average condition, we can store a few statistics such as count and sum for the base tuples that fall between a certain range of v e. This is analogous to the optimization presented in Section 4. A flight data warehouse for a travel agent consists of six dimensions: Starting with the base cuboid [traveller, departure, departure time, arrival, arrival time, f light], what specific OLAP operations e.
Outline an efficient cube computation method based on common sense about flight data distribution. The OLAP operations are: There are two constraints: Use an iceberg cubing algorithm, such as BUC. Use binning plus min sup to prune the computation of the cube. Implementation project There are four typical data cube computation methods: Find another student who has implemented a different algorithm on the same platform e.
An iceberg condition: Output i. The set of computed cuboids that satisfy the iceberg condition, in the order of your output gener- ation; ii. This is used to quickly check the correctness of your results. What challenging computation problems are encountered as the number of dimensions grows large? How can iceberg cubing solve the problems of part a for some data sets and characterize such data sets? Give one simple example to show that sometimes iceberg cubes cannot provide a good solution.
For example, for a dimensional data cube, we may only compute the 5-dimensional cuboids for every possible 5-dimensional combination. The resulting cuboids form a shell cube. Discuss how easy or hard it is to modify your cube computation algorithm to facilitate such computation.
This is to be evaluated on an individual basis. The number of cuboids for a cube grows exponentially with the number of dimensions. If the number of dimension grows large, then huge amounts of memory and time are required to compute all of the cuboids. Iceberg cubes, by eliminating statistically insignificant aggregated cells, can substantially reduce the number of aggregate cells and therefore greatly reduce the computation.
Benefits from iceberg cubing can be maximized if the data sets are sparse but not skewed. This is because in these data sets, there is a relatively low chance that cells will collapse into the same aggregated cell, except for cuboids consisting of a small number of dimensions. Thus, many cells may have values that are less than the threshold and therefore will be pruned. Consider, for example, an OLAP database consisting of dimensions. Let ai,j be the jth value of dimension i.
Assume that there are 10 cells in the base cuboid, all of which aggregate to the cell a1,1 , a2,1 , Let the support threshold be Then, all descendent cells of this cell satisfy the threshold.
In this case, iceberg cubing cannot benefit from the pruning effect. It is easy to modify the algorithms if they adopt a top-down approach. Consider BUC as an example. We can modify the algorithm to generate a shell cube of a specific number of dimension combinations because it proceeds from the apex all cuboid, downward.
The process can be stopped when it reaches the maximum number of dimensions. H-cubing and Star-Cubing can be modified in a similar manner. Consider the following multifeature cube query: Why or why not?
R1 such that R1. For class characterization, what are the major differences between a data cube-based implementation and a relational implementation such as attribute-oriented induction? Discuss which method is most efficient and under what conditions this is so. For class characterization, the major differences between a data cube-based implementation and a relational based implementation such as attribute-oriented induction include the following: Under a data cube-based approach, the process is user-controlled at every step.
This includes the selection of the relevant dimensions to be used as well as the application of OLAP operations such as roll-up, roll-down, slicing and dicing. A relational approach does not require user interaction at every step, however, as attribute relevance and ranking is performed automatically.
The relational approach supports complex data types and measures, which restrictions in current OLAP technology do not allow. Thus, OLAP implementations are limited to a more simplified model for data analysis.
An OLAP-based implementation allows for the precomputation of measures at different levels of aggregation materialization of subdata cubes , which is not supported under a relational approach. EXERCISES 51 Based upon these differences, it is clear that a relational approach is more efficient when there are complex data types and measures being used, as well as when there are a very large number of attributes to be considered.
This is due to the advantage that automation provides over the efforts that would be required by a user to perform the same tasks. However, when the data set being mined consists of regular data types and measures that are well supported by OLAP technology, then the OLAP-based implementation provides an advantage in efficiency. This results from the time saved by using precomputed measures, as well as the flexibility in investigating mining results provided by OLAP functions.
Suppose that the following table is derived by attribute-oriented induction. See Table 4. A crosstab for birth place of Programmers and DBAs. Discuss why relevance analysis is beneficial and how it can be performed and integrated into the character- ization process.
Compare the result of two induction methods: Incremental class comparison. Data-cube-based incremental algorithm for mining class comparisons with dimen- sion relevance analysis. P , a Prime generalized relation used to build the data cube. The method is outlined as follows.
To build the initial data cube for mining: The incremental part of the data is identified to produce a target class and contrasting class es from the set of task relevant data to generate the initial working relations. This is performed on the initial working relation for the target class in order to determine which attributes should be retained attribute relevance.
An attribute will have to be added that indicates the class of the data entry. The desired level of generalization is determined to form prime target class and prime contrasting class cuboid s. This generalization will be synchronous between all of the classes, as the contrasting class relation s will be generalized to the same level.
To process revisions to the relevant data set and thus make the algorithm incremental, perform the following. Instead, only the changes to the relevant data will be processed and added to the prime relation as held in the data cube.
Figure 4. A data-cube-based algorithm for incremental class comparison. Outline an incremental updating procedure for applying the necessary deletions to R. Outline a data cube-based incremental algorithm for mining class comparisons. A data-cube-based algorithm for incremental class comparison is given in Figure 4.
The Apriori algorithm uses prior knowledge of subset support properties. Prove that any itemset that is frequent in D must be frequent in at least one partition of D. Let s be a frequent itemset.
Let min sup be the minimum support. Let D be the task-relevant data, a set of database transactions. Let D be the number of transactions in D. Let s0 be any nonempty subset of s.
Any transaction containing itemset s will also contain itemset s0. Thus, s0 is also a frequent itemset. This proves that the support of any nonempty subset s0 of itemset s must be as great as the support of s.
Any itemset that is frequent in D must be frequent in at least one partition of D. Proof by Contradiction: Assume that the itemset is not frequent in any of the partitions of D. Let F be any frequent itemset. Let C be the total number of transactions in D. Let A be the total number of transactions in D containing the itemset F. Let us partition D into n nonoverlapping partitions, d1 , d2 , d3 ,. Because of the assumption at the start of the proof, we know that F is not frequent in any of the partitions d1 , d2 , d3 ,.
But this is a contradiction since F was defined as a frequent itemset at the beginning of the proof. This proves that any itemset that is frequent in D must be frequent in at least one partition of D.
Section 5. Propose a more efficient method. Explain why it is more efficient than the one proposed in Section 5. Consider incorporating the properties of Exercise 5. An algorithm for generating strong rules from frequent itemsets is given in Figure 5. It is more efficient than the method proposed in Section 5. If a subset x of length k does not meet the minimum confidence, then there is no point in generating any of its nonempty subsets because their respective confidences will never be greater than the confidence of x see Exercise 5.
The method in Section 5. This is inefficient because it may generate and test many unnecessary subsets i. Consider the following worst-case scenario: The method of Section 5. A database has five transactions. Rule Generator. Given a set of frequent itemsets, output all of its strong rules. Strong rules of itemsets in l. An algorithm for generating strong rules from frequent itemsets. Compare the efficiency of the two mining processes.
See Figure 5.
Root K: FP-tree for Exercise 5. Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single scan. Candidate generation in Apriori is expensive owing to the self-join , while FP-growth does not generate any candidates. Implementation project Implement three frequent itemset mining algorithms introduced in this chapter: Compare the performance of each algorithm with various kinds of large data sets.
Write a report to analyze the situations such as data size, data distribution, minimal support threshold setting, and pattern density where one algorithm may perform better than the others, and state why. This is to be evaluated on an individual basis as there is no standard answer. A database has four transactions. Suppose that a large store has a transaction database that is distributed among four locations. Transactions in each component database have the same format, namely Tj: Propose an efficient algorithm to mine global association rules without considering multilevel associations.
You may present your algorithm in the form of an outline. Your algorithm should not require shipping all of the data to one site and should not cause excessive network communication overhead.
An algorithm to mine global association rules is as follows: Let CF be the union of all of the local frequent itemsets in the four stores. This can be done by summing up, for each itemset, the local support of that itemset in the four stores. Doing this for each itemset in CF will give us their global supports.
Itemsets whose global supports pass the support threshold are global frequent itemsets. Suppose that frequent itemsets are saved for a large transaction database, DB. However, multiple occurrences of an item in the same shopping basket, such as four cakes and three jugs of milk, can be important in transaction data analysis. How can one mine frequent itemsets efficiently considering multiple occurrences of items?
Propose modifications to the well-known algorithms, such as Apriori and FP-growth, to adapt to such a situation. Consider an item and its occurrence count as a combined item in a transaction. For example, we can consider jug, 3 as one item. For instance, jug, 3 may be a frequent item. For i, max count , try to find k-itemsets for count from 1 to max count. This can be done either by Apriori or FP-growth.
In FP-growth, one can create a node for each i, count combination, however, for efficient implementation, such nodes can be combined into one using combined counters i. Compare their performance with various kinds of large data sets.