Suchen und Finden
Service
Handbook of Big Data Analytics
Wolfgang Karl Härdle, Henry Horng-Shing Lu, Xiaotong Shen
Verlag Springer-Verlag, 2018
ISBN 9783319182841 , 532 Seiten
Format PDF, OL
Kopierschutz Wasserzeichen
Preface
6
Contents
7
Part I Overview
9
1 Statistics, Statisticians, and the Internet of Things
10
1.1 Introduction
11
1.1.1 The Internet of Things
11
1.1.2 What Is Big Data in an Internet of Things?
11
1.1.3 Building Blocks
12
1.1.4 Ubiquity
13
1.1.5 Consumer Applications
15
1.1.6 The Internets of [Infrastructure] Things
17
1.1.7 Industrial Scenarios
19
1.2 What Kinds of Statistics Are Needed for Big IoT Data?
20
1.2.1 Coping with Complexity
20
1.2.2 Privacy
21
1.2.3 Traditional Statistics Versus the IoT
22
1.2.4 A View of the Future of Statistics in an IoT World
23
1.3 Big Data in the Real World
24
1.3.1 Skills
24
1.3.2 Politics
25
1.3.3 Technique
25
1.3.4 Traditional Databases
26
1.3.5 Cognition
26
1.4 Conclusion
27
2 Cognitive Data Analysis for Big Data
29
2.1 Introduction
30
2.1.1 Big Data
30
2.1.2 Defining Cognitive Data Analysis
31
2.1.3 Stages of CDA
33
2.2 Data Preparation
35
2.2.1 Natural Language Query
36
2.2.2 Data Integration
37
2.2.3 Metadata Discovery
38
2.2.4 Data Quality Verification
39
2.2.5 Data Type Detection
40
2.2.6 Data Lineage
41
2.3 Automated Modeling
42
2.3.1 Descriptive Analytics
42
2.3.2 Predictive Analytics
43
2.3.3 Starting Points
44
2.3.4 System Recommendations
45
2.4 Application of Results
46
2.4.1 Gaining Insights
46
2.4.2 Sharing and Collaborating
47
2.4.3 Deployment
47
2.5 Use Case
48
2.6 Conclusion
52
References
52
Part II Methodology
54
3 Statistical Leveraging Methods in Big Data
55
3.1 Background
55
3.2 Leveraging Approximation for Least Squares Estimator
58
3.2.1 Leveraging for Least Squares Approximation
58
3.2.2 A Matrix Approximation Perspective
60
3.2.3 The Computation of Leveraging Scores
61
3.2.4 An Innovative Proposal: Predictor-Length Method
61
3.2.5 More on Modeling
63
3.2.6 Statistical Leveraging Algorithms in the Literature: A Summary
63
3.3 Statistical Properties of Leveraging Estimator
64
3.3.1 Weighted Leveraging Estimator
64
3.3.2 Unweighted Leveraging Estimator
66
3.4 Simulation Study
68
3.4.1 UNIF and BLEV
68
3.4.2 BLEV and LEVUNW
69
3.4.3 BLEV and SLEV
69
3.4.4 BLEV and PL
70
3.4.5 SLEV and PL
70
3.5 Real Data Analysis
72
3.6 Beyond Linear Regression
74
3.6.1 Logistic Regression
74
3.6.2 Time Series Analysis
75
3.7 Discussion and Conclusion
76
References
76
4 Scattered Data and Aggregated Inference
79
4.1 Introduction
80
4.2 Problem Formulation
84
4.2.1 Notations
84
4.2.2 Review on M-Estimators
86
4.2.3 Simple Averaging Estimator
86
4.2.4 One-Step Estimator
87
4.3 Main Results
88
4.3.1 Assumptions
89
4.3.2 Asymptotic Properties and Mean Squared Errors (MSE) Bounds
90
4.3.3 Under the Presence of Communication Failure
91
4.4 Numerical Examples
92
4.4.1 Logistic Regression
93
4.4.2 Beta Distribution
95
4.4.3 Beta Distribution with Possibility of Losing Information
97
4.4.4 Gaussian Distribution with Unknown Mean and Variance
99
4.5 Discussion on Distributed Statistical Inference
100
4.6 Other Problems
102
4.7 Conclusion
104
References
104
5 Nonparametric Methods for Big Data Analytics
107
5.1 Introduction
107
5.2 Classical Methods for Nonparametric Regression
109
5.2.1 Additive Models
109
5.2.2 Generalized Additive Models (GAM)
111
5.2.3 Smoothing Spline ANOVA (SS-ANOVA)
111
5.3 High Dimensional Additive Models
113
5.3.1 COSSO Method
114
5.3.2 Adaptive COSSO
117
5.3.3 Linear and Nonlinear Discover (LAND)
119
5.3.4 Adaptive Group LASSO
122
5.3.5 Sparse Additive Models (SpAM)
123
5.3.6 Sparsity-Smoothness Penalty
124
5.4 Nonparametric Independence Screening (NIS)
125
References
126
6 Finding Patterns in Time Series
129
6.1 Introduction
130
6.1.1 Regime Descriptors: Local Models
130
6.1.2 Changepoints
131
6.1.3 Patterns
131
6.1.4 Clustering, Classification, and Prediction
132
6.1.5 Measures of Similarity/Dissimilarity
132
6.1.6 Outline
132
6.2 Data Reduction and Changepoints
133
6.2.1 Piecewise Constant Models
134
6.2.2 Models with Changing Scales
135
6.2.3 Trends
136
6.3 Model Building
138
6.3.1 Batch Methods
139
6.3.2 Online Methods
139
6.4 Model Building: Alternating Trends Smoothing
139
6.4.1 The Tuning Parameter
141
6.4.2 Modifications and Extensions
144
6.5 Bounding Lines
145
6.6 Patterns
148
6.6.1 Time Scaling and Junk
149
6.6.2 Further Data Reduction: Symbolic Representation
150
6.6.3 Symbolic Trend Patterns (STP)
151
6.6.4 Patterns in Bounding Lines
152
6.6.5 Clustering and Classification of Time Series
153
References
154
7 Variational Bayes for Hierarchical Mixture Models
155
7.1 Introduction
156
7.2 Variational Bayes
158
7.2.1 Overview of the VB Method
158
7.2.2 Practicality
160
7.2.3 Over-Confidence
161
7.2.4 Simple Two-Component Mixture Model
161
7.2.5 Marginal Posterior Approximation
164
7.3 VB for a General Finite Mixture Model
166
7.3.1 Motivation
166
7.3.2 The B-LIMMA Model
167
7.4 Numerical Illustrations
169
7.4.1 Simulation
169
7.4.1.1 The B-LIMMA Model
170
7.4.1.2 A Mixture Model Extended from the LIMMA Model
173
7.4.1.3 A Mixture Model for Count Data
179
7.4.2 Real Data Examples
181
7.4.2.1 APOA1 Data
181
7.4.2.2 Colon Cancer Data
184
7.5 Discussion
185
Appendix: The VB-LEMMA Algorithm
187
The B-LEMMA Model
187
Algorithm
188
The VB-Proteomics Algorithm
193
The Proteomics Model
193
Algorithm
194
References
203
8 Hypothesis Testing for High-Dimensional Data
206
8.1 Introduction
206
8.2 Applications
208
8.2.1 Testing of Covariance Matrices
208
8.2.2 Testing of Independence
209
8.2.3 Analysis of Variance
210
8.3 Tests Based on L? Norms
211
8.4 Tests Based on L2 Norms
214
8.5 Asymptotic Theory
216
8.5.1 Preamble: i.i.d. Gaussian Data
217
8.5.2 Rademacher Weighted Differencing
218
8.5.3 Calculating the Power
219
8.5.4 An Algorithm with General Testing Functionals
220
8.6 Numerical Experiments
220
8.6.1 Test of Mean Vectors
220
8.6.2 Test of Covariance Matrices
224
8.6.2.1 Sizes Accuracy
224
8.6.2.2 Power Curve
224
8.6.3 A Real Data Application
225
References
226
9 High-Dimensional Classification
228
9.1 Introduction
228
9.2 LDA, Logistic Regression, and SVMs
230
9.2.1 LDA
230
9.2.2 Logistic Regression
230
9.2.3 The Support Vector Machine
231
9.3 Lasso and Elastic-Net Penalized SVMs
233
9.3.1 The 1 SVM
233
9.3.2 The DrSVM
234
9.4 Lasso and Elastic-Net Penalized Logistic Regression
235
9.5 Huberized SVMs
237
9.6 Concave Penalized Margin-Based Classifiers
243
9.7 Sparse Discriminant Analysis
247
9.7.1 Independent Rules
248
9.7.2 Linear Programming Discriminant Analysis
250
9.7.3 Direct Sparse Discriminant Analysis
251
9.8 Sparse Semiparametric Discriminant Analysis
253
9.9 Sparse Penalized Additive Models for Classification
256
References
262
10 Analysis of High-Dimensional Regression Models Using Orthogonal Greedy Algorithms
265
10.1 Introduction
265
10.2 Convergence Rates of OGA
267
10.2.1 Random Regressors
267
10.2.2 The Fixed Design Case
270
10.3 The Performance OGA Under General Sparse Conditions
272
10.3.1 Rates of Convergence
272
10.3.2 Comparative Studies
273
10.4 The Performance of OGA in High-Dimensional Time Series Models
276
References
284
11 Semi-supervised Smoothing for Large Data Problems
286
11.1 Introduction
286
11.2 Semi-supervised Local Kernel Regression
287
11.2.1 Supervised Kernel Regression
288
11.2.2 Semi-supervised Kernel Regression with a Latent Response
291
11.2.3 Adaptive Semi-supervised Kernel Regression
294
11.2.4 Computational Issues for Large Data
296
11.3 Optimization Frameworks for Semi-supervised Learning
296
References
299
12 Inverse Modeling: A Strategy to Cope with Non-linearity
301
12.1 Introduction
301
12.2 SDR and Inverse Modeling
303
12.2.1 From SIR to PFC
303
12.2.2 Revisit SDR from an Inverse Modeling Perspective
305
12.3 Variable Selection
308
12.3.1 Beyond Sufficient Dimension Reduction: The Necessity of Variable Selection
308
12.3.2 SIR as a Transformation-Projection Pursuit Problem
308
12.3.3 COP: Correlation Pursuit
309
12.3.4 From COP to SIRI
312
12.3.5 Simulation Study for Variable Selection and SDR Estimation
315
12.4 Nonparametric Dependence Screening
317
12.5 Conclusion
321
References
322
13 Sufficient Dimension Reduction for Tensor Data
324
13.1 Curse of Dimensionality
324
13.2 Sufficient Dimension Reduction
326
13.3 Tensor Sufficient Dimension Reduction
329
13.3.1 Tensor Sufficient Dimension Reduction Model
329
13.3.2 Estimate a Single Direction
330
13.4 Simulation Studies
332
13.5 Example
334
13.6 Discussion
335
References
336
14 Compressive Sensing and Sparse Coding
338
14.1 Leveraging the Sparsity Assumption for Signal Recovery
338
14.2 From Combinatorial to Convex Optimization
339
14.3 Dealing with Noisy Measurements
339
14.4 Other Common Forms and Variations
340
14.5 The Theory Behind
340
14.5.1 The Restricted Isometry Property
340
14.5.2 Guaranteed Signal Recovery
341
14.5.3 Random Matrix is Good Enough
341
14.6 Compressive Sensing in Practice
342
14.6.1 Solving the Compressive Sensing Problem
342
14.6.2 Sparsifying Basis
342
14.6.3 Sensing Matrix
343
14.7 Sparse Coding Overview
344
14.7.1 Compressive Sensing and Sparse Coding
345
14.7.1.1 Compressed Domain Feature Extraction
346
14.7.1.2 Compressed Domain Classification
346
14.8 Compressive Sensing Extensions
347
14.8.1 Reconstruction with Additional Information
347
14.8.2 Compressive Sensing with Distorted Measurements
347
References
348
15 Bridging Density Functional Theory and Big Data Analytics with Applications
350
15.1 Introduction
351
15.2 Structure of Data Functionals Defined in the DFT Perspectives
353
15.3 Determinations of Number of Data Groups and the Corresponding Data Boundaries
358
15.4 Physical Phenomena of the Mixed Data Groups
362
15.4.1 Physical Structure of the DFT-Based Algorithm
362
15.4.2 Typical Problem of the Data Clustering:The Fisher's Iris
364
15.4.3 Tentative Experiments on Dataset of MRI with Brain Tumors
366
15.5 Conclusion
369
References
370
Part III Software
374
16 Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing
375
16.1 Introduction: From Data to Information
376
16.1.1 Transparency, Collaboration, and Reproducibility
377
16.2 Related Work
378
16.3 Q3-D3 Genesis
378
16.4 Vector Space Representations
382
16.4.1 Text to Vector
382
16.4.2 Weighting Scheme, Similarity, Distance
384
16.4.3 Shakespeare's Tragedies
389
16.4.4 Generalized VSM (GVSM)
391
16.4.4.1 Basic VSM (BVSM)
392
16.4.4.2 GVSM: Term–Term Correlations
392
16.4.4.3 GVSM: Latent Semantic Analysis (LSA)
393
16.4.4.4 Closer Look at the LSA Implementation
394
16.4.4.5 GVSM Applicability for Big Data
395
16.5 Methods
396
16.5.1 Cluster Analysis
396
16.5.1.1 Partitional Clustering
397
16.5.1.2 Hierarchical Clustering
399
16.5.2 Cluster Validation Measures
399
16.5.2.1 Connectivity
400
16.5.2.2 Silhouette
401
16.5.2.3 Dunn Index
402
16.5.3 Visual Cluster Validation
402
16.6 Results
403
16.6.1 Text Preprocessing Results
403
16.6.2 Sparsity Results
404
16.6.3 Three Models, Three Methods, Three Measures
406
16.6.4 LSA Anatomy
411
16.7 Application
411
16.8 Outlook
413
16.8.1 GitHub Mining Infrastructure in R
413
16.8.2 Future Developments
414
Appendix
415
References
420
17 A Tutorial on Libra: R Package for the Linearized Bregman Algorithm in High-Dimensional Statistics
423
17.1 Introduction to brownLibra
424
17.2 Linear Model
427
17.2.1 Example: Simulation Data
429
17.2.2 Example: Diabetes Data
431
17.3 Logistic Model
432
17.3.1 Binomial Logistic Model
432
17.3.1.1 Example: Publications of COPSS Award Winners
434
17.3.1.2 Example: Journey to the West
435
17.3.2 Multinomial Logistic Model
436
17.4 Graphical Model
438
17.4.1 Gaussian Graphical Model
439
17.4.1.1 Example: Journey to the West
440
17.4.2 Ising Model
442
17.4.2.1 Example: Simulation Data
443
17.4.2.2 Example: Journey to the West
444
17.4.2.3 Example: Dream of the Red Chamber
446
17.4.3 Potts Model
448
17.5 Discussion
450
References
451
Part IV Application
452
18 Functional Data Analysis for Big Data: A Case Study on California Temperature Trends
453
18.1 Introduction
453
18.2 Basic Statistics for Functional Data
455
18.3 Dimension Reduction for Functional Data
456
18.4 Functional Principal Component Analysis
457
18.4.1 Smoothing and Interpolation
459
18.4.2 Sample Size Considerations
462
18.5 Functional Variance Process
463
18.6 Functional Data Analysis for Temperature Trends
465
18.7 Conclusions
475
References
476
19 Bayesian Spatiotemporal Modeling for Detecting Neuronal Activation via Functional Magnetic Resonance Imaging
480
19.1 Introduction
481
19.1.1 Emotion Processing Data
482
19.2 Variable Selection in Bayesian Spatiotemporal Models
483
19.2.1 Bezener et al.'s (2015) Areal Model
484
19.2.1.1 Posterior Distribution and MCMC Algorithm
486
19.2.1.2 Starting Values
487
19.2.1.3 Emotion Processing Data
487
19.2.2 Musgrove et al.'s (2015) Areal Model
488
19.2.2.1 Partitioning the Image
489
19.2.2.2 Spatial Bayesian Variable Selection with Temporal Correlation
489
19.2.2.3 Sparse SGLMM Prior
490
19.2.2.4 Posterior Computation and Inference
491
19.2.2.5 Emotion Processing Data
492
19.2.3 Activation Maps for Emotion Processing Data
493
19.3 Discussion
494
References
494
20 Construction of Tight Frames on Graphs and Application to Denoising
497
20.1 Introduction
497
20.1.1 Motivation
497
20.1.2 Relation to Previous Work
498
20.2 Notation and Basics
499
20.2.1 Setting
499
20.2.2 Frames
500
20.2.3 Neighborhood Graphs
501
20.2.4 Spectral Graph Theory
502
20.3 Construction and Properties
503
20.3.1 Construction of a Tight Graph Frame
503
20.3.2 Spatial Localization
505
20.4 Denoising
508
20.5 Numerical Experiments
511
20.6 Outlook
512
Appendix
514
Proof of Theorem 3
514
References
515
21 Beta-Boosted Ensemble for Big Credit Scoring Data
517
21.1 Introduction
517
21.2 Method Description
519
21.2.1 Beta Binomial Distribution
519
21.2.2 Beta-Boosted Ensemble Model
520
21.2.3 Toy Example
522
21.2.4 Relation to Existing Solutions
525
21.3 Experiments
525
21.4 Conclusion and Future Work
531
References
531