Suchen und Finden

Titel

Autor

Inhaltsverzeichnis

Nur ebooks mit Firmenlizenz anzeigen:

 

Handbook of Big Data Analytics

Handbook of Big Data Analytics

Wolfgang Karl Härdle, Henry Horng-Shing Lu, Xiaotong Shen

 

Verlag Springer-Verlag, 2018

ISBN 9783319182841 , 532 Seiten

Format PDF, OL

Kopierschutz Wasserzeichen

Geräte

341,33 EUR

Mehr zum Inhalt

Handbook of Big Data Analytics


 

Preface

6

Contents

7

Part I Overview

9

1 Statistics, Statisticians, and the Internet of Things

10

1.1 Introduction

11

1.1.1 The Internet of Things

11

1.1.2 What Is Big Data in an Internet of Things?

11

1.1.3 Building Blocks

12

1.1.4 Ubiquity

13

1.1.5 Consumer Applications

15

1.1.6 The Internets of [Infrastructure] Things

17

1.1.7 Industrial Scenarios

19

1.2 What Kinds of Statistics Are Needed for Big IoT Data?

20

1.2.1 Coping with Complexity

20

1.2.2 Privacy

21

1.2.3 Traditional Statistics Versus the IoT

22

1.2.4 A View of the Future of Statistics in an IoT World

23

1.3 Big Data in the Real World

24

1.3.1 Skills

24

1.3.2 Politics

25

1.3.3 Technique

25

1.3.4 Traditional Databases

26

1.3.5 Cognition

26

1.4 Conclusion

27

2 Cognitive Data Analysis for Big Data

29

2.1 Introduction

30

2.1.1 Big Data

30

2.1.2 Defining Cognitive Data Analysis

31

2.1.3 Stages of CDA

33

2.2 Data Preparation

35

2.2.1 Natural Language Query

36

2.2.2 Data Integration

37

2.2.3 Metadata Discovery

38

2.2.4 Data Quality Verification

39

2.2.5 Data Type Detection

40

2.2.6 Data Lineage

41

2.3 Automated Modeling

42

2.3.1 Descriptive Analytics

42

2.3.2 Predictive Analytics

43

2.3.3 Starting Points

44

2.3.4 System Recommendations

45

2.4 Application of Results

46

2.4.1 Gaining Insights

46

2.4.2 Sharing and Collaborating

47

2.4.3 Deployment

47

2.5 Use Case

48

2.6 Conclusion

52

References

52

Part II Methodology

54

3 Statistical Leveraging Methods in Big Data

55

3.1 Background

55

3.2 Leveraging Approximation for Least Squares Estimator

58

3.2.1 Leveraging for Least Squares Approximation

58

3.2.2 A Matrix Approximation Perspective

60

3.2.3 The Computation of Leveraging Scores

61

3.2.4 An Innovative Proposal: Predictor-Length Method

61

3.2.5 More on Modeling

63

3.2.6 Statistical Leveraging Algorithms in the Literature: A Summary

63

3.3 Statistical Properties of Leveraging Estimator

64

3.3.1 Weighted Leveraging Estimator

64

3.3.2 Unweighted Leveraging Estimator

66

3.4 Simulation Study

68

3.4.1 UNIF and BLEV

68

3.4.2 BLEV and LEVUNW

69

3.4.3 BLEV and SLEV

69

3.4.4 BLEV and PL

70

3.4.5 SLEV and PL

70

3.5 Real Data Analysis

72

3.6 Beyond Linear Regression

74

3.6.1 Logistic Regression

74

3.6.2 Time Series Analysis

75

3.7 Discussion and Conclusion

76

References

76

4 Scattered Data and Aggregated Inference

79

4.1 Introduction

80

4.2 Problem Formulation

84

4.2.1 Notations

84

4.2.2 Review on M-Estimators

86

4.2.3 Simple Averaging Estimator

86

4.2.4 One-Step Estimator

87

4.3 Main Results

88

4.3.1 Assumptions

89

4.3.2 Asymptotic Properties and Mean Squared Errors (MSE) Bounds

90

4.3.3 Under the Presence of Communication Failure

91

4.4 Numerical Examples

92

4.4.1 Logistic Regression

93

4.4.2 Beta Distribution

95

4.4.3 Beta Distribution with Possibility of Losing Information

97

4.4.4 Gaussian Distribution with Unknown Mean and Variance

99

4.5 Discussion on Distributed Statistical Inference

100

4.6 Other Problems

102

4.7 Conclusion

104

References

104

5 Nonparametric Methods for Big Data Analytics

107

5.1 Introduction

107

5.2 Classical Methods for Nonparametric Regression

109

5.2.1 Additive Models

109

5.2.2 Generalized Additive Models (GAM)

111

5.2.3 Smoothing Spline ANOVA (SS-ANOVA)

111

5.3 High Dimensional Additive Models

113

5.3.1 COSSO Method

114

5.3.2 Adaptive COSSO

117

5.3.3 Linear and Nonlinear Discover (LAND)

119

5.3.4 Adaptive Group LASSO

122

5.3.5 Sparse Additive Models (SpAM)

123

5.3.6 Sparsity-Smoothness Penalty

124

5.4 Nonparametric Independence Screening (NIS)

125

References

126

6 Finding Patterns in Time Series

129

6.1 Introduction

130

6.1.1 Regime Descriptors: Local Models

130

6.1.2 Changepoints

131

6.1.3 Patterns

131

6.1.4 Clustering, Classification, and Prediction

132

6.1.5 Measures of Similarity/Dissimilarity

132

6.1.6 Outline

132

6.2 Data Reduction and Changepoints

133

6.2.1 Piecewise Constant Models

134

6.2.2 Models with Changing Scales

135

6.2.3 Trends

136

6.3 Model Building

138

6.3.1 Batch Methods

139

6.3.2 Online Methods

139

6.4 Model Building: Alternating Trends Smoothing

139

6.4.1 The Tuning Parameter

141

6.4.2 Modifications and Extensions

144

6.5 Bounding Lines

145

6.6 Patterns

148

6.6.1 Time Scaling and Junk

149

6.6.2 Further Data Reduction: Symbolic Representation

150

6.6.3 Symbolic Trend Patterns (STP)

151

6.6.4 Patterns in Bounding Lines

152

6.6.5 Clustering and Classification of Time Series

153

References

154

7 Variational Bayes for Hierarchical Mixture Models

155

7.1 Introduction

156

7.2 Variational Bayes

158

7.2.1 Overview of the VB Method

158

7.2.2 Practicality

160

7.2.3 Over-Confidence

161

7.2.4 Simple Two-Component Mixture Model

161

7.2.5 Marginal Posterior Approximation

164

7.3 VB for a General Finite Mixture Model

166

7.3.1 Motivation

166

7.3.2 The B-LIMMA Model

167

7.4 Numerical Illustrations

169

7.4.1 Simulation

169

7.4.1.1 The B-LIMMA Model

170

7.4.1.2 A Mixture Model Extended from the LIMMA Model

173

7.4.1.3 A Mixture Model for Count Data

179

7.4.2 Real Data Examples

181

7.4.2.1 APOA1 Data

181

7.4.2.2 Colon Cancer Data

184

7.5 Discussion

185

Appendix: The VB-LEMMA Algorithm

187

The B-LEMMA Model

187

Algorithm

188

The VB-Proteomics Algorithm

193

The Proteomics Model

193

Algorithm

194

References

203

8 Hypothesis Testing for High-Dimensional Data

206

8.1 Introduction

206

8.2 Applications

208

8.2.1 Testing of Covariance Matrices

208

8.2.2 Testing of Independence

209

8.2.3 Analysis of Variance

210

8.3 Tests Based on L? Norms

211

8.4 Tests Based on L2 Norms

214

8.5 Asymptotic Theory

216

8.5.1 Preamble: i.i.d. Gaussian Data

217

8.5.2 Rademacher Weighted Differencing

218

8.5.3 Calculating the Power

219

8.5.4 An Algorithm with General Testing Functionals

220

8.6 Numerical Experiments

220

8.6.1 Test of Mean Vectors

220

8.6.2 Test of Covariance Matrices

224

8.6.2.1 Sizes Accuracy

224

8.6.2.2 Power Curve

224

8.6.3 A Real Data Application

225

References

226

9 High-Dimensional Classification

228

9.1 Introduction

228

9.2 LDA, Logistic Regression, and SVMs

230

9.2.1 LDA

230

9.2.2 Logistic Regression

230

9.2.3 The Support Vector Machine

231

9.3 Lasso and Elastic-Net Penalized SVMs

233

9.3.1 The 1 SVM

233

9.3.2 The DrSVM

234

9.4 Lasso and Elastic-Net Penalized Logistic Regression

235

9.5 Huberized SVMs

237

9.6 Concave Penalized Margin-Based Classifiers

243

9.7 Sparse Discriminant Analysis

247

9.7.1 Independent Rules

248

9.7.2 Linear Programming Discriminant Analysis

250

9.7.3 Direct Sparse Discriminant Analysis

251

9.8 Sparse Semiparametric Discriminant Analysis

253

9.9 Sparse Penalized Additive Models for Classification

256

References

262

10 Analysis of High-Dimensional Regression Models Using Orthogonal Greedy Algorithms

265

10.1 Introduction

265

10.2 Convergence Rates of OGA

267

10.2.1 Random Regressors

267

10.2.2 The Fixed Design Case

270

10.3 The Performance OGA Under General Sparse Conditions

272

10.3.1 Rates of Convergence

272

10.3.2 Comparative Studies

273

10.4 The Performance of OGA in High-Dimensional Time Series Models

276

References

284

11 Semi-supervised Smoothing for Large Data Problems

286

11.1 Introduction

286

11.2 Semi-supervised Local Kernel Regression

287

11.2.1 Supervised Kernel Regression

288

11.2.2 Semi-supervised Kernel Regression with a Latent Response

291

11.2.3 Adaptive Semi-supervised Kernel Regression

294

11.2.4 Computational Issues for Large Data

296

11.3 Optimization Frameworks for Semi-supervised Learning

296

References

299

12 Inverse Modeling: A Strategy to Cope with Non-linearity

301

12.1 Introduction

301

12.2 SDR and Inverse Modeling

303

12.2.1 From SIR to PFC

303

12.2.2 Revisit SDR from an Inverse Modeling Perspective

305

12.3 Variable Selection

308

12.3.1 Beyond Sufficient Dimension Reduction: The Necessity of Variable Selection

308

12.3.2 SIR as a Transformation-Projection Pursuit Problem

308

12.3.3 COP: Correlation Pursuit

309

12.3.4 From COP to SIRI

312

12.3.5 Simulation Study for Variable Selection and SDR Estimation

315

12.4 Nonparametric Dependence Screening

317

12.5 Conclusion

321

References

322

13 Sufficient Dimension Reduction for Tensor Data

324

13.1 Curse of Dimensionality

324

13.2 Sufficient Dimension Reduction

326

13.3 Tensor Sufficient Dimension Reduction

329

13.3.1 Tensor Sufficient Dimension Reduction Model

329

13.3.2 Estimate a Single Direction

330

13.4 Simulation Studies

332

13.5 Example

334

13.6 Discussion

335

References

336

14 Compressive Sensing and Sparse Coding

338

14.1 Leveraging the Sparsity Assumption for Signal Recovery

338

14.2 From Combinatorial to Convex Optimization

339

14.3 Dealing with Noisy Measurements

339

14.4 Other Common Forms and Variations

340

14.5 The Theory Behind

340

14.5.1 The Restricted Isometry Property

340

14.5.2 Guaranteed Signal Recovery

341

14.5.3 Random Matrix is Good Enough

341

14.6 Compressive Sensing in Practice

342

14.6.1 Solving the Compressive Sensing Problem

342

14.6.2 Sparsifying Basis

342

14.6.3 Sensing Matrix

343

14.7 Sparse Coding Overview

344

14.7.1 Compressive Sensing and Sparse Coding

345

14.7.1.1 Compressed Domain Feature Extraction

346

14.7.1.2 Compressed Domain Classification

346

14.8 Compressive Sensing Extensions

347

14.8.1 Reconstruction with Additional Information

347

14.8.2 Compressive Sensing with Distorted Measurements

347

References

348

15 Bridging Density Functional Theory and Big Data Analytics with Applications

350

15.1 Introduction

351

15.2 Structure of Data Functionals Defined in the DFT Perspectives

353

15.3 Determinations of Number of Data Groups and the Corresponding Data Boundaries

358

15.4 Physical Phenomena of the Mixed Data Groups

362

15.4.1 Physical Structure of the DFT-Based Algorithm

362

15.4.2 Typical Problem of the Data Clustering:The Fisher's Iris

364

15.4.3 Tentative Experiments on Dataset of MRI with Brain Tumors

366

15.5 Conclusion

369

References

370

Part III Software

374

16 Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing

375

16.1 Introduction: From Data to Information

376

16.1.1 Transparency, Collaboration, and Reproducibility

377

16.2 Related Work

378

16.3 Q3-D3 Genesis

378

16.4 Vector Space Representations

382

16.4.1 Text to Vector

382

16.4.2 Weighting Scheme, Similarity, Distance

384

16.4.3 Shakespeare's Tragedies

389

16.4.4 Generalized VSM (GVSM)

391

16.4.4.1 Basic VSM (BVSM)

392

16.4.4.2 GVSM: Term–Term Correlations

392

16.4.4.3 GVSM: Latent Semantic Analysis (LSA)

393

16.4.4.4 Closer Look at the LSA Implementation

394

16.4.4.5 GVSM Applicability for Big Data

395

16.5 Methods

396

16.5.1 Cluster Analysis

396

16.5.1.1 Partitional Clustering

397

16.5.1.2 Hierarchical Clustering

399

16.5.2 Cluster Validation Measures

399

16.5.2.1 Connectivity

400

16.5.2.2 Silhouette

401

16.5.2.3 Dunn Index

402

16.5.3 Visual Cluster Validation

402

16.6 Results

403

16.6.1 Text Preprocessing Results

403

16.6.2 Sparsity Results

404

16.6.3 Three Models, Three Methods, Three Measures

406

16.6.4 LSA Anatomy

411

16.7 Application

411

16.8 Outlook

413

16.8.1 GitHub Mining Infrastructure in R

413

16.8.2 Future Developments

414

Appendix

415

References

420

17 A Tutorial on Libra: R Package for the Linearized Bregman Algorithm in High-Dimensional Statistics

423

17.1 Introduction to brownLibra

424

17.2 Linear Model

427

17.2.1 Example: Simulation Data

429

17.2.2 Example: Diabetes Data

431

17.3 Logistic Model

432

17.3.1 Binomial Logistic Model

432

17.3.1.1 Example: Publications of COPSS Award Winners

434

17.3.1.2 Example: Journey to the West

435

17.3.2 Multinomial Logistic Model

436

17.4 Graphical Model

438

17.4.1 Gaussian Graphical Model

439

17.4.1.1 Example: Journey to the West

440

17.4.2 Ising Model

442

17.4.2.1 Example: Simulation Data

443

17.4.2.2 Example: Journey to the West

444

17.4.2.3 Example: Dream of the Red Chamber

446

17.4.3 Potts Model

448

17.5 Discussion

450

References

451

Part IV Application

452

18 Functional Data Analysis for Big Data: A Case Study on California Temperature Trends

453

18.1 Introduction

453

18.2 Basic Statistics for Functional Data

455

18.3 Dimension Reduction for Functional Data

456

18.4 Functional Principal Component Analysis

457

18.4.1 Smoothing and Interpolation

459

18.4.2 Sample Size Considerations

462

18.5 Functional Variance Process

463

18.6 Functional Data Analysis for Temperature Trends

465

18.7 Conclusions

475

References

476

19 Bayesian Spatiotemporal Modeling for Detecting Neuronal Activation via Functional Magnetic Resonance Imaging

480

19.1 Introduction

481

19.1.1 Emotion Processing Data

482

19.2 Variable Selection in Bayesian Spatiotemporal Models

483

19.2.1 Bezener et al.'s (2015) Areal Model

484

19.2.1.1 Posterior Distribution and MCMC Algorithm

486

19.2.1.2 Starting Values

487

19.2.1.3 Emotion Processing Data

487

19.2.2 Musgrove et al.'s (2015) Areal Model

488

19.2.2.1 Partitioning the Image

489

19.2.2.2 Spatial Bayesian Variable Selection with Temporal Correlation

489

19.2.2.3 Sparse SGLMM Prior

490

19.2.2.4 Posterior Computation and Inference

491

19.2.2.5 Emotion Processing Data

492

19.2.3 Activation Maps for Emotion Processing Data

493

19.3 Discussion

494

References

494

20 Construction of Tight Frames on Graphs and Application to Denoising

497

20.1 Introduction

497

20.1.1 Motivation

497

20.1.2 Relation to Previous Work

498

20.2 Notation and Basics

499

20.2.1 Setting

499

20.2.2 Frames

500

20.2.3 Neighborhood Graphs

501

20.2.4 Spectral Graph Theory

502

20.3 Construction and Properties

503

20.3.1 Construction of a Tight Graph Frame

503

20.3.2 Spatial Localization

505

20.4 Denoising

508

20.5 Numerical Experiments

511

20.6 Outlook

512

Appendix

514

Proof of Theorem 3

514

References

515

21 Beta-Boosted Ensemble for Big Credit Scoring Data

517

21.1 Introduction

517

21.2 Method Description

519

21.2.1 Beta Binomial Distribution

519

21.2.2 Beta-Boosted Ensemble Model

520

21.2.3 Toy Example

522

21.2.4 Relation to Existing Solutions

525

21.3 Experiments

525

21.4 Conclusion and Future Work

531

References

531