Parallel Computing for Data Science With Examples in R C++ and CUDA 2nd Edition by Norman Matloff – Ebook PDF Instant Download/Delivery: 0367738198, 9780367738198
Full download Parallel Computing for Data Science With Examples in R C++ and CUDA 2nd Edition after payment
Product details:
ISBN 10: 0367738198
ISBN 13: 9780367738198
Author: Norman Matloff
This is one of the first parallel computing books to focus exclusively on parallel data structures, algorithms, software tools, and applications in data science. The book prepares readers to write effective parallel code in various languages and learn more about different R packages and other tools. It covers the classic n observations, p variables matrix format and common data structures. Many examples illustrate the range of issues encountered in parallel programming.
Parallel Computing for Data Science With Examples in R C++ and CUDA 2nd Table of contents:
1 Introduction to Parallel Processing in R
1.1 Recurring Theme: The Principle of Pretty Good Parallelism
1.1.1 Fast Enough
1.1.2 “R+X”
1.2 A Note on Machines
1.3 Recurring Theme: Hedging One’s Bets
1.4 Extended Example: Mutual Web Outlinks
1.4.1 Serial Code
1.4.2 Choice of Parallel Tool
1.4.3 Meaning of “snow” in This Book
1.4.4 Introduction to snow
1.4.5 Mutual Outlinks Problem, Solution 1
1.4.5.1 Code
1.4.5.2 Timings
1.4.5.3 Analysis of the Code
1.5 Further Reading
2 “Why Is My Program So Slow?”: Obstacles to Speed
2.1 Obstacles to Speed
2.2 Performance and Hardware Structures
2.3 Memory Basics
2.3.1 Caches
2.3.2 Virtual Memory
2.3.3 Monitoring Cache Misses and Page Faults
2.3.4 Locality of Reference
2.4 Network Basics
2.5 Latency and Bandwidth
2.5.1 Two Representative Hardware Platforms: Multicore Machines and Clusters
2.5.1.1 Multicore
2.5.1.2 Clusters
2.5.2 The Principle of “Just Leave It There”
2.6 Thread Scheduling
2.7 How Many Processes/Threads?
2.8 Example: Mutual Outlink Problem
2.9 “Big O” Notation
2.10 Data Serialization
2.11 “Embarrassingly Parallel” Applications
2.11.1 What People Mean by “Embarrassingly Parallel”
2.11.2 Suitable Platforms for Non-Embarrassingly Parallel Applications
2.12 Further Reading
3 Principles of Parallel Loop Scheduling
3.1 General Notions of Loop Scheduling
3.2 Chunking in snow
3.2.1 Example: Mutual Outlinks Problem
3.3 A Note on Code Complexity
3.4 Example: All Possible Regressions
3.4.1 Parallelization Strategies
3.4.2 The Code
3.4.3 Sample Run
3.4.4 Code Analysis
3.4.4.1 Our Task List
3.4.4.2 Chunking
3.4.4.3 Task Scheduling
3.4.4.4 The Actual Dispatching of Work
3.4.4.5 Wrapping Up
3.4.5 Timing Experiments
3.5 The partools Package
3.6 Example: All Possible Regressions, Improved Version
3.6.1 Code
3.6.2 Code Analysis
3.6.3 Timings
3.7 Introducing Another Tool: multicore
3.7.1 Source of the Performance Advantage
3.7.2 Example: All Possible Regressions, Using multicore
3.8 Issues with Chunk Size
3.9 Example: Parallel Distance Computation
3.9.1 The Code
3.9.2 Timings
3.10 The foreach Package
3.10.1 Example: Mutual Outlinks Problem
3.10.2 A Caution When Using foreach
3.11 Stride
3.12 Another Scheduling Approach: Random Task Permutation
3.12.1 The Math
3.12.2 The Random Method vs. Others, in Practice
3.13 Debugging snow and multicore Code
3.13.1 Debugging in snow
3.13.2 Debugging in multicore
4 The Shared-Memory Paradigm: A Gentle Introduction via R
4.1 So, What Is Actually Shared?
4.1.1 Global Variables
4.1.2 Local Variables: Stack Structures
4.1.3 Non-Shared Memory Systems
4.2 Clarity of Shared-Memory Code
4.3 High-Level Introduction to Shared-Memory Programming: Rdsm Package
4.3.1 Use of Shared Memory
4.4 Example: Matrix Multiplication
4.4.1 The Code
4.4.2 Analysis
4.4.3 The Code
4.4.4 A Closer Look at the Shared Nature of Our Data
4.4.5 Timing Comparison
4.4.6 Leveraging R
4.5 Shared Memory Can Bring A Performance Advantage
4.6 Locks and Barriers
4.6.1 Race Conditions and Critical Sections
4.6.2 Locks
4.6.3 Barriers
4.7 Example: Maximal Burst in a Time Series
4.7.1 The Code
4.8 Example: Transforming an Adjacency Matrix
4.8.1 The Code
4.8.2 Overallocation of Memory
4.8.3 Timing Experiment
4.9 Example: k-Means Clustering
4.9.1 The Code
4.9.2 Timing Experiment
4.10 Further Reading
5 The Shared-Memory Paradigm: C Level
5.1 OpenMP
5.2 Example: Finding the Maximal Burst in a Time Series
5.2.1 The Code
5.2.2 Compiling and Running
5.2.3 Analysis
5.2.4 A Cautionary Note About Thread Scheduling
5.2.5 Setting the Number of Threads
5.2.6 Timings
5.3 OpenMP Loop Scheduling Options
5.3.1 OpenMP Scheduling Options
5.3.2 Scheduling through Work Stealing
5.4 Example: Transforming an Adjacency Matrix
5.4.1 The Code
5.4.2 Analysis of the Code
5.5 Example: Adjacency Matrix, R-Callable Code
5.5.1 The Code, for C()
5.5.2 Compiling and Running
5.5.3 Analysis
5.5.4 The Code, for Rcpp
5.5.5 Compiling and Running
5.5.6 Code Analysis
5.5.7 Advanced Rcpp
5.6 Speedup in C
5.7 Run Time vs. Development Time
5.8 Further Cache/Virtual Memory Issues
5.9 Reduction Operations in OpenMP
5.9.1 Example: Mutual In-Links
5.9.1.1 The Code
5.9.1.2 Sample Run
5.9.1.3 Analysis
5.9.2 Cache Issues
5.9.3 Rows vs. Columns
5.9.4 Processor Affinity
5.10 Debugging
5.10.1 Threads Commands in GDB
5.10.2 Using GDB on C/C++ Code Called from R
5.11 Intel Thread Building Blocks (TBB)
5.12 Lockfree Synchronization
5.13 Further Reading
6 The Shared-Memory Paradigm: GPUs
6.1 Overview
6.2 Another Note on Code Complexity
6.3 Goal of This Chapter
6.4 Introduction to NVIDIA GPUs and CUDA
6.4.1 Example: Calculate Row Sums
6.4.2 NVIDIA GPU Hardware Structure
6.4.2.1 Cores
6.4.2.2 Threads
6.4.2.3 The Problem of Thread Divergence
6.4.2.4 “OS in Hardware”
6.4.2.5 Grid Configuration Choices
6.4.2.6 Latency Hiding in GPUs
6.4.2.7 Shared Memory
6.4.2.8 More Hardware Details
6.4.2.9 Resource Limitations
6.5 Example: Mutual Inlinks Problem
6.5.1 The Code
6.5.2 Timing Experiments
6.6 Synchronization on GPUs
6.6.1 Data in Global Memory Is Persistent
6.7 R and GPUs
6.7.1 Example: Parallel Distance Computation
6.8 The Intel Xeon Phi Chip
6.9 Further Reading
7 Thrust and Rth
7.1 Hedging One’s Bets
7.2 Thrust Overview
7.3 Rth
7.4 Skipping the C++
7.5 Example: Finding Quantiles
7.5.1 The Code
7.5.2 Compilation and Timings
7.5.3 Code Analysis
7.6 Introduction to Rth
8 The Message Passing Paradigm
8.1 Message Passing Overview
8.2 The Cluster Model
8.3 Performance Issues
8.4 Rmpi
8.4.1 Installation and Execution
8.5 Example: Pipelined Method for Finding Primes
8.5.1 Algorithm
8.5.2 The Code
8.5.3 Timing Example
8.5.4 Latency, Bandwdith and Parallelism
8.5.5 Possible Improvements
8.5.6 Analysis of the Code
8.6 Memory Allocation Issues
8.7 Message-Passing Performance Subtleties
8.7.1 Blocking vs. Nonblocking I/O
8.7.2 The Dreaded Deadlock Problem
8.8 Further Reading
9 MapReduce Computation
9.1 Apache Hadoop
9.1.1 Hadoop Streaming
9.1.2 Example: Word Count
9.1.3 Running the Code
9.1.4 Analysis of the Code
9.1.5 Role of Disk Files
9.2 Other MapReduce Systems
9.3 R Interfaces to MapReduce Systems
9.4 An Alternative: “Snowdoop”
9.4.1 Example: Snowdoop Word Count
9.4.2 Example: Snowdoop k-Means Clustering
9.5 Further Reading
10 Parallel Sorting and Merging
10.1 The Elusive Goal of Optimality
10.2 Sorting Algorithms
10.2.1 Compare-and-Exchange Operations
10.2.2 Some “Representative” Sorting Algorithms
10.3 Example: Bucket Sort in R
10.4 Example: Quicksort in OpenMP
10.5 Sorting in Rth
10.6 Some Timing Comparisons
10.7 Sorting on Distributed Data
10.7.1 Hyperquicksort
10.8 Further Reading
11 Parallel Prefix Scan
11.1 General Formulation
11.2 Applications
11.3 General Strategies
11.3.1 A Log-Based Method
11.3.2 Another Way
11.4 Implementations of Parallel Prefix Scan
11.5 Parallel cumsum() with OpenMP
11.5.1 Stack Size Limitations
11.5.2 Let’s Try It Out
11.6 Example: Moving Average
11.6.1 Rth Code
11.6.2 Algorithm
11.6.3 Performance
11.6.4 Use of Lambda Functions
12 Parallel Matrix Operations
12.1 Tiled Matrices
12.2 Example: Snowdoop Approach
12.3 Parallel Matrix Multiplication
12.3.1 Multiplication on Message-Passing Systems
12.3.1.1 Distributed Storage
12.3.1.2 Fox’s Algorithm
12.3.1.3 Overhead Issues
12.3.2 Multiplication on Multicore Machines
12.3.2.1 Overhead Issues
12.3.3 Matrix Multiplication on GPUs
12.3.3.1 Overhead Issues
12.4 BLAS Libraries
12.4.1 Overview
12.5 Example: Performance of OpenBLAS
12.6 Example: Graph Connectedness
12.6.1 Analysis
12.6.2 The “Log Trick”
12.6.3 Parallel Computation
12.6.4 The matpow Package
12.6.4.1 Features
12.7 Solving Systems of Linear Equations
12.7.1 The Classical Approach: Gaussian Elimination and the LU Decomposition
12.7.2 The Jacobi Algorithm
12.7.2.1 Parallelization
12.7.3 Example: R/gputools Implementation of Jacobi
12.7.4 QR Decomposition
12.7.5 Some Timing Results
12.8 Sparse Matrices
12.9 Further Reading
13 Inherently Statistical Approaches: Subset Methods
13.1 Chunk Averaging
13.1.1 Asymptotic Equivalence
13.1.2 O(·) Analysis
13.1.3 Code
13.1.4 Timing Experiments
13.1.4.1 Example: Quantile Regression
13.1.4.2 Example: Logistic Model
13.1.4.3 Example: Estimating Hazard Functions
13.1.5 Non-i.i.d. Settings
13.2 Bag of Little Bootstraps
13.3 Subsetting Variables
13.4 Further Reading
A Review of Matrix Algebra
A.1 Terminology and Notation
A.1.1 Matrix Addition and Multiplication
A.2 Matrix Transpose
A.3 Linear Independence
A.4 Determinants
A.5 Matrix Inverse
A.6 Eigenvalues and Eigenvectors
A.7 Matrix Algebra in R
B R Quick Start
B.1 Correspondences
B.2 Starting R
B.3 First Sample Programming Session
B.4 Second Sample Programming Session
B.5 Third Sample Programming Session
B.6 The R List Type
B.6.1 The Basics
B.6.2 The Reduce() Function
B.6.3 S3 Classes
B.6.4 Handy Utilities
B.7 Debugging in R
C Introduction to C for R Programmers
C.0.1 Sample Program
C.0.2 Analysis
C.1 C++
People also search for Parallel Computing for Data Science With Examples in R C++ and CUDA 2nd:
what is computational data science
is parallel computing useful
computational data science jobs
parallel computing for data science
Tags: Norman Matloff, Computing, Parallel, Examples