Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein – Ebook PDF Instant Download/Delivery: 1138470899, 9781138470897
Full download Introduction to High Performance Computing for Scientists and Engineers 1st Edition after payment
Product details:
ISBN 10: 1138470899
ISBN 13: 9781138470897
Author: Georg Hager, Gerhard Wellein
Written by high performance computing (HPC) experts, Introduction to High Performance Computing for Scientists and Engineers provides a solid introduction to current mainstream computer architecture, dominant parallel programming models, and useful optimization strategies for scientific HPC. From working in a scientific computing center.
Introduction to High Performance Computing for Scientists and Engineers 1st Table of contents:
1 Modern processors
1.1 Stored-program computer architecture
1.2 General-purpose cache-based microprocessor architecture
1.2.1 Performance metrics and benchmarks
1.2.2 Transistors galore: Moore’s Law
1.2.3 Pipelining
1.2.4 Superscalarity
1.2.5 SIMD
1.3 Memory hierarchies
1.3.1 Cache
1.3.2 Cache mapping
1.3.3 Prefetch
1.4 Multicore processors
1.5 Multithreaded processors
1.6 Vector processors
1.6.1 Design principles
1.6.2 Maximum performance estimates
1.6.3 Programming for vector architectures
Problems
2 Basic optimization techniques for serial code
2.1 Scalar profiling
2.1.1 Function- and line-based runtime profiling
Function profiling
Line-based profiling
2.1.2 Hardware performance counters
2.1.3 Manual instrumentation
2.2 Common sense optimizations
2.2.1 Do less work!
2.2.2 Avoid expensive operations!
2.2.3 Shrink the working set!
2.3 Simple measures, large impact
2.3.1 Elimination of common subexpressions
2.3.2 Avoiding branches
2.3.3 Using SIMD instruction sets
2.4 The role of compilers
2.4.1 General optimization options
2.4.2 Inlining
2.4.3 Aliasing
2.4.4 Computational accuracy
2.4.5 Register optimizations
2.4.6 Using compiler logs
2.5 C++ optimizations
2.5.1 Temporaries
2.5.2 Dynamic memory management
Lazy construction
Static construction
2.5.3 Loop kernels and iterators
Problems
3 Data access optimization
3.1 Balance analysis and lightspeed estimates
3.1.1 Bandwidth-based performance modeling
3.1.2 The STREAM benchmarks
3.2 Storage order
3.3 Case study: The Jacobi algorithm
3.4 Case study: Dense matrix transpose
3.5 Algorithm classification and access optimizations
3.5.1 O(N)/O(N)
3.5.2 ON2/ON2
3.5.3 ON3/ON2
3.6 Case study: Sparse matrix-vector multiply
3.6.1 Sparse matrix storage schemes
3.6.2 Optimizing JDS sparse MVM
Problems
4 Parallel computers
4.1 Taxonomy of parallel computing paradigms
4.2 Shared-memory computers
4.2.1 Cache coherence
4.2.2 UMA
4.2.3 ceNUMA
4.3 Distributed-memory computers
4.4 Hierarchical (hybrid) systems
4.5 Networks
4.5.1 Basic performance characteristics of networks
Point-to-point connections
Bisection bandwidth
4.5.2 Buses
4.5.3 Switched and fat-tree networks
4.5.4 Mesh networks
4.5.5 Hybrids
Problems
5 Basics of parallelization
5.1 Why parallelize?
5.2 Parallelism
5.2.1 Data parallelism
Example: Medium-grained loop parallelism
Example: Coarse-grained parallelism by domain decomposition
5.2.2 Functional parallelism
Example: Master-worker scheme
Example: Functional decomposition
5.3 Parallel scalability
5.3.1 Factors that limit parallel execution
5.3.2 Scalability metrics
5.3.3 Simple scalability laws
5.3.4 Parallel efficiency
5.3.5 Serial performance versus strong scalability
5.3.6 Refined performance models
5.3.7 Choosing the right scaling baseline
5.3.8 Case study: Can slower processors compute faster?
Modified weak scaling
5.3.9 Load imbalance
OS jitter
Problems
6 Shared-memory parallel programming with OpenMP
6.1 Short introduction to OpenMP
6.1.1 Parallel execution
6.1.2 Data scoping
6.1.3 OpenMP worksharing for loops
6.1.4 Synchronization
Critical regions
6.1.5 Reductions
6.1.6 Loop scheduling
6.1.7 Tasking
6.1.8 Miscellaneous
Conditional compilation
Memory consistency
Thread safety
Affinity
Environment variables
6.2 Case study: OpenMP-parallel Jacobi algorithm
6.3 Advanced OpenMP: Wavefront parallelization
Problems
7 Efficient OpenMP programming
7.1 Profiling OpenMP programs
7.2 Performance pitfalls
7.2.1 Ameliorating the impact of OpenMP worksharing constructs
Run serial code if parallelism does not pay off
Avoid implicit barriers
Try to minimize the number of parallel regions
Avoid dynamic/guided loop scheduling or tasking unless necessary
7.2.2 Determining OpenMP overhead for short loops
7.2.3 Serialization
7.2.4 False sharing
7.3 Case study: Parallel sparse matrix-vector multiply
Problems
8 Locality optimizations on ccNUMA architectures
8.1 Locality of access on ccNUMA
8.1.1 Page placement by first touch
8.1.2 Access locality by other means
8.2 Case study: ccNUMA optimization of sparse MVM
8.3 Placement pitfalls
8.3.1 NUMA-unfriendly OpenMP scheduling
8.3.2 File system cache
8.4 ccNUMA issues with C++
8.4.1 Arrays of objects
8.4.2 Standard Template Library
Problems
9 Distributed-memory parallel programming with MPI
9.1 Message passing
9.2 A short introduction to MPI
9.2.1 A simple example
9.2.2 Messages and point-to-point communication
9.2.3 Collective communication
9.2.4 Nonblocking point-to-point communication
9.2.5 Virtual topologies
9.3 Example: MPI parallelization of a Jacobi solver
9.3.1 MPI implementation
9.3.2 Performance properties
Weak scaling
Strong scaling
Problems
10 Efficient MPI programming
10.1 MPI performance tools
10.2 Communication parameters
10.3 Synchronization, serialization, contention
10.3.1 Implicit serialization and synchronization
10.3.2 Contention
10.4 Reducing communication overhead
10.4.1 Optimal domain decomposition
Minimizing interdomain surface area
Mapping issues
10.4.2 Aggregating messages
Message aggregation and derived datatypes
10.4.3 Nonblocking vs. asynchronous communication
10.4.4 Collective communication
10.5 Understanding intranode point-to-point communication
Problems
11 Hybrid parallelization with MPI and OpenMP
11.1 Basic MPI/OpenMP programming models
11.1.1 Vector mode implementation
11.1.2 Task mode implementation
11.1.3 Case study: Hybrid Jacobi solver
11.2 MPI taxonomy of thread interoperability
11.3 Hybrid decomposition and mapping
One MPI process per node
One MPI process per socket
Multiple MPI processes per socket
11.4 Potential benefits and drawbacks of hybrid programming
Improved rate of convergence
Re-use of data in shared caches
Exploiting additional levels of parallelism
Overlapping MPI communication and computation
Reducing MPI overhead
Multiple levels of overhead
Bulk-synchronous communication in vector mode
Appendix A Topology and affinity in multicore environments
A. 1 Topology
A. 2 Thread and process placement
A.2.1 External affinity control
A.2.2 Affinity under program control
A. 3 Page placement beyond first touch
Appendix B Solutions to the problems
Solution 1.1 (page 34): How fast is a divide?
Solution 1.2 (page 34): Dependencies revisited.
Solution 1.3 (page 35): Hardware prefetching.
Solution 1.4 (page 35): Dot product and prefetching.
Solution 2.1 (page 62): The perils of branching.
Solution 2.2 (page 62): SIMD despite recursion?
Solution 2.3 (page 62): Lazy construction on the stack.
Solution 2.4 (page 62): Fast assignment.
Solution 3.1 (page 91): Strided access.
Solution 3.2 (page 92): Balance fun.
Solution 3.3 (page 92): Performance projection.
Solution 3.5 (page 93): Inner loop unrolling revisited.
Solution 3.6 (page 93): Not unrollable?
Solution 3.7 (page 93): Application optimization.
Solution 3.8 (page 93): TLB impact.
Solution 4.1 (page 114): Building fat-tree network hierarchies.
Solution 5.1 (page 140): Overlapping communication and computation.
Solution 5.2 (page 141): Choosing an optimal number of workers.
Solution 5.3 (page 141): The impact of synchronization.
Solution 5.4 (page 141): Accelerator devices.
Solution 6.1 (page 162): OpenMP correctness.
Solution 6.2 (page 162): π by Monte Carlo.
Solution 6.3 (page 163): Disentangling critical regions.
Solution 6.4 (page 163): Synchronization perils.
Solution 6.5 (page 163): Unparallelizable?
Solution 6.6 (page 163): Gauss-Seidel pipelined.
Solution 7.1 (page 184): Privatization gymnastics.
Solution 7.2 (page 184): Superlinear speedup.
Solution 7.3 (page 184): Reductions and initial values.
Solution 7.4 (page 184): Optimal thread count.
Solution 8.1 (page 201): Dynamic scheduling and ccNUMA.
Solution 8.2 (page 202): Unfortunate chunksizes.
Solution 8.3 (page 202): Speeding up “small” jobs.
Solution 8.4 (page 202): Triangular matrix-vector multiplication.
Solution 8.5 (page 202): NUMA placement by overloading.
Solution 9.1 (page 233): Shifts and deadlocks.
Solution 9.2 (page 233): Deadlocks and nonblocking MPI.
Solution 9.3 (page 234): Open boundary conditions.
Solution 9.4 (page 234): A performance model for strong scaling of the parallel Jacobi code.
Solution 9.5 (page 234): MPI correctness.
Solution 10.1 (page 260): Reductions and contention.
Solution 10.2 (page 260): Allreduce, optimized.
Solution 10.3 (page 260): Eager vs. rendezvous.
Solution 10.4 (page 260): Is cubic always optimal?
Solution 10.5 (page 260): Riding the PingPong curve.
Solution 10.6 (page 261): Nonblocking Jacobi revisited.
Solution 10.7 (page 261): Send and receive combined.
Solution 10.8 (page 261): Load balancing and domain decomposition.
Bibliography
Standard works
Parallel programming
Tools
Computer architecture and design
Performance modeling
Numerical techniques and libraries
Optimization techniques
Large-scale parallelism
Applications
C++ references
Vendor-specific information and documentation
Web sites and online resources
Computer history
Miscellaneous
People also search for Introduction to High Performance Computing for Scientists and Engineers 1st:
introduction to high performance and parallel computing coursera answers
introduction to high performance and parallel computing github
an introduction to high performance scientific computing
what is introduction to computing
Tags: Georg Hager, Gerhard Wellein, Introduction, Performance