Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein – Ebook PDF Instant Download/Delivery: 1138470899, 9781138470897

Full download Introduction to High Performance Computing for Scientists and Engineers 1st Edition after payment

Product details:

ISBN 10: 1138470899

ISBN 13: 9781138470897

Author: Georg Hager, Gerhard Wellein

Written by high performance computing (HPC) experts, Introduction to High Performance Computing for Scientists and Engineers provides a solid introduction to current mainstream computer architecture, dominant parallel programming models, and useful optimization strategies for scientific HPC. From working in a scientific computing center.

Introduction to High Performance Computing for Scientists and Engineers 1st Table of contents:

1 Modern processors

1.1 Stored-program computer architecture

1.2 General-purpose cache-based microprocessor architecture

1.2.1 Performance metrics and benchmarks

1.2.2 Transistors galore: Moore’s Law

1.2.3 Pipelining

1.2.4 Superscalarity

1.2.5 SIMD

1.3 Memory hierarchies

1.3.1 Cache

1.3.2 Cache mapping

1.3.3 Prefetch

1.4 Multicore processors

1.5 Multithreaded processors

1.6 Vector processors

1.6.1 Design principles

1.6.2 Maximum performance estimates

1.6.3 Programming for vector architectures

Problems

2 Basic optimization techniques for serial code

2.1 Scalar profiling

2.1.1 Function- and line-based runtime profiling

Function profiling

Line-based profiling

2.1.2 Hardware performance counters

2.1.3 Manual instrumentation

2.2 Common sense optimizations

2.2.1 Do less work!

2.2.2 Avoid expensive operations!

2.2.3 Shrink the working set!

2.3 Simple measures, large impact

2.3.1 Elimination of common subexpressions

2.3.2 Avoiding branches

2.3.3 Using SIMD instruction sets

2.4 The role of compilers

2.4.1 General optimization options

2.4.2 Inlining

2.4.3 Aliasing

2.4.4 Computational accuracy

2.4.5 Register optimizations

2.4.6 Using compiler logs

2.5 C++ optimizations

2.5.1 Temporaries

2.5.2 Dynamic memory management

Lazy construction

Static construction

2.5.3 Loop kernels and iterators

Problems

3 Data access optimization

3.1 Balance analysis and lightspeed estimates

3.1.1 Bandwidth-based performance modeling

3.1.2 The STREAM benchmarks

3.2 Storage order

3.3 Case study: The Jacobi algorithm

3.4 Case study: Dense matrix transpose

3.5 Algorithm classification and access optimizations

3.5.1 O(N)/O(N)

3.5.2 ON2/ON2

3.5.3 ON3/ON2

3.6 Case study: Sparse matrix-vector multiply

3.6.1 Sparse matrix storage schemes

3.6.2 Optimizing JDS sparse MVM

Problems

4 Parallel computers

4.1 Taxonomy of parallel computing paradigms

4.2 Shared-memory computers

4.2.1 Cache coherence

4.2.2 UMA

4.2.3 ceNUMA

4.3 Distributed-memory computers

4.4 Hierarchical (hybrid) systems

4.5 Networks

4.5.1 Basic performance characteristics of networks

Point-to-point connections

Bisection bandwidth

4.5.2 Buses

4.5.3 Switched and fat-tree networks

4.5.4 Mesh networks

4.5.5 Hybrids

Problems

5 Basics of parallelization

5.1 Why parallelize?

5.2 Parallelism

5.2.1 Data parallelism

Example: Medium-grained loop parallelism

Example: Coarse-grained parallelism by domain decomposition

5.2.2 Functional parallelism

Example: Master-worker scheme

Example: Functional decomposition

5.3 Parallel scalability

5.3.1 Factors that limit parallel execution

5.3.2 Scalability metrics

5.3.3 Simple scalability laws

5.3.4 Parallel efficiency

5.3.5 Serial performance versus strong scalability

5.3.6 Refined performance models

5.3.7 Choosing the right scaling baseline

5.3.8 Case study: Can slower processors compute faster?

Modified weak scaling

5.3.9 Load imbalance

OS jitter

Problems

6 Shared-memory parallel programming with OpenMP

6.1 Short introduction to OpenMP

6.1.1 Parallel execution

6.1.2 Data scoping

6.1.3 OpenMP worksharing for loops

6.1.4 Synchronization

Critical regions

6.1.5 Reductions

6.1.6 Loop scheduling

6.1.7 Tasking

6.1.8 Miscellaneous

Conditional compilation

Memory consistency

Thread safety

Affinity

Environment variables

6.2 Case study: OpenMP-parallel Jacobi algorithm

6.3 Advanced OpenMP: Wavefront parallelization

Problems

7 Efficient OpenMP programming

7.1 Profiling OpenMP programs

7.2 Performance pitfalls

7.2.1 Ameliorating the impact of OpenMP worksharing constructs

Run serial code if parallelism does not pay off

Avoid implicit barriers

Try to minimize the number of parallel regions

Avoid dynamic/guided loop scheduling or tasking unless necessary

7.2.2 Determining OpenMP overhead for short loops

7.2.3 Serialization

7.2.4 False sharing

7.3 Case study: Parallel sparse matrix-vector multiply

Problems

8 Locality optimizations on ccNUMA architectures

8.1 Locality of access on ccNUMA

8.1.1 Page placement by first touch

8.1.2 Access locality by other means

8.2 Case study: ccNUMA optimization of sparse MVM

8.3 Placement pitfalls

8.3.1 NUMA-unfriendly OpenMP scheduling

8.3.2 File system cache

8.4 ccNUMA issues with C++

8.4.1 Arrays of objects

8.4.2 Standard Template Library

Problems

9 Distributed-memory parallel programming with MPI

9.1 Message passing

9.2 A short introduction to MPI

9.2.1 A simple example

9.2.2 Messages and point-to-point communication

9.2.3 Collective communication

9.2.4 Nonblocking point-to-point communication

9.2.5 Virtual topologies

9.3 Example: MPI parallelization of a Jacobi solver

9.3.1 MPI implementation

9.3.2 Performance properties

Weak scaling

Strong scaling

Problems

10 Efficient MPI programming

10.1 MPI performance tools

10.2 Communication parameters

10.3 Synchronization, serialization, contention

10.3.1 Implicit serialization and synchronization

10.3.2 Contention

10.4 Reducing communication overhead

10.4.1 Optimal domain decomposition

Minimizing interdomain surface area

Mapping issues

10.4.2 Aggregating messages

Message aggregation and derived datatypes

10.4.3 Nonblocking vs. asynchronous communication

10.4.4 Collective communication

10.5 Understanding intranode point-to-point communication

Problems

11 Hybrid parallelization with MPI and OpenMP

11.1 Basic MPI/OpenMP programming models

11.1.1 Vector mode implementation

11.1.2 Task mode implementation

11.1.3 Case study: Hybrid Jacobi solver

11.2 MPI taxonomy of thread interoperability

11.3 Hybrid decomposition and mapping

One MPI process per node

One MPI process per socket

Multiple MPI processes per socket

11.4 Potential benefits and drawbacks of hybrid programming

Improved rate of convergence

Re-use of data in shared caches

Exploiting additional levels of parallelism

Overlapping MPI communication and computation

Reducing MPI overhead

Multiple levels of overhead

Bulk-synchronous communication in vector mode

Appendix A Topology and affinity in multicore environments

A. 1 Topology

A. 2 Thread and process placement

A.2.1 External affinity control

A.2.2 Affinity under program control

A. 3 Page placement beyond first touch

Appendix B Solutions to the problems

Solution 1.1 (page 34): How fast is a divide?

Solution 1.2 (page 34): Dependencies revisited.

Solution 1.3 (page 35): Hardware prefetching.

Solution 1.4 (page 35): Dot product and prefetching.

Solution 2.1 (page 62): The perils of branching.

Solution 2.2 (page 62): SIMD despite recursion?

Solution 2.3 (page 62): Lazy construction on the stack.

Solution 2.4 (page 62): Fast assignment.

Solution 3.1 (page 91): Strided access.

Solution 3.2 (page 92): Balance fun.

Solution 3.3 (page 92): Performance projection.

Solution 3.5 (page 93): Inner loop unrolling revisited.

Solution 3.6 (page 93): Not unrollable?

Solution 3.7 (page 93): Application optimization.

Solution 3.8 (page 93): TLB impact.

Solution 4.1 (page 114): Building fat-tree network hierarchies.

Solution 5.1 (page 140): Overlapping communication and computation.

Solution 5.2 (page 141): Choosing an optimal number of workers.

Solution 5.3 (page 141): The impact of synchronization.

Solution 5.4 (page 141): Accelerator devices.

Solution 6.1 (page 162): OpenMP correctness.

Solution 6.2 (page 162): π by Monte Carlo.

Solution 6.3 (page 163): Disentangling critical regions.

Solution 6.4 (page 163): Synchronization perils.

Solution 6.5 (page 163): Unparallelizable?

Solution 6.6 (page 163): Gauss-Seidel pipelined.

Solution 7.1 (page 184): Privatization gymnastics.

Solution 7.2 (page 184): Superlinear speedup.

Solution 7.3 (page 184): Reductions and initial values.

Solution 7.4 (page 184): Optimal thread count.

Solution 8.1 (page 201): Dynamic scheduling and ccNUMA.

Solution 8.2 (page 202): Unfortunate chunksizes.

Solution 8.3 (page 202): Speeding up “small” jobs.

Solution 8.4 (page 202): Triangular matrix-vector multiplication.

Solution 8.5 (page 202): NUMA placement by overloading.

Solution 9.1 (page 233): Shifts and deadlocks.

Solution 9.2 (page 233): Deadlocks and nonblocking MPI.

Solution 9.3 (page 234): Open boundary conditions.

Solution 9.4 (page 234): A performance model for strong scaling of the parallel Jacobi code.

Solution 9.5 (page 234): MPI correctness.

Solution 10.1 (page 260): Reductions and contention.

Solution 10.2 (page 260): Allreduce, optimized.

Solution 10.3 (page 260): Eager vs. rendezvous.

Solution 10.4 (page 260): Is cubic always optimal?

Solution 10.5 (page 260): Riding the PingPong curve.

Solution 10.6 (page 261): Nonblocking Jacobi revisited.

Solution 10.7 (page 261): Send and receive combined.

Solution 10.8 (page 261): Load balancing and domain decomposition.

Bibliography

Standard works

Parallel programming

Tools

Computer architecture and design

Performance modeling

Numerical techniques and libraries

Optimization techniques

Large-scale parallelism

Applications

C++ references

Vendor-specific information and documentation

Web sites and online resources

Computer history

Miscellaneous

People also search for Introduction to High Performance Computing for Scientists and Engineers 1st:

introduction to high performance and parallel computing coursera answers

introduction to high performance and parallel computing github

an introduction to high performance scientific computing

what is introduction to computing

Tags: Georg Hager, Gerhard Wellein, Introduction, Performance

Sign up for Newsletter

Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein 1138470899 9781138470897

Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein – Ebook PDF Instant Download/Delivery: 1138470899, 9781138470897

Full download Introduction to High Performance Computing for Scientists and Engineers 1st Edition after payment

Product details:

Introduction to High Performance Computing for Scientists and Engineers 1st Table of contents:

People also search for Introduction to High Performance Computing for Scientists and Engineers 1st:

Sign up for Newsletter

Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein 1138470899 9781138470897

Introduction to High Performance Computing for Scientists and Engineers 1st Edition by Georg Hager, Gerhard Wellein – Ebook PDF Instant Download/Delivery: 1138470899, 9781138470897

Full download Introduction to High Performance Computing for Scientists and Engineers 1st Edition after payment

Product details:

Introduction to High Performance Computing for Scientists and Engineers 1st Table of contents:

People also search for Introduction to High Performance Computing for Scientists and Engineers 1st:

Login