For distributed memory parallelism, every task has its own, private space in memory. Consequently, tasks cannot directly access other tasks’ data, but the data needs to be communicated, which needs to be included explicitly in the code.
The Message Passing Interface (MPI) is usually used for this kind of computing.
The explicit coding of the communication makes the development of such an MPI-parallel code more difficult, however, for parallelism beyond a single compute node this is crucial. Consequently it is usually not worth to MPI parallelize a code where development time is an important aspect, such as analysis. A notable exception would be an analysis script where analysis is performed on large amounts of data (beyond 100 GB or so) where individual chunks are independent of each other. The more communication is needed for a given algorithm, the harder it gets to MPI parallelize it.
It is possible to combine the shared memory parallelism and distributed memory parallelism. This is usually done having an outer layer of MPI (distributed-memory) parallel code with an inner layer which is OpenMP (shared-memory) parallel. Consequently, the granularity of the distributed memory parallel tasks is quite coarse, which the OpenMP layer has a finer granularity.
For high-performance computing this is a way of naturally adopting the general architecture of modern supercomputers, being a large number of interconnected nodes, each with separate memory and several tens of compute-cores. It is important to note, however, that having the number of OpenMP threads equal the number of compute-cores per node is not necessarily the optimal (however maximal) choice. This also reduces the number of MPI ranks needed compared to a pure MPI parallel code by the respective amount, which is important for very large simulations where MPI (all-to-all) communication can become and important contributor to the total runtime.
A form of parallel computing where every thread has access to the complete memory. On a cluster this is only usable on a single node. Shared memory parallelism has become increasingly important with the establishment of multi-core processors also on desktop-computers and laptops. It is in general relatively easy to implement and therefore suitable also for code where development-time is an important aspect, such as analysis scripts.
Using shared memory parallelism might speed up code by up to a factor of a few. Optimizing the algorithm is usually one thing that should be done before thinking about this.
In C/C++ the standard for using shared-memory parallelism is OpenMP. One typical example would be a for-loop that does operations and stores the result in an array (of the length of the number loop iterations). One needs to take care of private variables (to a specific thread) and that writing data does not result in race conditions.
For Python, it is possible to use Cython (I don’t have practical experience with this, it seems quite complicated to program this, though). There are also a number of implemented functions in libraries that can be shared-memory parallel.