|
Linear algebra can be parallelised within Magma for operations
with large matrices over finite fields of small characteristic.
These are two main parallelisation modes available: multi-threaded
and GPU. A combination of these two modes (called `hybrid')
is also available but is still in an experimental phase.
Once parallelisation in one of the below modes is selected, the most
critical base algorithm which is parallelised is matrix multiplication
for sufficiently large matrices over the above fields (directly called by
A*B). The most important other matrix algorithms which
directly benefit from this are powering (A^{e}) and the computation
of the echelon form (EchelonForm(A)) and the many operations which
are closely derived from that, including inverse (A^{-1}), and the
functions Rank, Determinant, Nullspace, Rowspace,
Solution, IsConsistent and MinimalPolynomial(A: Al :=
"KellerGehrig") (which selects the asymptotically-fast Keller-Gehrig
algorithm for minimal polynomial computation).
Any other algorithm of Magma which depends on large matrix computations
over small finite fields will also be automatically parallelised when
any of the above modes of parallelisation is selected and any of the
above functions are implicitly called. For example, most of
the important functions to compute with KG-modules, where
K is a small finite field and G is a finite group,
depend heavily on multiplication and echelonisation of matrices, and so
automatically benefit from linear algebra parallelisation in large examples;
in particular the Meataxe (called by CompositionFactors, etc.) and
the computation of sub- and quotient modules are greatly sped up in
large examples when linear algebra parallelisation is enabled.
Multi-threaded linear algebra is supported in these cases:
- (i)
- For matrices over GF(p), where p = 2, 3, 5, 7 (POSIX threads
with split-word representation).
- (ii)
- For matrices over GF(p), where p is prime and 10 < p < 223.5
(POSIX threads with exact floating point doubles and parallel
OpenBLAS library or Intel Math Kernel Library).
- (iii)
- For matrices over GF(pk), where p<223.5 and k>1 (parallel
Karatsuba combined with one of the above, depending on the size of p).
This POSIX threads
mode will be selected for the above types of finite field when
the global number of threads is first set to a number greater than 1
before calling the functions below; as usual, one simply first calls
the procedure:
SetNthreads(NT);
where NT is the number of threads desired. Note that since matrix
operations are memory intensive, the resulting speedups compared with the
serial algorithms depend strongly on the memory speed of the computer,
and it is unlikely that the speedup obtained will be close to NT.
Also, hyperthreading (setting NT to more than the number of physical
cores) is usually not beneficial because it increases the contention
between threads on memory access.
GPU-based linear algebra (with NVIDIA CUDA) is supported in these cases:
- (i)
- For matrices over GF(p), where p = 2, 3, 5, 7 (CUDA code
developed within Magma).
- (ii)
- For matrices over GF(p), where p is prime and 10 < p < 223.5
(CUBLAS library or Intel Math Kernel Library).
- (iii)
- For matrices over GF(pk), where p<223.5 and k>1 (parallel
Karatsuba combined with one of the above, depending on the size of p).
Since V2.26, multiple GPUs may also be selected for
matrix multiplication over GF(p), where p = 2, 3, 5, 7.
This can give a significant speedup for very large dimensions.
See Section GPUs for information on how to select multiple GPUs.
This combines multi-threaded and GPU
operations at the same time (for matrices defined over the same types
of finite fields as above). This mode is selected by calling SetNthreads(NT); while running the CUDA executable (and with the GPU
mode enabled). This mode may be not be beneficial compared to the
simple POSIX threads mode or the simple GPU mode (it depends strongly on
the number of threads and the type of GPU(s)), so when running the CUDA
executable by default, it may be better to keep the number of threads at 1.
The following input demonstrates simple use of POSIX threads when
multiplying matrices (on an Intel 4-core i7-7700 CPU @ 3.60GHz):
> SetNthreads(1);
> X := Random(MatrixRing(GF(5), 10000));
> time P := X*X; // single thread, normal CPU time
Time: 5.040
> time E,T := EchelonForm(X);
Time: 6.940
> SetNthreads(4);
> time P2 := X*X; // multi-threads, so real time shown by [r]
Time: 1.760[r]
> time E,T := EchelonForm(X);
Time: 2.660[r]
> assert P2 eq P;
The following input demonstrates simple use of one GPU when
multiplying matrices (NVIDIA V100 GPU; available with CUDA executable only):
> SetGPU(false);
> X := Random(MatrixRing(GF(5), 10000));
> time P := X*X; // no GPU
Time: 5.040
> time E,T := EchelonForm(X);
Time: 6.940
> SetGPU(true);
> time P2 := X*X; // 1 GPU
Time: 1.320
> time E,T := EchelonForm(X);
Time: 2.920
> assert P2 eq P;
The following input tests very large matrix multiplication over
GF(2) to see speedups when two GPUs are available (CUDA executable
only with GPU selected by default; hardware used here: dual NVIDIA V100 GPUs).
> K := GF(2);
> for i in [16 .. 19] do
> n := 2^i;
> printf "Dim: 2^%o = %o n", i, n;
> A := Random(MatrixRing(K, n));
> B := Random(MatrixRing(K, n));
> IndentPush();
> "1 GPU:";
> t0 := Cputime(); SetNGPUs(1); time P1 := A*B; t0 := Cputime(t0);
> "2 GPUs:";
> t1 := Cputime(); SetNGPUs(2); time P2 := A*B; t1 := Cputime(t1);
> printf "Speedup: %.3o n", t0/t1;
> assert P1 eq P2;
> IndentPop();
> end for;
Dim: 2^16 = 65536
1 GPU:
Time: 2.680
2 GPUs:
Time: 1.550
Speedup: 1.729
Dim: 2^17 = 131072
1 GPU:
Time: 18.420
2 GPUs:
Time: 11.170
Speedup: 1.649
Dim: 2^18 = 262144
1 GPU:
Time: 134.380
2 GPUs:
Time: 75.840
Speedup: 1.772
Dim: 2^19 = 524288
1 GPU:
Time: 1085.930
2 GPUs:
Time: 571.670
Speedup: 1.900
In this example we compute absolutely irreducible
modular representations of a finite simple group over the finite
field GF(17). The server used has dual Intel
3.2GHz Gold 6146 CPUs with 24 cores total, and an NVIDIA V100 GPU,
showing speedups when multiple cores or a GPU is used.
> G := PermutationGroup(ATLASGroup("O8m2"));
> K := GF(17);
> L := AbsolutelyIrreducibleModules(G, K);
> [Dimension(M): M in L];
[ 1, 34, 51, 83, 204, 204, 357, 476, 476, 595, 714, 714, 1020,
1071, 1071, 1071, 1071, 1190, 1261, 1428, 2142, 2142, 2142, 2142,
2176, 2295, 2295, 2295, 2835, 2856, 2856, 3264, 4284, 4760, 5355 ]
With one thread only:
> SetNthreads(1); SetGPU(false);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 6503.600
With 24 threads:
> SetNthreads(24); SetGPU(false);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 2432.150[r]
With one thread and a GPU:
> SetNthreads(1); SetGPU(true);
> time L := AbsolutelyIrreducibleModules(G, K);
Time: 2584.590
[Next][Prev] [Right] [Left] [Up] [Index] [Root]
|