Here's my results. I tested master vs PR #177/a1 on a synthetic sequential geometry-detection task.
Environment:
Linux x86_64
FPC 3.3.1
Laz trunk
11th Gen Intel® Core™ i5-11600K @ 3.90GHz
15,4 GiB of RAM
The task is binary classification over sequential numeric windows. Each sample is a 300-step window with 5 numeric features per step. The positive class represents a generated geometric shape/pattern inside the sequence; the negative class is background/no-pattern data.
Dataset:Total windows: 9,700
Train/Validation/Test: 5,820 / 1,940 / 1,940
Input shape: 300 x 5 x 1
Classes: 2
Batch size: 64
Epochs: 10
Training timemaster, no switches: 41.68 sec
master + -dAVX2: 27.72 sec
master + -dAVX2 -O3: 27.31 sec
PR #177/a1, no switches: 34.03 sec
PR #177/a1 + -dAVX2: 19.30 sec
PR #177/a1 + -dAVX2 -O3: 13.09 sec
Evaluation timemaster, no switches: ~7.30 sec
master + -O3: 5.95 sec
master + -dAVX2: ~4.62 sec
master + -dAVX2 -O3: 4.25 sec
PR #177/a1, no switches: 5.42 sec
PR #177/a1 + -O3: 3.23 sec
PR #177/a1 + -dAVX2: 1.74 sec
PR #177/a1 + -dAVX2 -O3: 0.95 sec
Qualitymaster + -dAVX2 -O3:
Accuracy: 99.02%
Precision: 98.97%
Recall: 100.00%
F1: 99.48%
FP/FN: 19 / 0
PR #177/a1 + -dAVX2 -O3:
Accuracy: 99.28%
Precision: 100.00%
Recall: 99.23%
F1: 99.62%
FP/FN: 0 / 14
Main result:With -dAVX2 -O3, PR #177/a1 trained about 2.1x faster and evaluated about 4.5x faster than master on this test, with similar or slightly better F1.