https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429 https://www.programmersought.com/article/38835154723/ http://pepijndevos.nl/2018/07/04/loefflers-discrete-cosine-transform-algorithm-in-futhark.html https://github.com/gk7huki/pydct/blob/master/dct.py https://www.kaggle.com/thesherpafromalabama/2d-dct2-3-implementation https://cs.stanford.edu/people/eroberts/courses/soco/projects/data-compression/lossy/jpeg/dct.htm http://www-personal.umich.edu/~mejn/computational-physics/dcst.py https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft.py https://www.spiral.net/doc/papers/ffte-spiral.pdf https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft-complex.c https://www.nayuki.io/res/free-small-fft-in-multiple-languages/fft-real-pair.c https://www.nayuki.io/page/fast-discrete-cosine-transform-algorithms https://par.nsf.gov/servlets/purl/10137134 https://github.com/HazyResearch/learning-circuits https://www.codeproject.com/Articles/151043/Iterative-Fast-1D-Forvard-DCT https://fgiesen.wordpress.com/2013/11/04/bink-2-2-integer-dct-design-part-1/ https://www.researchgate.net/publication/309704590_Early_Determination_of_Zero-Quantized_8_8_DCT_Coefficients https://www.ijaiem.org/volume3issue5/IJAIEM-2014-05-31-117.pdf https://www.codeproject.com/KB/recipes/Fast_DCT/Lee_s_algorithm.gif reed solomon uses DFT in GF https://github.com/tomerfiliba/reedsolomon/blob/master/reedsolo.py https://github.com/Bulat-Ziganshin/FastECC/blob/master/ntt.cpp https://github.com/lrq3000/pyFileFixity/blob/master/TODO.md https://tmo.jpl.nasa.gov/progress_report2/42-35/35K.PDF fast large multiply https://arxiv.org/pdf/1503.04955

discussion and notes http://lists.libre-soc.org/pipermail/libre-soc-dev/2021-June/003198.html

see v3.0C p47 current idea being explored, a new Form (SVD-Form?) byte-reversed mode SVP64 a new form would be used: OP RT RA RC immediate 0 6 11 16 21 31 this gives *both* an immediate *and* a (dynamic) shift amount, to be able to perform the following Effective Address calculation: EA = RA + (bitrev(srcstep) * imm) << RC * the bitrev is at width log2(VL) * imm will typically be width of the data

on the svp64 ldst imm mode this line needs changing: 0-1 2 3 4 description 00 els dz sz normal mode one option: when sz=dz=1, this enables bitreverse mode. algorithm is: # Returns the integer whose value is the reverse of the lowest 'width' bits of the integer 'val'. def reverse_bits(val, width): result = 0 for _ in range(width): result = (result << 1) | (val & 1) val >>= 1 return result where width = log2(VL).

* updated fields to add SVD and SVDS form * added svfixedload pseudocode * partly updated sv ld/st page

argh, the decode for LDST is... not very clean. for the +/- twin arithemtic there is no change to the Forms, it is a matter of "if SVP64 Mode and OE=1 equals +/- twin arithmetic". LD/ST is waay more involved, because the detection of whether SVP64 RM *itself* is in LD/ST Mode involves first identifying if the *v3.0B suffix* is a LD/ST operation. once that is done, *then* bitreverse mode can be detected (in SVP64 RM) once that is done, *then* the decoder can be flipped over to the new SVP64 LDST instruction forms. argh. that's awful. the only saving grace is that the detection only needs to be on MAJOR opcodes, because only the major opcodes are big enough to fit LD/ST with immediate.

* added CSV files for major ld/st * added CSV minor 58 both use new SVD and SVDS Form respectively

finally. working unit test for LD bit-reverse https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=8f224196e18dcea037ec6c8ea1b2e38e251fe401

https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/remap_fft_yield.py;hb=HEAD FFT REMAP schedule sorted, started looking for butterfly variants of DCT. these are not very common, most Lee implementations are recursive. found one with a sloghtly dibious license, says "work shall not be used for immoral purposes"

https://en.wikipedia.org/wiki/Discrete_cosine_transform#/media/File:Single_butterfly_of_the_3-D_DCT-II_VR_DIF_algorithm.jpg

butterfly schedule done and added to ISACaller, a "hack" instruction added which hardcodes a schedule, not ideal but it works.

(In reply to Luke Kenneth Casson Leighton from comment #9) > https://en.wikipedia.org/wiki/Discrete_cosine_transform#/media/File: > Single_butterfly_of_the_3-D_DCT-II_VR_DIF_algorithm.jpg Better url: https://en.wikipedia.org/wiki/File:Single_butterfly_of_the_3-D_DCT-II_VR_DIF_algorithm.jpg

progress made adding "Vertical-First" mode which hilariously is conceptually identical to Mitch Alsup's 66000 ISA "Vector Loop" system.

TODO, create partial recursive partial iterative FFT import sys from cmath import exp, pi def fft(x): N = len(x) if N <= 1: return x if N % 2 > 0: raise ValueError("size of x must be a power of 2") even = fft(x[::2]) odd = fft(x[1::2]) r = range(N//2) T = [exp(-2j * pi * k / N) * odd[k] for k in r] [even[k] for k in r] res = ([even[k] + T[k] for k in r] + [even[k] - T[k] for k in r]) return res input_data = [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0] print(' '.join("%5.3f" % abs(f) for f in fft(input_data))) clearer: def fft(x): N = len(x) T = exp(-2*pi*1j/N) if N > 1: x = fft(x[::2]) + fft(x[1::2]) for k in xrange(N/2): xk = x[k] x[k] = xk + T**k*x[k+N/2] x[k+N/2] = xk - T**k*x[k+N/2] return x

(In reply to Luke Kenneth Casson Leighton from comment #13) > TODO, create partial recursive partial iterative FFT leaving this one alone for now, the simpler RADIX8 or RADIX16 is good enough. moving on to DCT, the bink diagram and nayuki code shows that DCT can be broken down into phases: 1) a butterfly with inversion on the top half indices, 0 1 2 3 7 6 5 4 where for the lower half the multiply constants can be set to zero 2) a bitreversed iterative sum on the top half of the output, and bitreversed copy on the lower half. it is quite complex, but can be done in two halves. it *might* be possible to do the 2nd part in-place, if WaR hazards can be tolerated for the iterative sum.

result[0 : : 2] = alpha result[1 : : 2] = beta result[1 : n - 1 : 2] += beta[1 : ] in theory this can be in-place if the bitreversing has already been done. this requires DOUBLE bitreversing: the data to be LD-bitreversed AND the cos schedule to be bitreversed, result is that the cos multiplies end up in the correct indices. for i in range(half): t1, t2 = vector[i], vector[n-i-1] k = (math.cos((i + 0.5) * math.pi / n) * 2.0) alpha[i] = t1 + t2 beta[i] = (t1 - t2) * (1/k) alpha = transform(alpha) beta = transform(beta ) result = [0] * n for i in range(half): result[i*2] = alpha[i] result[i*2+1] = beta[i] for i in range(half - 1): result[i*2+1] += result[i*2+3]

have a demo DCT which is in-place, but as discussed on list has an annoying swap/copy which should be possible to get rid of but not straight away. temporary workaround is to use VerticalFirst mode which needs detecting the inner loop end.

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/remap_dct_yield.py;hb=HEAD done, got rid of the in-place swap, can now be done with a first outer butterfly and an inner one, each with their own SVP64 remap and a single instruction.

(In reply to Luke Kenneth Casson Leighton from comment #17) > https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/ > decoder/isa/remap_dct_yield.py;hb=HEAD > > done, got rid of the in-place swap, can now be done with a first > outer butterfly and an inner one, each with their own SVP64 remap > and a single instruction. combined them both into another unit test, show the inner and outer butterfly back-to-back

the next stages here are: * calculate the COS table to be stored in regs and used horizontally * calculate the COS coefficient on-demand i.e. Vertical-First the only thing is, the coefficients need to be computed from an integer which is part of the inner loop counter. to get at it, an "iota" instruction is needed. or, more specifically, a "get srcstep / dststep" instruction. the Vectorised version of this would then get internal REMAP schedule loop indices from there, those integers can be converted to floats, and the calculation "cos (ci + 0.5) * pi / size" can be done. quite a lot of instructions are needed which are not yet implemented, including the new FP into INT ones.

notes on progress: * cos table precalculation schedule added, this is a power-2 pascal triangle 4 + 2 + 1 rather than 4 + 2+2 + 1+1+1+1 * LD-BITREV needed fixing to work with REMAP. it is now the *memory offset* (the immediate) that is BOTH bitrev'd AND REMAPed where previously it could only be the RA register and of course that is a scalar only so both brev and REMAP are meaningless for it. there are a LOT of modes needed here. it may be necessary to reorganise the bits around for the instructions and submodes

LD-bitreverse does not work for iDCT due to needing to be bitreversed *and* half-swapped... those operations being applied in the *opposite* order from DCT. therefore LD-bitreverse needs to be downgraded to LD-shift and it is REMAP that needs the bitrev-halfswap and halfswap-bitrev modes. this makes LD-shift less of a priority

LDST downgraded, REMAP upgraded to compensate. initial exploration almost complete, NTT will have to wait until GF ops are available. next is SVG autogeneration.

https://git.libre-soc.org/?p=openpower-isa.git;a=commitdiff;h=413c083e4dd8452fae7c4e426851f1c19a7a9b8c DCT SVG image generator