c++ - Faster abs-max of float array -
I must draw the peak meter for audio in realtime at least 44100 samples per second 40 minimum flows per second buffer 64 Is between 1024 samples. I have to catch the AB max from each buffer (they are then fed through the filter at least and drawn at intervals of about 20 min.)
for (int i = 0; i & lt; numsamples; i ++) {absMaxOfBuffer = MAX (PhbS buffer [i]), Petty Maxofffuffer); } Similarly, how do I do it now, I want to speed it up very much. Buffers float in the -1 range 1, so fibers.
Question, is there a few complex vibrations to do this fast work?
Most of the functions for this denial, branchless ABS and floats, are they present?
EDIT: The primary platform is Linux / GCC, but a Windows port is planned (possibly with Mingav).
Edit Second:
I acknowledged one billions due to the bit about the actual different structure, which was in the middle of the question.
I try to loop all the loops on time, I am nullifying the signs and I am getting the maximum (maximum instructions) with SSE and see that this banana does not peel Thanks for the suggestions, As I mentioned above, some of you have up-voted for both IEEE floats (in fact, single-integer-up fast, fast, etc.). In theory).
If the compiler is not inline with both the operations, then it strikes it either until it is done, or to find out the implementation of your architecture and make it your own You can inline
Maybe get something from the fact that
f> G iff * (int *) & amp; F & gt; So, once you have been fibbed, I think that a branch free max for IT will also work for float (assuming that They are similar to the size of the curriculum) There is an explanation of why it works here: But your compiler already knows all this, as you have a CPU, so it does not matter. There is no complication - there is a quick way to do this. Your algorithm is already O (n), and you can not beat it and still see every sample.
I think there is something in the SIM of your processor (which is SSE 2 on Intel) which would be helpful, compared to its code, by processing more data than per watch. But I do not know if that is what it is, then it will probably be many times faster.
You can parallel the multi-core CPU, especially when you are working with 40 independent streams anyway. This will be some of the best things to "fast" launch the appropriate number of additional threads, split the work between them, and using the light weight primitive, it can indicate that when they are completely ( Maybe a thread block). I am not quite clear whether you are planning to maximize, or maximize, all of the 40 streams, so you do not need to synchronize the worker thread in addition to ensuring the results of the next step. Without data corruption .
It is possible to see how much the compiler has opened the loop to compile on disassembly. Try to make it a bit more unarmed, see if there is a difference.
One other thing to think about is how many cache are you remembering, and whether it is possible to reduce the number by giving some clues to the cache, it can load the correct pages prematurely. But I have no experience with this, and I do not expect much. __builtin_prefetch Magic is proud on the GCC, and I think the first experiment will be something like "for the start of the next block before entering the loop for this block".
What percentage of the speed do you want at present? Or is it a matter of "as fast as possible"?
Comments
Post a Comment