FFmpeg Devs Boast of Up To 94x Performance Boost After Implementing Handwritten AVX-512 Assembly Code (tomshardware.com) 42

Posted by BeauHD on Monday November 04, 2024 @05:50PM from the would-you-look-at-that dept.

Anton Shilov reports via Tom's Hardware: FFmpeg is an open-source video decoding project developed by volunteers who contribute to its codebase, fix bugs, and add new features. The project is led by a small group of core developers and maintainers who oversee its direction and ensure that contributions meet certain standards. They coordinate the project's development and release cycles, merging contributions from other developers. This group of developers tried to implement a handwritten AVX512 assembly code path, something that has rarely been done before, at least not in the video industry.

The developers have created an optimized code path using the AVX-512 instruction set to accelerate specific functions within the FFmpeg multimedia processing library. By leveraging AVX-512, they were able to achieve significant performance improvements -- from three to 94 times faster -- compared to standard implementations. AVX-512 enables processing large chunks of data in parallel using 512-bit registers, which can handle up to 16 single-precision FLOPS or 8 double-precision FLOPS in one operation. This optimization is ideal for compute-heavy tasks in general, but in the case of video and image processing in particular.

The benchmarking results show that the new handwritten AVX-512 code path performs considerably faster than other implementations, including baseline C code and lower SIMD instruction sets like AVX2 and SSSE3. In some cases, the revamped AVX-512 codepath achieves a speedup of nearly 94 times over the baseline, highlighting the efficiency of hand-optimized assembly code for AVX-512.

FFmpeg Devs Boast of Up To 94x Performance Boost After Implementing Handwritten AVX-512 Assembly Code

Post Load All Comments

Search 42 Comments Log In/Create an Account

Comments Filter:

Neat... But, (Score:4, Insightful)

by Valgrus Thunderaxe ( 8769977 ) writes: on Monday November 04, 2024 @05:55PM (#64919749)

With a 94x improvement, someone needs to fix their compiler.

Reply to This Share
Flag as Inappropriate
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  They're too busy, the compiler writers are working on breaking compatibility with doubtful optimizations. [cr.yp.to]
Pretty fucking cool (Score:2)

by backslashdot ( 95548 ) writes:

I hope it's a real world benchmark not some contrived situation and propaganda from Intel.
- Re:Pretty fucking cool (Score:5, Informative)
  
  by test321 ( 8891681 ) writes: on Monday November 04, 2024 @06:28PM (#64919821)
  
  propaganda from Intel.
  If anything, the propaganda would be from AMD.
  Intel disabled AVX-512 for its Core 12th, 13th, and 14th Generations of Core processors, leaving owners of these CPUs without them. On the other hand, AMD's Ryzen 9000-series CPUs feature a fully-enabled AVX-512 FPU so the owners of these processors can take advantage of the FFmpeg achievement.
  
  Reply to This Parent Share
  Flag as Inappropriate
where is AI? (Score:3, Funny)

by dfghjk ( 711126 ) writes: on Monday November 04, 2024 @06:07PM (#64919771)

Why didn't they just ask ChatGPT to rewrite it? Handwritten? Haven't we been told there's no reason for that?

Reply to This Share
Flag as Inappropriate
- Spit you out the same garbage, answer 94x faster. (Score:2)
  
  by stooo ( 2202012 ) writes:
  
  If you ask ChatGPT to write you 94x faster code, it will spit you out the same garbage, just answer 94x faster.
- Re: (Score:2)
  
  by Darinbob ( 1142669 ) writes:
  
  They asked ChatGPT, but it said "Ain't no one got time for hand coding!"
Every boomer programmer just shrugged (Score:5, Insightful)

by ip_freely_2000 ( 577249 ) writes: on Monday November 04, 2024 @06:09PM (#64919777)

I've had 90%+ optimizations on certain data processing functions by hand coding and tuning instead of depending on libraries and other 'productivity' tools. In my coding life (which is long but fortunately very nearly over) we've added layer upon layer of complexity which is sometimes not necessary.

Reply to This Share
Flag as Inappropriate
- Re: (Score:3)
  
  by backslashdot ( 95548 ) writes:
  
  I stopped coding/optimizing in assembly over 20 years ago, and then only utilized knowledge of it for debugging, cybersecurity, or fun purposes for a few years. Nowadays I have zero use for it other than getting super annoyed that people don't know it.
  - Re: (Score:2)
    
    by dsgrntlxmply ( 610492 ) writes:
    
    Hey punk, get off my ROM!
    My last significant use of assembly was late 80s / early 90s. Working on embedded systems with 8-bit microprocessors, tiny boot ROM capacity, and rudimentary or no compiler, there was no choice. Around 2017, I had to take a deep dive into GCC compiled ARM code to characterize an obscure but dramatic failure in a specific embedded situation. This turned out to be incorrect code generated by GCC. Once characterized, it was not difficult to work around, but it required machine level
- Re:Every boomer programmer just shrugged (Score:4, Insightful)
  
  by Tony Isaac ( 1301187 ) writes: on Monday November 04, 2024 @06:21PM (#64919801) Homepage
  
  That, and programmers often use boneheaded algorithms because they don't know any better.
  Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort. And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive. Students normally react with "How does that even work???" But you know that algorithm made its way into more than a few production systems.
  Software optimization employs some very specific techniques. Notably, using some kind of profiler to identify where your bottlenecks are, and looking for ways to reduce execution or loop counts, or ways to reduce the time spent in each iteration. There's a whole lot of software, including decoding algorithms, that never went through any kind of proper optimization analysis.
  I agree, it's not surprising to find ways to increase performance by 90+%, regardless of the language chosen.
  
  Reply to This Parent Share
  Flag as Inappropriate
  - Re: (Score:2)
    
    by Entrope ( 68843 ) writes:
    
    If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort.
    Challenge accepted [wikipedia.org].
    - Re: (Score:2)
      
      by Tony Isaac ( 1301187 ) writes:
      
      Funny! Well, since this sort compares its slowness to Bubble Sort, it would seem that Bubble Sort might still get 2nd place for slowest!
    - Re: (Score:2)
      
      by vux984 ( 928602 ) writes:
      
      Bah - I like random sort, which is essentially:
      swap two elements at random
      check if the list is sorted now
      repeat if the list is not sorted
      Given enough time, it will sort the list, quite by accident. ;)
  - Re: (Score:3)
    
    by ShanghaiBill ( 739463 ) writes:
    
    If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort.
    Then you lack imagination. Bubble sort is O(n^2). There are O(n^3) sorting algorithms. Here's an O(n!) sort:
    1. Shuffle data randomly
    2. Test if it is sorted. If yes, you're done, else go to 1.
    And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing
    Bubble sort is useful for very small datasets, like 10 or so, and constrained memory or cache capacity.
    But bubble sort is most taught as an example of a naive implementation leading to poor performance.
    isn't even intuitive. Students normally react with "How does that even work???"
    It's obvious why Bubble sort works. It is way easier to understand than Quicksort.
    - Re: (Score:3)
      
      by 93 Escort Wagon ( 326346 ) writes:
      
      1. Shuffle data randomly
      2. Test if it is sorted. If yes, you're done, else go to 1.
      Look, I've told you before - I get really tired of people reposting my code without attribution.
  - Re: (Score:2)
    
    by cheesybagel ( 670288 ) writes:
    
    Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort. And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive. Students normally react with "How does that even work???" But you know that algorithm made its way into more than a few production systems.
    Bubble Sort sorts an already sorted list in O(n) time. Try doing the same thing with Merge Sor
    - Re: (Score:2)
      
      by Tony Isaac ( 1301187 ) writes:
      
      Shell sort and insertion sort are both simple and both do better than bubble sort, even in your "ideal" scenario.
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    Remember Bubble Sort? If you tried to build the most inefficient algorithm possible, it's hard to imagine one that would beat Bubble Sort
    Bubble sort is faster than Quicksort for less than 8 items. That sounds like nothing, but then you realize many if not most sorts done probably have fewer than 8 items.
    - Re: (Score:2)
      
      by Tony Isaac ( 1301187 ) writes:
      
      And shell sort and insertion sort would be even faster.
  - Re: (Score:2)
    
    by thegarbz ( 1787294 ) writes:
    
    And yet for years, every computer programming textbook taught this algorithm that's useful for basically nothing, and isn't even intuitive.
    You didn't pay any attention in class. Bubble sort is held up in virtually every textbook as an example of something that does the base line job in an inefficient way. It is literally taught as an example of not being useful.
    That said I think your assertion that it isn't intuitive is quite silly. It's probably the most intuitive algorithm that is. Is current value bigger than next value in array? If so, switch, repeat, done. There is literally nothing more intuitive than comparing two numbers and just movin
- Re: Every boomer programmer just shrugged (Score:1)
  
  by io333 ( 574963 ) writes:
  
  Optimizing in assembly used to be a routine thing. I guess it went away because: 1. Takes forever. 2. Only very smart people can do it. 3. Idiot managers only want shipped product, today!
  I think #2 is the real bottleneck. Geniuses have lots of options, working for idiots is a shitty option.
  - Re: Every boomer programmer just shrugged (Score:1)
    
    by dowhileor ( 7796472 ) writes:
    
    I wonder how much faster it will be when it's written in rust?
    - Re: (Score:2)
      
      by olmsfam ( 1399493 ) writes:
      
      Haha I came her to comment almost the same thing. Can't tell if you are being sarcastic though. But I sure was going to be.
      The masturbating security monkeys sure seem to think Rust is hot shit but I want REAL WORLD examples dammit. If I was a billionaire I would be paying top programmers to battle head to head, and benchmark both the code, and TIME TO WRITE the code, of C vs Rust
      - Re: Every boomer programmer just shrugged (Score:2)
        
        by viperidaenz ( 2515578 ) writes:
        
        When does the competition end? You should be counting for vulnerabilities in the initial product and also over time as it gets updated with new features and existing bugs get fixed buy developers who didnâ(TM)t have anything to do with the original code.
  - Re: (Score:3)
    
    by mukundajohnson ( 10427278 ) writes:
    
    Well, maintenance nightmare aside, it's actually pretty hard to get performance out of handwritten assembly on x86. It's not worth it. I've seen the compiler spit out utter garbage but it still runs at roughly the same speed of handwritten assembly, just because of the amazing pipeline process the CPU has. The main benefit you'd get is a smaller binary file. The SIMD instructions are the outlier, where compiler support might not be good enough, where the instructions are difficult to generate for.
- Re: (Score:2)
  
  by Darinbob ( 1142669 ) writes:
  
  Mostly I have used assembler to do stuff needed in a system, stuff that a general purpose library does you on a full operating system. But in an embedded system you are the full operating system, and the RTOSs out there don't give you system startup code and the like. For instance, cache invalidation instructions, interrupt/exception handlers, memory barriers, context switching, etc. Other times you _know_ the code is very slow and can be sped up, and can't easily be sped up with pure standard C code.
  In
Doesn't this depend on chip model? (Score:3)

by Xylantiel ( 177496 ) writes: on Monday November 04, 2024 @06:19PM (#64919795)

I fiddled with this a bit once and it seemed that not all chips implemented "actual" AVX-512. i.e. some chips just support the instructions, they don't actually have the hardware to do all those operation in parallel. Maybe that is discussed more in the article.

Reply to This Share
Flag as Inappropriate
Well, that's one architecture. (Score:1)

by jddj ( 1085169 ) writes:

Due to the nature of assembler, that code will be bound to one CPU architecture.
Won't help those of us on ARM or Apple silicon.
I suppose we could offload video over the network to an x86 architecture box, but that'll eat up some of that 94x.
Wonder what Windows and MacOS emulators make of raw machine code?
- Re: (Score:2)
  
  by aBlueMe ( 7317380 ) writes:
  
  I have an acquaintance that works for one of the FAANG companies. The focus of their team is to hand write assembly for performance critical operations across the company. They do this for multiple chip architectures.
  Your phone or PC might well have some of that code in it.
- Re: (Score:3)
  
  by divide overflow ( 599608 ) writes:
  
  AVX512 instructions are specific to x86. If you want the same sort of accelerations in ARM (including Apple M series processors) you need to use something like Scalable Vector Extension (SVE) or Scalable Vector Extension 2 (SVE2) which is written for the ARM architecture family.
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  Architecturally, it's not a problem. You just encapsulate the assembly in a function, and use polymorphism, based on whether the CPU feature is available. For the platforms that don't have the ability, they will just run slower. Apparently 94x slower or something.
ffmpeg is just... awesome. (Score:1)

by Anonymous Coward writes:

A while back, I was working at a place that did video production. They had these expensive, barely working video appliances that the license fees were just plain extortion, and the support was often, "buy our newer model, and we might fix that". I took the physical appliance, removed the disk with the vendor OS and set it aside if need be, installed Linux and used ffmpeg for everything that appliance did. It worked perfectly, and did what we needed it to do, and might as use the Supermicro hardware that
Crappy summary. (Score:2)

by msauve ( 701917 ) writes:

Since the summary couldn't be bothered, AVX-512 is an instruction set extension for X86 processors.
Not surprised (Score:1)

by spaglia ( 1163639 ) writes:

Hand coding can drastically improve performance if you know what you're doing and the compiler is doing a poor job. That being said, Intel is deprecating its AVX-512 support since it wasn't worth the silicon. AMD on the other hand did a much better job of it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

FFmpeg Devs Boast of Up To 94x Performance Boost After Implementing Handwritten AVX-512 Assembly Code More | Reply Login

Neat... But, (Score:4, Insightful)

Re: (Score:2)

Pretty fucking cool (Score:2)

Re:Pretty fucking cool (Score:5, Informative)

where is AI? (Score:3, Funny)

Spit you out the same garbage, answer 94x faster. (Score:2)

Re: (Score:2)

Every boomer programmer just shrugged (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re:Every boomer programmer just shrugged (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Every boomer programmer just shrugged (Score:1)

Re: Every boomer programmer just shrugged (Score:1)

Re: (Score:2)

Re: Every boomer programmer just shrugged (Score:2)

Re: (Score:3)

Re: (Score:2)

Doesn't this depend on chip model? (Score:3)

Well, that's one architecture. (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

ffmpeg is just... awesome. (Score:1)

Crappy summary. (Score:2)

Not surprised (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals