RFE: opt-in AI based file classification #3760

pmatilai · 2025-05-19T08:13:05Z

Many of the file formats we deal with, such as ELF, are quickly and cleanly detectable by their magic bytes. But then there are all these different text-based formats also known as scripted languages and documentation formats where libmagic really has no chance to get it right. So this might be an actually useful application for AI, the machine learning kind.

So basically we could have libmagic do what it does best, and add an option to flip MAGIC_NO_CHECK_TEXT and use AI based classifier for that part only. There could be other similar cases, the text files are just the most obvious one.

ffesti · 2025-05-21T09:35:22Z

Magika

Magika is written is Rust and currently has bindings for Go, JavaScript and Python. It is one possible candidate for a AI based file classifier. It has reasonable size with an only 3MB model that can easily be run locally.

In theory Rust libraries can be used from C/C++ code but Magika currently don't offer the necessary wrappers and headers. This is probably not too difficult to add but requires extra steps to use it as a library in rpmbuild.

There are multiple command line tools bundled with the bindings including a native Rust implementation. So we could just use them - at least as a first step.

Magika also is not yet packaged for any RPM based distribution. We would need to push for the inclusion before even being able to make this an dependency.

While adding a Rust based dependency to RPM is a bit of a hurdle - especially for boots strapping - we already have another Rust based dependency with Sequoia.

Fallback to libmagic is ofc possible. Using it as a fallback for bootstrapping only may create compatibility issues between the early and later versions of some packages. This may not be a big deal for the few early packages but a possible additional complication an an already complicated process.

Runtime Performance

Run some test on my local machine processing usr/lib. Number of files differ a bit depending on the exact way how to gather the file list.

time find /usr/share/ | wc
 323352  323470 17998480
 real   0m0.162s
 user   0m0.078s
 sys    0m0.117s

magika --no-dereference -r /usr/share | wc
 307223 1643876 27824734

magika -r /usr/share | wc
 423090 2425635 41077620

find -P /usr/share/ -type f | wc
 229770  229843 12871733

Magika does not support reading the filenames for stdin - only the content to classify. So we have to work around that with xargs.

time find /usr/share/ | file -f - > /dev/null
 real   1m9.576s
 user   1m7.727s
 sys    0m1.828s

time find /usr/share/ | xargs magika  > /dev/null
 real   7m58.372s
 user   55m39.400s
 sys    0m30.339s

time magika -r /usr/share  > /dev/null
 real   11m28.996s
 user   80m44.985s
 sys    0m41.994s

time magika --no-dereference -r /usr/share  > /dev/null
 real   5m54.493s
 user   41m35.664s
 sys    0m20.302s

Regular files only

time find -P /usr/share/ -type f | file -f - > /dev/null
real    1m7.653s
user    1m5.738s
sys     0m1.826s

time find -P /usr/share/ -type f | xargs magika  > /dev/null
real    5m48.706s
user    40m34.093s
sys     0m22.222s

On my machine Magika is about 7 times slower wrt real time as file/libmagic. But additionally is also uses about 7 cores instead of just one - so overall about 50 times the amount of CPU.

The CPU usage per files is still only about a hundredth of a second. So for a selected number of files this may be am acceptable price. Otoh we already optimized the call to libmagic away for files like C headers and sources as even that was prohibitly expensive for packages like the kernel.

The multi threaded nature of Magika will also further mess with our (currently very insufficient) resource management during parallel builds. We should probably improve it before putting on more load.

In a world where any hardware - including build systems - come with an integrated AI accelerator performance would probably not an issue. But we are still far from that.

Text based files

Quick check on my system. Ofc this is not really representative of the distribution as a whole.

$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | wc
 608599 2865111 53582717
$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | grep -v text | wc
 434295 1659582 36954396
$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | grep text | wc
 174304 1205529 16628321

So here we have about 28% text files. So restricting the AI classification to that will save a bit of time but not offset the huge difference in performance.

pmatilai added RFE handsfree Packaging automation and convenience labels May 19, 2025

pmatilai added this to RPM May 19, 2025

github-project-automation bot moved this to Backlog in RPM May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFE: opt-in AI based file classification #3760

RFE: opt-in AI based file classification #3760

pmatilai commented May 19, 2025 •

edited

Loading

ffesti commented May 21, 2025 •

edited

Loading

Uh oh!

RFE: opt-in AI based file classification #3760

RFE: opt-in AI based file classification #3760

Comments

pmatilai commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ffesti commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Magika

Runtime Performance

Text based files

Uh oh!

pmatilai commented May 19, 2025 •

edited

Loading

ffesti commented May 21, 2025 •

edited

Loading