Skip to content

RFE: opt-in AI based file classification #3760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pmatilai opened this issue May 19, 2025 · 1 comment
Open

RFE: opt-in AI based file classification #3760

pmatilai opened this issue May 19, 2025 · 1 comment
Labels
handsfree Packaging automation and convenience RFE

Comments

@pmatilai
Copy link
Member

pmatilai commented May 19, 2025

Many of the file formats we deal with, such as ELF, are quickly and cleanly detectable by their magic bytes. But then there are all these different text-based formats also known as scripted languages and documentation formats where libmagic really has no chance to get it right. So this might be an actually useful application for AI, the machine learning kind.

So basically we could have libmagic do what it does best, and add an option to flip MAGIC_NO_CHECK_TEXT and use AI based classifier for that part only. There could be other similar cases, the text files are just the most obvious one.

@pmatilai pmatilai added RFE handsfree Packaging automation and convenience labels May 19, 2025
@pmatilai pmatilai added this to RPM May 19, 2025
@github-project-automation github-project-automation bot moved this to Backlog in RPM May 19, 2025
@ffesti
Copy link
Contributor

ffesti commented May 21, 2025

Magika

Magika is written is Rust and currently has bindings for Go, JavaScript and Python. It is one possible candidate for a AI based file classifier. It has reasonable size with an only 3MB model that can easily be run locally.

In theory Rust libraries can be used from C/C++ code but Magika currently don't offer the necessary wrappers and headers. This is probably not too difficult to add but requires extra steps to use it as a library in rpmbuild.

There are multiple command line tools bundled with the bindings including a native Rust implementation. So we could just use them - at least as a first step.

Magika also is not yet packaged for any RPM based distribution. We would need to push for the inclusion before even being able to make this an dependency.

While adding a Rust based dependency to RPM is a bit of a hurdle - especially for boots strapping - we already have another Rust based dependency with Sequoia.

Fallback to libmagic is ofc possible. Using it as a fallback for bootstrapping only may create compatibility issues between the early and later versions of some packages. This may not be a big deal for the few early packages but a possible additional complication an an already complicated process.

Runtime Performance

Run some test on my local machine processing usr/lib. Number of files differ a bit depending on the exact way how to gather the file list.

time find /usr/share/ | wc
 323352  323470 17998480
 real   0m0.162s
 user   0m0.078s
 sys    0m0.117s

magika --no-dereference -r /usr/share | wc
 307223 1643876 27824734

magika -r /usr/share | wc
 423090 2425635 41077620

find -P /usr/share/ -type f | wc
 229770  229843 12871733

Magika does not support reading the filenames for stdin - only the content to classify. So we have to work around that with xargs.

time find /usr/share/ | file -f - > /dev/null
 real   1m9.576s
 user   1m7.727s
 sys    0m1.828s

time find /usr/share/ | xargs magika  > /dev/null
 real   7m58.372s
 user   55m39.400s
 sys    0m30.339s

time magika -r /usr/share  > /dev/null
 real   11m28.996s
 user   80m44.985s
 sys    0m41.994s

time magika --no-dereference -r /usr/share  > /dev/null
 real   5m54.493s
 user   41m35.664s
 sys    0m20.302s

Regular files only

time find -P /usr/share/ -type f | file -f - > /dev/null
real    1m7.653s
user    1m5.738s
sys     0m1.826s

time find -P /usr/share/ -type f | xargs magika  > /dev/null
real    5m48.706s
user    40m34.093s
sys     0m22.222s

On my machine Magika is about 7 times slower wrt real time as file/libmagic. But additionally is also uses about 7 cores instead of just one - so overall about 50 times the amount of CPU.

The CPU usage per files is still only about a hundredth of a second. So for a selected number of files this may be am acceptable price. Otoh we already optimized the call to libmagic away for files like C headers and sources as even that was prohibitly expensive for packages like the kernel.

The multi threaded nature of Magika will also further mess with our (currently very insufficient) resource management during parallel builds. We should probably improve it before putting on more load.

In a world where any hardware - including build systems - come with an integrated AI accelerator performance would probably not an issue. But we are still far from that.

Text based files

Quick check on my system. Ofc this is not really representative of the distribution as a whole.

$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | wc
 608599 2865111 53582717
$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | grep -v text | wc
 434295 1659582 36954396
$ rpm -qa --qf "[%{filenames} %{fileclass} \n]" | grep text | wc
 174304 1205529 16628321

So here we have about 28% text files. So restricting the AI classification to that will save a bit of time but not offset the huge difference in performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
handsfree Packaging automation and convenience RFE
Projects
Status: Backlog
Development

No branches or pull requests

2 participants