-
Notifications
You must be signed in to change notification settings - Fork 391
RFE: opt-in AI based file classification #3760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
MagikaMagika is written is Rust and currently has bindings for Go, JavaScript and Python. It is one possible candidate for a AI based file classifier. It has reasonable size with an only 3MB model that can easily be run locally. In theory Rust libraries can be used from C/C++ code but Magika currently don't offer the necessary wrappers and headers. This is probably not too difficult to add but requires extra steps to use it as a library in rpmbuild. There are multiple command line tools bundled with the bindings including a native Rust implementation. So we could just use them - at least as a first step. Magika also is not yet packaged for any RPM based distribution. We would need to push for the inclusion before even being able to make this an dependency. While adding a Rust based dependency to RPM is a bit of a hurdle - especially for boots strapping - we already have another Rust based dependency with Sequoia. Fallback to libmagic is ofc possible. Using it as a fallback for bootstrapping only may create compatibility issues between the early and later versions of some packages. This may not be a big deal for the few early packages but a possible additional complication an an already complicated process. Runtime PerformanceRun some test on my local machine processing
Magika does not support reading the filenames for
Regular files only
On my machine Magika is about 7 times slower wrt real time as file/libmagic. But additionally is also uses about 7 cores instead of just one - so overall about 50 times the amount of CPU. The CPU usage per files is still only about a hundredth of a second. So for a selected number of files this may be am acceptable price. Otoh we already optimized the call to libmagic away for files like C headers and sources as even that was prohibitly expensive for packages like the kernel. The multi threaded nature of Magika will also further mess with our (currently very insufficient) resource management during parallel builds. We should probably improve it before putting on more load. In a world where any hardware - including build systems - come with an integrated AI accelerator performance would probably not an issue. But we are still far from that. Text based filesQuick check on my system. Ofc this is not really representative of the distribution as a whole.
So here we have about 28% text files. So restricting the AI classification to that will save a bit of time but not offset the huge difference in performance. |
Uh oh!
There was an error while loading. Please reload this page.
Many of the file formats we deal with, such as ELF, are quickly and cleanly detectable by their magic bytes. But then there are all these different text-based formats also known as scripted languages and documentation formats where libmagic really has no chance to get it right. So this might be an actually useful application for AI, the machine learning kind.
So basically we could have libmagic do what it does best, and add an option to flip MAGIC_NO_CHECK_TEXT and use AI based classifier for that part only. There could be other similar cases, the text files are just the most obvious one.
The text was updated successfully, but these errors were encountered: