TL;DR
Lately I am analyzing many malware samples from different families written in C#, C++ and other languages based on the .NET framework (.NET assembly).
This has led me to find a problem when correlating different samples using hashing techniques, and that is that the imphash in a high percentage was always the same, even with different malware families, however, using other fuzzy hashing techniques I couldn’t find any similarity.
The problem is due to the fact that during the compilation of the .NET programming languages, the source code is converted into Microsoft Intermediate Language (MSIL), which causes the same imphash to always exist, corresponding in some cases to the import of the mscoree.dll DLL and the _CorExeMain function.
I have solved this problem by using another hashing tool called TypeRef Hasher developed by the folks at G Data CyberDefense. This tool provides a solution to imphash only for .NET malware samples.
Taking advantage of the CLI they have available on GitHub, I have developed a small solution that implements and complements it.
Problem
During the last months, analyzing malware samples that were compiled by some programming language in .NET, I always ran into the same imphash, which in many cases made me believe that I was crazy.
I was seeing correlations where there weren’t simply because imphash was always the same. For this reason, I started to investigate to see if imphash could really have some kind of bug and I could develop something to fix it, since imphash is one of the indicators I use (from a long list) to correlate malware and to help me later in the attribution processes.
The top part of the above image is an AsyncRAT sample which, as you can see, has the same imphash as the AgentTesla sample at the bottom (f34d5f2d4577ed6d9ceec516c1f5a744).
Remembering how imphash works, the hash it produces is based on the DLLs, functions and the order of them in a PE. This fact was what definitely helped me to think that I was not crazy.
Both malware samples contain the same imported DLL (mscoree.dll) and the same function (_CorExeMain), which must have a reason for this behavior.
Looking for information about the _CorExeMain function imported from the DLL in the Microsoft Documentation Center they indicate that it has the following functionality:
- Initialize the CLR (Common Language Runtime)
- Finds the entry point in the CLR header of the executable
- Normal PE execution begins
To summarize, a PE that has been compiled with .NET will have its Entry Point to the _CorExeMain() function of the mscoree.dll DLL, which will ensure that the CLR is loaded to execute the rest of the code.
Later, at runtime, on demand, the PE will load the rest of the DLLs it needs, that is why in the first instance, the DLLs are not listed in the IAT.
Solution
The folks at G Data CyberDefense ran into the same problem as I did recently and came up with a very interesting development to solve this problem.
They have created a hash called TypeRefHash that is based on the TypeRef Table of PEs in .NET.
This table stores references to the imported namespaces, having a behavior very similar to that of the DLLs and their functions. For example, if in a PE the DLL Kernel32.dll is imported to make use of the WriteFile function, in a .NET PE the FileStream class of the System.IO namespace can be used.
In the previous image you can see part of the TypeRef table of the same AsyncRAT sample above, illustrating as an example how it performs the import of the System.IO namespace and the FileStream class.
All this table is finally used to build a hash (TypeRefHash) with the names and namespaces imported by the PE, something very similar to what imphash does with the DLLs.
The calculation of the SHA256 hash that they do is based on the following three steps that they indicate in their post.
- They sort the entries by TypeNameSpace and then by TypeName. This is very important, because this way if a new variant of a malware family changes the order of imports, it will not affect the final result of the hash calculation. Empty TypeNameSpaces will be the first to be sorted.
- Concatenate the TypeNamespace and TypeName with a dash. If the TypeNameSpace is empty, the concatenated string begins with the dash.
- All comma-separated strings are joined together and the SHA256 of the resulting UTF8 byte string is then calculated.
An example could be the following, which I wanted to illustrate in an image as a summary.
Neossins development
Based on the G Data CyberDefense tool, I have developed a small application called Neossins in its 0.1 Alpha version, which I will be improving and implementing new features over time.
The objective pursued by Neossins is first of all to have a structured and comprehensive dataset of TypeRef Hashes on the main .NET malware families, having so far contemplated three:
- AsyncRAT
- Blackshades
- AgentTesla
Once the information is structured and stored in a database, the next step is to compare malware samples received from different sources with the dataset to identify if there is a match with any TypeRef Hash, and if so, to correlate other samples that have the same hash and that were previously stored in the database.
Ultimately, it is optimal that new matching samples are stored in the dataset, since the idea is to grow the dataset to be able to know as many TypeRef Hashes as possible for a malware family.
The tool has been developed in python3 and has some dependencies such as Cytoscape for the generation of the graph that is displayed in a web application that is raised by Dash.
In a series of tests I have been performing using some downloaded malware samples related to AsyncRAT and AgentTesla, the result was the following.
The red rectangular nodes are all the samples that I deposited in the directory (the starting point) that must be set in the application’s configuration file.
Starting from the left side (blue zone), you can see that there are five nodes that all share the same TypeRef Hash, which could mean that they are variants belonging to the same family.
In the central part (red zone), there are three rectangular nodes of red color that are the samples selected to make a comparison with the AsyncRAT family, one of them is the one that has been seen in this post at the beginning of the whole. Two of the three samples, it can be seen that they have a large number of matches previously stored in the dataset, whose nodes are represented by pink circles. All these relationships between nodes means that they share the same TypeRef Hash.
Finally, on the right side of the screen, seven orphan nodes that have no match with the dataset can be seen, which could be studied to verify if they could be the families mentioned by means of techniques other than TypeRef Hash, and if so, it would be convenient to store the samples in the dataset.
Conclusion
The TypeRef Hash solution is one more feature that can be used to identify malware variants on known families.
As I say and reiterate, it is not a definitive and unique solution to be used for the mentioned task, since it can be perfectly used in parallel with other techniques such as ssdeep, VT vhash, imphash for non-.NET files, etc…
The use of imphash for samples compiled using .NET and incorporating only the mscoree.dll DLL and its corresponding _CorExeMain function, has no value since in most cases it will give a false positive in possible relationships of malware samples.
Little by little I will update Neossins until I have a stable version and incorporate new features that I want to implement in the future.
Twitter: @Joseliyo_Jstnk
Github: https://github.com/jstnk9/neossins
LinkedIn: https://linkedin.com/in/joseluissm/
References
https://github.com/GDATASoftwareAG/TypeRefHasher
https://www.gdatasoftware.com/blog/2020/06/36164-introducing-the-typerefhash-trh
https://gist.github.com/secana/42974cf3536590def7bf09582efca7a8