**************************************************************************************************** Introduction to the MicrosoftDmang project: This code base is both an artifact and an outcome of the Microsoft Demangler effort. It is an outcome in that we drove ourselves to create our own Microsoft Demangler not constrained by others' intellectual property. However, part of coming up with our own demangler is first to understand Microsoft objects, mangling process, and demangling process; we are trying to mimic this understanding in our end product, but we need a tool or framework to tease out this understanding by: 1) Performing hand-tweaked fuzzing of the input to undname, 2) Performing forward programming and checking/analyzing what is produced in terms of code and symbols, 3) Utilizing limited information that we can find--we started with "Visual C++ name mangling" which is now on wikibooks (moved from wikipedia). This code base serves as a reflection of our continual understanding of mangled symbols and how to process them as we adapt our understanding over time. This code base is not necessarily "the truth" in terms of a representation, but it is our current codification of "our truth," which can still morph over time. **************************************************************************************************** Precepts and Goals: Some precepts and goals of this project follow: 1) Create a Microsoft symbol demangler not tied to others' intellectual property, 2) Try to make this code separable from the rest of the code base--so it could be a stand-alone deliverable. Currently, some of the "test" software is tied to the rest of the code base--the tests extend "genericTestCase" and use the public variable BATCH_MODE. 3) Knowing that other utilities (e.g., undname) are far from perfect in demangling symbols, we created a demangler that can present differing processing/output rules. This allows us to create a better set of processing/output rules than currently exist while still creating a set of rules that mimic existing utilities, which allows us to process bulk sets of symbols into bulk sets of test data (mangled/demangled pairs). Because of this, we have better faith in our better demangler because we can see how well we do with the demangler rule set that mimics the results of many of the 2.7 million Windows 7 symbols and 6.8 million Windows 10 symbols in our bulk test sets. 4) In order to mimic the Microsoft rules, we often have to adhere to odd white space rules, which includes some cases where the are no spaces after commas and in which there are dangling white spaces. We put much effort in trying to mimic these odd spacing rules--to all for the bulk testing AND because it also sheds light on what we believe could be internal software architecture. **************************************************************************************************** MicrosoftDmang Development and Testing Overview: The software architecture of this project is continually in flux with some classes being better defined and "cleaner" than others. Not having real software requirements, but being test driven, there are many times where there have been grand scale software changes which rely upon a nimble environment. For example there could be a case where I could make a change in order to cause 50 more tests to pass, but 200 others fail, yet it might have been the correct change that requires 13 other changes with numerous tests moving back and forth between pass and fail states until I finally settle at the state where all previously passing tests pass again and I've gotten one additional failing test to now pass. This is not an unrealistic description, and we have come a very long way, which has allowed to to focus more now on cleaning up the code, but there are some areas where the code looks like spaghetti. This is primarily in the section of "modified" types. Tests have also been continually added to either provide the data from a new fuzzing experience or to create additional bounds on a new test case. Individual tests are found in MDMangBaseTest. There are also tests in the MDMangListTest, which has various mechanisms for pulling test data from a file. One of these has mangled/demangled pairs, others might might just have mangled symbols only, but we are looking for cases where the demangler could "blow up." These file tests often provide a data record for creating a new individual test. There is also MDMangBaseTest, which uses MDMangBaseTestSuite as the junit-4 testSuite, but which also uses the runWith(Categories) junit runner to exclude test from running that have been given the correct MDMangFailingTests annotation in the MDMangBaseTest file. The MDMangBaseTestSuite, which excludes failing tests, is geared toward being the nightly test to be run, as no error are expected to be seen. **************************************************************************************************** MicrosoftDmang Architecture: MDMang is the basic interface and driver of the demangler action (perhaps these should be separated). It takes a symbols and returns a MDParsableItem (*in most cases), from which we can retrieve a demangled string or from which we can ask questions. There are additional demanglers that derive from MDMang, which produce the results of other processing/output rules. These include MDMangVS2015 and MDMangVS2013. *The Ghidra-specific demangler does not directly return an MDMParsableItem (it can be requested post-processing), but instead an output specific to the needs to Ghidra. Also within MDMang lies various public methods intended for use by the driver side of the project (again, another reason to break this class into pieces). MDException: The exception class for internal exception handling. MDContext: A class containing a single context that is pushed or popped to/from a context stack in MDMang. A context contains "backreferences" (as we currently understand backreferences and a context of them--simplified from more complicated contexts, we are trying to whittle this away toward non-existence). There are backrefNames, backrefParameters, and backrefTemplateParameters. A context is created from a previous context using particular rules that are dictated by an enumerated MDMcontext annotation. These, too, might go away, but we have boiled them down to what currently exists. In the future, the MDContext class might go away and backreferences could be part of the class for which the context has been created--but we started with this current model while we were trying to understand when there was a context change and what required the change; in fact, there are still questions that arise in my mind, yet I have not yet created tests that might tease the answer out. I do know, however, that one or more tests in the MDMangBaseTest class had helped define what we have--I no longer have record of which tests were solely responsible for revealing some of the special context/backreference cases (e.g., could have been that a backreference to an internal template argument got used in a certain way). MDFuzzyFit: Not currently used (only in an @Ignore case in one of the tests at this time). The currently goal is to potentially make this into an MDMang extension. Then create a utility that exposes the functionality. MDParsableItem: This is the base class for any internal object that has a mangled/demangled representation. All parsable items derive from MDParsableItem. MDObject: This class represents a fully presentable symbol as would be expected to be found in a list of symbols for a binary. It contains a name and and MDTypeinfo. An MDObject could probably be an abstract class (not one at the moment). The MDObjectCPP (below) is the primary derived object of interest for us. Other than MDObjectC, the others (MDObjectBracket and MDObjectCodeView) may or may not be true representations of the MSFT architecture. (The "object" itself might not be representative of the MSFT architecture, but it is what works for me at the moment.) An MDobject is composed of a name (either MDFragmentName or MDQualifiedName) and an MDTypeInfo, which can be a derived class. MDFragmentName: This class represents a single string part of a name--it is nearly as simple as a C-language name. MDQualifiedName: represents a complicated C++-style name that has an MDBasicName and an MDQualification. MDBasicName: Can be a template name with arguments (MDTemplateNameAndArguments); the name of an embedded object; a simple, reusable name fragment; or a special operator name. MDQualification: represents the scope of a name or other construct. A qualification is an ordered list of qualifiers (MDQual--internal class of MDQualification), which can be further parsed from other complicated constructs. MDTypeInfo: This represents something about the type of the object (the MDObject). Recently, we created derived types from the MDTypeInfo to represent a "Variable" type versus a "Function" type versus one of many other C++ types, such as virtual function calls and virtual function tables. In most cases MDTypeInfo contains an MDType, which is the base type of all "types," whether data types or function types. I'm not necessarily happy with the separate constructs of MDTypeInfo and MDType, but the code was much more easy to work with, in terms of getting the correct parsing and output order in place. While they are seemingly at opposite ends of the details, there's a chance that they are one in the same, and this will take more study. MDType: This is the base type of all "types" (see documentation for MDTypeInfo), whether data types or function types. There is currently a large set of commented-out code in MDType, which might eventually get deleted, but I'm still trying to find the commonality of types, trying to get them as low as possible and also see where MDType and MDTypeinfo overlap. MDDataType: This is the "data" type derived from MDType. There are many leaf-level derived types of MDDataType, such as "int," but there are also a good number of derived intermediate type classes for MDDataType. Currently, as for MDType, there is a large set of commented-out code for MDDataType which is being worked for possible solution of consolidating information into lower base classes from higher classes. MDFunctionType: This is the "function" type derived from MDType. There are instances of MDFunctionType as well as derived classes. There are many more details and derived types not specified here, but there are a host of other miscellaneous MDParsableItem-derived classes that include: MDEncodedNumber, MDSignedEncodedNumber, and MDString. There are a number of parsers that parse parts of a mangled string and created various MDParsableItems (those documented above, as well as many other). These parses tend to be large switch statements. At times, some cases of the switch make calls out to other methods that further refine the parsing. **************************************************************************************************** ****************************************************************************************************