Fakultät Medien (M) (ab 22.04.2021)
Refine
Document Type
- Conference Proceeding (3)
- Master's Thesis (1)
- Article (unreviewed) (1)
- Working Paper (1)
Conference Type
- Konferenzartikel (3)
Language
- English (6)
Keywords
- Binary Executable (1)
- Cryptography (1)
- Deep Learning (1)
- Deep learning (1)
- Deepfake (1)
- Feature extraction (1)
- Machine Learning (1)
- Real-time (1)
- Vulnerability identification (1)
Institute
Open Access
- Closed (2)
- Closed Access (2)
- Open Access (2)
- Diamond (1)
The identification of vulnerabilities is an important element in the software development life cycle to ensure the security of software. While vulnerability identification based on the source code is a well studied field, the identification of vulnerabilities on basis of a binary executable without the corresponding source code is more challenging. Recent research [1] has shown how such detection can generally be enabled by deep learning methods, but appears to be very limited regarding the overall amount of detected vulnerabilities. We analyse to what extent we could cover the identification of a larger variety of vulnerabilities. Therefore, a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables is used. The underlying basis is a dataset with 50,651 samples of vulnerable code in the form of a standardised LLVM Intermediate Representation. Te vectorised features of a Word2Vec model are used to train different variations of three basic architectures of recurrent neural networks (GRU, LSTM, SRNN). A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability, which achieved an out-of-sample accuracy of 88% and 77%, respectively. Differences in the detection of different vulnerabilities were also observed, with non-vulnerable samples being detected with a particularly high precision of over 98%. Thus, our proposed technical approach and methodology enables an accurate detection of 23 (compared to 4 [1]) vulnerabilities.
The identification of vulnerabilities is an important element in the software development life cycle to ensure the security of software. While vulnerability identification based on the source code is a well studied field, the identification of vulnerabilities on basis of a binary executable without the corresponding source code is more challenging. Recent research has shown, how such detection can be achieved by deep learning methods. However, that particular approach is limited to the identification of only 4 types of vulnerabilities. Subsequently, we analyze to what extent we could cover the identification of a larger variety of vulnerabilities. Therefore, a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables is used. The underlying basis is a dataset with 50,651 samples of vulnerable code in the form of a standardized LLVM Intermediate Representation. The vectorised features of a Word2Vec model are used to train different variations of three basic architectures of recurrent neural networks (GRU, LSTM, SRNN). A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability, which achieved an out-of-sample accuracy of 88% and 77%, respectively. Differences in the detection of different vulnerabilities were also observed, with non-vulnerable samples being detected with a particularly high precision of over 98%. Thus, the methodology presented allows an accurate detection of 23 (compared to 4) vulnerabilities.
Synthesizing voice with the help of machine learning techniques has made rapid progress over the last years. Given the current increase in using conferencing tools for online teaching, we question just how easy (i.e. needed data, hardware, skill set) it would be to create a convincing voice fake. We analyse how much training data a participant (e.g. a student) would actually need to fake another participants voice (e.g. a professor). We provide an analysis of the existing state of the art in creating voice deep fakes and align the identified as well as our own optimization techniques in the context of two different voice data sets. A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37% can recognize a professor’s fake voice). From a longer-term societal perspective such voice deep fakes may lead to a disbelief by default.
Synthesizing voice with the help of machine learning techniques has made rapid progress over the last years [1]. Given the current increase in using conferencing tools for online teaching, we question just how easy (i.e. needed data, hardware, skill set) it would be to create a convincing voice fake. We analyse how much training data a participant (e.g. a student) would actually need to fake another participants voice (e.g. a professor). We provide an analysis of the existing state of the art in creating voice deep fakes and align the identified as well as our own optimization techniques in the context of two different voice data sets. A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37 percent can recognize a professor’s fake voice). From a longer-term societal perspective such voice deep fakes may lead to a disbelief by default.
The identification of vulnerabilities is an important element of the software development process to ensure the security of software. Vulnerability identification based on the source code is a well studied field. To find vulnerabilities on the basis of a binary executable without the corresponding source code is more challenging. Recent research has shown how such detection can be performed statically and thus runtime efficiently by using deep learning methods for certain types of vulnerabilities.
This thesis aims to examine to what extent this identification can be applied sufficiently for a variety of vulnerabilities. Therefore, a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables is used. For this purpose, a dataset with 50,651 samples of 23 different vulnerabilities in the form of a standardised LLVM Intermediate Representation was prepared. The vectorised features of a Word2Vec model were then used to train different variations of three basic architectures of recurrent neural networks (GRU, LSTM, SRNN). For this purpose, a binary classification was trained for the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability, which achieved an out-of-sample accuracy of 88% and 77%, respectively. Differences in the detection of different vulnerabilities were also observed, with non-vulnerable samples being detected with a particularly high precision of over 98%. Thus, the methodology presented allows an accurate detection of vulnerabilities, as well as a strong limitation of the analysis scope for further analysis steps.
In the field of network security, the detection of possible intrusions is an important task to prevent and analyse attacks. Machine learning has been adopted as a particular supporting technique over the last years. However, the majority of related published work uses post mortem log files and fails to address the required real-time capabilities of network data feature extraction and machine learning based analysis [1-5]. We introduce the network feature extractor library FEX, which is designed to allow real-time feature extraction of network data. This library incorporates 83 statistical features based on reassembled data flows. The introduced Cython implementation allows processing individual packets within 4.58 microseconds. Based on the features extracted by FEX, existing intrusion detection machine learning models were examined with respect to their real-time capabilities. An identified Decision-Tree Classifier model was thus further optimised by transpiling it into C Code. This reduced the prediction time of a single sample to 3.96 microseconds on average. Based on the feature extractor and the improved machine learning model an IDS system was implemented which supports a data throughput between 63.7 Mbit/s and 2.5 Gbit/s making it a suitable candidate for a real-time, machine-learning based IDS.