Anomaly Detection of Domain Name System (dns) Query Traffic at Top Level Domain Servers

Major network events can be reflected on domain name system (DNS) traffic at the top level server on the DNS hierarchical structure. This paper pursues a novel approach to detect the DNS traffic anomaly of 5.19 events in China at CN top level domain server using covariance analysis. We normalize, expand and average the covariance changes for different length of time slice to enhance the robustness of detection. Feature anomaly is detected based on clustering analysis of covariance change anomaly. To improve the accuracy and reduce the complexity of the k –means algorithm, an initial cluster selection technique is proposed and its performance is analyzed. Transient anomaly and time span anomaly are defined and an efficient real time approximating algorithm is derived. We use an incremental computational method for covariance matrix. The computation and transmission scheme of feature values are analyzed and the process of the detecting algorithm is given. The traffic detecting results of 5.19 event shows that the approach can accurately detect the network anomaly.


INTRODUCTION
Domain name system (DNS) is one of the most fundamental infrastructures of today's Internet (Albitz and Liu, 1998;Mockapetris, 1987).It is the most common use system that translates the human-readable domain names into internet protocol (IP) addresses, and thus it is relied on by many internet applications such as web browser and email.
With the rapid increase of network traffic diversity and network topology complexity, it is even more difficult to monitor internet traffic as a whole.Since DNS is heavily used by an overwhelming huge amount of internet appli-*Corresponding author.E-mail: zhengwang09@126.com.cations, DNS query traffic has the potential to reveal the characteristics of the whole internet traffic.Moreover, monitoring and analysis of DNS query traffic is more variable in comparison with the use of other measurements or data sources, for example, the query information can be recorded by the log file of DNS software, thus is easy to be captured.

Abbreviations
The network anomaly observed from DNS traffic embraces many sources.DNS servers itself are frequently targeted by network attacks and evil usages.The last decade has witnessed many reports about distributed denial-of-service (DDoS) attacks or cache poison attacks launched toward DNS servers (Klein, 2010).The DNS message, based on user datagram protocol (UDP), is also subject to source address forging attacks.In this way, DNS servers can be leveraged by hackers to take distributed reflecting DDoS attacks (http://www.us-cert.gov/readingroom/DNS-recursion033006.pdf,current December 2010).
In recent years, there have been some efforts to detect network anomaly via monitoring DNS traffic, for example, Jung and Emil (2004) proposed an approach to perceive simple mail transfer protocol (SMTP) client misbehavior using DNS traffic.Whyte et al. (2005a, b) showed the effectiveness of analyzing DNS traffic on anomaly detection.
They detected the dissemination of network worms through covariance analysis of DNS traffic and the traffic involved in other protocols in the enterprise network.Musashi et al. (2009) proposed an anomaly detection approach only relying on the incoming DNS traffic of DNS servers.Ishibashi et al. (2005) monitored the heavyloaded mail servers among internet service provider's (ISP) DNS servers and detected anomaly by traffic analysis.Other works include statistical studies of domain name distribution aiming at finding the worms or the hijacked domains controlled by (Weimer et al., 2005;Kristoff and Botnets, 2005;Schonewille and Helmond, 2006;Bojan et al., 2007;Wang et al., 2006).These approaches are mostly tailored for feather analysis and anomaly detection of specific host types, attacks or worms, and the affected network scope is limited by DNS recursive servers and local networks (Yasuo et al., 2006;Nikolaos, 2007).But for the internet as a whole, few measurement studies or approaches for anomaly detection of DNS traffic were presented.
Thanks to the hierarchical structure of DNS, top level DNS servers are more likely to be the entry of DNS query behavior of clients, thus are more appropriate to be indicator of network anomaly.For a macroscopic view of internet, few measurements are available for anomaly detection and there are hardly any techniques for the real time detection.In this paper, we introduce an anomaly detection scheme that utilizes covariance analysis.Through the implementation on the DNS query traffic of 5.19 events in China, we show that the scheme can effectively and accurately detect network anomalies.

Covariance analysis
Covariance of attributes can be used to describe the characteristics of an information system (Shuyuan and Daniel, 2004).If we select some known parameters as the attributes, they can help to identify the pattern of domain name system (DNS) traffic because the covariance of the attributes provides us additional information.We expect that this kind of covariance is sensitive to traffic anomaly, therefore is beneficial for detecting the change of traffic pattern.As for the covariance, we assumed that the anomaly pattern is distinguishable from the normal pattern.In this sense, the covariance can be taken as the indicator of status change.In other words, any anomaly behavior should change the covariance coefficients obtained from the normal status.So traffic pattern change can be detected through identifying the covariance change.In the theory of statistics, the effectiveness of this detection approach is obvious, and the efficiency is merely dependent of the appropriate data volume in the limited time window.

Wang and Tseng 3859
We assume p attributes x , …, n x be n observation vector and ( ) 1 ,..., ' is the value of the j th i f observed in the l th time slot l T .We define a new variable y and a covariance matrix M to describe y We give the definition of the distance between ( ) Where, l z represent change or anomaly.To simplify, the distance function between the two matrixes mentioned above can be written as;

Anomaly detection of DNS query traffic at top level domain servers
There are two categories of top level domains maintained by the Internet Assigned Numbers Authority (IANA) for use in the domain name system (DNS), generic top level domain (gTLD) and country code top level domain (ccTLD).gTLD domains include the com, info, net, and org domains, etc., which do not have a geographic or country designation.A ccTLD domain is an internet top level domain generally used or reserved for a country, a sovereign state, or a dependent territory.ccTLD domains are usually managed by the corresponding countries or regions.For example, China's domain registry, CNNIC (China Internet Network Information Center), is in charge of the operation of CN domains.
There are currently ten nodes distributed around the Sci.Res.Essays world for the resolution service of CN domains.The DNS query data of each node are transmitted to the central node, which then analyzes and processes the data.We utilize the aggregated DNS query log of all nodes collected by the central node around 5.19 events to execute the anomaly detection.Netizens from the province of Jiangsu, Hebei, Shanxi, Guangxi, Zhejiang, Tianjin, Anhui, Heilongjiang, Guangdong etc., were reported to experience unusual network failures such as troubles with opening website pages.According to the announcement issued by Ministry of Industry and Information of China, the event resulted from cyber attacks on the domain resolution system of Baofeng website and thereby its overloaded servers.The attacks, in turn, had a ripple effect on the recursive servers for domain resolution across Chinese telecommunication operating enterprises.And these servers were heavily loaded by the usual queries thus congestions were created.
We take three day's log file of CN top level domain servers from 2009-5-18 00:00 to 2009-5-20 24:00, which covers the duration of 5.19 event.The source addresses recorded in the log file represent the internet protocols (IPs) sending DNS requests (mostly recursive servers).
We map these IPs into provinces they belong to using Maxmind geographical information database (Maxmind GeoIP database: http://www.maxmind.com/).The query traffic from all sources IPs in one province, sum up to the aggregated traffic of that province.We select the DNS query traffic volumes from six provinces that expected to be most affected by 5.19 events; Anhui, Guangxi, Jiangsu, Shanxi, Hebei and Zhejiang.The DNS query traffic volumes are taken as six attributes for covariance analysis in our study.

Covariance anomaly detection
We split the time series of DNS query traffic into time slices at regular intervals, and then calculate the covariance matrix of the time slices.Let the lth time interval be l T , so the worst detection lagging is l T .We compare the DNS traffic covariance matrix of each time slice with the average covariance matrix learned from long normal traffic.We assumed that in the normal cases, DNS query traffic form a certain region which satisfies some stable statistical properties.An important metric for the properties should include small random covariance variation of regions.For network anomaly, the covariance may go out of the normal variation range.This kind of deviation can be taken as the indicator for network anomaly.So network anomaly can be detected by calculating the variation of covariance matrix.
Here we specify the region whose traffic is to be investigated based on the unit of province.Thus six provinces' DNS traffic volumes are obtained, and the number of attributes is n ＝ 6.Let the interval of time slices for covariance analysis be equal, or . Let the time length of DNS query traffic be T q .. If the traffic is divided according to the time sequence without omission or overlapping, the slices sum up to For the l th (l = 1, 2… L) time slice with the interval l T , we can calculate the covariance matrix according to Equations 1 and 2 and its mean value can be written as: The covariance variation l z is obtained by Equation 3. If l z goes beyond the set threshold value, it can be determined to be abnormal.We consider covariance variation under different lengths of time intervals and investigate how different values of T impact the detection results.To effectively compare the detection results, we normalize l z as follows; ( ) ( ) The detection results are shown in Figure 1.We can see that the detection results vary with the increase of T. For example, on 18th May at 3, there is hardly any anomaly in Figure 1(a) (T ＝ 3), while apparent covariance instability is located at the time spot in Figure 1(b).Furthermore, the anomaly tends to be less visible when T is set at a larger value until it is really quite negligible in Figure 1(f) (T ＝ 30).In a similar way, covariance variations on 20th May at 15 differ for different values of T and its visibility diminishes gradually for a larger T.
To enhance the detection robustness, we average the results for different values of T. In other words, if the normalized covariance changes for a certain T is  ( ) The averaged detection result is This can be written in the form of time series.

( ) ( )
'' ' ' / 1 ( mod 1) The averaged and normalized detection results are shown in Figures 2(a) and (b) respectively.Figure 2(a) clearly displays DNS query traffic anomaly on 19th May at about 21, which is consistent with the time of 5.19 event announced by Ministry of Industry and Information of China.This verifies the effectiveness of the detection approach.The results also suggest the breakdowns or faults of the recursive servers (especially the heavily and widely requested ones) can be revealed by the DNS query traffic, and specifically, anomaly aggregated by a set of busy recursive servers can also be observed from the covariance analysis executed on some major top level domain servers.

Attribute anomaly detection
Let the covariance change matrix in the lth (l = 1, 2… L) yl M ∆ can be written as; When anomaly is detected through monitoring l z in l T , the covariance anomaly is usually attributed to a few attributes rather than all of them.To locate the attributes involved, we let the covariance change for any attribute be l i c (i = 1, 2… n) and it can be defined as follows; To distinguish the abnormal attributes from the normal ones in the covariance anomaly, we rely on clustering analysis.( ) summation of squared distance between each points xi and the initial clustering-center mv (the summation is also called deviation E).The selection of initial clusteringcenter has crucial impact on the clustering result, but it is often hard to determine or optimize for a general clustering problem.k-means is a simple algorithm that has shown its applicability to many problem domains.However, it does have some weaknesses: The way to initialize the means was not specified.On one hand, the clustering-center is hard to choose.On the other hand, the number of clusters is also not easy to estimate.We make reasonable assumptions for the initial condition of clustering according to the observation and analysis of the traffic covariance matrix.Thus, the convergence speed and accuracy of k-means is greatly improved.Figure 5 shows the ratio of maximum and total mean covariance change under different normal threshold values.It can be seen that the maximum mean covariance change accounts for 68 to 80% of the total, and this provides good separability for the maximum mean covariance change.Figures 6 to 8 demonstrate the number of maximum, sub-maximum and maximum two mean covariance change attributes, respectively under different normal threshold values.It is evident that most covariance anomalies are caused by the traffic from Guangxi, Jiangsu and Anhui.To conclude, we can come to the assumption that for each time point with abnormal covariance change, its number of abnormal attributes is no more than two or can be one in some cases.Therefore the key problem remaining for k-means clustering is how to identify the cluster of the submaximum attribute.
We take the attribute with the maximum l i c as the abnormal attribute, that is; Where, E is the square error of all objects in the data set, x is the points in the space denoting the given objects and mi is the mean point of cluster Ci.
To simplify the problem, we assume that all points is identified to be in the abnormal cluster, the summation of square errors can be written as; falls into the normal cluster, the summation of square errors should be; The clustering result for different number of attributes is shown in Figure 9.If the sub-maximum covariance change lies above the line, the corresponding attributes is identified as normal.Otherwise, it is classed as abnormal.
For a large number of attributes, we have; of the total length T q (the length of time slice is T).

Let the ith
The averaged results are; Equation 22 can be written in the form of time series as follows; ( ) ( )

Definitions of covariance anomaly
We define anomaly from two perspectives of time.One perspective is used for the instant state and the corresponding covariance anomaly is instant anomaly.If the covariance fluctuates so significantly at some separated time points as to go beyond the instant threshold value 1 threhold , the instant anomaly appears.Instant anomaly covers those anomalies that burst in a short time interval.The other perspective is for describing the state of time period and the corresponding covariance anomaly is time period anomaly.When a large covariance change above the instant threshold value 2 threhold happens successively in a time period, we can express the degree of covariance anomaly via the averaged covariance change in the time period.If the covariance change averaged over the time period exceeds the period threshold value mean threhold , the time period anomaly is detected.
Generally, we have threhold threhold < .We define the set of time points of instant anomaly to be as follows; And the set of time points above Obviously, we have; Although we discuss the two categories of anomalies separately, they do have correlations.Large instant anomaly often tends to have significant impacts on the adjacent time period anomaly so as to bring about time period anomaly.This is really not the desired result because the impacts hide the time averaging effects which we expect for time period anomaly.To overcome this problem, we modify the covariance change values at the time points of instant anomaly as 1 threhold thereby the large instant anomalies are smoothed out.The process can be written as; For all A t T ∈ , they are sorted by time sequence as; ,1 ,2 , ...
The time intervals for them are; The time period anomaly is defined to appear in a time interval ( ) should be calculated and the total number amounts to j-1.Moreover, the averaged covariance changes have to be maintained for all time points obtained before.When the detection lasts long enough, the computational load becomes heavier until the server is overloaded.To decrease the amount of computing, we only maintain a list of starting time points which is thought to be most likely to surpass the threshold value traffic statistics in the central node is incremental, which call for the synchronization of computation of covariance matrix with traffic upload.The synchronization helps to decrease the computation load for one step.
The covariance of attribute i f and j f can be expressed as; ( ) The covariance can be estimated as follows according to the known samples This can be further simplified as; The number of independent items of the covariance matrix is , where cross-correlation coefficient The incremental computation should maintain three items for the cross-correlation coefficients: And two items for self-correlation coefficients: For every traffic data at one time point in the time slice, S all require one addition.In the end of the time slice, cross-correlation coefficient can be calculated as follows: In the end of the time slice, self-correlation coefficient can be calculated as follows;

Anomaly detection process
The anomaly detection using covariance analysis is based on the computation of covariance matrix.However, traffic statistics in the central node is incremental, which call for the synchronization of computation of covariance matrix with traffic upload.The synchronization helps to decrease the computation load for one step.
To ensure that the anomaly detection in the central node can be performed in real-time or quasi-real-time, the query traffic recorded by all of the other nodes should be transmitted to the central node in time.However, the solution of transmitting raw data is not variable for the scenario.The main reason lies in two aspects: The enormous domain name system (DNS) query data make high requirement to the transmission system.The daily request volume for CN domains amounts to over 1 billion queries and the overall query log file is as large as 200 GB.For a single node, the size of the query log file exceeds 60 GB and the data volume of the peak query traffic surges to 5 GB per hour.
The bandwidth of the transmission network is limited and unreliable.Not only the query data of the peak traffic to be transferred is hardly satisfied by the network capability, but also average data transmission rate cannot be guaranteed by the public network considering possible network congestions and fluctuations.
Due to the above constraints, raw data should not be the practical load on the transmission network.Instead, an appropriate way is to complete necessary computation in the local node before data transmission, thus computation load is distributed among all nodes.In our scheme, the attributes (the query traffic of one province) are obtained in the local node, and then only the attributes rather than raw data are transferred to the central node.This method can greatly relieve the load on the transmission network and the computation load of the central node is also reduced.
The processing of the distributed nodes is relatively simple.The source addresses in the DNS request messages are analyzed and then mapped into their provinces according to the geographical information database.The queries from each province are accumulated by a time interval (for example, minute) to get the attributes.At the end of each time interval, the attributes are uploaded to the central node.The processing of the central nodes is shown in Figure 10., report time period anomaly and give the instant anomaly attributes.Go to the next processing cycle.

CONCLUSIONS
DNS traffic observed at top level domain servers is suggestive of network anomaly.This paper proposes a new approach for the anomaly detection based on covariance analysis for the DNS traffic.Geographical covariance is exploited to locate the anomalies points.The approach is used for CN top level domain servers to detect the network anomalies in 5.19 events in China.And the results demonstrate its effectiveness.

zFigure 1 .
Figure 1.Covariance change under different length of time interval.

Figure 2 .
Figure 2. Covariance change averaged over different length of time interval.

Figure 3 .
Figure 3. Number of anomaly time point.
First, the purpose of clustering attribute covariance change is explicit, that is to separate abnormal attributes from normal ones.With this purpose, the number of clusters is sure to be k ＝ 2 and the problem of determining k does not exist.As for the normalized mean covariance change in Figure 2(b), the number of anomaly points for varied normal threshold values is shown in Figure 3. Statistical analysis of attribute covariance change l i c tells us that covariance change at anomaly points mainly results from minority of attributes.An equivalent conclusion is that the impacts of different attributes on covariance change are far from homogeneous.The ratio of maximum and sub-maximum mean covariance change under different normal threshold values is shown in Figure 4.In most cases, maximum and sub-maximum anomaly attributes are quite distinguishable with the ratio of them above 6 in Figure 4.
centroid of the remaining attributes acts as the initial clustering-center for the other clusterquantitatively analyze the relationship based on the minimal square error criterion of k-means clustering.The minimal square error (MSE) is defined as;

Figure 9 .
Figure 9. Max/sub-max covariance change under different num of attributes. 2 that the sub-maximum covariance change is put into the normal cluster if 2.41 p > .Similar to the covariance change ' , l Tz , the covariance matrix change is also dependent of T, so it can be written as ,