\documentclass{report} \usepackage{epsfig} \usepackage{path} \usepackage{fancyvrb} \def\dsc{{\sc dsc}} \DefineVerbatimEnvironment% {MyVerbatim}{Verbatim} {frame=lines,framerule=0.8mm,fontsize=\small} \renewcommand{\abstractname}{} \begin{document} \begin{titlepage} \title{DSC Manual} \author{Duane Wessels, Measurement Factory\\ Ken Keys, CAIDA\\ \\ http://dns.measurement-factory.com/tools/dsc/} \date{\today} \end{titlepage} \maketitle \begin{abstract} \setlength{\parskip}{1ex} \section{Copyright} The DNS Statistics Collector (dsc) Copyright 2003-2007 by The Measurement Factory, Inc., 2007-2008 by Internet Systems Consortium, Inc., 2008-2019 by OARC, Inc. {\em info@measurement-factory.com\/}, {\em info@isc.org\/} \section{License} {\dsc} is licensed under the terms of the BSD license: Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. Neither the name of The Measurement Factory nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. \section{Contributors} \begin{itemize} \item Duane Wessels, Measurement Factory \item Ken Keys, Cooperative Association for Internet Data Analysis \item Sebastian Castro, New Zealand Registry Services \end{itemize} \end{abstract} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Introduction} {\dsc} is a system for collecting and presenting statistics from a busy DNS server. \section{Components} {\dsc} consists of the following components: \begin{itemize} \item A data collector \item A data presenter, where data is archived and rendered \item A method for securely transferring data from the collector to the presenter \item Utilities and scripts that parse XML and archive files from the collector \item Utilities and scripts that generate graphs and HTML pages \end{itemize} \subsection{The Collector} The collector is a binary program, named {\tt dsc\/}, which snoops on DNS messages. It is written in C and uses {\em libpcap\/} for packet capture. {\tt dsc\/} uses a relatively simple configuration file called {\em dsc.conf\/} to define certain parameters and options. The configuration file also determines the {\em datasets\/} that {\tt dsc\/} collects. A Dataset is a 2-D array of counters of IP/DNS message properties. You can define each dimension of the array independently. For example you might define a dataset categorized by DNS query type along one dimension and TLD along the other. {\tt dsc\/} dumps the datasets from memory to XML files every 60 seconds. \subsection{XML Data Transfer} You may run the {\dsc} collector on a remote machine. That is, the collector may run on a different machine than where the data is archived and displayed. {\dsc} includes some Perl and {\tt /bin/sh} scripts to move XML files from collector to presenter. One technique uses X.509 certificates and a secure HTTP server. The other uses {\em rsync\/}, presumably over {\em ssh\/}. \subsubsection{X.509/SSL} To make this work, Apache/mod\_ssl should run on the machine where data is archived and presented. Data transfer is authenticated via SSL X.509 certificates. A Perl CGI script handles all PUT requests on the server. If the client certificate is allowed, XML files are stored in the appropriate directory. A shell script runs on the collector to upload the XML files. It uses {\tt curl\/}\footnote{http://curl.haxx.se} to establish an HTTPS connection. XML files are bundled together with {\tt tar\/} before transfer to eliminate per-connection delays. You could use {\tt scp\/} or {\tt rsync\/} instead of {\tt curl\/} if you like. \path|put-file.pl| is the script that accepts PUT requests on the HTTP server. The HTTP server validates the client's X.509 certificate. If the certificate is invalid, the PUT request is denied. This script reads environment variables to get X.509 parameters. The uploaded-data is stored in a directory based on the X.509 Organizational Unit (server) and Common Name fields (node). \subsubsection{rsync/ssh} This technique uses the {\em rsync\/} utility to transfer files. You'll probably want to use {\em ssh\/} as the underlying transport, although you can still use the less-secure {\em rsh\/} or native rsync server transports if you like. If you use {\em ssh\/} then you'll need to create passphrase-less SSH keys so that the transfer can occur automatically. You may want to create special {\em dsc\/} userids on both ends as well. \subsection{The Extractor} The XML extractor is a Perl script that reads the XML files from {\tt dsc\/}. The extractor essentially converts the XML-structured data to a format that is easier (faster) for the graphing tools to parse. Currently the extracted data files are line-based ASCII text files. Support for SQL databases is planned for the future. \subsection{The Grapher} {\dsc} uses {\em Ploticus\/}\footnote{http://ploticus.sourceforge.net/} as the graphing engine. A Perl module and CGI script read extracted data files and generate Ploticus scriptfiles to generate plots. Plots are always generated on demand via the CGI application. \path|dsc-grapher.pl| is the script that displays graphs from the archived data. \section{Architecture} Figure~\ref{fig-architecture} shows the {\dsc} architecture. \begin{figure} \centerline{\psfig{figure=dsc-arch.eps,width=3.5in}} \caption{\label{fig-architecture}The {\dsc} architecture.} \end{figure} Note that {\dsc} utilizes the concept of {\em servers\/} and {\em nodes\/}. A server is generally a logical service, which may actually consist of multiple nodes. Figure~\ref{fig-architecture} shows six collectors (the circles) and two servers (the rounded rectangles). For a real-world example, consider a DNS root server. IP Anycast allows a DNS root server to have geographically distributed nodes that share a single IP address. We call each instance a {\em node\/} and all nodes sharing the single IP address belong to the same {\em server\/}. The {\dsc} collector program runs on or near\footnote{by ``near'' we mean that packets may be sniffed remotely via Ethernet taps, switch port mirroring, or a SPAN port.} the remote nodes. Its XML output is transferred to the presentation machine via HTTPS PUTs (or something simpler if you prefer). The presentation machine includes an HTTP(S) server. The extractor looks for XML files PUT there by the collectors. A CGI script also runs on the HTTP server to display graphs and other information. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Installing the Presenter} You'll probably want to get the Presenter working before the Collector. If you're using the secure XML data transfer, you'll need to generate both client- and server-side X.509 certificates. Installing the Presenter involves the following steps: \begin{itemize} \setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} \item Install Perl dependencies \item Install {\dsc} software \item Create X.509 certificates (optional) \item Set up a secure HTTP server (e.g., Apache and mod\_ssl) \item Add some cron jobs \end{itemize} \section{Install Perl Dependencies} {\dsc} uses Perl for the extractor and grapher components. Chances are that you'll need Perl-5.8, or maybe only Perl-5.6. You'll also need these readily available third-party Perl modules, which you can find via CPAN: \begin{itemize} \setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} \item CGI-Untaint (CGI::Untaint) \item CGI.pm (CGI) \item Digest-MD5 (Digest::MD5) \item File-Flock (File::Flock) \item File-Spec (File::Spec) \item File-Temp (File::Temp) \item Geography-Countries (Geography::Countries) \item Hash-Merge (Hash::Merge) \item IP-Country (IP::Country) \item MIME-Base64 (MIME::Base64) \item Math-Calc-Units (Math::Calc::Units) \item Scalar-List-Utils (List::Util) \item Text-Template (Text::Template) \item URI (URI::Escape) \item XML-Simple (XML::Simple) \item Net-DNS-Resolver (Net::DNS::Resolver) \end{itemize} \noindent Also note that XML::Simple requires XML::Parser, which in turn requires the {\em expat\/} package. \section{Install Ploticus} {\dsc} uses Ploticus to generate plots and graphs. You can find this software at \verb|http://ploticus.sourceforge.net|. The {\em Download\/} page has links to some pre-compiled binaries and packages. FreeBSD and NetBSD users can find Ploticus in the ports/packages collection. \section{Install {\dsc} Software} All of the extractor and grapher tools are Perl or {\tt /bin/sh} scripts, so there is no need to compile anything. Still, you should run {\tt make} first: \begin{MyVerbatim} % cd presenter % make \end{MyVerbatim} If you see errors about missing Perl prerequisites, you may want to correct those before continuing. The next step is to install the files. Recall that \path|/usr/local/dsc| is the hard-coded installation prefix. You must create it manually: \begin{MyVerbatim} % mkdir /usr/local/dsc % make install \end{MyVerbatim} Note that {\dsc}'s Perl modules are installed in the ``site\_perl'' directory. You'll probably need {\em root\/} privileges to install files there. \section{CGI Symbolic Links} {\dsc} has a couple of CGI scripts that are installed into \path|/usr/local/dsc/libexec|. You should add symbolic links from your HTTP server's \path|cgi-bin| directory to these scripts. Both of these scripts have been designed to be mod\_perl-friendly. \begin{MyVerbatim} % cd /usr/local/apache/cgi-bin % ln -s /usr/local/dsc/libexec/put-file.pl % ln -s /usr/local/dsc/libexec/dsc-grapher.pl \end{MyVerbatim} You can skip the \path|put-file.pl| link if you plan to use {\em rsync\/} to transfer XML files. If you cannot create symbolic links, you'll need to manually copy the scripts to the appropriate directory. \section{/usr/local/dsc/data} \subsection{X.509 method} This directory is where \path|put-file.pl| writes incoming XML files. It should have been created when you ran {\em make install\/} earlier. XML files are actually placed in {\em server\/} and {\em node\/} subdirectories based on the authorized client X.509 certificate parameters. If you want \path|put-file.pl| to automatically create the subdirectories, the \path|data| directory must be writable by the process owner: \begin{MyVerbatim} % chgrp nobody /usr/local/dsc/data/ % chmod 2775 /usr/local/dsc/data/ \end{MyVerbatim} Alternatively, you can create {\em server\/} and {\em node\/} directories in advance and make those writable. \begin{MyVerbatim} % mkdir /usr/local/dsc/data/x-root/ % mkdir /usr/local/dsc/data/x-root/blah/ % mkdir /usr/local/dsc/data/x-root/blah/incoming/ % chgrp nobody /usr/local/dsc/data/x-root/blah/ % chmod 2775 /usr/local/dsc/data/x-root/blah/incoming/ \end{MyVerbatim} Make sure that \path|/usr/local/dsc/data/| is on a large partition with plenty of free space. You can make it a symbolic link to another partition if necessary. Note that a typical {\dsc} installation for a large DNS root server requires about 4GB to hold a year's worth of data. \subsection{rsync Method} The directory structure is the same as above (for X.509). The only differences are that: \begin{itemize} \item The {\em server\/}, {\em node\/}, and {\em incoming\/} directories must be made in advance. \item The directories should be writable by the userid associated with the {\em rsync}/{\em ssh\/} connection. You may want to create a dedicated {\em dsc\/} userid for this. \end{itemize} \section{/usr/local/dsc/var/log} The \path|put-file.pl| script logs its activity to \path|put-file.log| in this directory. It should have been created when you ran {\em make install\/} earlier. The directory should be writable by the HTTP server userid (usually {\em nobody\/} or {\em www\/}). Unfortunately the installation isn't fancy enough to determine that userid yet, so you must change the ownership manually: \begin{MyVerbatim} % chgrp nobody /usr/local/dsc/var/log/ \end{MyVerbatim} Furthermore, you probably want to make sure the log file does not grow indefinitely. For example, on FreeBSD we add this line to \path|/etc/newsyslog.conf|: \begin{MyVerbatim} /usr/local/dsc/var/log/put-file.log nobody:wheel 644 10 * @T00 BN \end{MyVerbatim} You need not worry about this directory if you are using the {\em rsync\/} upload method. \section{/usr/local/dsc/cache} This directory, also created by {\em make install\/} above, holds cached plot images. It also must be writable by the HTTP userid: \begin{MyVerbatim} % chgrp nobody /usr/local/dsc/cache/ \end{MyVerbatim} \section{Cron Jobs} {\dsc} requires two cron jobs on the Presenter. The first is the one that processes incoming XML files. It is called \path|refile-and-grok.sh|. We recommend running it every minute. You also may want to run the jobs at a lowerer priority with {\tt nice\/}. Here is the cron job that we use: \begin{MyVerbatim} * * * * * /usr/bin/nice -10 /usr/local/dsc/libexec/refile-and-grok.sh \end{MyVerbatim} The other useful cron script is \path|remove-xmls.pl|. It removes XML files older than a specified number of days. Since most of the information in the XML files is archived into easier-to-parse data files, you can remove the XML files after a few days. This is the job that we use: \begin{MyVerbatim} @midnight find /usr/local/dsc/data/ | /usr/local/dsc/libexec/remove-xmls.pl 7 \end{MyVerbatim} \section{Data URIs} {\dsc} uses ``Data URIs'' by default. This is a URI where the content is base-64 encoded into the URI string. It allows us to include images directly in HTML output, such that the browser does not have to make additional HTTP requests for the images. Data URIs may not work with some browsers. To disable Data URIs, edit {\em presenter/perllib/DSC/grapher.pm\/} and change this line: \begin{verbatim} $use_data_uri = 1; \end{verbatim} to \begin{verbatim} $use_data_uri = 0; \end{verbatim} Also make this symbolic link from your HTTP servers ``htdocs'' directory: \begin{verbatim} # cd htdocs # ln -s /usr/local/dsc/share/html dsc \end{verbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Configuring the {\dsc} Presenter} This chapter describes how to create X.509 certificates and configure Apache/mod\_ssl. If you plan on using a different upload technique (such as scp or rsync) you can skip these instructions. \section{Generating X.509 Certificates} We use X.509 certificates to authenticate both sides of an SSL connection when uploading XML data files from the collector to the presenter. Certificate generation is a tricky thing. We use three different types of certificates: \begin{enumerate} \item A self-signed root CA certificate \item A server certificate \item Client certificates for each collector node \end{enumerate} In the client certificates we use X.509 fields to store the collector's server and node name. The Organizational Unit Name (OU) becomes the server name and the Common Name (CN) becomes the node name. The {\dsc} source code distribution includes some shell scripts that we have used to create X.509 certificates. You can find them in the \path|presenter/certs| directory. Note these are not installed into \path|/usr/local/dsc|. You should edit \path|openssl.conf| and enter the relevant information for your organization. \subsection{Certificate Authority} You may need to create a self-signed certificate authority if you don't already have one. The CA signs client and server certificates. You will need to distribute the CA and client certificates to collector sites. Here is how to use our \path|create-ca-cert.sh| script: \begin{MyVerbatim} % sh create-ca-cert.sh CREATING CA CERT Generating a 2048 bit RSA private key .............................................................................. ............+++ ......+++ writing new private key to './private/cakey.pem' Enter PEM pass phrase: Verifying - Enter PEM pass phrase: ----- \end{MyVerbatim} \subsection{Server Certificate} The server certificate is used by the HTTP server (Apache/mod\_ssl). The clients will have a copy of the CA certificate so they can validate the server's certificate when uploading XML files. Use the \path|create-srv-cert.sh| script to create a server certificate: \begin{MyVerbatim} % sh create-srv-cert.sh CREATING SERVER REQUEST Generating a 1024 bit RSA private key ..........................++++++ .....................................++++++ writing new private key to 'server/server.key' Enter PEM pass phrase: Verifying - Enter PEM pass phrase: ----- You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:Colorado Locality Name (eg, city) []:Boulder Organization Name (eg, company) [Internet Widgits Pty Ltd]:The Measurement Factory, Inc Organizational Unit Name (eg, section) []:DNS Common Name (eg, YOUR name) []:dns.measurement-factory.com Email Address []:wessels@measurement-factory.com Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []: An optional company name []: Enter pass phrase for server/server.key: writing RSA key CREATING SERVER CERT Using configuration from ./openssl.conf Enter pass phrase for ./private/cakey.pem: Check that the request matches the signature Signature ok The Subject's Distinguished Name is as follows countryName :PRINTABLE:'US' stateOrProvinceName :PRINTABLE:'Colorado' localityName :PRINTABLE:'Boulder' organizationName :PRINTABLE:'The Measurement Factory, Inc' organizationalUnitName:PRINTABLE:'DNS' commonName :PRINTABLE:'dns.measurement-factory.com' emailAddress :IA5STRING:'wessels@measurement-factory.com' Certificate is to be certified until Jun 3 20:06:17 2013 GMT (3000 days) Sign the certificate? [y/n]:y 1 out of 1 certificate requests certified, commit? [y/n]y Write out database with 1 new entries Data Base Updated \end{MyVerbatim} Note that the Common Name must match the hostname of the HTTP server that receives XML files. Note that the \path|create-srv-cert.sh| script rewrites the server key file without the RSA password. This allows your HTTP server to start automatically without prompting for the password. The script leaves the server certificate and key in the \path|server| directory. You'll need to copy these over to the HTTP server config directory as described later in this chapter. \section{Client Certificates} Generating client certificates is similar. Remember that the Organizational Unit Name and Common Name correspond to the collector's {\em server\/} and {\em node\/} names. For example: \begin{MyVerbatim} % sh create-clt-cert.sh CREATING CLIENT REQUEST Generating a 1024 bit RSA private key ................................++++++ ..............++++++ writing new private key to 'client/client.key' Enter PEM pass phrase: Verifying - Enter PEM pass phrase: ----- You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:California Locality Name (eg, city) []:Los Angeles Organization Name (eg, company) [Internet Widgits Pty Ltd]:Some DNS Server Organizational Unit Name (eg, section) []:x-root Common Name (eg, YOUR name) []:LAX Email Address []:noc@example.com Please enter the following 'extra' attributes to be sent with your certificate request A challenge password []: An optional company name []: CREATING CLIENT CERT Using configuration from ./openssl.conf Enter pass phrase for ./private/cakey.pem: Check that the request matches the signature Signature ok The Subject's Distinguished Name is as follows countryName :PRINTABLE:'US' stateOrProvinceName :PRINTABLE:'California' localityName :PRINTABLE:'Los Angeles' organizationName :PRINTABLE:'Some DNS Server' organizationalUnitName:PRINTABLE:'x-root ' commonName :PRINTABLE:'LAX' emailAddress :IA5STRING:'noc@example.com' Certificate is to be certified until Jun 3 20:17:24 2013 GMT (3000 days) Sign the certificate? [y/n]:y 1 out of 1 certificate requests certified, commit? [y/n]y Write out database with 1 new entries Data Base Updated Enter pass phrase for client/client.key: writing RSA key writing RSA key \end{MyVerbatim} The client's key and certificate will be placed in a directory based on the server and node names. For example: \begin{MyVerbatim} % ls -l client/x-root/LAX total 10 -rw-r--r-- 1 wessels wessels 3311 Mar 17 13:17 client.crt -rw-r--r-- 1 wessels wessels 712 Mar 17 13:17 client.csr -r-------- 1 wessels wessels 887 Mar 17 13:17 client.key -rw-r--r-- 1 wessels wessels 1953 Mar 17 13:17 client.pem \end{MyVerbatim} The \path|client.pem| (and \path|cacert.pem|) files should be copied to the collector machine. \section{Apache Configuration} \noindent You need to configure Apache for SSL. Here is what our configuration looks like: \begin{MyVerbatim} SSLRandomSeed startup builtin SSLRandomSeed startup file:/dev/random SSLRandomSeed startup file:/dev/urandom 1024 SSLRandomSeed connect builtin SSLRandomSeed connect file:/dev/random SSLRandomSeed connect file:/dev/urandom 1024 DocumentRoot "/httpd/htdocs-ssl" SSLEngine on SSLCertificateFile /httpd/conf/SSL/server/server.crt SSLCertificateKeyFile /httpd/conf/SSL/server/server.key SSLCertificateChainFile /httpd/conf/SSL/cacert.pem # For client-validation SSLCACertificateFile /httpd/conf/SSL/cacert.pem SSLVerifyClient require SSLOptions +CompatEnvVars Script PUT /cgi-bin/put-file.pl \end{MyVerbatim} \noindent Note the last line of the configuration specifies the CGI script that accepts PUT requests. The {\em SSLOptions\/} line is necessary so that the CGI script receives certain HTTP headers as environment variables. Those headers/variables convey the X.509 information to the script so it knows where to store received XML files. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Collector Installation} A collector machine needs only the {\em dsc\/} binary, a configuration file, and a couple of cron job scripts. At this point, {\dsc} lacks certain niceties such as a \path|./configure| script. The installation prefix, \path|/usr/local/dsc| is currently hard-coded. \section{Prerequisites} You'll need a C/C++ compiler to compile the {\tt dsc\/} source code. If the collector and archiver are different systems, you'll need a way to transfer data files. We recommend that you use the {\tt curl\/} HTTP/SSL client You may use another technique, such as {\tt scp\/} or {\tt rsync\/} if you prefer. \section{\tt Installation} You can compile {\tt dsc\/} from the {\tt collector\/} directory: \begin{MyVerbatim} % cd collector % make \end{MyVerbatim} Assuming there are no errors or problems during compilation, install the {\tt dsc\/} binary and other scripts with: \begin{MyVerbatim} % make install \end{MyVerbatim} This installs five files: \begin{Verbatim} /usr/local/dsc/bin/dsc /usr/local/dsc/etc/dsc.conf.sample /usr/local/dsc/libexec/upload-prep.pl /usr/local/dsc/libexec/upload-rsync.sh /usr/local/dsc/libexec/upload-x509.sh \end{Verbatim} Of course, if you don't want to use the default installation prefix, you can manually copy these files to a location of your choosing. If you do that, you'll also need to edit the cron scripts to match your choice of pathnames, etc. \section{Uploading XML Files} \label{sec-install-collector-cron} This section describes how XML files are transferred from the collector to one or more Presenter systems. As we'll see in the next chapter, each {\tt dsc} process has its own {\em run directory\/}. This is the directory where {\tt dsc} leaves its XML files. It usually has a name like \path|/usr/local/dsc/run/NODENAME|\@. XML files are removed after they are successfully transferred. If the Presenter is unreachable, XML files accumulate here until they can be transferred. Make sure that you have enough disk space to queue a lot of XML files in the event of an outage. In general we want to be able to upload XML files to multiple presenters. This is the reason behind the {\tt upload-prep.pl} script. This script runs every 60 seconds from cron: \begin{MyVerbatim} * * * * * /usr/local/dsc/libexec/upload-prep.pl \end{MyVerbatim} {\tt upload-prep.pl} looks for \path|dsc.conf| files in \path|/usr/local/dsc/etc| by default. For each config file found, it cd's to the {\em run\_dir\/} and links\footnote{as in ``hard link'' made with \path|/bin/ln|.} XML files to one or more upload directories. The upload directories are named \path|upload/dest1|, \path|upload/dest2|, and so on. In order for all this to work, you must create the directories in advance. For example, if you are collecting stats on your nameserver named {\em ns0\/}, and want to send the XML files to two presenters (named oarc and archive), the directory structure might look like: \begin{MyVerbatim} % set prefix=/usr/local/dsc % mkdir $prefix/run % mkdir $prefix/run/ns0 % mkdir $prefix/run/ns0/upload % mkdir $prefix/run/ns0/upload/oarc % mkdir $prefix/run/ns0/upload/archive \end{MyVerbatim} With that directory structure, the {\tt upload-prep.pl} script moves XML files from the \path|ns0| directory to the two upload directories, \path|oarc| and \path|archive|. To actually transfer files to the presenter, use either \path|upload-x509.sh| or \path|upload-rsync.sh|. \subsection{upload-x509.sh} This cron script is responsible for actually transferring XML files from the upload directories to the remote server. It creates a {\em tar\/} archive of XML files and then uploads it to the remote server with {\tt curl}. The script takes three commandline arguments: \begin{MyVerbatim} % upload-x509.sh NODE DEST URI \end{MyVerbatim} {\em NODE\/} must match the name of a directory under \path|/usr/local/dsc/run|. Similarly, {\em DEST\/} must match the name of a directory under \path|/usr/local/dsc/run/NODE/upload|. {\em URI\/} is the URL/URI that the data is uploaded to. Usually it is just an HTTPS URL with the name of the destination server. We also recommend running this from cron every 60 seconds. For example: \begin{MyVerbatim} * * * * * /usr/local/dsc/libexec/upload-x509.sh ns0 oarc \ https://collect.oarc.isc.org/ * * * * * /usr/local/dsc/libexec/upload-x509.sh ns0 archive \ https://archive.example.com/ \end{MyVerbatim} \path|upload-x509.sh| looks for X.509 certificates in \path|/usr/local/dsc/certs|. The client certificate should be named \path|/usr/local/dsc/certs/DEST/NODE.pem| and the CA certificate should be named \path|/usr/local/dsc/certs/DEST/cacert.pem|. Note that {\em DEST\/} and {\em NODE\/} must match the \path|upload-x509.sh| command line arguments. \subsection{upload-rsync.sh} This script can be used to transfer XML files files from the upload directories to the remote server. It uses {\em rsync\/} and assumes that {\em rsync\/} will use {\em ssh\/} for transport. This script also takes three arguments: \begin{MyVerbatim} % upload-rsync.sh NODE DEST RSYNC-DEST \end{MyVerbatim} Note that {\em DEST\/} is the name of the local ``upload'' directory and {\em RSYNC-DEST\/} is an {\em rsync\/} destination (i.e., hostname and remote directory). Here is how you might use it in a crontab: \begin{MyVerbatim} * * * * * /usr/local/dsc/libexec/upload-rsync.sh ns0 oarc \ dsc@collect.oarc.isc.org:/usr/local/dsc/data/Server/ns0 * * * * * /usr/local/dsc/libexec/upload-rsync.sh ns0 archive \ dsc@archive.oarc.isc.org:/usr/local/dsc/data/Server/ns0 \end{MyVerbatim} Also note that \path|upload-rsync.sh| will actually store the remote XML files in \path|incoming/YYYY-MM-DD| subdirectories. That is, if your {\em RSYNC-DEST\/} is \path|host:/usr/local/dsc/data/Server/ns0| then files will actually be written to \path|/usr/local/dsc/data/Server/ns0/incoming/YYYY-MM-DD| on {\em host}, where \path|YYYY-MM-DD| is replaced by the year, month, and date of the XML files. These subdirectories reduce filesystem pressure in the event of backlogs. {\em rsync\/} over {\em ssh\/} requires you to use RSA or DSA public keys that do not have a passphrase. If you do not want to use one of {\em ssh\/}'s default identity files, you can create one specifically for this script. It should be named \path|dsc_uploader_id| (and \path|dsc_uploader_id.pub|) in the \$HOME/.ssh directory of the user that will be running the script. For example, you can create it with this command: \begin{MyVerbatim} % ssh-keygen -t dsa -C dsc-uploader -f $HOME/.ssh/dsc_uploader_id \end{MyVerbatim} Then add \path|dsc_uploader_id.pub| to the \path|authorized_keys| file of the receiving userid on the presenter system. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Configuring and Running the {\dsc} Collector} \section{dsc.conf} Before running {\tt dsc\/} you need to create a configuration file. Note that configuration directive lines are terminated with a semi-colon. The configuration file currently understands the following directives: \begin{description} \item[local\_address] Specifies the DNS server's local IP address. It is used to determine the ``direction'' of an IP packet: sending, receiving, or other. You may specify multiple local addresses by repeating the {\em local\_address} line any number of times. Example: {\tt local\_address 172.16.0.1;\/} Example: {\tt local\_address 2001:4f8:0:2::13;\/} \item[run\_dir] A directory that should become {\tt dsc\/}'s current directory after it starts. XML files will be written here, as will any core dumps. Example: {\tt run\_dir "/var/run/dsc";\/} \item[minfree\_bytes] If the filesystem where {\tt dsc\/} writes its XML files does not have at least this much free space, then {\tt dsc\/} will not write the XML files. This prevents {\tt dsc\/} from filling up the filesystem. The XML files that would have been written are simply lost and cannot be receovered. {\tt dsc\/} will begin writing XML files again when the filesystem has the necessary free space. \item[bpf\_program] A Berkeley Packet Filter program string. Normally you should leave this unset. You may use this to further restrict the traffic seen by {\tt dsc\/}. Note that {\tt dsc\/} currently has one indexer that looks at all IP packets. If you specify something like {\em udp port 53\/} that indexer will not work. However, if you want to monitor multiple DNS servers with separate {\dsc} instances on one collector box, then you may need to use {\em bpf\_program} to make sure that each {\tt dsc} process sees only the traffic it should see. Note that this directive must go before the {\em interface\/} directive because {\tt dsc\/} makes only one pass through the configuration file and the BPF filter is set when the interface is initialized. Example: {\tt bpf\_program "dst host 192.168.1.1";\/} \item[interface] The interface name to sniff packets from or a pcap file to read packets from. You may specify multiple interfaces. Example: {\tt interface fxp0;\/} {\tt interface /path/to/dump.pcap;\/} \item[bpf\_vlan\_tag\_byte\_order] {\tt dsc\/} knows about VLAN tags. Some operating systems (FreeBSD-4.x) have a bug whereby the VLAN tag id is byte-swapped. Valid values for this directive are {\tt host\/} and {\tt net\/} (the default). Set this to {\tt host\/} if you suspect your operating system has the VLAN tag byte order bug. Example: {\tt bpf\_vlan\_tag\_byte\_order host;\/} \item[match\_vlan] A list of VLAN identifiers (integers). If set, only the packets belonging to these VLANs are counted. Example: {\tt match\_vlan 101 102;\/} \item[qname\_filter] This directive allows you to define custom filters to match query names in DNS messages. Please see Section~\ref{sec-qname-filter} for more information. \item[dataset] This directive is the heart of {\dsc}. However, it is also the most complex. To save time we recommend that you copy interesting-looking dataset definitions from \path|dsc.conf.sample|. Comment out any that you feel are irrelevant or uninteresting. Later, as you become more familiar with {\dsc}, you may want to read the next chapter and add your own custom datasets. \item[output\_format] Specify the output format, can be give multiple times to output in more then one format. Default output format is XML. Available formats are: - XML - JSON Example: {\tt output\_format JSON} \end{description} \section{A Complete Sample dsc.conf} Here's how your entire {\em dsc.conf\/} file might look: \begin{MyVerbatim} #bpf_program interface em0; local_address 192.5.5.241; run_dir "/usr/local/dsc/run/foo"; dataset qtype dns All:null Qtype:qtype queries-only; dataset rcode dns All:null Rcode:rcode replies-only; dataset opcode dns All:null Opcode:opcode queries-only; dataset rcode_vs_replylen dns Rcode:rcode ReplyLen:msglen replies-only; dataset client_subnet dns All:null ClientSubnet:client_subnet queries-only max-cells=200; dataset qtype_vs_qnamelen dns Qtype:qtype QnameLen:qnamelen queries-only; dataset qtype_vs_tld dns Qtype:qtype TLD:tld queries-only,popular-qtypes max-cells=200; dataset certain_qnames_vs_qtype dns CertainQnames:certain_qnames Qtype:qtype queries-only; dataset client_subnet2 dns Class:query_classification ClientSubnet:client_subnet queries-only max-cells=200; dataset client_addr_vs_rcode dns Rcode:rcode ClientAddr:client replies-only max-cells=50; dataset chaos_types_and_names dns Qtype:qtype Qname:qname chaos-class,queries-only; dataset idn_qname dns All:null IDNQname:idn_qname queries-only; dataset edns_version dns All:null EDNSVersion:edns_version queries-only; dataset do_bit dns All:null D0:do_bit queries-only; dataset rd_bit dns All:null RD:rd_bit queries-only; dataset tc_bit dns All:null TC:tc_bit replies-only; dataset idn_vs_tld dns All:null TLD:tld queries-only,idn-only; dataset ipv6_rsn_abusers dns All:null ClientAddr:client queries-only,aaaa-or-a6-only,root-servers-n et-only max-cells=50; dataset transport_vs_qtype dns Transport:transport Qtype:qtype queries-only; dataset direction_vs_ipproto ip Direction:ip_direction IPProto:ip_proto any; \end{MyVerbatim} \section{Running {\tt dsc}} {\tt dsc\/} accepts a single command line argument, which is the name of the configuration file. For example: \begin{MyVerbatim} % cd /usr/local/dsc % bin/dsc etc/foo.conf \end{MyVerbatim} If you run {\tt ps} when {\tt dsc} is running, you'll see two processes: \begin{MyVerbatim} 60494 ?? S 0:00.36 bin/dsc etc/foo.conf 69453 ?? Ss 0:10.65 bin/dsc etc/foo.conf \end{MyVerbatim} The first process simply forks off child processes every 60 seconds. The child processes do the work of analyzing and tabulating DNS messages. Please use NTP or another technique to keep the collector's clock synchronized to the correct time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Viewing {\dsc} Graphs} To view {\dsc} data in a web browser, simply enter the URL to the \path|dsc-grapher.pl| CGI. But before you do that, you'll need to create a grapher configuration file. \path|dsc-grapher.pl| uses a simple configuration file to set certain menu options. This configuration file is \path|/usr/local/dsc/etc/dsc-grapher.cfg|. You should find a sample version in the same directory. For example: \begin{MyVerbatim} server f-root pao1 sfo2 server isc senna+piquet server tmf hq sc lgh trace_windows 1hour 4hour 1day 1week 1month accum_windows 1day 2days 3days 1week timezone Asia/Tokyo domain_list isc_tlds br nl ca cz il pt cl domain_list isc_tlds sk ph hr ae bg is si za valid_domains isc isc_tlds \end{MyVerbatim} \begin{figure} \centerline{\psfig{figure=screenshot1.eps,width=6.5in}} \caption{\label{fig-screenshot1}A sample graph} \end{figure} Refer to Figure~\ref{fig-screenshot1} to see how the directives affect the visual display. The following three directives should always be set in the configuration file: \begin{description} \item[server] This directive tells \path|dsc-grapher.pl| to list the given server and its associated nodes in the ``Servers/Nodes'' section of its navigation menu. You can repeat this directive for each server that the Presenter has. \item[trace\_windows] Specifies the ``Time Scale'' menu options for trace-based plots. \item[accum\_windows] Specifies the ``Time Scale'' menu options for ``cumulative'' plots, such as the Classification plot. \end{description} Note that the \path|dsc-grapher.cfg| only affects what may appear in the navigation window. It does NOT prevent users from entering other values in the URL parameters. For example, if you have data for a server/node in your \path|/usr/local/dsc/data/| directory that is not listed in \path|dsc-grapher.cfg|, a user may still be able to view that data by manually setting the URL query parameters. The configuration file accepts a number of optional directives as well. You may set these if you like, but they are not required: \begin{description} \item[timezone] Sets the time zone for dates and times displayed in the graphs. You can use this if you want to override the system time zone. The value for this directive should be the name of a timezone entry in your system database (usually found in {\path|/usr/share/zoneinfo|}. For example, if your system time zone is set to UTC but you want the times displayed for the London timezone, you can set this directive to {\tt Europe/London\/}. \item[domain\_list] This directive, along with {\em valid\_domains\/}, tell the presenter which domains a nameserver is authoritative for. That information is used in the TLDs subgraphs to differentiate requests for ``valid'' and ``invalid'' domains. The {\em domain\_list\/} creates a named list of domains. The first token is a name for the list, and the remaining tokens are domain names. The directive may be repeated with the same list name, as shown in the above example. \item[valid\_domains] This directive glues servers and domain\_lists together. The first token is the name of a {\em server\/} and the second token is the name of a {\em domain\_list\/}. \item[embargo] The {\em embargo\/} directive may be used to delay the availability of data via the presenter. For example, you may have one instance of {\em dsc-grapher.pl\/} for internal use only (password protected, etc). You may also have a second instance for third-parties where data is delayed by some amount of time, such as hours, days, or weeks. The value of the {\em embargo\/} directive is the number of seconds which data availability should be delayed. For example, if you set it to 604800, then viewers will not be able to see any data less than one week old. \item[anonymize\_ip] When the {\em anonymize\_ip\/} directive is given, IP addresses in the display will be anonymized. The anonymization algorithm is currently hard-coded and designed only for IPv4 addresses. It masks off the lower 24 bits and leaves only the first octet in place. \item[hide\_nodes] When the {\em hide\_nodes\/} directive is given, the presenter will not display the list node names underneath the current server. This might be useful if you have a number of nodes but only want viewers to see the server as a whole, without exposing the particular nodes in the cluster. Note, however, that if someone already knows the name of a node they can hand-craft query terms in the URL to display the data for only that node. In other words, the {\em hide\_nodes\/} only provides ``security through obscurity.'' \end{description} The first few times you try \path|dsc-grapher.pl|, be sure to run {\tt tail -f} on the HTTP server error.log file. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{{\dsc} Datasets} A {\em dataset\/} is a 2-D array of counters. For example, you might have a dataset with ``Query Type'' along one dimension and ``Query Name Length'' on the other. The result is a table that shows the distribution of query name lengths for each query type. For example: \vspace{1ex} \begin{center} \begin{tabular}{l|rrrrrr} Len & A & AAAA & A6 & PTR & NS & SOA \\ \hline $\cdots$ & & & & & \\ 11 & 14 & 8 & 7 & 11 & 2 & 0 \\ 12 & 19 & 2 & 3 & 19 & 4 & 1 \\ $\cdots$ & & & & & & \\ 255 & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline \end{tabular} \end{center} \vspace{1ex} \noindent A dataset is defined by the following parameters: \begin{itemize} \setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} \item A name \item A protocol layer (IP or DNS) \item An indexer for the first dimension \item An indexer for the second dimension \item One or more filters \item Zero or more options and parameters \end{itemize} \noindent The {\em dataset\/} definition syntax in \path|dsc.conf| is: {\tt dataset\/} {\em name\/} {\em protocol\/} {\em Label1:Indexer1\/} {\em Label2:Indexer2\/} {\em filter\/} {\em [parameters]\/}; \vspace{2ex} \section{Dataset Name} The dataset name is used in the filename for {\tt dsc\/}'s XML files. Although this is an opaque string in theory, the Presenter's XML extractor routines must recognize the dataset name to properly parse it. The source code file \path|presenter/perllib/DSC/extractor/config.pm| contains an entry for each known dataset name. \section{Protocol} {\dsc} currently knows about two protocol layers: IP and DNS. On the {\tt dataset\/} line they are written as {\tt ip\/} and {\tt dns\/}. \section{Indexers} An {\em indexer\/} is simply a function that transforms the attributes of an IP/DNS message into an array index. For some attributes the transformation is straightforward. For example, the ``Query Type'' indexer simply extracts the query type value from a DNS message and uses this 16-bit value as the array index. Other attributes are slightly more complicated. For example, the ``TLD'' indexer extracts the TLD of the QNAME field of a DNS message and maps it to an integer. The indexer maintains a simple internal table of TLD-to-integer mappings. The actual integer values are unimportant because the TLD strings, not the integers, appear in the resulting XML data. When you specify an indexer on a {\tt dataset\/} line, you must provide both the name of the indexer and a label. The Label appears as an attribute in the XML output. For example, Figure~\ref{fig-sample-xml} shows the XML corresponding to this {\em dataset\/} line: \begin{MyVerbatim} dataset the_dataset dns Foo:foo Bar:bar queries-only; \end{MyVerbatim} \begin{figure} \begin{MyVerbatim} ... ... \end{MyVerbatim} \caption{\label{fig-sample-xml}Sample XML output} \end{figure} In theory you are free to choose any label that you like, however, the XML extractors look for specific labels. Please use the labels given for the indexers in Tables~\ref{tbl-dns-indexers} and~\ref{tbl-ip-indexers}. \subsection{IP Indexers} \begin{table} \begin{center} \begin{tabular}{|lll|} \hline Indexer & Label & Description \\ \hline ip\_direction & Direction & one of sent, recv, or other \\ ip\_proto & IPProto & IP protocol (icmp, tcp, udp) \\ ip\_version & IP version number (4, 6) \\ \hline \end{tabular} \caption{\label{tbl-ip-indexers}IP packet indexers} \end{center} \end{table} {\dsc} includes only minimal support for collecting IP-layer stats. Mostly we are interested in finding out the mix of IP protocols received by the DNS server. It can also show us if/when the DNS server is the subject of denial-of-service attack. Table~\ref{tbl-ip-indexers} shows the indexers for IP packets. Here are their longer descriptions: \begin{description} \item[ip\_direction] One of three values: sent, recv, or else. Direction is determined based on the setting for {\em local\_address\/} in the configuration file. \item[ip\_proto] The IP protocol type, e.g.: tcp, udp, icmp. Note that the {\em bpf\_program\/} setting affects all traffic seen by {\dsc}. If the program contains the word ``udp'' then you won't see any counts for non-UDP traffic. \item[ip\_version] The IP version number, e.g.: 4 or 6. Can be used to compare how much traffic comes in via IPv6 compared to IPV4. \end{description} \subsection{IP Filters} Currently there is only one IP protocol filter: {\tt any\/}. It includes all received packets. \subsection{DNS Indexers} \begin{table} \begin{center} \begin{tabular}{|lll|} \hline Indexer & Label & Description \\ \hline certain\_qnames & CertainQnames & Popular query names seen at roots \\ client\_subnet & ClientSubnet & The client's IP subnet (/24 for IPv4, /96 for IPv6) \\ client & ClientAddr & The client's IP address \\ do\_bit & DO & Whether the DO bit is on \\ edns\_version & EDNSVersion & The EDNS version number \\ idn\_qname & IDNQname & If the QNAME is in IDN format \\ msglen & MsgLen & The DNS message length \\ null & All & A ``no-op'' indexer \\ opcode & Opcode & DNS message opcode \\ qclass & - & Query class \\ qname & Qname & Full query name \\ qnamelen & QnameLen & Length of the query name \\ qtype & Qtype & DNS query type \\ query\_classification & Class & A classification for bogus queries \\ rcode & Rcode & DNS response code \\ rd\_bit & RD & Check if Recursion Desired bit set \\ tc\_bit & TC & Check if Truncated bit set \\ tld & TLD & TLD of the query name \\ transport & Transport & Transport protocol for the DNS message (UDP or TCP) \\ dns\_ip\_version & IPVersion & IP version of the packet carrying the DNS message \\ \hline \end{tabular} \caption{\label{tbl-dns-indexers}DNS message indexers} \end{center} \end{table} Table~\ref{tbl-dns-indexers} shows the currently-defined indexers for DNS messages, and here are their descriptions: \begin{description} \item[certain\_qnames] This indexer isolates the two most popular query names seen by DNS root servers: {\em localhost\/} and {\em [a--m].root-servers.net\/}. \item[client\_subnet] Groups DNS messages together by the subnet of the client's IP address. The subnet is maked by /24 for IPv4 and by /96 for IPv6. We use this to make datasets with large, diverse client populations more manageable and to provide a small amount of privacy and anonymization. \item[client] The IP (v4 and v6) address of the DNS client. \item[do\_bit] This indexer has only two values: 0 or 1. It indicates whether or not the ``DO'' bit is set in a DNS query. According to RFC 2335: {\em Setting the DO bit to one in a query indicates to the server that the resolver is able to accept DNSSEC security RRs.} \item[edns\_version] The EDNS version number, if any, in a DNS query. EDNS Version 0 is documented in RFC 2671. \item[idn\_qname] This indexer has only two values: 0 or 1. It returns 1 when the first QNAME in the DNS message question section is an internationalized domain name (i.e., containing non-ASCII characters). Such QNAMEs begin with the string {\tt xn--\/}. This convention is documented in RFC 3490. \item[msglen] The overall length (size) of the DNS message. \item[null] A ``no-op'' indexer that always returns the same value. This can be used to effectively turn the 2-D table into a 1-D array. \item[opcode] The DNS message opcode is a four-bit field. QUERY is the most common opcode. Additional currently defined opcodes include: IQUERY, STATUS, NOTIFY, and UPDATE. \item[qclass] The DNS message query class (QCLASS) is a 16-bit value. IN is the most common query class. Additional currently defined query class values include: CHAOS, HS, NONE, and ANY. \item[qname] The full QNAME string from the first (and usually only) QNAME in the question section of a DNS message. \item[qnamelen] The length of the first (and usually only) QNAME in a DNS message question section. Note this is the ``expanded'' length if the message happens to take advantage of DNS message ``compression.'' \item[qtype] The query type (QTYPE) for the first QNAME in the DNS message question section. Well-known query types include: A, AAAA, A6, CNAME, PTR, MX, NS, SOA, and ANY. \item[query\_classification] A stateless classification of ``bogus'' queries: \begin{itemize} \setlength{\itemsep}{0ex plus 0.5ex minus 0.0ex} \item non-auth-tld: when the TLD is not one of the IANA-approved TLDs. \item root-servers.net: a query for a root server IP address. \item localhost: a query for the localhost IP address. \item a-for-root: an A query for the DNS root (.). \item a-for-a: an A query for an IPv4 address. \item rfc1918-ptr: a PTR query for an RFC 1918 address. \item funny-class: a query with an unknown/undefined query class. \item funny-qtype: a query with an unknown/undefined query type. \item src-port-zero: when the UDP message's source port equals zero. \item malformed: a malformed DNS message that could not be entirely parsed. \end{itemize} \item[rcode] The RCODE value in a DNS response. The most common response codes are 0 (NO ERROR) and 3 (NXDOMAIN). \item[rd\_bit] This indexer returns 1 if the RD (recursion desired) bit is set in the query. Usually only stub resolvers set the RD bit. Usually authoritative servers do not offer recursion to their clients. \item[tc\_bit] This indexer returns 1 if the TC (truncated) bit is set (in a response). An authoritative server sets the TC bit when the entire response won't fit into a UDP message. \item[tld] the TLD of the first QNAME in a DNS message's question section. \item[transport] Indicates whether the DNS message is carried via UDP or TCP\@. \item[dns\_ip\_version] The IP version number that carried the DNS message. \end{description} \subsection{DNS Filters} You must specify one or more of the following filters (separated by commas) on the {\tt dataset\/} line: \begin{description} \item[any] The no-op filter, counts all messages. \item[queries-only] Count only DNS query messages. A query is a DNS message where the QR bit is set to 0. \item[replies-only] Count only DNS response messages. A query is a DNS message where the QR bit is set to 1. \item[popular-qtypes] Count only DNS messages where the query type is one of: A, NS, CNAME, SOA, PTR, MX, AAAA, A6, ANY. \item[idn-only] Count only DNS messages where the query name is in the internationalized domain name format. \item[aaaa-or-a6-only] Count only DNS Messages where the query type is AAAA or A6. \item[root-servers-net-only] Count only DNS messages where the query name is within the {\em root-servers.net\/} domain. \item[chaos-class] Counts only DNS messages where QCLASS is equal to CHAOS (3). The CHAOS class is generally used for only the special {\em hostname.bind\/} and {\em version.bind\/} queries. \end{description} \noindent Note that multiple filters are ANDed together. That is, they narrow the input stream, rather than broaden it. In addition to these pre-defined filters, you can add your own custom filters. \subsubsection{qname\_filter} \label{sec-qname-filter} The {\em qname\_filter} directive defines a new filter that uses regular expression matching on the QNAME field of a DNS message. This may be useful if you have a server that is authoritative for a number of zones, but you want to limit your measurements to a small subset. The {\em qname\_filter} directive takes two arguments: a name for the filter and a regular expression. For example: \begin{MyVerbatim} qname_filter MyFilterName example\.(com|net|org)$ ; \end{MyVerbatim} This filter matches queries (and responses) for names ending with {\em example.com\/}, {\em example.net\/}, and {\em example.org\/}. You can reference the named filter in the filters part of a {\em dataset\/} line. For example: \begin{MyVerbatim} dataset qtype dns All:null Qtype:qtype queries-only,MyFilterName; \end{MyVerbatim} \subsection{Parameters} \label{sec-dataset-params} \noindent {\tt dsc\/} currently supports the following optional parameters: \begin{description} \item[min-count={\em NN\/}] Cells with counts less than {\em NN\/} are not included in the output. Instead, they are aggregated into the special values {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/}. This helps reduce the size of datasets with a large number of small counts. \item[max-cells={\em NN\/}] A different, perhaps better, way of limiting the size of a dataset. Instead of trying to determine an appropriate {\em min-count\/} value in advance, {\em max-cells\/} allows you put a limit on the number of cells to include for the second dataset dimension. If the dataset has 9 possible first-dimension values, and you specify a {\em max-cell\/} count of 100, then the dataset will not have more than 900 total values. The cell values are sorted and the top {\em max-cell\/} values are output. Values that fall below the limit are aggregated into the special {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/} entries. \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Data Storage} \section{XML Structure} A dataset XML file has the following structure: \begin{MyVerbatim} \end{MyVerbatim} \noindent {\em dataset-name\/}, {\em Label1\/}, and {\em Label2\/} come from the dataset definition in {\em dsc.conf\/}. The {\em start\_time\/} and {\em stop\_time\/} attributes are given in Unix seconds. They are normally 60-seconds apart. {\tt dsc} usually starts a new measurement interval on 60 second boundaries. That is: \begin{equation} stop\_time \bmod{60} == 0 \end{equation} The LABEL1 VAL attributes ({\em D1-V1\/}, {\em D1-V2\/}, etc) are values for the first dimension indexer. Similarly, the LABEL2 VAL attributes ({\em D2-V1\/}, {\em D2-V2\/}, {\em D2-V3\/}) are values for the second dimension indexer. For some indexers these values are numeric, for others they are strings. If the value contains certain non-printable characters, the string is base64-encoded and the optional BASE64 attribute is set to 1. There are two special VALs that help keep large datasets down to a reasonable size: {\tt -:SKIPPED:-\/} and {\tt -:SKIPPED\_SUM:-\/}. These may be present on datasets that use the {\em min-count\/} and {\em max-cells\/} parameters (see Section~\ref{sec-dataset-params}). {\tt -:SKIPPED:-\/} is the number of cells that were not included in the XML output. {\tt -:SKIPPED\_SUM:-\/}, on the other hand, is the sum of the counts for all the skipped cells. Note that ``one-dimensional datasets'' still use two dimensions in the XML file. The first dimension type and value will be ``All'', as shown in the example below. The {\em count\/} values are always integers. If the count for a particular tuple is zero, it should not be included in the XML file. Note that the contents of the XML file do not indicate where it came from. In particular, the server and node that it came from are not present. Instead, DSC relies on the presenter to store XML files in a directory hierarchy with the server and node as directory names. \noindent Here is a short sample XML file with real content: \begin{MyVerbatim} \end{MyVerbatim} \noindent Please see \path|http://dns.measurement-factory.com/tools/dsc/sample-xml/| for more sample XML files. The XML is not very strict and might cause XML purists to cringe. {\tt dsc} writes the XML files the old-fashioned way (with printf()) and reads them with Perl's XML::Simple module. Here is a possibly-valid DTD for the dataset XML format. Note, however, that the {\em LABEL1\/} and {\em LABEL2\/} strings are different for each dataset: \begin{MyVerbatim} ]> \end{MyVerbatim} \subsection{XML File Naming Conventions} {\tt dsc\/} relies on certain file naming conventions for XML files. The file name should be of the format: \begin{quote} {\em timestamp\/}.dscdata.xml \end{quote} \noindent For example: \begin{quote} 1154649660.dscdata.xml \end{quote} NOTE: Versions of DSC prior to 2008-01-30 used a different naming convention. Instead of ``dscdata'' the XML file was named after the dataset that generated the data. The current XML extraction code still supports the older naming convention for backward compatibility. If the second component of the XML file name is not ``dscdata'' then the extractor assume it is a dataset name. \noindent Dataset names come from {\em dsc.conf\/}, and should match the NAME attribute of the ARRAY tag inside the XML file. The timestamp is in Unix epoch seconds and is usually the same as the {\em stop\_time\/} value. \section{JSON Structure} The JSON structure mimics the XML structure so that elements are the same. \begin{MyVerbatim} { "name": "dataset-name", "start_time": unix-seconds, "stop_time": unix-seconds, "dimensions": [ "Label1", "Label2" ], "data": [ { "Label1": "D1-V1", "Label2": [ { "val": "D2-V1", "count": N1 }, { "val": "D2-V2", "count": N2 }, { "val": "D2-V3", "count": N3 } ] }, { "Label1": "D1-V1-base64", "base64": true, "Label2": [ { "val": "D2-V1", "count": N1 }, { "val": "D2-V2-base64", "base64": true, "count": N2 }, { "val": "D2-V3", "count": N3 } ] } ] } \end{MyVerbatim} \section{Archived Data Format} {\dsc} actually uses four different file formats for archived datasets. These are all text-based and designed to be quickly read from, and written to, by Perl scripts. \subsection{Format 1} \noindent \begin{tt}time $k1$ $N_{k1}$ $k2$ $N_{k2}$ $k3$ $N_{k3}$ ... \end{tt} \vspace{1ex}\noindent This is a one-dimensional time-series format.\footnote{Which means it can only be used for datasets where one of the indexers is set to the Null indexer.} The first column is a timestamp (unix seconds). The remaining space-separated fields are key-value pairs. For example: \begin{MyVerbatim} 1093219980 root-servers.net 122 rfc1918-ptr 112 a-for-a 926 funny-qclass 16 1093220040 root-servers.net 121 rfc1918-ptr 104 a-for-a 905 funny-qclass 15 1093220100 root-servers.net 137 rfc1918-ptr 116 a-for-a 871 funny-qclass 12 \end{MyVerbatim} \subsection{Format 2} \noindent \begin{tt}time $j1$ $k1$:$N_{j1,k1}$:$k2$:$N_{j1,k2}$:... $j2$ $k1$:$N_{j2,k1}$:$k2$:$N_{j2,k2}$:... ... \end{tt} \vspace{1ex}\noindent This is a two-dimensional time-series format. In the above, $j$ represents the first dimension indexer and $k$ represents the second. Key-value pairs for the second dimension are separated by colons, rather than space. For example: \begin{MyVerbatim} 1093220160 recv icmp:2397:udp:136712:tcp:428 sent icmp:819:udp:119191:tcp:323 1093220220 recv icmp:2229:udp:124708:tcp:495 sent icmp:716:udp:107652:tcp:350 1093220280 recv udp:138212:icmp:2342:tcp:499 sent udp:120788:icmp:819:tcp:364 1093220340 recv icmp:2285:udp:137107:tcp:468 sent icmp:733:udp:118522:tcp:341 \end{MyVerbatim} \subsection{Format 3} \noindent \begin{tt}$k$ $N_{k}$ \end{tt} \vspace{1ex}\noindent This format is used for one-dimensional datasets where the key space is (potentially) very large. That is, putting all the key-value pairs on a single line would result in a very long line in the datafile. Furthermore, for these larger datasets, it is prohibitive to store the data as a time series. Instead the counters are incremented over time. For example: \begin{MyVerbatim} 10.0.160.0 3024 10.0.20.0 92 10.0.244.0 5934 \end{MyVerbatim} \subsection{Format 4} \noindent \begin{tt}$j$ $k$ $N_{j,k}$ \end{tt} \vspace{1ex}\noindent This format is used for two-dimensional datasets where one or both key spaces are very large. Again, counters are incremented over time, rather than storing the data as a time series. For example: \begin{MyVerbatim} 10.0.0.0 non-auth-tld 105 10.0.0.0 ok 37383 10.0.0.0 rfc1918-ptr 5941 10.0.0.0 root-servers.net 1872 10.0.1.0 a-for-a 6 10.0.1.0 non-auth-tld 363 10.0.1.0 ok 144 \end{MyVerbatim} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Bugs} \begin{itemize} \item Seems too confusing to have an opaque name for indexers in dsc.conf dataset line. The names are pre-determined anyway since they must match what the XML extractors look for. \item Also stupid to have indexer names and a separate ``Label'' for the XML file. \item {\dsc} perl modules are installed in the ``site\_perl'' directory but they should probably be installed under /usr/local/dsc. \item {\dsc} collector silently drops UDP frags \end{itemize} \end{document}