Features' value in KDD99 data set was wrong?

问题

In KDD99 data set, a huge number of connections 32nd and 33rd feature’s value is greater than 100.

I can’t understand the reason why used a connection window of 100 connections can get a value which is greater than 100? I consulted a lot of information, but found nothing.

回答1:

The dataset contains 41 features for each connection.

These features were obtained preprocessing TCP dump files.

To do so, packet information in the TCP dump file was summarized into connections. Specifically (http://kdd.ics.uci.edu/databases/kddcup99/task.html):

a connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows from a source IP address to a target IP address under some well defined protocol.

Some of the features (the so called Time-based Traffic Features) were calculated over a 2-seconds temporal windows.

Other features (Host-based Traffic Features) using a historical window estimated over a number of connections (in this case 100).

Host-based features are useful for attacks which span intervals longer than 2 seconds.

2-seconds and 100-connections are somewhat arbitrary values.

The values of these two class of features haven't an upper limit (e.g. the number of connections to the same host over the 2-seconds interval can be greater than 100).

Same "should be" true for:

32. | dst host count | count of connections having the same destination host



33. | dst host srv count | count of connections having the same
                           destination host and using the same service

The problem is that there was no documentation explaining the details of KDD features extraction. The main reference is:

A Framework for Constructing Features and Models for Intrusion Detection Systems - WENKE LEE / SALVATORE J. STOLFO

from which it's clear that the bro-ids tools was used:

used Bro as the packet filtering and reassembling engine. We extended Bro to handle ICMP packets, and made changes to its packet fragment inspection modules since it crashed when processing data that contains Teardrop or Ping-of-Death attacks. We used a Bro “connection finished” event handler to output a summarized record for each connection.

and

In the Bro event handlers, we added functions that inspect data exchanges of interactive TCP connections (e.g., telnet, ftp, smtp, etc.). These functions assign values to a set of “content” features to indicate whether the data contents suggest suspicious behavior.

but this not enough.

Both dst host count and dst host srv count are in the [0,255] range.

The AI-IDS/kdd99_feature_extractor project on Github can extract the 32nd and 33rd feature from raw data (take a look at the stats*.cpp files) but:

Some feature might not be calculated exactly same way as in KDD