NSL KDD Features from Raw Live Packets?

南楼画角 提交于 2019-11-29 16:53:42

If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap file.

This tool was created as part of one university project. I haven't found detailed documentation of KDD '99 features so the resulting values may be bit different compared to original KDD. Some sources used are mentioned in README. Also the implementation is not complete. For example, the content features dealing with payload are not implemented.

It is available in my github repository.

The 1999 KDD Cup Data is flawed and should not be used anymore

Even this "cleaned up" version (NSL KDD) is not realistic.

Furthermore, many of the "cleanups" they did are not sensible. Real data has duplicates, and the frequencies of such records is important. By removing duplicates, you bias your data towards the more rare observations. You must not do this blindly "just because", or even worse: to reduce the data set size.

The biggest issue however remains:

KDD99 is not realistic in any way

It wasn't realistic even in 1999, but the internet has changed a lot since back then.

It's not reasonable to use this data set for machine learning. The attacks in it are best detected by simple packet inspection firewall rules. The attacks are well understood, and appropriate detectors - highly efficient, with 100% detection rate and 0% false positives - should be available in many cases on modern routers. They are so omnipresent that these attacks virtually do not exist anymore since 1998 or so.

If you want real attacks, look for SQL injections and similar. But these won't show up in pcap files, yet the largely undocumented way the KDDCup'99 features were extracted from this...

Stop using this data set.

Seriously, it's useless data. Labeled, large, often used, but useless.

Jhordany

It seems that I am late to reply. But, as other people already answered, the KDD99 data-set is outdated.

I don't know about the usefulness of the NSL-KDD dataset. However, there is a couple of things:

  • When getting information from network traffic, the best you can do is to get statistical information (content-based information is usually encrypted). What you can do is to create your own data-set to describe the behaviors you want to consider as "normal". Then, train the neural network to detect deviations from that "normal" behavior.
  • Be careful knowing that even the definition of "normal" behavior changes from network to network and from time to time.

You can have a look to this work, I was involved in it, in which besides taking the statistical features of the original KDD, takes additional features from a real network environment.

The software is under request and it is free for academic purposes! Here two links to publications:

  1. http://link.springer.com/chapter/10.1007/978-94-007-6818-5_30
  2. http://www.iaeng.org/publication/WCECS2012/WCECS2012_pp30-35.pdf

Thanks!

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!