First I'd use a CountVectorizer to look at the vocabulary generated. There'd be words like 'from', 'laptop', 'fast', 'silver' etc. You can use stop words to discard such words that give us no information. I'd also go ahead and discard 'hard', 'drive', 'hard drive' etc. because I know this is a list of hard drives so they provide no information. Then we'd have list of words like
- Seagate 500Go
- Seagate 120Go
- Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s
- 500Go Seagate etc.
You can use list of features like things that end with RPM are likely to give RPM information, same goes with stuff ending with mb/s or Gb/s. Then I'd discard alphanumeric characters like '1234FBA5235' which is most likely model numbers etc. which won't give us much information. Now if you are already aware of brands of hard drives that are appearing in your list like 'Seagate' 'Kingston' you can use string similarity or simply check if they are present in the given sentence. Once that's done you can use Clustering to group similar objects together. Now objects with similary rpm, gb's, gb/s, brand name will be clustered together. Again, if you use something like KMeans you'd have to figure out the best value of K. You'll have to do some manual work. What you could do it use a scatter plot and eyeball for which value of K the data classifies the best.
But the problem in above approach is if you don't know before hand the list of brands then you'd be in trouble. Then I'd use Bayesian Classifier to look for every sentence and get the probability of it being a hard drive brand. I'd look for two things
- Look at the data, most of the times the sentence would explicitly mention the word 'hard drive' then I'd know it's definitely talking about a hard drive. Chances for something like 'Mercedes Benz hard drive' are slim.
- This is a bit laborious but I'd write a Python web scrapper over Amazon (or if you can't write one just Google for most used Hard Drive brands and create a list) It give me list like 'Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s' now for every sentence it'd use something like Naive Bayes to give me probability it's a brand. sklearn come pretty handy to do this stuff.