This is a discussion on Top 100 most popular words on the site within the Puff Banter forums, part of the Everything But Cigars category; Skip past this first paragraph if you want laymen's description
Hey guys I was doing a little bit of high ...
Skip past this first paragraph if you want laymen's description
Hey guys I was doing a little bit of high performance computing and had written a parsing program to convert HTML to markup text and count the frequency of words in a given html file. This was run on x number of files and I output the overall top 100 most frequent words that are 8 characters or longer.
From a random sample of 10,000 pages on puff.com I have compiled a list of the top 100 words that are 8 or more characters in length. These are sorted in descending frequency and it can be read by the word and next to it how many times it was found in the 10000 files. hope this is interesting because the execution took about 2.5 hours on my computing cluster.
Would is be hard to lower the paramerters to 5 or 6 letter words or greater?
No it would not I can rerun it tonight. We have other people on the cluster right now so I'd like to wait until they get off. I'll set up a cron and run it at like 1 a.m. so we will go with 5 char minimum since cigar is five characters? I was trying to avoid words like "has, a, go, forums" with out making a gigantic list of words to avoid. I will post the results when they finish computing.
Very cool work Chase. I don't see a reason to run it at a five character minimum just to see how many times "cigar" shows up. Now, it might be a good reason only if we know of important words that are five characters. I.E. I like to see which brands are mentioned the most, not really bombs, pass, cigars.