PDA

View Full Version : C++ Help Part Two



Warsaw
March 13th, 2010, 10:55 PM
Ok, so I've been taking Object Oriented Programming in C++ at the local community university and they haven't really done a good job teaching said course. :v:

So, I have to write a program that takes an HTML file, parses it into a text only document, and then lists the key words in order of frequency. I can do the parsing just fine, it's the listing key words in order of frequency that gets me. If anybody can lend me a hand with a piece of sample coding showing me how to sort words in order of frequency then output them to another file, there is +rep and some virtual cookies in it for you.

hry
March 13th, 2010, 11:02 PM
#include<stdio.h>
#include<conio.h>
void main()
{
printf("I don't know dude");
getch;
}

Warsaw
March 13th, 2010, 11:21 PM
What I have so far:



#include<iostream> // Required for cin, cout, and cerr.
#include<fstream> // Required for ifstream and ofstream.
#include<string> // Required for string.
using namespace std;

int main()
{
// Declare Objects.
char character;
bool text_state(true);
string infile;
string outfile;
ifstream html;
ofstream htmltext;

// Prompt user for name of input file.
cout << "Enter the name of the input file:";
cin >> infile;

// Prompt user for name of output file.
cout << "Enter name of the output file:";
cin >> outfile;

// Open files.
html.open(infile.c_str());
if(html.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
htmltext.open(outfile.c_str());

//Read first character from html file.
html.get(character);

while(html.eof())
{
//Check state
if(text_state)
{
if(character =='<') // Beginning of a tag.
text_state=false; // Change states.
else
htmltext << character; // Still text, write to file.
}

else

{
// Command state, no output required.
if(character =='>') // End of a tag.
text_state=true; // Change states.
}

// Read next character from html file.
html.get(character);
}
html.close();
htmltext.close();
return 0;
}


I need the function prototypes to sort through the text that was just parsed above and list the key words in order of frequency. I can handle the exclusion of things like "to", "the", etc. I do realise that what's up there will end up being a module and not the main function, so disregard the "int main()" for now. Also, I'm using Dev-C++ because the instructor wants us to, so no Visual Studio shortcuts/vernacular please.

E: I also know that this will end up in an array. First I need to make a list of all the key words (column a, if you will), then it needs to list how many times each of those appears (column b). I then need to take column b and apply a sorting algorithm to it to order it from most common to least common. I know the methodology. What I don't know are the commands, operators, syntax, etc. to do those steps.

Kornman00
March 14th, 2010, 05:30 AM
If this specific assignment is suppose to be for an OOP subject in your class then I want you to go into class on monday and smack your teacher

I'm assuming you're allowed to use anything in the STL library. All you would have to do use a <string,int> dictionary (or map is it is called in stl (http://www.sgi.com/tech/stl/Map.html)) and then when you enter a keyword (found by the first character after a '<' then terminated either by the first white space or a matching '>') you'd either add it to the map or it would already exist so you'd just increment the "int" data of the dictionary (which represents the count of how many times you found it)

You said in the first post that this is only for HTML keywords so it shouldn't be too difficult to modify your parsing code for those conditions

CrAsHOvErRide
March 14th, 2010, 11:49 AM
Your teacher is missing the OO in OOP :P

For me: HTML Parser class with Keyword list.

Warsaw
March 14th, 2010, 01:00 PM
^ That's too easy, and the teacher would probably give me a D for that. I'd do it too, but I have to use the same program with at least two different HTML files, which means either an extra long (versus an already long) keyword list. I think the point of this exercise (which is my mid-term) is to get us to use arrays and strings. The only problem is that she didn't teach us jack all about this shit.

@Korn: Yeah, she didn't impose any limitations on how we achieve the end result, just that we get there with one program. Also, I got that down, I just don't know how to structure it. The teacher basically named out classes and things, but never showed us how to implement any of it. You could say that I'm a total nub at C++, since the most I know about structure is what a block is and where statements go. I've been reading my text book for the past 12 hours and I still can't figure out how to manage this. That link you posted was somewhat helpful though, thanks.

Limited
March 14th, 2010, 02:23 PM
Could you just not add it into a vector of chars or String and then loop back and compare them?. The map function Kornman linked to would do it well, hopefully the teacher will think "oh, they've actually done some research into it to use a pre-built function" , rather than "oh they are lazy using pre-built function" :D

CrAsHOvErRide
March 14th, 2010, 02:31 PM
^ That's too easy, and the teacher would probably give me a D for that. I'd do it too, but I have to use the same program with at least two different HTML files, which means either an extra long (versus an already long) keyword list. I think the point of this exercise (which is my mid-term) is to get us to use arrays and strings. The only problem is that she didn't teach us jack all about this shit.


Huh? Why would that be any easier? Your whole code and more would just be transferred into an own class with proper OOP. You would still have to write everything on your own.

Warsaw
March 14th, 2010, 03:35 PM
Maybe I'm misunderstanding what you mean (very likely), but I took that to mean that I manually copy down all the words once into an array then have the program go back and increment the count of each word in that array as they appear in the file, then list them in descending count order.

My problem is that I haven't been taught how to code. I'm trying to learn on the fly without actually know how to do much to begin with. I know conceptually what I'm supposed to be doing, but I have no idea how to implement those concepts in code. Yeah, teacher did miss the OO part of OOP. She also managed to miss the P part of OOP. :saddowns:

tl;dr kids: don't go to community college hoping to learn anything.

Limited
March 14th, 2010, 04:01 PM
You should use a vector, its a like a dynamic array and is really useful as a container for information.

You'll need to include the vector class though.


#include <vector>

// Defining the vector
std:: vector <String> strContainer;

To loop through contents
String tmpStr;
for (int i = 0; i < strContainer.size(); i++)
{
tmpStr = strContainer.at(i)
}

//To add words to the vector
strContainer.push_back(tmpStr);


To remove values from the container you can use the pop_back which removes last added item first.

http://www.cplusplus.com/reference/stl/vector/

Warsaw
March 14th, 2010, 06:42 PM
Small update. It's not pretty, but it gets the job done as far as parsing the HTML for text goes. Eliminated anything not whitespace or alphabetic characters, and changed all the upper case ones to lower case.



//
//
//
// Mid-Term Exam Program
//
// The purpose of this program is to parse an HTML file and count the number of
// unique key words that appear in the file.

#include<iostream> // Required for cin, cout, and cerr.
#include<fstream> // Required for ifstream and ofstream.
#include<string> // Required for string.
#include<cctype> // Required for tolower, isalpha, isupper.
using namespace std;

// Define constants and declare function prototypes.

int main()
{
// Declare Objects.
char character;
bool text_state = true;
string infile;
string outfile1;
string outfile2;
ifstream html;
ofstream htmltext;
ifstream itext;
ofstream otext;

// Prompt user for name of input file.
cout << "Enter the name of the input file:";
cin >> infile;

// Prompt user for name of output file.
cout << "Enter name of the output file:";
cin >> outfile1;

// Open files.
html.open(infile.c_str());
if(html.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
htmltext.open(outfile1.c_str());

//Read first character from html file.
html.get(character);

while(!html.eof())
{
//Check state
if(text_state)
{
if(character == '<') // Beginning of a tag.
{
text_state=false; // Change states.
}
else
{
htmltext << character; // Still text, write to file.
}
}

else

{
// Command state, no output required.
if(character == '>') // End of a tag.
text_state=true; // Change states.
}

// Read next character from html file.
html.get(character);
}
html.close();
htmltext.close();

//Input file for refinements same as last output file.
//cin >> outfile1;

// Get name of output file.
cout << "Enter name of final output file: ";
cin >> outfile2;

// Open files.
itext.open(outfile1.c_str());
if(itext.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
otext.open(outfile2.c_str());

// Read first character.
itext.get(character);
cout << "Hi!\n"; // Execution stage indicator.

while(!itext.eof())
{
if (isupper(character))
{
character=tolower(character);
putchar(character);
text_state=true;
otext << character;
}
else
{
if (isalpha(character)||isspace(character))
{
text_state=true;
otext << character;
}
else
{
text_state=false;
}
}
// Get next character.
itext.get(character);
}

itext.close();
otext.close();
return 0;
}


So, someone told me I should use a multiset for this...good idea or bad? If good, how would I take what I have in my text file and put it into said multiset?

Warsaw
May 8th, 2010, 07:12 PM
I hate to bump an old thread, but it's not worth creating a new one just for this. I can't get it to output to the very last file at the end of the program. Halp.



/*----------------------------------------------------------------------------*/
//
// Mid-Term Exam Program
//
// The purpose of this program is to parse an HTML file and count the number of
// unique key words that appear in the file.
//
/*----------------------------------------------------------------------------*/

#include <iostream> // Required for cin, cout, and cerr.
#include <fstream> // Required for ifstream and ofstream.
#include <string> // Required for string.
#include <cctype> // Required for tolower, isalpha, isupper.
#include <vector> // Required for vector <>.
using namespace std;

int main()
{
// Declare Objects.
char character;

bool text_state = true;

string infile;
string storage;

ifstream input;
ofstream store;

// Prompt user for name of the input file.
cout << "Enter the name of the input file:";
cin >> infile;

// Prompt user for name of the storage file.
cout << "Enter the name of the storage file:";
cin >> storage;

// Open files.
input.open(infile.c_str());
if(input.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
store.open(storage.c_str());

//Read first character from html file.
input.get(character);

while(!input.eof())
{
//Check state
if(text_state)
{
if(character == '<') // Beginning of a tag.
{
text_state=false; // Change states.
}
else
{
store << character; // Still text, write to file.
}
}

else

{
// Command state, no output required.
if(character == '>') // End of a tag.
text_state=true; // Change states.
}

// Read next character from html file.
input.get(character);
}
input.close();
store.close();
/*----------------------------------------------------------------------------*/
//
// This section removes non-alphabetic characters and converts uppercase letters
// to lowercase form. It also preserves whitespace.
//
/*----------------------------------------------------------------------------*/
// Declare objects.
string outfile;

ifstream input2;
ofstream output;

// Prompt users for the name of the final output file.
cout << "Enter the name of the final output file:";
cin >> outfile;

// Open fthe storage file.
input2.open(storage.c_str());
if(input2.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
output.open(outfile.c_str());

// Read first character.
input2.get(character);
// cout << "Hi!\n" << endl; // Execution stage indicator.

while(!input2.eof())
{
if (isupper(character))
{
character=tolower(character);
putchar(character);
text_state=true;
output << character;
}
else
{
if (isalpha(character)||isspace(character))
{
text_state=true;
output << character;
}
else
{
text_state=false;
}
} // End of "else".

// Get next character.
input2.get(character);

} // End of "while".

store.close();
output.close();


/*----------------------------------------------------------------------------*/
//
// This next segment will parse words from the output file and list them in
// descending order of frequency in a new file.
//
/*----------------------------------------------------------------------------*/
// Declare objects.
char wordchar; // characters that will make up each word

bool word_state;

string keywords;
string tempWord;
string testWord;

vector<string> WordList;
vector<int> WordCount;

ifstream input3;
ofstream outfinal;

// Prompt for filenames and open the files.
cout << "Enter name of file for the final list of keywords: ";
cin >> keywords;

// This next line is the previous output file.
input3.open(outfile.c_str());
if(input3.fail())
{
cout << "Error opening input file";
exit(1);
}

outfinal.open(keywords.c_str());

// Get the first character from the input file.
input3.get(wordchar);

// Parse for unique words.
do
{
if(isspace(wordchar))
{
for(int i=0; i<WordList.size(); i++)
{
if(WordList[i].compare(testWord) == 0)
{
WordCount[i] += 1;
break;
}
else
{
WordList.push_back(testWord);
break;
}
}
}
else
{
testWord += wordchar; // continues building current word.
}
}
while (!input3.eof());

// Close input file.
input3.close();

//Declarations for bubble sort algorithm.
bool flag = true;
int tempCount;

// Bubble sorter.
//for(int i=1; (i<=WordList.size()) && flag; i++)
while(flag = false)
{
//flag = false;
for(int j=0; j<((WordList.size())-2); j++)
{
if(WordCount[j+1] < WordCount[j])
{
tempCount = WordCount[j];
WordCount[j] = WordCount[j+1];
WordCount[j+1] = tempCount;

tempWord = WordList[j];
WordList[j] = WordList[j+1];
WordList[j+1] = tempWord;
flag = true;
}
}
}

for(int i=0; i<WordList.size(); i++)
{
outfinal << WordList[i] << " occurs " << WordCount[i] << "\n" << endl;
}

outfinal.close();
return 0;
}

Limited
May 8th, 2010, 07:21 PM
I take it you mean the :
for(int i=0; i<WordList.size(); i++)
{
outfinal << WordList[i] << " occurs " << WordCount[i] << "\n" << endl;
}

Personally I access my vectors using WorldList.at(i).

Warsaw
May 8th, 2010, 07:23 PM
That could actually possibly be the problem. I feel like a moron now, I had just read about that function too. Let me replace every instance with that function where appropriate, I'll update with results.

Limited
May 8th, 2010, 07:25 PM
It might but I doubt it. Vectors are basically dynamic arrays, so they can be accessed in same way.

Warsaw
May 8th, 2010, 07:29 PM
That's what I figured, but it never hurts to try. It didn't work, though. It creates the keywords file that I told it to, but it doesn't actually put anything in it. The program also never actually finishes, it just sits there in the console. I'm not getting any debugging errors through the IDE and I'm not getting any compiler or linker errors, so I have no idea what's going on.

Limited
May 8th, 2010, 07:33 PM
Have you stepped through it to see what its doing?

Warsaw
May 8th, 2010, 07:35 PM
Define "stepped through it." You mean put breaks in, run it to that point, and if it works move the break up? If so, then yes.

Limited
May 8th, 2010, 08:37 PM
Well yeah, debugged it and gone through and checked the variables along the way.

Warsaw
May 8th, 2010, 09:32 PM
Ok, I've got it fixed and outputting now. Just a few formatting errors needing fixed in the output. Thanks for the tips.

Kornman00
May 8th, 2010, 10:25 PM
If they're going to make you use the STL, they should at least teach you guys about iterators. Then you don't have to worry about indexing and such as the iterator handles providing a reference to the current element.

Warsaw
May 8th, 2010, 10:29 PM
Yeah...=|

Ok, so my problem now is that it's not perfectly parsing each word. This means I sometimes get two or more words together counting as one even though in the input source file they have a space or line between them. It's also counting spaces as words.

Here's some updated code:



#include <cstdlib> // Required for exit
#include <iostream> // Required for cin, cout, and cerr.
#include <fstream> // Required for ifstream and ofstream.
#include <string> // Required for string.
#include <cctype> // Required for tolower, isalpha, isupper.
#include <vector> // Required for vector <>.
using namespace std;

// Function Prototypes
int main();

int main()
{
// Declare Objects.
char character;

bool text_state = true;

string infile;
string storage;

ifstream input;
ofstream store;

// Prompt user for name of the input file.
cout << "Enter the name of the input file:";
cin >> infile;

// Prompt user for name of the storage file.
cout << "Enter the name of the storage file:";
cin >> storage;

// Open files.
input.open(infile.c_str());
if(input.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
store.open(storage.c_str());

//Read first character from html file.
input.get(character);

while(!input.eof())
{
//Check state
if(text_state)
{
if(character == '<') // Beginning of a tag.
{
text_state=false; // Change states.
}
else
{
store << character; // Still text, write to file.
}
}

else

{
// Command state, no output required.
if(character == '>') // End of a tag.
{
text_state=true; // Change states.
}
}

// Read next character from html file.
input.get(character);
}
input.close();
store.close();
/*----------------------------------------------------------------------------*/
//
// This section removes non-alphabetic characters and converts uppercase letters
// to lowercase form. It also preserves whitespace.
//
/*----------------------------------------------------------------------------*/
// Declare objects.
string outfile;

ifstream input2;
ofstream output;

// Prompt users for the name of the final output file.
cout << "Enter the name of the final output file:";
cin >> outfile;

// Open fthe storage file.
input2.open(storage.c_str());
if(input2.fail())
{
cerr << "Error opening the input file\n";
exit(1);
}
output.open(outfile.c_str());

// Read first character.
input2.get(character);
// cout << "Hi!\n" << endl; // Execution stage indicator.

while(!input2.eof())
{
character = tolower(character);
if (isalpha(character)||isspace(character))
{
text_state=true;
output << character;
}
else
{
if(iscntrl(character))
{
text_state=false;
}

text_state=false;
}

// Get next character.
input2.get(character);

} // End of "while".

store.close();
output.close();


/*----------------------------------------------------------------------------*/
//
// This next segment will parse words from the output file and list them in
// descending order of frequency in a new file.
//
/*----------------------------------------------------------------------------*/
// Declare objects.
char wordchar; // characters that will make up each word

string keywords;
string tempWord;
string testWord;

vector<string> WordList;
vector<int> WordCount;

ifstream input3;
ofstream outfinal;

// Prompt for filenames and open the files.
cout << "Enter name of file for the final list of keywords: ";
cin >> keywords;

// This next line is the previous output file.
input3.open(outfile.c_str());
if(input3.fail())
{
cout << "Error opening input file";
exit(1);
}

outfinal.open(keywords.c_str());
if(outfinal.fail())
{
cout << "Error opening keywords file";
exit(1);
}

// Get the first character from the input file.
input3.get(wordchar);

// Parse for unique words.
cout << "Parse for unique words.\n";
do
{
if(isspace(wordchar))
{
int i;
text_state = false;
for(i=0; i<WordList.size(); i++)
{
// Check to see if word is already in the list.
if(WordList.at(i).compare(testWord) == 0)
{
WordCount.at(i) += 1;
break;
}
} // end for

// No match so add the word to the array.
if (i == WordList.size())
{
// Add word to array if it is not already included.
WordList.push_back(testWord);
WordCount.push_back(1);
testWord = ""; //empty temp word variable for next word
}
}
else
{
text_state=true;
// Otherwise, continue building the current word.
testWord += wordchar;
}

input3.get(wordchar);
}
while (!input3.eof());
cout << "end parse for words\n";

// Close input file.
input3.close();

//Declarations for bubble sort algorithm.
bool sorted = false;
int tempCount;

// Bubble sorter.
//for(int i=1; (i<=WordList.size()) && flag; i++)
while(!sorted)
{
// Assume no sorts will take place.
sorted = true;

// Compare and sort if necessary.
for(int j=0; j<((WordList.size())-1); j++)
{
// Sort by WordCount in descending order
if(WordCount.at(j+1) > WordCount.at(j))
{
tempCount = WordCount.at(j);
WordCount.at(j) = WordCount.at(j+1);
WordCount.at(j+1) = tempCount;

// Need to keep the association between the word and its count
tempWord = WordList.at(j);
WordList.at(j) = WordList.at(j+1);
WordList.at(j+1) = tempWord;

sorted = false; // a sort took place
}

} // end of 'for'.
} // end of 'while'.

// Write the sorted list.
for(int i=0; i<WordList.size(); i++)
{
outfinal << WordList.at(i) << " occurs " << WordCount.at(i)
<< " times\n";
}

outfinal.close();
return 0;
}


Problem area is in the last third, I think.