View Full Version : PHP, strings & word count -is it possible ?
vosgien
08-25-2004, 08:55 PM
Hi,
I have been learning all about strings in php and what can be done with the functions available. However, I want to be able to count the number of words in a file that has been uploaded by a user
I know how to use strlen to count the number of characters in the file, how to remove spaces to get the exact number of letters, however I want to return the number of words.
The files can be in any format and all words will be seperated by a space, I have looked everywhere for a function or some help on the code, I'm guessing that I will have to explode the entire file into an array using the spaces as the seperator and count the elements.
This however could be time consuming as I expect the docs to be several pages of text - any ideas or help from the forum will be more than useful
Cheers
Vosgien
Curly Brace
08-25-2004, 09:42 PM
I think other ways than using explode function won't be faster.
...
Yeah, I've wrote 2 funcs. Check them out:
function countWords($string){
$words=0;
for($i=0; $i<=strlen($string); $i++){
if($string[$i] == " "){
$words++;
}
}
return $words+1;
}
function countWords2($string){
$a = explode(" ", $string);
return count($a);
}
I'd tested them with 7000-words string and they both executed less than second. Hmm, which do you like more? ;)
vosgien
08-25-2004, 10:43 PM
Hi,
Like them both, I had already settled on the second, tho' I hadn't gotten around to tying it up in a function - looks much neater than my effort. Thanks
Next problem is I need to strip out all punctuation marks so that I capture the number of actual words ( this is for a translation site - owners want the words counted in uploaded files - can't see why they cannot open the docs in Word or something and do a word count !!)
Anyways I tried this ( and variations of)
/*first of all I created my pretend uploaded file
as you can see I put punctuation marks in odd places*/
$message = "this is. the, message; to ? count the & length";
/*including the ? and the & this string has 10 words - so
I added this*/
$no = array(",","?","&",".",";"," ");
$newMessage = str_replace($no,"",$message);
$newMessageX = explode(' ',$newMessage);
$numWords = count($newMessageX);
/*this still returns a count of 10 ?*/
Your first function has given me an idea in that I could do this
if($string[$i] == " "||$string[$i] == "?"){
/*and on to take care of all punc. marcs - or could I do this*/
if($string[$i] == $no ){
Not sure if that will work quite the way I want it to.
Anyway, it is pretty late here so I am done today, I will test that out tomorrow, anymore suggestions welcome and thanks for the functions
Cheers
Vosgien
Curly Brace
08-25-2004, 11:06 PM
Here's an optimised version of 2'nd function. It returns 8 for your string.
function countWords2($string){
$a = explode(" ", $string);
$trash_count=0;
$trash = array(",","?","&",".",";");
for($i=0; $i<count($a); $i++){
for($j=0; $j<count($a); $j++){
if($a[$i] == $trash[$j]){
$trash_count++;
}
}
}
return count($a)-$trash_count;
}
Have a nice day :)
freddycodes
08-26-2004, 02:37 AM
Why don't you use str_word_count
print str_word_count($message)
vosgien
08-26-2004, 07:29 AM
Hi,
Thanks for your efforts Curly, but freddycodes answer is exactly what I had been looking for, as I said in my original post, I searched high and low for that function, somehow new it existed but blowed if I could find it - probably staring me in the face !!!
Thnak you again, and thank you freddycodes ( once more)
Cheers
Vosgien
vosgien
08-26-2004, 04:51 PM
Hi,
I'm having problems getting either of these two to work, if I run the str_word_count($message) as in the code above, it works perfectly and the same with your function Curly, but when I run them in a script that uploads files from the user, both lots of code return 5 ??? (there are 142 words in the doc I am testing with - it is an unzipped word doc)
So this is how my code looks :
<?php
$userfile = $HTTP_POST_FILES['userfile']['tmp_name'];
$userfile_name = $HTTP_POST_FILES['userfile']['name'];
$userfile_size = $HTTP_POST_FILES['userfile']['size'];
$userfile_type = $HTTP_POST_FILES['userfile']['type'];
$userfile_error = $HTTP_POST_FILES['userfile']['error'];
/*above code receives the file, I then run a series of checks to make sure that the form has been completed correctly and that a file has been uploaded
I then use $user_error to make sure the file has loaded fully etc etc. Before moving the file I have run*/
str_word_count($userfile );
/* returns 5
and*/
function countWords2($userfile){
$a = explode(" ", $userfile);
$trash_count=0;
$trash = array(",","?","&",".",";");
for($i=0; $i<count($a); $i++){
for($j=0; $j<count($a); $j++){
if($a[$i] == $trash[$j]){
$trash_count++;
}
}
}
$words = count($a)-$trash_count;
}
function countWords2($userfile);
/* also returns 5*/
So, there must be something adrift in the way that i am using the functions and I cannot figure out what to do about it, also, how can I make the variable $words available to me outside the function, should I declare it as part of the function or declare it before I call the function ?
Thanks again
Vosgien
freddycodes
08-26-2004, 05:11 PM
Well all you are doing is counting the words in the tmpname of the uploaded file, you must do something with the file first.
Like copy it to a tmp directory that is writable by the web server, open it read it into a string, count the words, then delete it.
I didn't test this code, and just wrote it right now before heading into work, so hopefull there is no parse errors, and it will get you going in the right direction.
//See here for more info on handling uploads
//http://us4.php.net/manual/en/features.file-upload.php
$userfile = $HTTP_POST_FILES['userfile']['tmp_name'];
$userfile_name = $HTTP_POST_FILES['userfile']['name'];
$userfile_size = $HTTP_POST_FILES['userfile']['size'];
$userfile_type = $HTTP_POST_FILES['userfile']['type'];
$userfile_error = $HTTP_POST_FILES['userfile']['error'];
//Path to a folder that is writable by the web server
$uploads = "/some/path/to/some/folder/onthe/server/that/iswritable/";
//Move the tmp file to a new tmp location
if (move_uploaded_file($userfile, $uploads."somenewtempfilename.txt"))
{
//Open it for read
$fp = fopen($uploads."somenewtempfilename.txt", "r");
if($fp)
{
$data = "";
//Read in the contents to a string
while(!feof($fp))
{
$data .= fgets($fp, 1024);
}
fclose($fp);
}
//Delete the file
unlink($uploads."somenewtempfilename.txt");
print "Word count = ". str_word_count($data);
}
else
{
print "Error copying file to tmp location";
}
vosgien
08-27-2004, 04:19 PM
Hi,
Sorry it has taken me a little time to get back. Thanks Freddycodes. As you know I have been emmeshed in PHP for a little while now, feel I am getting on with it really well, and then I do something that underlines the fact that I am a newbie......like try to read the contents of a file without opening it. Pretty dumb really. So thanks again and cheers for the links, also Curly Braces, thank you for your input
Cheers
Vosgien
vosgien
08-29-2004, 01:44 PM
Hi,
Two questions, the code posted above works very well except that the word count is not correct, it is returning about 5 times the number of words in a file. I echo'd $data to run a secondary check and this is what is returned :
$h === =" = :Q , p_"} &0V . >B,n$ V====|
That is just a short amount of what is showing at the beginning and end of the file and is obviously affecting the count, so how do I get rid of all those characters, is whats called " trailing white space" and can I remove it ?
Finally, Freddy if you are still picking up on this thread, can you explain for me what the 1024 does in this line
$data .= fgets($fp, 1024);
I have run the code with and without 1024 and it works the same way, I have searched at php.net, but not found anything yet !
***edit*** so the number in the above code represents the number of bytes to be read, so I replaced it with $userfile_size like this:
$data .= fgets($fp, $userfile_size);
However, this still does not eliminate the characters as shown above which in this file are before and after the text.
In this particular case the file being uploaded and read is a word.doc
I guess that the info I need to remove all of those characters is probably in front of me, so I will have to keep looking !!!!
Cheers
Vosgien
p.s. absolutely no parse errors in that code - thanks again FreddyCodes
freddycodes
08-29-2004, 06:30 PM
Okay, here we go, as I will attempt to explain.
Firstly, what kind of files are these? Are they text files? I think maybe they are MS Word Documents. Which would explain the funny characters, MS word creates binary file formats, so this is not going to work with them. Do you have control over what kind of files get uploaded?
The 1024 is the number of bytes per line, so I am just looping over each line and reading it in.
vosgien
08-29-2004, 07:27 PM
Hi,
If the 1024 is the number of byters per line, then using $userfile_size is incorrect ......yes ?
So is 1024 a regular/standard number of bytes per line ?
Would it be better to convert the file type to something like a .txt file ?
However, I have tried this and it still picks up the unwanted characters.
I have also tried using strreplace, however, there is text in with the binary characters and of course I will not know what that text is !!!
I have been looking thru' my copy of O'Reilly's PHP Cookbook, and I am wondering if I can use something like pack() or unpack() to convert the binary code to a string and then remove it before the word count happens, but I am not sure if I can do this, and if I can how !
As far as the files being loaded in are concerned, I have no control over them whatsoever, they could be word.docs, pdf or any other type of text file you can think of (there will never be images of any description, just text).
This is for a site offering language translation services. Translations are priced per word, i.e. 0.085 centimes ( European, not US) per word.
I have suggested to the web site owner that they can ( in Word) find the number of words in a file, but I am not sure if the same facility exists in other apps ( like PDF for example), and they are insiting that the word count is done at the upload stage so that a preliminary quote can be given in the return/confirmation email that the code sends to to the user after the file has been uploaded.
Anyway, back to my books and php.net, I am certain there is a solution to this, equally certain that I/we will find it.
Thanks for your time Freddy
Vosgien
freddycodes
08-29-2004, 10:28 PM
Yeah that should be the standard. does filesize return a number in bytes?
Anyways, I am not sure you are going to find an easy solution to this problem, and PHP might not be the right tool for the job.
Its one thing uploading text files and counting words, but binary files are another issue, pdf will be tough, word docs will be tough.
Both Word Docs and PDFs are compiled to proprietery format, your better bet is to do like a resume service might do and to require uploaded text be submitted in a rtf or ascii format.
vosgien
08-30-2004, 09:09 AM
Hi,
Yes, $userfile_size returns the number of bytes in a file, so should I use that or1024 ?
I played around with pack() and unpack(), for those that are following this thread and don't know what these two funcs do, the first one will fill a string with binary numbers and the second will extract binary characters from a string, so
$notWanted = unpack('s4',$data);
foreach($notWanted as $val)
{
print "$val<br />";
}
/*in my case this returns
-12336
-8175
-20063
-7910*/
So, I will take your advice Freddy, if I can work out how to remove the binary from a word file, there will also pfd docs and others. As I have no idea what files are being loaded in, I think it may be possible that there will be some obscure file that I haven't allowed for, better I think to restrict the user to one or two specific file types.
So, one more Q that is not related to this thread really, how can I restrict user input to numbers only or letters only ?
I know how to do this in flash, but on this project, I have an html front end and PHP working away in the background, code for each input box is
<td> <strong>Telephone:</strong></td><td><input name="telephone" type="text" id="telephone" size="25"style="background-color: #FFFFCC; border: 1px solid border-color:#FF0000; font-family: verdana; font-size:12px; color:#FF0000" >
Anyways, thanks for your help Freddy, this has been quite an instructive thread
Cheers
Vosgien
freddycodes
08-30-2004, 11:08 AM
Yes since you don't have the looping to worry about you can use the filesize.
//Open it for read
//If its binary, open it for binary read = "rb"
// Else regular "r"
$fp = fopen($uploads."somenewtempfilename.txt", "rb");
if($fp)
{
//If its binary, use fread
$data = fread($fp, $userfile_size);
// Else use fgets
//$data = fgets($fp, $userfile_size);
fclose($fp);
}
As for the other thing, you can use javascript to handle this. Or if you want to use PHP use regular expressions. Here are two methods.
Javascript. The trick is deciding when you want to check to see that its only numbers.
This would be the most similar to actionscript.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Untitled</title>
<script language="Javascript">
function StoreOldValue(obj)
{
obj.setAttribute('oldval', obj.value);
}
function CheckNewValue(obj, type)
{
var error = "";
switch(type)
{
case 'int':
var reg = new RegExp(/^[\d-]+$/);
if(!reg.test(obj.value))
error = "This is a numerical value.\nPlease use only numbers.";
break;
}
if(error != "")
{
obj.value = obj.getAttribute('oldval');
alert(error);
}
}
</script>
</head>
<body>
<form>
<strong>Telephone:</strong><input onBlur="CheckNewValue(this, 'int');" onFocus="StoreOldValue(this);" name="telephone" type="text" id="telephone" size="25" style="background-color: #FFFFCC; border: 1px solid border-color:#FF0000; font-family: verdana; font-size:12px; color:#FF0000" value="dddd">
</form>
</body>
</html>
But ppl can and do disable javascript. So here is the php version.
<?php
if($_POST['submit'] != "")
{
$error = array();
if(!preg_match("/^[\d-]+$/", $_POST['telephone']))
{
$error['Telephone'] = "<span style=\"color:red;\">Telephone can only contain numbers and dashes, ie. 833-333-3333</span><br>";
}
}
?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Untitled</title>
</head>
<body>
<form action="<?php print $_GLOBALS['PHP_SELF']; ?>" method="POST">
<?php if($error['Telephone'] != "") print $error['Telephone']; ?>
<strong>Telephone:</strong><input name="telephone" type="text" id="telephone" size="25" style="background-color: #FFFFCC; border: 1px solid border-color:#FF0000; font-family: verdana; font-size:12px; color:#FF0000" value="dddd"><br />
<input type="submit" name="submit" value="Submit">
</form>
</body>
</html>
Hope that helps/.
vosgien
08-30-2004, 12:04 PM
Hi,
Thanks v.much. I already use javascript to check that the text fields are completed, however, because people do disable js and because I check email authenticity in PHP, I think I will resort to doing all the checking in PHP, next I have to work out how to leave the form completed should the user make an error in only one or two fields, I believe this is called 'saving state' and involves session cookies ( am I going in the right direction ?), I think I have enough info at hand to proceed with this, but if I have any probs, I will start another thread
Thanks again
Vosgien
|
vBulletin® v3.8.4, Copyright ©2000-2009, Jelsoft Enterprises Ltd.