There should be a better word count function since the current one doesn't count non-ASCII words, or at least it doesn't count Russian words :-/
By the way the default PHP function str_word_count sucks, even with plain ASCII text it counts characters - and ' as words!!!
I used inc/items/model/_item.class.php file from b2evo 2.4.2 as text source.
str_word_count - 12319
my_word_count - 11220
I was wondering why these results have about 10% difference... until I checked the words 8|
str_word_count( $string, 1 )
All these extra words were characters ' and -. Then I removed fake words with this code
function array_fix( $array )
{
$arr = array();
foreach( $array as $str )
{
if( eregi("[A-Za-z]", $str) )
$arr[] = $str;
}
return $arr;
}
echo str_word_count( implode( "\n", array_fix( str_word_count( $string, 1 ) ) ) );
And it finally displayed the same number as my function
echo 'str_word_count - '.str_word_count($string);
echo '<br />str_word_count fixed - '.str_word_count( implode( "\n", array_fix( str_word_count( $string, 1 ) ) ) );
echo '<br /> my_word_count - '.my_word_count($string);
str_word_count - 12319
str_word_count fixed - 11220
my_word_count - 11220
Here's my_word_count function, it counts non-ASCII characters (tested on utf-8 only) and doesn't count ' and - as words.
function my_word_count( $str, $format = 0, $strip_tags = false )
{
if( $strip_tags )
$str = trim(strip_tags($str));
$words = 0;
$array = array();
// Remove everything except letters, ' and -
$pattern = "/[\d\"^!#$%&()*+,.\/:;<=>?@\]\[\\\_`{|}~ \t\r\n\v\f]+/";
$str = @preg_replace($pattern, " ", $str);
$str_array = explode(' ', $str);
foreach( $str_array as $word )
{
if( @preg_match('/[A-Za-z\pL]/', $word) )
{ // Check if the $word has at least one letter
$array[] = $word;
$words++;
}
}
if( $format == 1 )
return $array;
return $words;
}
Example #2
$string = " one and two - ' -- '' ";
echo 'str_word_count - '.str_word_count($string);
echo '<br />str_word_count fixed - '.str_word_count( implode( "\n", array_fix( str_word_count( $string, 1 ) ) ) );
echo '<br /> my_word_count - '.my_word_count($string);
str_word_count - 7
str_word_count fixed - 3
my_word_count - 3
great work.