Counting Syllables in the English Language Using Python

Even though it seems like an easy task, counting syllables is very hard in English. After hours of googling I’ve realized that the non-corpus-based algorithms are not perfect, and it’s impossible for them to be. So I wanted to make a better one combining some I’ve read and overcoming the errors I’ve encountered. The goal is to create a step-by-step algorithm using the least amount of dictionary help. Here’s the first time I’m sharing it publicly hoping that it will help someone out.

Even though this is Python based, the important thing is the algorithm. I know it’s not the best way to handle it, but the outcome’s not so bad for a first attempt.

Here are some discussions or algorithms I’ve found on the web.

  1.  simple algorithm
  2.  some pseudo-algorithm
  3.  a little bit more algorithm
  4. this uses the nltk library, which is not what I’m looking for, since that’s not the challenge. (yet, it may be the most clever approach for you)

Reading all these and experimenting, I’ve developed my own set of rules, here they go.

The Algorithm

  1. If number of letters <= 3 : return 1
  2. If doesn’t end with “ted” or “tes” or “ses” or “ied” or “ies”, discard “es” and “ed” at the end. If it has only 1 vowel or 1 set of consecutive vowels, discard. (like “speed”, “fled” etc.)
  3. Discard trailing “e”, except where ending is “le” and isn’t in the le_except array
  4. Check if consecutive vowels exists, triplets or pairs, count them as one.
  5. Count remaining vowels in the word.
  6. Add one if begins with “mc”
  7. Add one if ends with “y” but is not surrouned by vowel. (ex. “mickey”)
  8. Add one if “y” is surrounded by non-vowels and is not in the last word. (ex. “python”)
  9. If begins with “tri-” or “bi-” and is followed by a vowel, add one. (so that “ia” at “triangle” won’t be mistreated by step 4)
  10. If ends with “-ian”, should be counted as two syllables, except for “-tian” and “-cian”. (ex. “indian” and “politician” should be handled differently and shouldn’t be mistreated by step 4)
  11. If begins with “co-” and is followed by a vowel, check if it exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly. (co_one and co_two dictionaries handle it. Ex. “coach” and “coapt” shouldn’t be treated equally by step 4)
  12. If starts with “pre-” and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly. (similar to step 11, but very weak dictionary for the moment)
  13. Check for “-n’t” and cross match with dictionary to add syllable. (ex. “doesn’t”, “couldn’t”)
  14. Handling the exceptional words. (ex. “serious”, “fortunately”)

Like I said earlier, this isn’t perfect, so there are some steps to add or modify, but it works just “fine”. Some exceptions should be added such as “evacuate”, “ambulances”, “shuttled”, “anyone” etc… Also it can’t handle some compund words like “facebook”. Counting only “face” would result correctly “1”, and “book” would also come out correct, but due to the “e” letter not being detected as a “silent e”, “facebook” will return “3 syllables.”

Anyway, here’s the Python (2.x) code, I’ll try and improve it sometime.

import re

def sylco(word) :

    word = word.lower()

    # exception_add are words that need extra syllables
    # exception_del are words that need less syllables

    exception_add = ['serious','crucial']
    exception_del = ['fortunately','unfortunately']

    co_one = ['cool','coach','coat','coal','count','coin','coarse','coup','coif','cook','coign','coiffe','coof','court']
    co_two = ['coapt','coed','coinci']

    pre_one = ['preach']

    syls = 0 #added syllable number
    disc = 0 #discarded syllable number

    #1) if letters < 3 : return 1
    if len(word) <= 3 :
        syls = 1
        return syls

    #2) if doesn't end with "ted" or "tes" or "ses" or "ied" or "ies", discard "es" and "ed" at the end.
    # if it has only 1 vowel or 1 set of consecutive vowels, discard. (like "speed", "fled" etc.)

    if word[-2:] == "es" or word[-2:] == "ed" :
        doubleAndtripple_1 = len(re.findall(r'[eaoui][eaoui]',word))
        if doubleAndtripple_1 > 1 or len(re.findall(r'[eaoui][^eaoui]',word)) > 1 :
            if word[-3:] == "ted" or word[-3:] == "tes" or word[-3:] == "ses" or word[-3:] == "ied" or word[-3:] == "ies" :
            else :

    #3) discard trailing "e", except where ending is "le"  

    le_except = ['whole','mobile','pole','male','female','hale','pale','tale','sale','aisle','whale','while']

    if word[-1:] == "e" :
        if word[-2:] == "le" and word not in le_except :

        else :

    #4) check if consecutive vowels exists, triplets or pairs, count them as one.

    doubleAndtripple = len(re.findall(r'[eaoui][eaoui]',word))
    tripple = len(re.findall(r'[eaoui][eaoui][eaoui]',word))
    disc+=doubleAndtripple + tripple

    #5) count remaining vowels in word.
    numVowels = len(re.findall(r'[eaoui]',word))

    #6) add one if starts with "mc"
    if word[:2] == "mc" :

    #7) add one if ends with "y" but is not surrouned by vowel
    if word[-1:] == "y" and word[-2] not in "aeoui" :
        syls +=1

    #8) add one if "y" is surrounded by non-vowels and is not in the last word.

    for i,j in enumerate(word) :
        if j == "y" :
            if (i != 0) and (i != len(word)-1) :
                if word[i-1] not in "aeoui" and word[i+1] not in "aeoui" :

    #9) if starts with "tri-" or "bi-" and is followed by a vowel, add one.

    if word[:3] == "tri" and word[3] in "aeoui" :

    if word[:2] == "bi" and word[2] in "aeoui" :

    #10) if ends with "-ian", should be counted as two syllables, except for "-tian" and "-cian"

    if word[-3:] == "ian" : 
    #and (word[-4:] != "cian" or word[-4:] != "tian") :
        if word[-4:] == "cian" or word[-4:] == "tian" :
        else :

    #11) if starts with "co-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:2] == "co" and word[2] in 'eaoui' :

        if word[:4] in co_two or word[:5] in co_two or word[:6] in co_two :
        elif word[:4] in co_one or word[:5] in co_one or word[:6] in co_one :
        else :

    #12) if starts with "pre-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:3] == "pre" and word[3] in 'eaoui' :
        if word[:6] in pre_one :
        else :

    #13) check for "-n't" and cross match with dictionary to add syllable.

    negative = ["doesn't", "isn't", "shouldn't", "couldn't","wouldn't"]

    if word[-3:] == "n't" :
        if word in negative :
        else :

    #14) Handling the exceptional words.

    if word in exception_del :

    if word in exception_add :

    # calculate the output
    return numVowels - disc + syls

smartd Settings on a CentOS Server

smartd is a great tool to keep track of the health status of your server disks. It tracks the S.M.A.R.T records on specified periods and warns you in case anything goes wrong. Even though it is quiet simple, people can get lost while setting up their configuration. Here I’ll explain how my generic settings go. Keep in mind that this is for CentOS servers.

To install the service, simply get the smartmontools package via yum. This will also install mailx if isn’t already installed.

yum install smartmontools -y

Now a file named /etc/smartd.conf will be created. This is where we tell smartd what to do. First, learn the names of your devices using fdisk.

root@eaVT:~# fdisk -l

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders, total 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0006f1aa

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048   943237119   471617536   83  Linux
/dev/sda2       943239166   976771071    16765953    5  Extended
/dev/sda5       943239168   976771071    16765952   82  Linux swap / Solaris

This output tells that I have one physical disk (/dev/sda) with three partitions (/dev/sda1, /dev/sda2, /dev/sda3). But we are only interested in the physical devices, which means smartd will only deal with /dev/sda.

Open /etc/smartd.conf using your favourite (vi?) text editor. Find the line that says
and comment it out. Then add this line
DEVICESCAN -S on -o on -a -m -s (S/../.././02|L/../../0603) -M test
The result should look like this:

# The word DEVICESCAN will cause any remaining lines in this
# configuration file to be ignored: it tells smartd to scan for all
# ATA and SCSI devices.  DEVICESCAN may be followed by any of the
# Directives listed below, which will be applied to all devices that
# are found.  Most users should comment out DEVICESCAN and explicitly
# list the devices that they wish to monitor.
#DEVICESCAN -H -m root
DEVICESCAN -S on -o on -a -m -s (S/../.././02|L/../../0603) -M test

Of course, don’t forget to replace it with your own email address. After this simply restart smartd service.

service smartd restart

Now wait for a while and check your email. According to my personal experience, it takes around 5-10 minutes to receive it. You will get a TEST email that says your disks have error. Now that we’ve established you can get the email when an error occurs, lets set it up to a real case.

Go back to /etc/smartd.conf and uncomment the line starting with DEVICESCAN. Don’t forget that there shouldn’t be any line starting with DEVICESCAN on this file, otherwise smartd will halt reading the conf file after it.

Now add the following lines to the /etc/smartd.conf

/dev/sda -H -C 0 -U 0 -m
/dev/sda -d scsi -s L/../../1/01 -m

Of course, replace the /dev/sda and email address according to yours.

The first line tells smartd to run a silence check on the /dev/sda disk and email us on any error.
The second line indicates that a long check will be made every Monday and 1 a.m. and on any error it will be mailed to us. If we wanted to make the test every Sunday at 6 p.m. the setting would have been L/../../7/18 -m

If you’d like to add a new disk, (for example /dev/sdb) simply add it as a new line.

/dev/sda -H -C 0 -U 0 -m
/dev/sda -d scsi -s L/../../1/01 -m
/dev/sdb -H -C 0 -U 0 -m
/dev/sdb -d scsi -s L/../../1/01 -m

Now save the file and restart the service again.

service smartd restart

Normally, it is possible that the service won’t get started on reboot. You must add it with chkconfig in order to run it automatically in a CentOS box. To check it:

[root@emre ~]# chkconfig --list |grep smartd
smartd         	0:off	1:off	2:off	3:off	4:off	5:off	6:off
[root@emre ~]# chkconfig smartd on
[root@emre ~]# chkconfig --list |grep smartd
smartd             0:off    1:off    2:on    3:on    4:on    5:on    6:off

This means that it will run on user levels 2, 3, 4 and 5. What this means is a different story.

So that’s it for now.

Importing Large MySQL Files With phpMyAdmin

It may not be possible to import large SQL files using phpMyAdmin due to it’s uploading limits.
Sometimes this is related to your php.ini settings, but no always.

You can always use the old reliable method of importing sql files from the command line. Or upload to file to server and tell phpMyAdmin to look for that file specifically.

MySQL Command Line Option

Of course, we need to upload the file first. Below, the 1st line is to achieve this. After that we connect to our server via ssh, and then (on the 2nd line) import the sql file.

scp the_sql_file.sql
mysql -u username -p -h localhost DATABASE-NAME < the_sql_file.sql

Here, the MySQL connection is established using a username and password. If you simple create a .my.cnf file in order to get automatic connections, you won’t need all these. Simply create a file to /root/.my.cnf (or any other user directory you want to)


From now on, you don’t need to use the -u and -p arguments when using mysql, if this file exists at the user’s home directory you’re logged on as, than it will automatically pass them. It may seem insecure, but the /root folder is only accessed by the root anyway, even though it is not a good idea to keep passwords clear text, if your /root folder is compromised, you’re in big trouble anyway. And keep in mind that we usually have to keep MySQL passwords a cleartexts in scripts all the time.

Using phpMyAdmin to Import the SQL File

You can set a default folder for phpMyAdmin to check for uploaded files. This way, if you place a file into this folder, you can easily choose to import it. Find where your file is. If you don’t know where, try the locate command (if it is installed). Open the file and find the $cfg[‘UploadDir’] section. Update it as below

$cfg['UploadDir'] = 'imports';

Now, phpMyAdmin will check the “imports” folder directly. So we need a folder like that. On your terminal, create the folder.

mkdir /path/to/phpmyadmin/imports

Now upload your sql files here.

scp /path/to/the_sql_file.sql

Change the owner and group of the imports folder. First check who owns the phpmyadmin folder. Let’s say it is called webapp,

chown -R webapp:webapp /path/to/phpmyadmin/imports

Ok. Now use your browser and access phpmyadmin, at the import db seciton you’ll see a dropdown menu that wasn’t there before. There you’ll see the files inside the “imports” folder. Even though they are larger than the phpmyadmin upload limits, they’ll get imported.

Şair Eşref’ten gelsin bu sefer de.

Bizdeki san’atı taklîd edemez avrupalı.
Sanma âheng-i umûmîye bu hey’et kapılır;
Milletin ağzı açıldıkça kilit vurmak için
Bâb-ı âli’de ne san’atlı anahtar yapılır!

Sözcü Gazetesi ve Atatürk’ün Öngörüsü Hakkında

Aslında bu mevzuyu facebook’ta bir post halinde yazmıştım ama bu kadar uzun ve tekrar okunması gerekebilecek yazıları bu şekilde paylaşmanın yanlış olduğunu düşündüğümden buraya taşıma kararı aldım. Bu arada birkaç referansı güncelleyip typo giderdim. Özellikle Gezi olaylarının ortaya çıkmasından sonra bilgi kirliliği üstel biçimde arttığından bir nebze bunların önüne geçmek, bilinçlenmek gerektiğini düşündüğümden yazıyorum.

Son zamanlarda paylaşılan, geçen seneden kalan bir Sözcü Gazetesi manşeti var, “Atatürk 89 yıl önce bugünleri görmüştü” diye. (görmediyseniz buyrun) Bir röportaj alıntısı ve alıntıda Atatürk “birinci cihan harbi” diyor. Röportaj 1923 tarihli olunca “Oha o zamanlar harbin adında ‘birinci’ geçiyor muymuş?” diye düşündüm. Tabii ki bir tercüme hatası, veya gazetenin eklediği bir şeydir diye düşünüp metnin orijinaline bakmak istedim. Bir de ne göreyim, benden önce davrananlar da olmuş.

“World War” söylemi, daha Birinci Dünya Savaşı çıkmadan önce Alman bir yazar (August Wilhelm Otto Niemann) tarafından bir romanda kullanılmış, 1904 yılında. Sözkonusu dünya savaşına “First World War” denmesiyse ilk olarak 1933’te yayımlanan bir kitapta (The First World War: A Photographic History) kullanılmış. Sonra da Time dergisi 1939’da “World War I” terimini kullanmış. (Bunlar wikipedia kaynaklı, güvendik sayalım)

İkinci Dünya Savaşının resmen 1939’da başladığı kabul edildiğinde, ikinciden önce birinciye “first” denmesi rastlanan bir şeymiş, o yüzden Atatürk’ün de böyle söylemesi mümkün diye düşünesi geliyor insanın. Halbuki ben Nutuk’ta da (1927, röportajdan 4 yıl sonra) böyle bir söylem hatırlamıyorum. (Şu anda açıp bakacak durumda da değilim) Tabii Nutuk’un da milyon kez elden geçirildiği akılda tutulmalı.

Alıntıdaki “Ortadoğu” söylemi de biraz şüpheli. “Middle East” söylemi ilk defa İngilizler tarafından 1900’lerin başı, 1800’lerin sonunda kullanılmış, ancak bu dönemde “Ortadoğu” ile kastedilen bölge Afganistan taraflarıyken, “Yakın Doğu” olan yer Suriye’yi, İran’ı ifade ediyormuş. Osmanlı’nın tam anlamıyla yıkılmasından sonra Ortadoğu söylemi bugün ifade ettiği yere kaymaya başlamış, ancak yine de akademik çevreler kullanımlarını çok değiştirmemiş, ABD’nin bile Ortadoğu amacıyla bugün bahsettiğimiz yeri resmen ifade etmesi 1950’leri buluyor. Bu tarihlere kadar Avrupa-odaklı bu söylem genelde “Batı” haricinde pek tercih edilmez, hatta tepkiyle algılanılırmış. Peki Atatürk gerçekten “Ortadoğu” demiş olabilir mi? (“Middle East” söylemi için şurada çok detaylı bir açıklama mevcut.)

O zaman öngörünün özetlendiğini kabul edip metnin orjinaline bakıyorum.

İngilizcesi şurada mevcut. :  (7 sayfa)
Türkçe tercümesi de (Prof. Dr. Ergun Özbudun tarafından) şurada mevcut. (tek sayfa)

Tamamını okumadım, ancak böyle bir şeyi yakalayacak aramalar sonuç vermedi, her iki dilde de.

Sonra baktım meğer yalnız değilmişim, konu sözlükte de tartışılmış, onlar da bulamamış :–3527741

E o zaman başkaları da farketmiştir herhalde? Google’a yazınca diğer gazetelerin de “asparagas” dediğini görüyorum.

Bugün (20 Mart 2013) farkedene kadar ben de doğru sanıyordum ne yalan söyleyeyim, insan neye güveneceğini bilemiyor. İnanmak istediğimiz şeyleri istemsizce süzüp haklı buluyoruz belki de. Adamın öngörüsü hakkında koca bir külliyat dururken başkalarına koz veriyoruz, köfte gazetelerin “asparagas” diyebilmesi için koz veriliyor.

Neyse, umarım röportajın tamamını okuyan birisi yanıldığımı söyler.

Biz de günümüz gazeteciliğinde yer alan bilgi kirliliği konusunda Carl Bernstein’ın sözlerine kulak verelim.

The lowest form of popular culture – lack of information, misinformation, disinformation and a contempt for the truth or the reality of most people’s lives – has overrun real journalism. – Carl Bernstein