Tag Archives: Python

Counting Syllables in the English Language Using Python

Even though it seems like an easy task, counting syllables is very hard in English. After hours of googling I’ve realized that the non-corpus-based algorithms are not perfect, and it’s impossible for them to be. So I wanted to make a better one combining some I’ve read and overcoming the errors I’ve encountered. The goal is to create a step-by-step algorithm using the least amount of dictionary help. Here’s the first time I’m sharing it publicly hoping that it will help someone out.

Even though this is Python based, the important thing is the algorithm. I know it’s not the best way to handle it, but the outcome’s not so bad for a first attempt.

Here are some discussions or algorithms I’ve found on the web.

  1. http://allenporter.tumblr.com/post/9776954743/syllables  simple algorithm
  2. http://www.howmanysyllables.com/howtocountsyllables.html  some pseudo-algorithm
  3. http://www.modulus.com.au/blog/?p=8  a little bit more algorithm
  4. http://www.onebloke.com/2011/06/counting-syllables-accurately-in-python-on-google-app-engine/ this uses the nltk library, which is not what I’m looking for, since that’s not the challenge. (yet, it may be the most clever approach for you)

Reading all these and experimenting, I’ve developed my own set of rules, here they go.

The Algorithm

  1. If number of letters <= 3 : return 1
  2. If doesn’t end with “ted” or “tes” or “ses” or “ied” or “ies”, discard “es” and “ed” at the end. If it has only 1 vowel or 1 set of consecutive vowels, discard. (like “speed”, “fled” etc.)
  3. Discard trailing “e”, except where ending is “le” and isn’t in the le_except array
  4. Check if consecutive vowels exists, triplets or pairs, count them as one.
  5. Count remaining vowels in the word.
  6. Add one if begins with “mc”
  7. Add one if ends with “y” but is not surrouned by vowel. (ex. “mickey”)
  8. Add one if “y” is surrounded by non-vowels and is not in the last word. (ex. “python”)
  9. If begins with “tri-” or “bi-” and is followed by a vowel, add one. (so that “ia” at “triangle” won’t be mistreated by step 4)
  10. If ends with “-ian”, should be counted as two syllables, except for “-tian” and “-cian”. (ex. “indian” and “politician” should be handled differently and shouldn’t be mistreated by step 4)
  11. If begins with “co-” and is followed by a vowel, check if it exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly. (co_one and co_two dictionaries handle it. Ex. “coach” and “coapt” shouldn’t be treated equally by step 4)
  12. If starts with “pre-” and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly. (similar to step 11, but very weak dictionary for the moment)
  13. Check for “-n’t” and cross match with dictionary to add syllable. (ex. “doesn’t”, “couldn’t”)
  14. Handling the exceptional words. (ex. “serious”, “fortunately”)

Like I said earlier, this isn’t perfect, so there are some steps to add or modify, but it works just “fine”. Some exceptions should be added such as “evacuate”, “ambulances”, “shuttled”, “anyone” etc… Also it can’t handle some compund words like “facebook”. Counting only “face” would result correctly “1”, and “book” would also come out correct, but due to the “e” letter not being detected as a “silent e”, “facebook” will return “3 syllables.”

Anyway, here’s the Python (2.x) code, I’ll try and improve it sometime.

import re

def sylco(word) :

    word = word.lower()

    # exception_add are words that need extra syllables
    # exception_del are words that need less syllables

    exception_add = ['serious','crucial']
    exception_del = ['fortunately','unfortunately']

    co_one = ['cool','coach','coat','coal','count','coin','coarse','coup','coif','cook','coign','coiffe','coof','court']
    co_two = ['coapt','coed','coinci']

    pre_one = ['preach']

    syls = 0 #added syllable number
    disc = 0 #discarded syllable number

    #1) if letters < 3 : return 1
    if len(word) <= 3 :
        syls = 1
        return syls

    #2) if doesn't end with "ted" or "tes" or "ses" or "ied" or "ies", discard "es" and "ed" at the end.
    # if it has only 1 vowel or 1 set of consecutive vowels, discard. (like "speed", "fled" etc.)

    if word[-2:] == "es" or word[-2:] == "ed" :
        doubleAndtripple_1 = len(re.findall(r'[eaoui][eaoui]',word))
        if doubleAndtripple_1 > 1 or len(re.findall(r'[eaoui][^eaoui]',word)) > 1 :
            if word[-3:] == "ted" or word[-3:] == "tes" or word[-3:] == "ses" or word[-3:] == "ied" or word[-3:] == "ies" :
                pass
            else :
                disc+=1

    #3) discard trailing "e", except where ending is "le"  

    le_except = ['whole','mobile','pole','male','female','hale','pale','tale','sale','aisle','whale','while']

    if word[-1:] == "e" :
        if word[-2:] == "le" and word not in le_except :
            pass

        else :
            disc+=1

    #4) check if consecutive vowels exists, triplets or pairs, count them as one.

    doubleAndtripple = len(re.findall(r'[eaoui][eaoui]',word))
    tripple = len(re.findall(r'[eaoui][eaoui][eaoui]',word))
    disc+=doubleAndtripple + tripple

    #5) count remaining vowels in word.
    numVowels = len(re.findall(r'[eaoui]',word))

    #6) add one if starts with "mc"
    if word[:2] == "mc" :
        syls+=1

    #7) add one if ends with "y" but is not surrouned by vowel
    if word[-1:] == "y" and word[-2] not in "aeoui" :
        syls +=1

    #8) add one if "y" is surrounded by non-vowels and is not in the last word.

    for i,j in enumerate(word) :
        if j == "y" :
            if (i != 0) and (i != len(word)-1) :
                if word[i-1] not in "aeoui" and word[i+1] not in "aeoui" :
                    syls+=1

    #9) if starts with "tri-" or "bi-" and is followed by a vowel, add one.

    if word[:3] == "tri" and word[3] in "aeoui" :
        syls+=1

    if word[:2] == "bi" and word[2] in "aeoui" :
        syls+=1

    #10) if ends with "-ian", should be counted as two syllables, except for "-tian" and "-cian"

    if word[-3:] == "ian" : 
    #and (word[-4:] != "cian" or word[-4:] != "tian") :
        if word[-4:] == "cian" or word[-4:] == "tian" :
            pass
        else :
            syls+=1

    #11) if starts with "co-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:2] == "co" and word[2] in 'eaoui' :

        if word[:4] in co_two or word[:5] in co_two or word[:6] in co_two :
            syls+=1
        elif word[:4] in co_one or word[:5] in co_one or word[:6] in co_one :
            pass
        else :
            syls+=1

    #12) if starts with "pre-" and is followed by a vowel, check if exists in the double syllable dictionary, if not, check if in single dictionary and act accordingly.

    if word[:3] == "pre" and word[3] in 'eaoui' :
        if word[:6] in pre_one :
            pass
        else :
            syls+=1

    #13) check for "-n't" and cross match with dictionary to add syllable.

    negative = ["doesn't", "isn't", "shouldn't", "couldn't","wouldn't"]

    if word[-3:] == "n't" :
        if word in negative :
            syls+=1
        else :
            pass   

    #14) Handling the exceptional words.

    if word in exception_del :
        disc+=1

    if word in exception_add :
        syls+=1     

    # calculate the output
    return numVowels - disc + syls

UNIX Process Time Bomb

Here’s a simple script that kills a process if it lives longer than the time specified. It’s written in Python, and is available on github.

Usage : timebomb.py <process-name> <minutes>

Example : $ timebomb.py firefox-bin 20

Outcome : This will kill the process named firefox-bin if it has been running longer than 20 minutes.

Crontab : You should probably add this to your crontab!

Dependency : Standard UNIX tools : Python 2.4.x, pgrep, ps, kill etc.

#!/usr/bin/python
# A Pythonic Time Bomb
# Kills Processes Living Longer than the specified time.
# Don't Forget to add it to your crontab!
# http://github.com/eaydin

import subprocess, sys
if len(sys.argv) != 3 :
    print "Usage : timebomb.py <process-name> <time-in-minutes>"
    print "Takes only and exactly 2 arguments."
    raise SystemExit
    
try : int(sys.argv[2])
except :
    print "%s is not an integer." % sys.argv[2]
    raise SystemExit
    
try :
    a=subprocess.Popen(["pgrep",sys.argv[1]],stdout=subprocess.PIPE).communicate()[0]
    if a == '' :
        raise SystemExit
    else :
        procc = subprocess.Popen(["ps -o pid,bsdtime -p $(pgrep %s)"%(sys.argv)[1]],shell=True,stdout=subprocess.PIPE).communicate()[0]
        procc=procc.strip()
except : raise SystemExit 
for lines in procc.split('\n') :
    if lines != '' :
        l=lines.split()
        if l[0] == 'PID' : pass
        else :
            if int(l[1].split(':')[0]) >= int(sys.argv[2]) :
                try : killer = subprocess.Popen(["kill","-9",l[0]],stdout=subprocess.PIPE).communicate()[0]
                except : pass
            else : pass

CR2 Files To FITS

Even though CR2 and FITS files both seem to be very common, unfortunately you just can’t simply google and find out how to convert between each other. So after a lot of googling, here’s my solution to the problem using Python with Numpy, PyFITS, Netpbmfile.py, and dcraw.

Actually if you take a look at dcraw’s homepage, you’ll see that it says you can use the following code to convert cr2 files to fits :

$ dcraw -c crw_0001.crw | pnmtofits > crw_0001.fits

But this wasn’t the case for me, since I insisted on going for 16 bits, the pnmtofits tool gave me bufferoverflows and other crap.

Here’s my solution.

$ dcraw -6 -c RAWDATA.cr2 > ThePPM.ppm

Ok this was easy. The -6 option says that we insist on our output to be 16 bits. The -c tells the program to write the output to stdout. Well, since the default output of dcraw is ppm, we redirect it to a ppm file.

After this, I needed a way to handle 16 bit images with Python. Unfortunately the Python Imaging Library doesn’t support 16 bit files. There’s this PythonMagick library which is a wrapper of the C++ bindings of ImageMagick named Magick++ but unfortunately it is a pain in the ass to get documentation for the library. So it seems both PIL and PythonMagick are out of the way here.

Other than that, I’ve found a library called GDAL which they say also handles 16 bit images (but in TIFF format, which is not an issue since dcraw can create 16 bit TIFF outputs with the -6 -T options) but using GDAL didn’t seem to be that clever since it comes with a lot of side effects. (GDAL stands for geospatial data abstraction library)

So, I’ve started looking for ways reading 16bit ppm data with Python, and luckily Christoph Gohlke has written a script for that, netpbmfile.py

So here’s a little snippet for you :

from netpbmfile import *
im = NetpbmFile("ThePPM.ppm").asarray()

Now we have the ppm file as a numpy array. The rest is easy to handle with numpy. Let’s say we only want one channel. (for me, that would be the Green channel, which is the second value in the pixel values)

import numpy
green = numpy.zeros((im.shape[0],im.shape[1]),dtype=numpy.uint16)
for row in xrange(0,im.shape[0]) :
for col in xrange(0,im.shape[1]) :
green[row,col] = im[row,col][1]

Cool. Now we have the 16 bit data of the Green channel in the numpy array called green. Using the PyFITS library we can easily write the data to a new fits file.

import PyFITS
hdu = pyfits.PrimaryHDU(green)
hdu.writeto('GreenChannel.fits')

That’s it!
Well, ofcourse the header information is not copied from the cr2 to fits here, but one can easily get the basic exifdata out of cr2 with dcraw like this :

marvin@marvin:/media/galileo/dcraw$ dcraw -i -v gor.cr2
Filename: gor.cr2
Timestamp: Thu Dec 1 17:42:51 2011
Camera: Canon EOS 550D
ISO speed: 800
Shutter: 24.7 sec
Aperture: f/4.6
Focal length: 37.0 mm
Embedded ICC profile: no
Number of raw images: 1
Thumb size: 5184 x 3456
Full size: 5344 x 3516
Image size: 5202 x 3465
Output size: 5202 x 3465
Raw colors: 3
Filter pattern: RGGBRGGBRGGBRGGB
Daylight multipliers: 2.222196 0.932800 1.295405
Camera multipliers: 2194.000000 1024.000000 1702.000000 1024.000000

So you can easily call this with a pipe in your Python script and after catching the necessary information (like using the re library) you can easily add the info to the header of the FITS image with PyFITS.

Soon I’ll add a sample Python script that does all it.

Here’s a sample output, don’t forget that the FITS is only the green channel and it is shown in grayscale. Also, ds9 normally uses the lower left pixel as origin, so the FITS is displayed in an inverted Y axis according to ds9’s painting settings.

Handling Images With Python

Here’s a quick reminder of how to handle image files with Python.
First of all, it’s always good to have the numpy module.
Other than that, the already included Image and ImageOps modules are also very handy.

An easy way to convert an image into grayscale and then save it back.

import Image, ImageOps
mona_lisa = Image.open("monalisa.jpeg")
mona_lisa = ImageOps.grayscale(resim)
mona_lisa.save("monsa_lisa_BW.jpeg")

Ok that was simple. Here’s how you can manipulate it easily with numpy.

import Image, numpy
mona_lisa = Image.open("monalisa.jpeg")
theArray = numpy.asarray(mona_lisa) # Now we have the image as a numpy array
for x in xrange(0,theArray.shape[0]) :
    for y in xrange(0,theArray.shape[1]) : theArray[x][y] = theArray[x][y]-128
theoutput = Image.fromarray(numpy.uint8(theArray))
theoutput.save("monalisa8bit.jpeg")

The code above will open the monalisa.jpeg image, and subtract the value 128 from each value. And then convert the new matrix (or numpy array) into an 8bit unsigned integer image, then save it as monalisa8bit.jpeg