I was looking at Scala again and decided to implement Peter Norvig's algorithm for suggesting spelling correction suggestions. I suggest you go read How to Write a Spelling Corrector for the clever stuff.
This implementation is limited to the English alphabet. You'll need the big.txt file or a similar set of training data.
Spell check in Java
Since I don't really know Python and am only getting to grips with Scala, I decided to implement the code in Java first.
I started by defining an interface to write to:
//Spell.java |
The implementation of this interface was written for clarity and efficiency rather than brevity.
//JSpell.java |
This code works by pushing correction candidates into a Receiver
type which determines which of the candidates it has been passed is the
best (that is, has the highest frequency in the dictionary). The
correction sequence goes like this:
- If the term is in the dictionary, return the term.
- Else if a candidate one character different to the term is in the dictionary, return the candidate with the highest weight.
- Else if a candidate two characters different to the term is in the dictionary, return the candidate with the highest weight.
- Else return the term.
Spell check in Scala
Rather than do a straight port of the Java code, I tried to implement this version in idiomatic Scala. Still, I imagine I've made some less than stellar implementation choices to the eyes of seasoned Scala developers.
//SSpell.scala
class SSpell(data: Map[String,Int]) {
private val dictionary = new FrequencyMap(data)
private val alphabet = 'a' to 'z'
/**Map that will return a default value of zero for no entry*/
private class FrequencyMap[K](m: Map[K, Int]) extends
scala.collection.mutable.HashMap[K, Int] {
override def default(key: K) = 0
this ++ m
}
private def edit1(word: String): Seq[String] = {
// "hello" becomes ("", "hello"), ("h", "ello"), etc.
val splits = (0 to word.length).map(i => (word.take(i), word.drop(i)))
val deleted = splits.filter(_._2.length > 0)
.map(tuple => tuple._1 + tuple._2.drop(1))
val transposed = splits.filter(_._2.length > 1)
.map(tuple => tuple._1 + tuple._2(1) + tuple._2(0) + tuple._2.drop(2))
val replaced = splits.filter(_._2.length > 0)
.flatMap(tuple => alphabet.map(tuple._1 + _ + tuple._2.drop(1)))
val inserted = splits
.flatMap(tuple => alphabet.map(tuple._1 + _ + tuple._2))
deleted ++ transposed ++ replaced ++ inserted
}
private def edit2(word: String, edits: Seq[String]) = {
val edit2 = for(edit <- edits;
e2 <- edit1(edit);
known = (e2, dictionary(e2));
if(known._2>0)) yield known
edit2.foldLeft((word, 0)){best(_,_)}
}
private def best(a: (String,Int), b: (String,Int)) = if(b._2 > a._2) b else a
private def fix(word: String, edits: Seq[String]) = {
val terms = edits.map(term => (term, dictionary(term)))
val correction = terms.foldLeft((word, 0)){best(_,_)}
if(correction._2 > 0) correction._1 else edit2(word, edits)._1
}
def correct(word: String) =
if(dictionary(word) > 0) word else fix(word, edit1(word))
}
If you aren't familiar with Scala, some of this can look quite daunting. Let's break down this method:
private def fix(word: String, edits: Seq[String]) = {
val terms = edits.map(term => (term, dictionary(term)))
val correction = terms.foldLeft((word, 0)){best(_,_)}
if(correction._2 > 0) correction._1 else edit2(word, edits)._1
}
The underscore _ used by itself as a variable is
known as place-holder notation and in this context roughly
means "the next argument." (foo,bar) is a shortcut for
creating a tuple type. map and foldLeft
iterate over the contents of the class and apply a function to the
elements. map returns another sequence and foldLeft
returns a single value. If you are a Java developer, you may grok
this rough translation more easily:
private String fix(String word, Seq<String> edits) {
|
Note: this isn't a literal translation of what Scala does - I just want to give a better sense of the logic involved.
Did you mean ..?
The application utilising the spell checker was written in Java and I wanted to just swap in functionality written in Scala. One of the attractions of Scala is the way you can integrate it with existing code and I wanted to try this out.
//DidYouMean.java |
There are two obstacles to switching implementations. One: SSpell
doesn't implement the Spell interface. Two: Scala
collection classes don't inherit from the Java collection classes type
hierarchy. A simple adapter class fixes both these problems.
//SSpellAdapter.scala
class SSpellAdapter(javaMap: java.util.Map[String, Int]) extends Spell {
import scala.collection.jcl.Conversions
val scalaMap = Map.empty ++ Conversions.convertMap(javaMap);
val corrector = new SSpell(scalaMap)
def correct(word: String): String = corrector.correct(word)
}
Here is the application as run from the command line:
X:\test>java -cp C:\Scala\lib\scala-library.jar;. DidYouMean spellin spellin: did you mean spelling?
Benchmarking
Introducing SSpellAdapter makes it really easy to
check the Scala implementation against the Java version. As a bonus, it
creates a fairly level playing field for benchmarking.
These are the results for running the checkers against the tests1 data set when run on a 1.8GHz Centrino:
class SSpellAdapter Time: 12.605887 seconds Total: 265; Right: 197; Wrong: 68; Unknown: 15; Pct: 74.339623 class JSpell Time: 3.845939 seconds Total: 265; Right: 197; Wrong: 68; Unknown: 15; Pct: 74.339623
Don't read too much into the timings on this micro-benchmark. After all, it would be possible to write a procedural Scala implementation just like the Java one. All this is really measuring is my proficiency with functional Scala.
Remarks
Java version 6; Scala version 2.7.7.
Windows build script:
@REM COMPILE.BAT @ECHO OFF javac Spell.java CALL scalac.bat -classpath . *.scala javac -cp %SCALA_HOME%\lib\scala-library.jar;. *.java
No comments:
Post a Comment
All comments are moderated