I was looking at Scala again and decided to implement Peter Norvig's algorithm for suggesting spelling correction suggestions. I suggest you go read How to Write a Spelling Corrector for the clever stuff.
This implementation is limited to the English alphabet. You'll need the big.txt file or a similar set of training data.
Spell check in Java
Since I don't really know Python and am only getting to grips with Scala, I decided to implement the code in Java first.
I started by defining an interface to write to:
//Spell.java |
The implementation of this interface was written for clarity and efficiency rather than brevity.
//JSpell.java |
This code works by pushing correction candidates into a Receiver
type which determines which of the candidates it has been passed is the
best (that is, has the highest frequency in the dictionary). The
correction sequence goes like this:
- If the term is in the dictionary, return the term.
- Else if a candidate one character different to the term is in the dictionary, return the candidate with the highest weight.
- Else if a candidate two characters different to the term is in the dictionary, return the candidate with the highest weight.
- Else return the term.
Spell check in Scala
Rather than do a straight port of the Java code, I tried to implement this version in idiomatic Scala. Still, I imagine I've made some less than stellar implementation choices to the eyes of seasoned Scala developers.
//SSpell.scala class SSpell(data: Map[String,Int]) { private val dictionary = new FrequencyMap(data) private val alphabet = 'a' to 'z' /**Map that will return a default value of zero for no entry*/ private class FrequencyMap[K](m: Map[K, Int]) extends scala.collection.mutable.HashMap[K, Int] { override def default(key: K) = 0 this ++ m } private def edit1(word: String): Seq[String] = { // "hello" becomes ("", "hello"), ("h", "ello"), etc. val splits = (0 to word.length).map(i => (word.take(i), word.drop(i))) val deleted = splits.filter(_._2.length > 0) .map(tuple => tuple._1 + tuple._2.drop(1)) val transposed = splits.filter(_._2.length > 1) .map(tuple => tuple._1 + tuple._2(1) + tuple._2(0) + tuple._2.drop(2)) val replaced = splits.filter(_._2.length > 0) .flatMap(tuple => alphabet.map(tuple._1 + _ + tuple._2.drop(1))) val inserted = splits .flatMap(tuple => alphabet.map(tuple._1 + _ + tuple._2)) deleted ++ transposed ++ replaced ++ inserted } private def edit2(word: String, edits: Seq[String]) = { val edit2 = for(edit <- edits; e2 <- edit1(edit); known = (e2, dictionary(e2)); if(known._2>0)) yield known edit2.foldLeft((word, 0)){best(_,_)} } private def best(a: (String,Int), b: (String,Int)) = if(b._2 > a._2) b else a private def fix(word: String, edits: Seq[String]) = { val terms = edits.map(term => (term, dictionary(term))) val correction = terms.foldLeft((word, 0)){best(_,_)} if(correction._2 > 0) correction._1 else edit2(word, edits)._1 } def correct(word: String) = if(dictionary(word) > 0) word else fix(word, edit1(word)) }
If you aren't familiar with Scala, some of this can look quite daunting. Let's break down this method:
private def fix(word: String, edits: Seq[String]) = { val terms = edits.map(term => (term, dictionary(term))) val correction = terms.foldLeft((word, 0)){best(_,_)} if(correction._2 > 0) correction._1 else edit2(word, edits)._1 }
The underscore _
used by itself as a variable is
known as place-holder notation and in this context roughly
means "the next argument." (foo,bar)
is a shortcut for
creating a tuple type. map
and foldLeft
iterate over the contents of the class and apply a function to the
elements. map
returns another sequence and foldLeft
returns a single value. If you are a Java developer, you may grok
this rough translation more easily:
private String fix(String word, Seq<String> edits) {
|
Note: this isn't a literal translation of what Scala does - I just want to give a better sense of the logic involved.
Did you mean ..?
The application utilising the spell checker was written in Java and I wanted to just swap in functionality written in Scala. One of the attractions of Scala is the way you can integrate it with existing code and I wanted to try this out.
//DidYouMean.java |
There are two obstacles to switching implementations. One: SSpell
doesn't implement the Spell
interface. Two: Scala
collection classes don't inherit from the Java collection classes type
hierarchy. A simple adapter class fixes both these problems.
//SSpellAdapter.scala class SSpellAdapter(javaMap: java.util.Map[String, Int]) extends Spell { import scala.collection.jcl.Conversions val scalaMap = Map.empty ++ Conversions.convertMap(javaMap); val corrector = new SSpell(scalaMap) def correct(word: String): String = corrector.correct(word) }
Here is the application as run from the command line:
X:\test>java -cp C:\Scala\lib\scala-library.jar;. DidYouMean spellin spellin: did you mean spelling?
Benchmarking
Introducing SSpellAdapter
makes it really easy to
check the Scala implementation against the Java version. As a bonus, it
creates a fairly level playing field for benchmarking.
These are the results for running the checkers against the tests1 data set when run on a 1.8GHz Centrino:
class SSpellAdapter Time: 12.605887 seconds Total: 265; Right: 197; Wrong: 68; Unknown: 15; Pct: 74.339623 class JSpell Time: 3.845939 seconds Total: 265; Right: 197; Wrong: 68; Unknown: 15; Pct: 74.339623
Don't read too much into the timings on this micro-benchmark. After all, it would be possible to write a procedural Scala implementation just like the Java one. All this is really measuring is my proficiency with functional Scala.
Remarks
Java version 6; Scala version 2.7.7.
Windows build script:
@REM COMPILE.BAT @ECHO OFF javac Spell.java CALL scalac.bat -classpath . *.scala javac -cp %SCALA_HOME%\lib\scala-library.jar;. *.java
No comments:
Post a Comment
All comments are moderated