|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.ja.JapaneseTokenizer
public final class JapaneseTokenizer
Tokenizer for Japanese that uses morphological analysis.
This tokenizer sets a number of additional attributes:
BaseFormAttribute containing base form for inflected
adjectives and verbs.
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading and pronunciation.
InflectionAttribute containing additional part-of-speech
information for inflected forms.
This tokenizer uses a rolling Viterbi search to find the
least cost segmentation (path) of the incoming characters.
For tokens that appear to be compound (> length 2 for all
Kanji, or > length 7 for non-Kanji), we see if there is a
2nd best segmentation of that token after applying
penalties to the long tokens. If so, and the Mode is
JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation
as well.
| Nested Class Summary | |
|---|---|
static class |
JapaneseTokenizer.Mode
Tokenization mode: this determines how the tokenizer handles compound and unknown words. |
static class |
JapaneseTokenizer.Type
Token type reflecting the original source of this token |
| Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
|---|
AttributeSource.AttributeFactory, AttributeSource.State |
| Field Summary | |
|---|---|
static JapaneseTokenizer.Mode |
DEFAULT_MODE
Default tokenization mode. |
| Fields inherited from class org.apache.lucene.analysis.Tokenizer |
|---|
input |
| Constructor Summary | |
|---|---|
JapaneseTokenizer(AttributeSource.AttributeFactory factory,
Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer. |
|
JapaneseTokenizer(Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
Create a new JapaneseTokenizer. |
|
| Method Summary | |
|---|---|
void |
end()
|
boolean |
incrementToken()
|
void |
reset()
|
void |
setGraphvizFormatter(GraphvizFormatter dotOut)
Expert: set this to produce graphviz (dot) output of the Viterbi lattice |
| Methods inherited from class org.apache.lucene.analysis.Tokenizer |
|---|
close, correctOffset, setReader |
| Methods inherited from class org.apache.lucene.util.AttributeSource |
|---|
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final JapaneseTokenizer.Mode DEFAULT_MODE
JapaneseTokenizer.Mode.SEARCH.
| Constructor Detail |
|---|
public JapaneseTokenizer(Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
Uses the default AttributeFactory.
input - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.
public JapaneseTokenizer(AttributeSource.AttributeFactory factory,
Reader input,
UserDictionary userDictionary,
boolean discardPunctuation,
JapaneseTokenizer.Mode mode)
factory - the AttributeFactory to useinput - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.| Method Detail |
|---|
public void setGraphvizFormatter(GraphvizFormatter dotOut)
public void reset()
throws IOException
reset in class TokenStreamIOExceptionpublic void end()
end in class TokenStream
public boolean incrementToken()
throws IOException
incrementToken in class TokenStreamIOException
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||