Class JapaneseIterationMarkCharFilter

java.lang.Object
java.io.Reader
org.apache.lucene.analysis.CharFilter
org.apache.lucene.analysis.ja.JapaneseIterationMarkCharFilter
All Implemented Interfaces:
Closeable, AutoCloseable, Readable

public class JapaneseIterationMarkCharFilter extends CharFilter
Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.

Sequences of iteration marks are supported. In case an illegal sequence of iteration marks is encountered, the implementation emits the illegal source character as-is without considering its script. For example, with input "?ゝ", we get "??" even though the question mark isn't hiragana.

Note that a full stop punctuation character "。" (U+3002) can not be iterated (see below). Iteration marks themselves can be emitted in case they are illegal, i.e. if they go back past the beginning of the character stream.

The implementation buffers input until a full stop punctuation character (U+3002) or EOF is reached in order to not keep a copy of the character stream in memory. Vertical iteration marks, which are even rarer than horizontal iteration marks in contemporary Japanese, are unsupported.

  • Field Details

    • NORMALIZE_KANJI_DEFAULT

      public static final boolean NORMALIZE_KANJI_DEFAULT
      Normalize kanji iteration marks by default
      See Also:
    • NORMALIZE_KANA_DEFAULT

      public static final boolean NORMALIZE_KANA_DEFAULT
      Normalize kana iteration marks by default
      See Also:
    • KANJI_ITERATION_MARK

      private static final char KANJI_ITERATION_MARK
      See Also:
    • HIRAGANA_ITERATION_MARK

      private static final char HIRAGANA_ITERATION_MARK
      See Also:
    • HIRAGANA_VOICED_ITERATION_MARK

      private static final char HIRAGANA_VOICED_ITERATION_MARK
      See Also:
    • KATAKANA_ITERATION_MARK

      private static final char KATAKANA_ITERATION_MARK
      See Also:
    • KATAKANA_VOICED_ITERATION_MARK

      private static final char KATAKANA_VOICED_ITERATION_MARK
      See Also:
    • FULL_STOP_PUNCTUATION

      private static final char FULL_STOP_PUNCTUATION
      See Also:
    • h2d

      private static char[] h2d
    • k2d

      private static char[] k2d
    • buffer

      private final RollingCharBuffer buffer
    • bufferPosition

      private int bufferPosition
    • iterationMarksSpanSize

      private int iterationMarksSpanSize
    • iterationMarkSpanEndPosition

      private int iterationMarkSpanEndPosition
    • normalizeKanji

      private boolean normalizeKanji
    • normalizeKana

      private boolean normalizeKana
  • Constructor Details

    • JapaneseIterationMarkCharFilter

      public JapaneseIterationMarkCharFilter(Reader input)
      Constructor. Normalizes both kanji and kana iteration marks by default.
      Parameters:
      input - char stream
    • JapaneseIterationMarkCharFilter

      public JapaneseIterationMarkCharFilter(Reader input, boolean normalizeKanji, boolean normalizeKana)
      Constructor
      Parameters:
      input - char stream
      normalizeKanji - indicates whether kanji iteration marks should be normalized
      normalizeKana - indicates whether kana iteration marks should be normalized
  • Method Details

    • read

      public int read(char[] buffer, int offset, int length) throws IOException
      Specified by:
      read in class Reader
      Throws:
      IOException
    • read

      public int read() throws IOException
      Overrides:
      read in class Reader
      Throws:
      IOException
    • normalizeIterationMark

      private char normalizeIterationMark(char c) throws IOException
      Normalizes the iteration mark character c
      Parameters:
      c - iteration mark character to normalize
      Returns:
      normalized iteration mark
      Throws:
      IOException - If there is a low-level I/O error.
    • nextIterationMarkSpanSize

      private int nextIterationMarkSpanSize() throws IOException
      Finds the number of subsequent next iteration marks
      Returns:
      number of iteration marks starting at the current buffer position
      Throws:
      IOException - If there is a low-level I/O error.
    • sourceCharacter

      private char sourceCharacter(int position, int spanSize) throws IOException
      Returns the source character for a given position and iteration mark span size
      Parameters:
      position - buffer position (should not exceed bufferPosition)
      spanSize - iteration mark span size
      Returns:
      source character
      Throws:
      IOException - If there is a low-level I/O error.
    • normalize

      private char normalize(char c, char m)
      Normalize a character
      Parameters:
      c - character to normalize
      m - repetition mark referring to c
      Returns:
      normalized character - return c on illegal iteration marks
    • normalizedHiragana

      private char normalizedHiragana(char c, char m)
      Normalize hiragana character
      Parameters:
      c - hiragana character
      m - repetition mark referring to c
      Returns:
      normalized character - return c on illegal iteration marks
    • normalizedKatakana

      private char normalizedKatakana(char c, char m)
      Normalize katakana character
      Parameters:
      c - katakana character
      m - repetition mark referring to c
      Returns:
      normalized character - return c on illegal iteration marks
    • isIterationMark

      private boolean isIterationMark(char c)
      Iteration mark character predicate
      Parameters:
      c - character to test
      Returns:
      true if c is an iteration mark character. Otherwise false.
    • isHiraganaIterationMark

      private boolean isHiraganaIterationMark(char c)
      Hiragana iteration mark character predicate
      Parameters:
      c - character to test
      Returns:
      true if c is a hiragana iteration mark character. Otherwise false.
    • isKatakanaIterationMark

      private boolean isKatakanaIterationMark(char c)
      Katakana iteration mark character predicate
      Parameters:
      c - character to test
      Returns:
      true if c is a katakana iteration mark character. Otherwise false.
    • isKanjiIterationMark

      private boolean isKanjiIterationMark(char c)
      Kanji iteration mark character predicate
      Parameters:
      c - character to test
      Returns:
      true if c is a kanji iteration mark character. Otherwise false.
    • lookupHiraganaDakuten

      private char lookupHiraganaDakuten(char c)
      Look up hiragana dakuten
      Parameters:
      c - character to look up
      Returns:
      hiragana dakuten variant of c or c itself if no dakuten variant exists
    • lookupKatakanaDakuten

      private char lookupKatakanaDakuten(char c)
      Look up katakana dakuten. Only full-width katakana are supported.
      Parameters:
      c - character to look up
      Returns:
      katakana dakuten variant of c or c itself if no dakuten variant exists
    • isHiraganaDakuten

      private boolean isHiraganaDakuten(char c)
      Hiragana dakuten predicate
      Parameters:
      c - character to check
      Returns:
      true if c is a hiragana dakuten and otherwise false
    • isKatakanaDakuten

      private boolean isKatakanaDakuten(char c)
      Katakana dakuten predicate
      Parameters:
      c - character to check
      Returns:
      true if c is a hiragana dakuten and otherwise false
    • lookup

      private char lookup(char c, char[] map, char offset)
      Looks up a character in dakuten map and returns the dakuten variant if it exists. Otherwise return the character being looked up itself
      Parameters:
      c - character to look up
      map - dakuten map
      offset - code point offset from c
      Returns:
      mapped character or c if no mapping exists
    • inside

      private boolean inside(char c, char[] map, char offset)
      Predicate indicating if the lookup character is within dakuten map range
      Parameters:
      c - character to look up
      map - dakuten map
      offset - code point offset from c
      Returns:
      true if c is mapped by map and otherwise false
    • correct

      protected int correct(int currentOff)
      Description copied from class: CharFilter
      Subclasses override to correct the current offset.
      Specified by:
      correct in class CharFilter
      Parameters:
      currentOff - current offset
      Returns:
      corrected offset