SeCo Language Analysis Services

We provide the following JSON-returning Web Services:

Language Recognition
Lemmatization
Morphological Analysis
Inflected Form Generation
Hyphenation

The tools backing these services are mostly not originally our own, but we've wrapped them for your convenience. For specifics, see the details of each service. For general questions about this service, contact eetu.makela@aalto.fi.

Tries to recognize the language of an input. Call with e.g. /las/identify?text=The+quick+brown+fox+jumps+over+the+lazy+dog
or with a list of possible locales, e.g. /las/identify?text=The+quick+brown+fox+jumps+over+the+lazy+dog&locales=fi&locales=en&locales=sv
Also available using HTTP POST with parameters given either as form-urlencoded or JSON. For intensive use, there is also a JSON-understanding WebSocket-version at /las/identifyWS. All methods are CORS-enabled.
Returns results as JSON, e.g.:

{"locale":"en","certainty":0.6803500000000001,"details":{"languageRecognizerResults":{"en":0.1973},"languageDetectorResults":[{"en":1.0}],"hfstAcceptorResults":[{"en":0.84375},{"fi":0.09375},{"sme":0.010416666666666666},{"sv":0.010416666666666666},{"la":0.010416666666666666},{"tr":0.010416666666666666},{"de":0.010416666666666666},{"it":0.010416666666666666}]}}

When called without parameters but with an Accept header other than text/html, returns the supported locales as JSON, e.g.:

{"acceptedLocales":["af","an","ar","ast","be","bg","bn","br","ca","cs","cy","da","de","el","en","es","et","eu","fa","fi","fr","ga","gl","gu","he","hi","hr","ht","hu","id","is","it","ja","km","kn","ko","la","liv","lt","lv","mdf","mhr","mk","ml","mr","mrj","ms","mt","myv","ne","nl","no","oc","pa","pl","pt","ro","ru","sk","sl","sme","so","sq","sr","sv","sw","ta","te","th","tl","tr","udm","uk","ur","vi","yi","zh-CN","zh-TW"]}

Pretty printing is enabled with the boolean parameter pretty.

In total, the service supports 78 locales, combining results from three sources:

The language-detector library (locales af, an, ar, ast, be, bg, bn, br, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fr, ga, gl, gu, he, hi, hr, ht, hu, id, is, it, ja, km, kn, ko, lt, lv, mk, ml, mr, ms, mt, ne, nl, no, oc, pa, pl, pt, ro, ru, sk, sl, so, sq, sr, sv, sw, ta, te, th, tl, tr, uk, ur, vi, yi, zh-CN, zh-TW),
custom code based on the list of cues at the Wikipedia language recognition chart (locales cs, de, en, es, et, fi, fr, hu, it, pl, pt, ro, ru, sk, sv), and
finite state transducers provided by the HFST, Omorfi and Giellatekno projects (locales de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm)

Where multiple sources are available for a language, they seem to complement each other nicely. The probabilistic langdetect library generally gives good results, but can give incorrect results on short strings. The finite state transducers on the other hand reliably recognize even single words as belonging to a language, but may have problems discerning between closely related languages when sentences contain many compound, loan or slang words. The custom code finally sits method-wise between the other two, breaking ties between them.

Test: (updates dynamically)

Guessed language: {{guessedLang}}

Service returned error {{errorStatus}}:
{{error}}

Lemmatizes the input into its base form.
Call with e.g. /las/baseform?text=Albert+osti+fagotin+ja+t%C3%B6r%C3%A4ytti+puhkuvan+melodian.&locale=fi
or just /las/baseform?text=The+quick+brown+fox+jumps+over+the+lazy+dog to guess locale.
Also available using HTTP POST with parameters given either as form-urlencoded or JSON. For intensive use, there is also a JSON-understanding WebSocket-version at /las/baseformWS. All methods are CORS-enabled.
Returns results as JSON (e.g. "Albert ostaa fagotti ja töräyttää puhkua melodia." or {"locale":"en","baseform":"the quick brown fox jump over the lazy dog"})
When called without parameters but with an Accept header other than text/html, returns the 21 supported locales as JSON. A boolean segment parameter can be set to segment compound words with a '#'. The boolean parameter guess on the other hand decides whether baseforms will be guessed for unknown words or not. Also accepts an optional depth parameter of either 0 or 1 for less or more in-depth analysis (default=1). Pretty printing is enabled with the boolean parameter pretty.

Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects where available (locales de, en, fi, fr, it, la, liv, mdf, mhr, mrj, myv, sme, sv, tr, udm). Note that the quality and scope of the lemmatization varies wildly between languages.
Snowball stemmers are used for locales dk, es, nl, no, pt, ru (not used: de, en, fi, fr, it, sv)

Test: (updates dynamically)

Lemmatized: {{baseform}}

Service returned error {{errorStatus}}:
{{error}}

Segment compounds?

Guess baseforms for unknown words?

Maximum Edit Distance:

Depth:

Locale:

Gives a morphological analysis of the text. Call with e.g. /las/analyze?text=Albert+osti&locale=fi&forms=V+N+Nom+Sg&forms=N+Nom+Pl
or just /las/analyze?text=Bier+bitte to guess locale.
Also available using HTTP POST with parameters given either as form-urlencoded or JSON. For intensive use, there is also a JSON-understanding WebSocket-version at /las/analyzeWS. All methods are CORS-enabled.
Returns results as JSON, e.g.:

[ {
  "word" : "Albert",
  "analysis" : [ {
    "weight" : 0.099609375,
    "wordParts" : [ {
      "lemma" : "Albert",
      "tags" : {
        "SEGMENT" : [ "Albert" ],
        "KTN" : [ "5" ],
        "UPOS" : [ "PROPN" ],
        "NUM" : [ "SG" ],
        "PROPER" : [ "LAST" ],
        "BASEFORM_FREQUENCY" : [ "2712" ],
        "CASE" : [ "NOM" ]
      }
    } ],
    "globalTags" : {
      "HEAD" : [ "3" ],
      "FIRST_IN_SENTENCE" : [ "TRUE" ],
      "DEPREL" : [ "nsubj" ],
      "POS_MATCH" : [ "TRUE" ],
      "BEST_MATCH" : [ "TRUE" ],
      "BASEFORM_FREQUENCY" : [ "2712" ]
    }
  }, {
    "weight" : 0.099609375,
    "wordParts" : [ {
      "lemma" : "Albert",
      "tags" : {
        "SEGMENT" : [ "Albert" ],
        "KTN" : [ "5" ],
        "UPOS" : [ "PROPN" ],
        "NUM" : [ "SG" ],
        "SEM" : [ "MALE" ],
        "PROPER" : [ "FIRST" ],
        "BASEFORM_FREQUENCY" : [ "2712" ],
        "CASE" : [ "NOM" ]
      }
    } ],
    "globalTags" : {
      "HEAD" : [ "3" ],
      "FIRST_IN_SENTENCE" : [ "TRUE" ],
      "DEPREL" : [ "nsubj" ],
      "POS_MATCH" : [ "TRUE" ],
      "BEST_MATCH" : [ "TRUE" ],
      "BASEFORM_FREQUENCY" : [ "2712" ]
    }
  } ]
}, {
  "word" : " ",
  "analysis" : [ {
    "weight" : 1.0,
    "wordParts" : [ {
      "lemma" : " ",
      "tags" : { }
    } ],
    "globalTags" : {
      "WHITESPACE" : [ "TRUE" ],
      "BEST_MATCH" : [ "TRUE" ]
    }
  } ]
}, {
  "word" : "osti",
  "analysis" : [ {
    "weight" : 0.099609375,
    "wordParts" : [ {
      "lemma" : "ostaa",
      "tags" : {
        "TENSE" : [ "PAST" ],
        "SEGMENT" : [ "ost", "{MB}i" ],
        "KTN" : [ "53" ],
        "UPOS" : [ "VERB" ],
        "MOOD" : [ "INDV" ],
        "PERS" : [ "SG3" ],
        "INFLECTED_FORM" : [ "V N Nom Sg" ],
        "VOICE" : [ "ACT" ],
        "INFLECTED" : [ "ostaminen" ],
        "BASEFORM_FREQUENCY" : [ "4034" ]
      }
    } ],
    "globalTags" : {
      "HEAD" : [ "0" ],
      "DEPREL" : [ "ROOT" ],
      "POS_MATCH" : [ "TRUE" ],
      "BEST_MATCH" : [ "TRUE" ],
      "BASEFORM_FREQUENCY" : [ "4034" ]
    }
  } ]
} ]

{
  "locale" : "de",
  "analysis" : [ {
    "word" : "Bier",
    "analysis" : [ {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "Bier",
        "tags" : {
          "Neut" : [ "Neut" ],
          "Sg" : [ "Sg" ],
          "+NN" : [ "+NN" ],
          "Nom" : [ "Nom" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "Bier",
        "tags" : {
          "Neut" : [ "Neut" ],
          "Sg" : [ "Sg" ],
          "Dat" : [ "Dat" ],
          "+NN" : [ "+NN" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "Bier",
        "tags" : {
          "Akk" : [ "Akk" ],
          "Neut" : [ "Neut" ],
          "Sg" : [ "Sg" ],
          "+NN" : [ "+NN" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    } ]
  }, {
    "word" : " ",
    "analysis" : [ {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : " ",
        "tags" : { }
      } ],
      "globalTags" : {
        "WHITESPACE" : [ "TRUE" ]
      }
    } ]
  }, {
    "word" : "bitte",
    "analysis" : [ {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitten",
        "tags" : {
          "Sg" : [ "Sg" ],
          "+V" : [ "+V" ],
          "1" : [ "1" ],
          "Konj" : [ "Konj" ],
          "Pres" : [ "Pres" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitten",
        "tags" : {
          "Sg" : [ "Sg" ],
          "Ind" : [ "Ind" ],
          "+V" : [ "+V" ],
          "1" : [ "1" ],
          "Pres" : [ "Pres" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitten",
        "tags" : {
          "Sg" : [ "Sg" ],
          "+V" : [ "+V" ],
          "Konj" : [ "Konj" ],
          "3" : [ "3" ],
          "Pres" : [ "Pres" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitten",
        "tags" : {
          "Sg" : [ "Sg" ],
          "+V" : [ "+V" ],
          "Imp" : [ "Imp" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitte",
        "tags" : {
          "+PTKL" : [ "+PTKL" ],
          "Ant" : [ "Ant" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    }, {
      "weight" : 1.0,
      "wordParts" : [ {
        "lemma" : "bitte",
        "tags" : {
          "+ADV" : [ "+ADV" ]
        }
      } ],
      "globalTags" : {
        "BEST_MATCH" : [ "TRUE" ]
      }
    } ]
  } ]
}

When called without parameters but with an Accept header other than text/html, returns the 15 supported locales as JSON (e.g. {"acceptedLocales":["de","en","fi","fr","it","la","liv","mdf","mhr","mrj","myv","sme","sv","tr","udm"]}). A boolean segment parameter can be set to segment compound words with a '#'. The boolean parameter guess on the other hand decides whether baseforms will be guessed for unknown words or not. Also accepts an optional depth parameter of 0-2 for less or more in-depth analysis (default=2). Pretty printing is enabled with the boolean parameter pretty. The analysis web services also supports inflection, with the same parameters as the inflection service.

Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects. Note that the quality and scope of analysis as well as tags returned vary wildly between languages.

Test: (updates dynamically)

Analysis: {{analysis|json}}

Service returned error {{errorStatus}}:
{{error}}

Baseform compound segments?

Guess baseforms for unknown words?

Try to segment guessed words (resource intensive)

Maximum Edit Distance:

Depth:

Desired Inflection Forms:

Locale:

Transforms the text given a set of inflection forms, by default also converting words not matching the inflection forms to their base form. Call with e.g. /las/inflect?text=Albert+osti+fagotin&forms=V+N+Nom+Sg&forms=N+Nom+Pl&segment=true
or /las/inflect?text=Albert+osti+fagotin&forms=V+N+Nom+Sg&forms=N+Nom+Pl
Also available using HTTP POST with parameters given either as form-urlencoded or JSON. For intensive use, there is also a JSON-understanding WebSocket-version at /las/inflectWS. All methods are CORS-enabled.
Returns results as JSON (e.g. "Albert ostaminen fagotit")
When called without parameters but with an Accept header other than text/html, returns the 14 supported locales as JSON (e.g. {"acceptedLocales":["de","en","fi","fr","it","liv","mdf","mhr","mrj","myv","sme","sv","tr","udm"]}). A boolean segment parameter can be set to segment compound words with a '#'. The boolean parameter guess on the other hand decides whether baseforms will be guessed for unknown words or not. The boolean baseform parameter decides whether uninflected words are returned in their baseform or original form. Pretty printing is enabled with the boolean parameter pretty.

Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects. Note that the inflection form syntaxes differ wildly between languages.

Test: (updates dynamically)

Inflected: {{inflection}}

Service returned error {{errorStatus}}:
{{error}}

Desired Inflection Forms:

Segment compounds?

Baseform?

Guess baseforms for unknown words?

Maximum Edit Distance:

Locale:

Hyphenates the given text. Call with e.g. /las/hyphenate?text=Albert+osti+fagotin+ja+t%C3%B6r%C3%A4ytti+puhkuvan+melodian.&locale=fi
or just /las/hyphenate?text=ein+Bier+bitte to guess locale.
Also available using HTTP POST with parameters given either as form-urlencoded or JSON. For intensive use, there is also a JSON-understanding WebSocket-version at /las/hyphenateWS. All methods are CORS-enabled.
Returns results as JSON (e.g. "al-bert os-ti fa-go-tin ja tö-räyt-ti puh-ku-van me-lo-dian ." or {"locale":"fi","hyphenation":"ein bier bit-te"})
When called without parameters but with an Accept header other than text/html, returns the 46 supported locales as JSON, e.g.:

{"acceptedLocales":["bg","ca","cop","cs","cy","da","el","es","et","eu","fi","fr","ga","gl","hr","hsb","hu","ia","in","is","it","la","liv","mdf","mhr","mn","mrj","myv","nb","nl","nn","pl","pt","ro","ru","sa","sh","sk","sl","sme","sr","sv","tr","udm","uk","zh"]}

Pretty printing is enabled with the boolean parameter pretty.

Uses finite state transducers provided by the HFST, Omorfi and Giellatekno projects. Those provided by HFST have been automatically translated from the TeX CTAN distribution's hyphenation rulesets.

Test: (updates dynamically)

hyphenated: {{hyphenation}}

Service returned error {{errorStatus}}:
{{error}}

Locale:

SeCo Language Analysis Services

Language Recognition

Lemmatization

Morphological Analysis

Inflected Form Generation

Hyphenation