Afrikaanse rekenaarlinguïstiek

(Afrikaans computational linguistics)

Here are some resources that I have collected, and written for free (as in free software) computational linguistics in Afrikaans. Most of them are used, or will be used with the Apertium machine translation system for the English—Afrikaans language pair.

Wordlists

Corpora

Aligned

Morphological analyser

This is part of the Apertium English—Afrikaans language pair

The morphological analyser is specified in an XML format, and compiled into a finite state machine which can be read by the lt-proc utility. For any given input lexical form, it outputs all of the possible analyses which can be found in the dictionary. For example:

$ echo "aan die deur" | lt-proc af-en.automorf.bin
^aan/aan<pr>$ ^die/die<det><def><sg>/die<det><def><sp>/die<det><def><pl>$ ^deur/deur<n><sg>/deur<pr>$

The analyser currently has around 11,000 lemmata, and can analyse around 20,000 surface forms. You can try it online here

Part of speech tagger

This is part of the Apertium English—Afrikaans language pair

The part of speech tagger was trained in an unsupervised manner on the database dump of the Afrikaans Wikipedia. As input it takes the output of the morphological analyser (see above). For any given set of input analyses, it outputs the most likely analysis for the word in context. For example:

$ echo "Die man klop aan die deur." | lt-proc af-en.automorf.bin | apertium-tagger -g af-en.prob
^Die<det><def><sg>$ ^man<n><sg>$ ^klop<vblex><pres>$ ^aan<pr>$ ^die<det><def><sg>$ ^deur<n><sg>$^.<sent>$

$ echo "Die man stap deur die gang." | lt-proc af-en.automorf.bin | apertium-tagger -g af-en.prob
^Die<det><def><sg>$ ^man<n><sg>$ ^stap<n><sg>$ ^deur<pr>$ ^die<det><def><sg>$ ^gang<n><sg>$^.<sent>$

$ echo "Die man loop deur die deur." | lt-proc af-en.automorf.bin | apertium-tagger -g af-en.prob
^Die<det><def><sg>$ ^man<n><sg>$ ^loop<vblex><pres>$ ^deur<pr>$ ^die<det><def><sg>$ ^deur<n><sg>$^.<sent>$

Compound resolver

The compound resolver uses the morphological dictionary to attempt to split Afrikaans compound words into their constituent parts in a longest-match left-to-right (LRLM) manner.

$ python compound-resolver.py wordlist-af nasionaleverdedigingsoorwegings
['nasionale', 'verdedigings', 'oorwegings']
This module is currently under development.

Contact

If you would like any more information on this stuff, or instructions for use, I invite you to check out the Apertium Wiki, if you want to email me, try Googling for "Francis Tyers" (my name) and "apertium-stuff" (our mailing list).

Interesting publications

Licensing

The content of this site which is Copyright © Francis Tyers is dual-licensed under the GNU General Public Licence, and the Creative Commons Attribution Share-Alike 3.0 Licence. The important thing for me is Copyleft (Kopielinks) which basically means we'll all contribute our changes and work to a common pool that everyone can use. So if you need this stuff under a licence different from any of the ones above, but still Copyleft, email me.

Creative Commons License