ConLex

I mentioned in my first post that I enjoy the activity of constructing languages. This may seem like a peculiar exercise, and I think that could certainly be argued to be true, but I for one (as well as a bunch of other people) really enjoy it. There are lots of ways to go about creating a new language, and there are several reasons to do it. Normally, it is one of these three:

  • Engineered Languages are typically made in order to fullfill a certain role very well. Lojban is an example of this.
  • Auxilary Languages are typically made with the intention of global adoption. Sometimes called “universal languages.” This is usually a futile exercise; although, it has had varying degrees of success. Esperanto is a relatively successful example of this (you can even read that page in Esperanto).
  • Artistic Languages are languages that people make for the sake of making cool, interesting, perhaps beautiful-sounding languages. Frequently, it is attempted to make these types of languages follow similar rules and patterns to natural languages. Dothraki seems to be pretty well known right now due to its appearance in the popular book series and TV show Game of Thrones.

While Engineered Languages can be an interesting exploration of the capacity of language, and Auxilary languages can be an interesting exploration of one’s ego, I certainly prefer the creation of artistic languages, or “artlangs” for short. I think I can say with certainty that no language (constructed or natural) is ever “complete.” Languages are always in a state of flux, being changed either by their speakers, or by their creators. Hopefully it is a little less unimpressive then when I say I have never completed a language. I have never come close. I have a gajillion billion sketches and ideas for languages, and a million starts besides, but only a few ever get fleshed out enough to really look like something meaningful. And most of these eventually halt after a while. There are a few though that I keep coming back to. One of the hardest parts of making a language is creating a dictionary. Making the actual words, and connecting them, and making everything all fit together in unison in a nice, natural way is pretty difficult in and of itself. Then add to that the simultaneous effort of creating a dictionary format that is readable, easily navigable, and appropriate for the type of lexicon that this language has and it can be quite a chore. There are several options for putting together dictionaries right now that various people use, but most of them have some gaping holes in their applicability to conlanging.

  • Spreadsheets can be good for organization, but they look like crap and are terribly difficult to read; not to mention, they are totally unyielding to multiple definitions, example sentences, or other such things that are absolutely necessary for a good dictionary entry.
  • Microsoft Word, or some similar word processor can be used, and it can look pretty good, but there is the monotonous, redundant typing of the same skeleton for every entry that eventually leads to either tendonitis or a severe lack of enthusiasm. It also does not provide a whole lot of control; it may look okay, but it likely won’t look exactly as you imagined it.
  • LaTeX is a markup language that some use. If used properly, it can make some beautiful documents. Along with various libraries, it provides a huge amount of control, and with the use of macros, it can reduce a lot of redundant typing; however it is another language that has to be learned, and while there is a great deal of good documentation, it can be rather difficult to find all of the more useful things in a timely manner.
  • Various other tools and applications also exist that are designed for field linguists who need to record lexicons of languages “in the wild.” Some of these can be useful (SIL has a lot of things useful to linguists in general), but most of them are either platform-dependent, very costly, or extremely buggy.

This is where I come in. I, an enthusiastic conlanger and amateur coder, have decided to tackle this issue with the full force of my C++ and QT skills. At this point, my only inspiration is my own desire for a useful tool, and a snippet of conversation from a wonderful podcast on conlanging called The Conlangery Podcast. Conlangery #56: 45:30 -> 47:15 describes a very fine lexicon tool. My development has brought me nowhere near that point yet, but it has begun, and more importantly, I have a lot of ideas both on paper, and still floating in my head which should be implementable with time. I will likely highlight many of these with time in other posts, but right now, I am going to talk about what is currently the most prominent in my mind.

There are so many languages out there, spoken in the world. Then there are all of the dead languages that are no longer spoken. These alone provide an enormously diverse sample. Add to that the lanuages that people make for a hobby who are technically not tied to the rules governing natural languages and only more variation and diversity appears. If I am to make a tool that works for every language out there, I have thought of two options: Give the tool everything it needs for any language, or make it extremely customizable. Each have pros and cons; however, I prefer the second option and have already, for the most part, implemented it.

  1. Provide EverythingI could—if I really wanted to, and I had a lot of time on my hands—go through and look at every prominent language, or language group/family and say, “A lot of languages need that in a dictionary, and a lot of other languages need this in a dictionary.” I would then likely need to look at a lot of less prominent languages, and a bunch of conlangs to get further examples of what a types of dictionary formats might be needed. This would certainly be a very interesting and educational exercise (one I might do at some point just for fun); however, it would take forever, and inevitably be futile. No matter if I found every dictionary format in use and added that to the system would it be enough because the best conlangs are the ones that break the rules and don’t do what all the other languages have done, so they would need a whole other format, one honed just for their unique conlang.
  2. Customization [I’m sorry if the above description sounded exagerated (because it was), but I couldn’t/didn’t-want-to find a way to make it work since I found another, better way.] Instead of pre-creating every imaginable dictionary format in advanced, what if I gave the user a couple of generic puzzle pieces that they could put together however they saw fit. This can be dangerous though because if the so-called puzzle pieces are generic and abstract, then the user will likely have trouble knowing what exactly to do with them. I have to provide an intuitional design and interface. I think this should be true of all UI design, but I want to keep it in focus as I think that a lot of customizability can reduce the intuition of a design.

Okay, customization; that’s all well and good, but… how to do it? Well, let’s look at an example dictionary entry and see what is needed.

Arbitrary, Fictional, Dictionary Entry
si·na·ti – /si-na’-ti/ 1. Noun a. Horse or pony b. A coach or buggy pulled by a horse; 2. Verb a. Run like a horse; fesinati /fe’-si-na,-ti/ Noun, sintai /sin’-tay/ Verb

(This isn’t really a very good entry for a word; maybe the origin, or more detailed relation to other words would be nice, and it is definitely lacking those example sentences, but it will certainly suffice for this exercise.) Let’s now analyze this dictionary entry and see what it requires, and abstract that to something useful to any dictionary format. I see a lot of words, some in English, some not. I see a lot of filler (“1.”, “a.”, “/…/”, “·…·”, ” – “, etc.). I see some arbitrary senses (“Horse or pony”, “Run like a horse”). I see identifiers, likely of a small/finite set (“Noun”, “Verb”). I see recurring patterns (multiple senses, multiple related words, multiple pronunciations).

  • Filler. Yes, I made this entry in a couple seconds just for this blog, but I made everything in it for a reason (admittedly, the “·”‘s reason was purely asthetic and totally irrelevant, I guess it’s filler, but whatever). Filler is a very important aspect of a dictionary entry. If each element of a dictionary entry was laid out in front of you without context, it would likely make very little sense. The filler provides the context, and creates greater separation where it is needed, and brings things closer where that is needed. It makes the entry readable and user-friendly, two very important things for a good dictionary.
  • Arbitrary Senses. Senses is less important, but Arbitrary is everything. I think it’s obvious that there has to be a way to put very arbitrary text in a dictionary entry. It’s necessary for everything from pronunciations to senses and example sentences. This is almost certainly going to be the biggest, most used aspect of every dictionary format.
  • Identifiers of a Finite Set. This one is a little more interesting. I have included this one in the set of puzzle pieces for constructing formats because of sorting. That is one really important thing that ConLex has to do. But not everyone is going to want to sort their dictionary the same way. Some may just want to sort it alphabetically by the name of each word, but someone else might want to sort it by what sort of inflections each word has (conjugations if you will); someone else may want to sort it by the type of each word because that just makes sense for their language. These last two is where the Finite Set comes in. It doesn’t make a lot of sense to sort word-types alphabetically; that just doesn’t apply here. More likely, some arbitrary order would be desired: maybe Nouns, then Verbs, then Adjectives, then Adverbs, then Conjunctions, then this, then that. To accomplish this, I keep a set of strings in a list in the order that they would be sorted. Each element or identifier of this list keeps not a string, but an index to this list of strings. When sorting, not the strings, but instead, the indecies are compared: the element with the smaller index comes first. There are multiple sets of strings, so there can be different sets of identifiers (not just word-types, or just conjugations).
  • Recurrences. Last but not least, are recurrences, or lists. There are actually two different types of lists utilised in ConLex. You might think of one of these lists as a package with a hammer and a nail in it. The other list might then be a shelf with only hammers on it, or only nails on it, or only packages of a hammer and some nails on it. To differentiate these two types of lists and avoid confusion, I will call the first a “package” (in the code I call it “FSList” for “Fixed Size List”), and I will call the second a “list” (which is exactly what I call it in the code). Even though they are both “lists,” and they both hold more than one item, I assume they will be used for very different things as they provide very different functionality.
    • Packages always have the same size, and each slot can have a different type (though they certainly don’t have to). The order of the elements is observed as well as maintained. Once created, the length should not/can not be alterred (leading to the name “Fixed Size List”). The related words section of the dictionary entry above (“sintai /sin’-tay/ Verb“) would likely have used a package, as each related word had exactly three elements: a name (string), a pronunciation (string), and a word-type (index of set).
    • Lists have a dynamic size, and elements of a single type. When I say dynamic size, I mean that every entry’s instance of this list can have any length depending only on the individual needs of each word. The fixed size list/package has the same length in every dictionary entry (the contents of each element will surely be different, but the size will always be the same). The senses in the example dictionary entry would likely have used a list because each word may have a different number of senses.

Formats are made up of a hierarchy of elements. Each element is one of four different types: A string (arbitrary sense), a number or index (identifiers of a finite set), a fixed-size list or a package (recurrences, packages), or a list (recurrences, lists). Lists, of both kinds, can hold other elements of any of the four types, leading to the hierarchal structure. Each of these hierarchies is called a “component.” A word is made up of a list of components and a name (the name is separated for saftey and simplicity reasons; although, technically, it could be its very own component). I think next post, I will go into even greater detail about components (maybe with some real, live code), and we’ll see how they really work below the surface. If you are curious and actually want to see where the code is now, I have it on a public Mercurial Repository. It is all C++ with QT. That’s all folks. Goodnight!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s