Notes on syntax


The last few years the expressiveness of programming languages have been on my mind. There are many things that comes into consideration for expressiveness, not matter what definition you actually end up using. However, what I’ve been thinking about lately is syntax. There’s a lot of talk about syntax and many opinions. What made me start thinking more about it lately was a few blog posts I read that kind of annoyed me a bit. So I thought it was time to put out some of my thoughts on syntax here.

I guess the first question to answer is whether syntax matters for a programming language. The traditional computer science view is largely that syntax doesn’t matter. And in a reductionist, system level view of the world this is understandable. However, you also have the opposite view which comes strongly into effect especially when talking about learning a new language, but also for reading existing code. At that point many people are of the opinion that syntax is extremely important.

The way I approach the question is based on programming language design. What can I do when designing a language to make it more expressive for as many users as possible. To me, syntax plays a big part in this. I am not saying that a language should designed with a focus on syntax or even with syntax first. But the language syntax is the user interface for a programmer, and as such there are many aspects of the syntax that should help a programmer. Help them with what? Well, understanding for one. Reading. Communicating. I suspect that writing is not something we’re very much interested in optimizing for in syntax, but that’s OK. Typing fewer characters doesn’t actually optimize for writing either – the intuition behind that statement is quite easy: imagine you had to write a book. However, instead of writing it in English, you just wrote the gzipped version of the book directly. You would definitely have to type much less – but would that in any way help you write the book? No, probably it would make it harder. So typing I definitely don’t want to optimize. However, I would like to make it easy for a programmer to express an idea as consicely as they can. To me, this is about mentioning all things that are relevant, without mentioning irrelevant things. But incidentally, a syntax with that property is probably going to be easier to communicate with, and also to read, so I don’t think focusing on writing at all is the right thing to do.

Fundamentally, programming is about building abstractions. We are putting together extremely intricate mind castles and then try to express them in such a way that our computers will realize them. Concepts, abstractions – and manipulating and communicating them – are the pieces underlying programming languages, and it’s really what all languages must do in some way. A syntax that makes it easier to think about hard abstractions is a syntax that will make it easier to write good and robust programs. If we talk about the Sapir-Whorf hypothesis and linguistic relativity, I suspect that programmers have an easier time reasoning about a problem if their language choice makes those abstractions clearer. And syntax is one way of making that process easier. Simply put, the things we manipulate with programming languages are hard to think about, and good syntax can improve that.

Seeing as we are talking about reading – who is this person reading? It makes a huge difference if we’re trying to design something that should be easy to read for a novice or we’re trying to design a syntax that makes it easier for an expert to understand what’s going on. Optimally we would like to have both, I guess, but that doesn’t seem very realistic. The things that make syntax useful to an expert are different than what makes it easy to read for a novice.

At this point I need to make a request – Rich Hickey gave a talk at Strange Loop a few months ago. It’s called Simple made Easy and you can watch it here: http://www.infoq.com/presentations/Simple-Made-Easy – you should watch it now.

Simply put, if you had never learnt any German, should you really expect to be able to read it? Is it such a huge problem that someone who has never studied Prolog will have no idea what’s going on until they study it a bit? Doesn’t it make sense that people who understand German can express all the things they need to say in that language? Even worse, when it comes to programming languages, people expect them to be readable to people who have never programmed before! Why in world would that ever be a useful goal? It would be like saying German is not readable (and is thus a bad language) because dolphins can’t read it.

A tangential aspect to the simple versus easy of programming languages is also how our current syntactic choices echo what’s been done earlier. It’s quite uncommon with a syntax design that becomes wildly successful while looking completely different from previous languages. This seems to have more to do with how easy a language is to learn, rather than how good the syntax actually is by itself. As such, it’s suspect. Historical accidents seem to contribute much more syntax design than I am comfortable with.

Summarizing: when we talk about reading programming languages, it doesn’t make much sense to optimize for someone who doesn’t know the language. In fact, we need to take as a given that a person knows a programming language. Then we can start talking about what aspects reduce complexity and improve communication for a programmer.

When are talking about reading of languages, one thing that sometimes come up is the need for redundancy. Specifically, one of the blogs that inspired these thoughts basically claimed that the redundancy in the design of Java was a good thing, because it improved readability. Now, I find this quite interesting – I have never seen any research that explains why this would be the case. In fact, the only argument in support I’ve heard that backs up the idea is that natural languages have highly redundant elements, and thus programming languages should too. First, that’s not actually true for all natural languages – but we must also consider _why_ natural languages have so much redundancy built in. Natural languages are not designed (with a few exceptions) – they grow to have the features they have because they are useful. But reading, writing, speaking and listening of natural languages have so different evolutionary pressures from each other that they should be treated differently. The reason we need redundancy is simply because it’s very hard to speak and listen without it. For all intents and purposes, what is considered good and idiomatic in spoken language is very different from written language. I just don’t buy this argument for redundancy. It might be good with redundancy in programming language syntax, but so far I remain to be convinced.

It is sometimes educational to look at mathematical notation. However, mathematical notation is just that – notation. I’m not convinced we can have one single notation for programming languages, and I don’t think it’s something to aspire to. But the useful lesson from math notation is how terse it is. However, you still need to spend a long time to digest what it means. That’s because the ideas are deep. The thinking that went into them is deep. If we ever come to a point where programming languages can embody as deep ideas in as terse a notation, I suspect we will have figured out how to design programming language syntax that is way better than what we have right now.

I think this covers most of the things I wanted to cover. At some point I would like to talk about why I think Smalltalk, Ruby, Lisp and some others have quite good syntax, and how that syntax is intimately related with why those languages are powerful and expressive. Some other random thoughts I wanted to cover was evolvability of language syntax, whether a syntax should be designed to be easy to parse, and possibly also how much English specifically has impact the design of programming languages. But these are thoughts for another time. Suffice to say, syntax matters.



A new parser for Ioke


Last week I finally bit the bullet and rewrote the Ioke parser. I’m pretty happy about the end result actually, but it does involve moving away from Antlr’s as a parser generator. In fact, the new parser is handwritten – and as such goes against my general opinion to generate everything possible. I would like to quickly take a look at the reasons for doing this and also what the new parser will give Ioke.

For reference, the way the parser used to work was that the Antlr generated lexer and parser gave the Ioke runtime an Antlr Tree structure. This tree structure was then walked and transformed into chained Message’s, which is the AST that Ioke uses internally. Several other things were also done at this stage, including separating message chains on comma-borders. Most significantly the processing to put together interpolated strings and regular expressions happened at this stage. Sadly, the code to handle all that was complex, ugly, slow and frail. After this stage, operator shuffling happened. That part is still the same.

There were several problems I wanted to solve, but the main one was the ugliness of the algorithm. It wasn’t clear from the parser how an interpolated expression mapped into the AST, and the generated code added several complications that frankly weren’t necessary.

Ioke is a language with an extremely simple base syntax. It is only slightly more complicated than the typical Lisp parser, and there is almost no parser-level productions needed. So the new parser does away with the lexer/parser distinction and does everything in one pass. There is no need for lookahead at the token level, so this turns out to be a clear win. The code is actually much simpler now, and the Message AST is created inline in the new parser. When it comes to interpolation, instead of the semantic predicates and global stacks I had to use in the Antlr parser, I just do the obvious recursive interpolation. The code is simple to understand and quite efficient too.

At the end of the day, I did expect to see some performance improvements too. They turned out to be substantial. Parsing is about 2.5 times faster, and startup speed has improved by about 30%. The distribution size will be substantially smaller since I don’t need to ship the Antlr runtime libraries. And building the project is also much faster.

But the real gain is actually in maintainability of the code. It will be much easier for me to extend the parser now. I can do nice things to make the syntax more open ended and more powerful in ways that would be very inconvenient in Antlr. The error messages are much better since I have control over all the error states. In fact, there are only 13 distinct error messages in the new parser, and they are all very clear on what has gone wrong – I never did the work in the old parser to support that, but I get that almost for free in the new one.

Another thing I’ve been considering is to add reader macros to Ioke – and that would also have been quite painful with the Antlr parser generator. So all in all I’m very happy about the new parser, and I think it will definitely make it easier for the project going forward.

This blog post is in no way saying that Antlr is bad in any way. I like Antlr a lot – it’s a great tool. But it just wasn’t the right tool for Ioke’s syntax.



Message chains and quoting in Ioke


One of the more advanced features in Ioke is the ability to work with first class messages. At the end of the day, you are manipulating the AST directly by doing this, which means that you can do pretty much anything you want. The manipulation of message chains is the main way of working with macros in Ioke, so understanding what you can do with them is pretty important.

The documentation surrounding these pieces is spread all over the place, so I thought I’d take a look at messages and the way you construct and modify them.

Messages

The first step in working with message chains is to actually understand the Message. Message is the core data structure in Ioke, and it has some native properties that define the full structure of the Ioke AST. There are four pieces of the structure that is central to messages, and a few more that is less interesting. So let us look at the core structure. It is actually extremely simple. These are the things that makes a Message:

  • Name – all messages have a name. From the perspective of Ioke, this is a symbol. It will never be nil, but it can be empty.
  • Arguments – a list of messages, zero or more.
  • Prev – a pointer to the previous message in the chain, or nil if there is no previous.
  • Next – a pointer to the next message in the chain, or nil if there is no next message.

A message can also wrap a value. In that case the message will always return that value, and no real evaluation will happen. This can be used to insert any kind of value into a message chain that will later be evaluated. This is called wrapping.

A message chain is just a collection of messages, linked through their Prev and Next pointers.

The arguments to a message are represented as a list of messages. This make sense if you think about it for a few seconds.

OK, now you know what the Ioke AST looks like. It isn’t harder than that. Now, if you actually want to start working with messages, there are several messages that Message can receive, that allow you to work with them. The simpler ones (that I won’t explain closer) is “name”, “name=”, “arguments”, “arguments=”, “next”, “next=”, “prev”, “prev=”.

There are a few more interesting ones that merit some explanation. First, “last”. This message will just return the last message in the message chain. It is the equivalent of following the next pointer until you come to the end.

It’s important to keep in mind that Message is a mutable structure, which means you need to be careful to not change things that will give you unexpected changes. For example, if someone sends in a message, you shouldn’t generally actually modify that without copying it. Now, if you only want to copy a message without copying recursively the next pointer, you can just mimic it. Otherwise you use the method “deepCopy” which will actually copy both the next pointer and the arguments recursively.

Now, if you want to add new arguments to a message, you can use “appendArgument”. This method is aliased as “<<“. It will also return the receiver, so you can add several arguments by linking calls to appendArgument/<<. If you want to add a message at the beginning of the argument list, you instead use >>.

One of the more annoying things is that once you set the next pointer, you generally need to make sure to set the previous pointer of the next value too, unless you are setting it to nil. The same thing is true when setting the prev pointer. So, in the cases when you want to link two messages, you shouldn’t set these specifically, but instead use the “->” method. This allow you to link two Ioke messages. For example “msg1 -> msg2” will actually set the next pointer on msg1 and the prev pointer on msg2. If you do “msg1 -> nil” it will set the next pointer to nil.

And that’s basically it. If you need to actually evaluate the messages, you can either use “sendTo” or “evaluateOn”. The main difference here is that sendTo will actually not evaluate the message chain. It will only evaluate the message that is the receiver of the call. The evaluateOn method will follow the message chain and evaluate it fully, based on the context arguments given to it.

Oh, one last thing. To create new messages from scratch, there are a few different ways. First of all, you can wrap a value like this: “Message wrap(42)”. That will return a new message that wraps the number 42.

You can create a message chain from a piece of text by doing ‘Message fromText(“one two three”)’. This will return a message chain with three messages, linked together.

Finally, you can create a new message chain by using the from-method. You use it like this: “Message from(one two(three) four)”. What is returned is the message chain that is the argument. If you think about it for a few seconds, you can probably guess how to implement this using an Ioke macro.

Quoting

Now that we understand messages and message chains, let us take a look at how to create new chains in a flexible way.

First of all, all of the above methods are all very useful and nice, but they tend to be a bit verbose. Coming from a Lisp background I felt inclined to put the quoting characters to good use for this. So, first of all, the single quote (‘) does the same thing as “Message from”. The back quote (`) does the same thing as “Message wrap”. So, to wrap the number 42, you can just do `42. In this case you don’t need parenthesis, since the back quote is an operator. To create a new message chain, use the single quote: ‘(foo bar(x) baz).

We almost have everything we need, except that we need some convenient ways of actually putting things into these message chains without having to put them together by hand.

Say for example we have a variable “blah” that contains an unknown message. We want to create a message “one” that is followed by the message in the variable “blah”. And then finally we want to add two messages “bax” and “baz” after it. We could do it like this: x = ‘one. x -> blah. x last -> ‘(bax baz). All in all, that is not too bad, but we can do better. This is done using the splice-quote operator, which is just two single quotes after each other. Using that it would look like this: ”(one `blah bax baz). In this case, the back quote inside of the splice-quote call will actually be evaluated in the current context and then have the result be spliced into the message chain being created. Now, only use the back quote if you are sure you can modify it. If you want to copy blah before inserting it, use the single quote again, instead of the back quote: ”(one ‘blah bax baz)

All in all, this is really all you need, and you can take a look at the core libraries and see how they are used. A typical example is the comprehensions library, and also the destructuring macros. In general, creating these message chains on the fly is the most useful inside of syntax macros.

I am planning to add a new feature to Ioke, that allow you to do tree rewriting for manipulating chains in different ways. This will be a feature built on top of the primitives described here, and these features will continue to be the main way of working with message chains for a long time.



Hijacking Ioke syntax


One of the nicer things with Ioke is that the syntax is highly flexible. This means that features that are generally considered part of the parsing step is not so in Ioke. There are still limitations on what you can take over, of course.

Another reason you can do much with Ioke’s syntax is that all operators are just method calls. This means you can override them in subclasses, which means these syntax changes can be totally localized.

A third way you can change things is by changing things at a global level, but only temporarily. This is possible since any cell in Ioke can act as a dynamic cell (or special) – by using the “let” method. This means you can make some very interesting changes to the syntax.

Understanding these three techniques make it possible to very easily create internal DSLs in Ioke, that feel like they are external. A fourth way of achieving this can be to massage message chains after the fact, to transform them into a different structure. You can do this directly, by working with the first class message structures – or if you don’t have to extravagant needs, you can interject specific behavior into the operator tables.

This post will give a quick introduction to these techniques, but it can’t really cover them all in full.

Flexible and polymorphic syntax elements

The things that in Ioke look like syntax elements but are mostly just message sends allow you to change the behavior locally of some things. Some examples of things where this is possible is in the creation of literals (such as numbers, texts and regexps), regular cell assignment (with =), and the creation of literal lists and dicts.

OK, to make this concrete, let me show a few examples. To start with, take the literals. The way literals work is that they will actually be translated into message sends. So when the parser see a number, it will generate a message send to “internal:createNumber”, and insert that into the message chain. This means you can override and change this behavior, which is something I do with my parser combinators example. As an extremely small example, take this code:

Parser(
  "foo" | "bar" | 42
)

This example creates a new parser that will parse either the literal string foo, the literal string bar, or the number 42. But how can we implement this? The method “|” is not even implemented for Texts in Ioke, and they definitely doesn’t return anything useful for the parser. We don’t want to override it to return the right thing either – it wouldn’t be a general solution – and what if someone wanted a parser that only matched one literal string? It’s clear that we need to hook into the handling of literals. (There is an alternative, we will talk about that in the final piece).

At this stage it might help to take a look at the canonical AST structure of the above code. It would look like this: internal:createText(“foo”) |(internal:createText(“bar”)) |(internal:createNumber(“42”)). With this structure, it should be more obvious how we can implement everything. The code for the above would look like this:

BaseParser = Origin mimic do(
  | = method(other,
    OrParser with(context: context, first: self, second: other)
  )
)

TextParser = BaseParser mimic
NumberParser = BaseParser mimic

ParserContext = Origin mimic do(
  internal:createText   = method(raw,
    TextParser with(context: self, text:   super(raw)))
  internal:createNumber = method(raw,
    NumberParser with(context: self, number: super(raw)))
)

Parser = dmacro(
  [code]
  context = ParserContext mimic
  code evaluateOn(context, context)
  context
)

The interesting pieces are in ParserContext. Inside it we override internal:createText and internal:createNumber to return parsers for their corresponding type. Notice how we call out to “super” to get the actual literal result. We then evaluate the argument to the Parser-method in the context of a newly created parser context. The only thing missing in the above code is the OrParser, and the actual matching pieces.

The other ways of hijacking syntax generally depend on executing code in a specific context, like the above.

I mentioned that “=”, [] and {} are overridable. Say that you for example like the syntax of blocks in Smalltalk, and want to use that within an internal DSL. That is actually extremely easy:

Smalltalk = Origin mimic do(
  [] = cell(:fn)
)

Smalltalk do(
  x = ["hello world" println]
  x call
  x call
)

Here we just assign [] to be the same as the method “fn” within the Smalltalk object. The same thing can be done for other operators, if wanted.

Using let to override syntax

As mentioned above, you can use the let method to override syntax (or any method really) for a specific bounded time. Lets say we want to do the above operation (Smalltalk blocks) for the duration of some execution. We can do it like this:

let(DefaultBehavior Literals cell("[]"), cell(:fn),
  x = ["hello world" println]
  x call
  x call
)

This will override the cell specified in the first argument with the value in the second argument – and then restore it at the end of the let-method. This is really not recommended for something like the [] method – as it will cause all kinds of problems in the internal implementations of methods. But you can definitely do it.

Transforming message chains

There are basically two techniques here. The first one is simply to add or remove or update the operator table
so that you can add operators that weren’t there before, or change the way they behave. This is not restricted to things with weird characters in them – indeed, in Ioke anything that appears in the operator tables count as an operator. A specific example of this is “return”. It can act as a unary operator, meaning it is possible to give return an argument without having parenthesis surrounding that argument.

To find out more information you can take a look at the documentation for Message OperatorTable here: http://ioke.org/dok/kinds/Message/OperatorTable.html.

The other way of transforming message chains is to actually take them apart and put them together again. I am planning a blog post dedicated to how to work with this, but I’ll take a quick peek at how to do it now. Lets take the earlier Parser example and see an alternative way of creating the text and number parsers without actually overriding the literals creation.

reformatCodeForParsing = method(code,
  ourCode = 'nil
  head = ourCode
  code each(msg,
    case(msg name,
      :internal:createText, ourCode -> ''createTextParser('msg),
      :internal:createNumber, ourCode -> ''createNumberParser('msg),
      :"|", ourCode -> ''withOrParser('(reformatCodeForParsing(msg arguments[0]))),
      else, ourCode -> msg mimic
    )
    ourCode = ourCode last
  )
  head
)

Parser = dmacro(
  [code]
  reformatCodeForParsing(code) evaluateOn(Ground, Ground)
)

This code is actually a bit annoying, but what it does is quite useful. What it does is that it will take the code above (“foo” | “bar” | 42) and restructure that into (createTextParser(“foo”) withOrParser(createTextParser(“bar”)) withOrParser(createNumberParser(42))).

The only thing that is a bit inscrutable is the way new message chains are put together using quoting of different kinds. I’m going to go into this with more depth in the next blog post. I am also working on a tree rewriting approach for making these kind of transformations much more idiomatic and readable.

So, this post have been a small introduction to several things you can do with Ioke to tweak its syntax. There is much more behind all these features, of course, and they all come from the fact that Ioke tries to unify and simplify all concepts as much as possible.



Macro types in Ioke – or: what is a dmacro?


With the release of Ioke 0, things regarding types of code were pretty simple. At that point Ioke had DefaultMethod, LexicalBlock and DefaultMacro. (That’s not counting the raw message chains of course). But since then I’ve seen fit to add several new types of macros to Ioke. All of these have their reason for existing, and I thought I would try to explain those reasons a bit here.

But first I need to explain what DefaultMacro is. Generally speaking, when you send the message “macro” in Ioke, you will get back an instance of DefaultMacro. A DefaultMacro is executed at runtime, just like regular methods, and in the same namespace. So a macro has a receiver, just as a method. In fact, the main difference between macros and methods are that you can’t define arguments for a macro. And when a message activates a macro, the arguments sent to that message will not be evaluated. Instead, the macro gets access to a cell called “call”. This cell is a mimic of the kind Call.

What can you do with a Call then? Well, you can get access to the unevaluated arguments. The easiest way to do this is by doing “call arguments”. That returns a list of messages. A Call also contains the message sent to activate it. This can be accessed with “call message”. Call contains a reference to the ground in which the message was sent. This is accessed with “call ground”, and is necessary to be able to evaluate arguments correctly. Finally, there are some convenience methods that allow the macro to evaluate arguments. Doing “call argAt(2)” will evaluate the third argument and return it. This is a short form for the equivalent “call arguments[2] evaluateOn(call ground, call ground)”.

This is all well and good. Macros allow you to do most things you would want to do, really. But they are quite rough to work with in their raw form. There are also plumbing that is a bit inconvenient. One common thing that you might want to do is to transform the argument messages without evaluating them, return those messages and have them be inserted instead of the current macro. You can do this directly, but it is as mentioned above a bit inconvenient. So I added DefaultSyntax. You define a DefaultSyntax with a message called “syntax”. The first time a syntax is activated, it will run, take the result of itself and replace itself with that result, and then execute that result. The next time that piece of code is found, the syntax will not execute, instead the result of the first invocation will be there. This is the feature that lies behind for comprehensions. To make this a bit more concrete, lets create a very simplified version of it. This version is fixed to take three arguments, an argument name, an enumerable to iterate over, and an expression for how to map the output value. Basically, a different way of calling “map”. A case like this is good, because we have all the information necessary to transform it, instead of evaluating it directly.

An example use case could look like this:

myfor(x, 1..5, x*2) ; returns [2,4,6,8,10]

Here myfor will return the code to double the the elements in the range, and then execute that.

The syntax definition to make this possible looks like this:

myfor = syntax(
  "takes a name, an enumerable, and a transforming expression
and returns the result of transforming each entry in the
expression, with the current value of the enumerable
bound to the name given as the first argument",

  argName = call arguments[0]
  enumerable = call arguments[1]
  argCode = call arguments[2]
  ''(`enumerable map(`argName, `argCode))
)

As you can see, I’ve provided a documentation text. This is available at runtime.

Syntactic macros also have access to “call”, just like regular macros. Here we use it to assign three variables. These variables get the messages, not the result of those things. Finally, a metaquote is used. A metaquote takes its content and returns the message chain inside of it, except that anywhere a ` is encountered, the message at that point will be evaluated and spliced into the message chain at that point. The result will be to transform “myfor(x, 1..5, x*2)” into “1..5 map(x, x*2)”.

As might be visible, the handling of arguments is kinda impractical here. There are two problems with it, really. First, it’s really verbose. Second, it doesn’t check for too many or too few arguments. Doing these things would complicate the code, at the expense of readability. And regular macros have exactly the same problem. That’s why I implemented the d-family of destructuring macros. The current versions of this are dmacro, dsyntax, dlecro and dlecrox. They all work the same, except the generate macros, syntax, lecros or lecroxes, depending on which version used.

Let’s take the previous example and show how it would look like with dsyntax:

myfor = dsyntax(
  "takes a name, an enumerable, and a transforming expression
and returns the result of transforming each entry in the
expression, with the current value of the enumerable
bound to the name given as the first argument",

  [argName, enumerable, argCode]

  ''(`enumerable map(`argName, `argCode))
)

The only difference here is that we use dsyntax instead of syntax. The usage of “call arguments[n]” is gone, and is instead replaced with a list of names. Under the covers, dsyntax will make sure the right number of arguments are sent and an error message provided otherwise. After it has ensured the right number of arguments, it will also assign the names in the list to their corresponding argument. This process is highly flexible and you can choose to evaluate some messages and some not. You can also collect messages into a list of messages.

But the real nice thing with dsyntax is that it allows several choices of argument lists. Say we wanted to provide the option of giving either 3 or 4 arguments, where the expansion looks the same for 3 arguments, but if 4 arguments are provided, the third one will be interpreted as a condition. In other words, to be able to do this:

myfor(x, 1..5, x*2) ; returns [2,4,6,8,10]
myfor(x, 1..5, x<4, x*2) ; returns [2,4,6]

Here a condition is used in the comprehension to filter out some elements. Just as with the original, this code transforms into an obvious application of “filter” followed by “map”. The updated version of the syntax looks like this:

myfor = dsyntax(
  "takes a name, an enumerable, and a transforming expression
and returns the result of transforming each entry in the
expression, with the current value of the enumerable
bound to the name given as the first argument",

  [argName, enumerable, argCode]

  ''(`enumerable map(`argName, `argCode)),

  [argName, enumerable, condition, argCode]

  ''(`enumerable filter(`argName, `condition) map(`argName, `argCode))
)

The only thing added is a new destructuring pattern that matches the new case and in that situation returns code that includes a call to filter.

The destructuring macros have more features than these, but this is the skinny on why they are useful. In fact, I’ve used a combination of syntax and dmacro to remove a lot of repetition from the Enumerable core methods, for example. Things like this make it possible to provide abstractions where you only need to specify what’s necessary, and nothing more.

And remember, the destructuring I’ve shown with dsyntax can be done exactly the same for macros and lecros. Regular methods doesn’t need it that much, since the rules for DefaultMethod arguments are so flexible anyway. But for macros this has really made a large difference.



Ioke syntax


Or: How using white space for application changes the syntax of a language.

I have spent most of the weekend working with different syntax elements of Ioke. Several of them are actually based on one simple decision I made quite early, and I thought it would be interesting to take a look at some of the syntax elements I’ve worked on, from the angle of how they are based on that one decision.

What is this decision then? In the manner of Smalltalk, Self and Io, I decided that periods are not the way to apply methods. Instead, space makes sense for this. So if in Java you would write “foo().bar(1).quux(2,3)” this would be written as “foo bar(1) quux(2, 3)” in Ioke. Everything is an expression and sending a message to something is done with putting the message adjacent to the thing receiving the message, separated by whitespace. This turns out to have some consequences I really didn’t expect, and several parts of the syntax have actually changed a lot because of this decision. I’ll take a look at the things that changed most recently because of it.

Terminators

Most language without explicit expression nesting (like Lisp) need some way to decide when a chain of message passing should stop. Most scripting languages today try to use newlines, and then use semicolons when newlines doesn’t quite work. That’s what I started out doing with Ioke too (since Io does it). But once I started thinking about it, I realized that Smalltalk got this thing right too. Since I don’t use dots for message application, I’m free to use it for termination. You still don’t need to terminate things that are obviously terminated with newlines, but when you need a terminator, the dot reads very well. I’ve always disliked the intrusiveness of semicolons – they seem to take to much visual space for me. Dots feel like the right size, and there is also a more pleasing symmetry with commas.

Comments

Once you don’t use semicolons for termination, you can use it for other things. I am quite fond of the Lisp tradition of using semicolons for comments, so I decided to not use hashes for that anymore. One of the ways Lisp systems use semicolons for comments is that they use different numbers of them to prepend different kinds of documentation. Common Lisp standard is to use four semicolons for headlines, three semicolons for left justified comments, two semicolons for a new line of comment that should be indented with the program text, and one semicolon for comments on the same line as program text. These things work because semicolons doesn’t take up so much visual space when stacked. A hash would never work for it.

The obvious question from any person with Unix experience will be how I handle shebangs if a hash isn’t a comment anymore. The short answer is that I will provide general read macro syntax based on hash. Since the shebang always starts with “#!” that would be a perfect application for a reader macro. That also opens up the possibility for other interesting reader macros, but I’ll take that question later.

Operator precedence

This one was totally unexpected. I had planned to add regular operator precedence style and it ended up being quite painful. I should probably have guessed the problem, but I didn’t – two grammar files later and I’m now hopefully a bit wiser. The problem ended up being whitespace. Since I use whitespace to separate application, but whitespace is also interesting to judge operator precedence, what happened was that the parsers I got working actually had exponential amount of backtracking. Two lines of regular code without operators still backtracked enough to take a minute or two to parse. Ouch. So what’s the solution? Two passes of parsing. Or not exactly, but almost. I’m currently implementing something like Io’s operator shuffling, which is a general solution to rearrange operators into a canonical form based on precedence rules. What’s fun with it is that the rules can be dynamically changed. If you want Smalltalk style left to right precedence, that should be possible by just setting the precedence to 1 for all operators. You can also turn of operator shuffling completely, which means you can’t use infix operators at all.

I’m also planning a way to scope these things, so you can actually change quite a lot of the syntax without switching the parser.

At some point I’m planning to explore how it would work to use an Antlr tree parser to do the shuffling. My intuition is that it would work well, but I’ll have to find the time to do it.

Syntactic flexibility

All is not perfect, but the current scheme seems to work well. I’ve been able to get a real amount of flexibility into the syntax, with loads of operators free for anyone to use and subclass. The result will be the possibility to create internal DSLs that Ruby could only dream of. Some things gets harder too, though. Regular expression syntax for example. If you can create a statement like this: “[10,12,14] map(/2 * 2/a)”, it’s kinda obvious that there is no easy way to know whether the statement inside the mapping call is a regular expression or an expression fragment. In Ioke the decision is simple, the above is an expression fragment. I’ve decided to make it really easy to work with regular expression syntax. Interestingly, it was one of the reasons I wanted reader macros for, and it turns out that using #/ will work well. So a regular expression looks just like in a perl like language, except that you add a hash before the first slash: #/foo/ =~ “str”. It seems that hash will end up being my syntax sin bin for those cases where I want syntax without touching the parser to much.

It’s funny to see how many things in classic syntax that changes if you change how message passing works. I like Ioke more and more for each of these things I find, and it currently looks very pleasant to work with. Dots are such an improvement for one-lines.