Talk:S-expressions: Difference between revisions

move the syntax sections to the top
(move the syntax sections to the top)
Line 7:
 
the OCaml solution is for example on the border. it doesn't handle numbers and it does not even encode the difference between quoted and unquoted strings.--[[User:EMBee|eMBee]] 01:59, 18 October 2011 (UTC)
 
=== practical solutions preferred ===
this is not supposed to be an academic exercise in writing a perfect s-expression parser, but it should provide a practical demonstration that serves as a starting point for anyone who really do needs to use an s-expression parser in their project.
 
you should ask yourself: would i use this approach in a real project? if a solution that you would really use is different from the task, please post it anyways and explain why.--[[User:EMBee|eMBee]] 04:01, 19 October 2011 (UTC)
 
::I used S-exps in a C++ project about eight years ago. More or less real S-exps with interned symbols, cons cells, etc. No garbage collection was used, but rather smart pointers with reference counting. This was used for inter-process communication over sockets, as well as for storing and retrieving complex configurations easily. The project was a mail transport. Later when an IMAP4 interface had to be developed, one of the other programmers realized that IMAP4 uses S-exps. We easily adapted the library to parse and generate IMAP4 which simplified the handling of that protocol greatly. Its commands and responses could be handled on a higher level as objects, using Lisp-like functions (in C++). None of us had known that IMAP4 uses S-exps, so we were amazed and the programmer coding the IMAP4 interface was pretty happy. (Some conventions had to be accomodated in our reader: IMAP4 has backslashes on symbols like \SEEN (flag indicating read). This is not an escape sequence but a constituent of the symbol name.) I had a package concept implemented in symbols: symbols belonged to packages (namespaces). This was became really useful for the IMAP4 module, because it was able to read forms in a private package to prevent symbols from polluting the regular namespace. Later in the project, we were given a requirement to develop some test tools that could be run on the mobile device to do some automated tests. I developed an evaluator over the S-expressions, creating a multi-threaded Lisp dialect with dynamic closures. I added a GUI where you could type expressions right on the device and hit EVAL. This saved them time because they didn't have to re-flash the whole image just to fix a bug in a script: upload the script and (load ...). Only if they needed a new intrinsic function in the interpreter did they have to rebuild the image and re-flash. Developing this dialect and the GUI window, and a concise, clear reference manual, took about three days (having the Lispy library in place). QA people developed multi-threaded tests which exercised various platform functions on the device (via intrinsic function bindings exposed in the dialect). Nobody in QA came to ask any major questions about the language, even though they had no Lisp experience, since the reference manual was thorough, and explained everything. What is a form, what is a variable, what is a binding, what is an environment, what is scoping, what does it mean to evaluate, what is a function, what is a closure, etc. '''Would I use S-exps as defined by this task?''' Not likely. You really need the correspondence to a proper Lisp-like structure which distinguishes symbols (interned objects) from strings. Interned symbols allow for identification using simple equality <code>if (car(form) == foo_s) ... // if the form begins with the foo symbol ... </code>. Strings and symbols really have to be a distinct data type, otherwise you're wasting cycles on string processing, and you have hygiene issues. If you need a unique symbol, just construct a symbol object and you got it (since it has a unique address). If symbols are strings, then you cannot make a unique symbol, only a unique-ish one based on giving the string contents that are unlikely to coincide with any other symbol. A lot of the power came from the simplicity of the API, not from what the notation looks like. So whether I would use a given S-exps solution in a project would depend heavily on the quality of the API used to manipulate the data structure corresponding to the S-exp. I would not use a solution that required me to manage memory with malloc and free, for instance, or which had some awfully encapsulated data structures with clumsy accessors. In this regard, some particular '''solutions''' to this task may be more useful than others. I would definitely not use the C code for instance. It is just an example. For a real task, you need a well-designed, ideally mature library for this type of thing, with some kind of garbage collection. [[Special:Contributions/192.139.122.42|192.139.122.42]] 23:06, 19 October 2011 (UTC)
 
::''I cannot imagine any case where I would use this for practical purposes. I already have more robust serialization and deserialization available with language primitives, [CONTINUED ...] --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)''
:::which languages are those? and what kind of serialization is it?--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
::::In the context of rosetta code, I mostly work in J. But C#, for example, also has deep support for serialization (allowing developers to define how their classes get serialized, or whether that is not possible for classes which should not be serialized). And both languages support multiple kinds of serialization (some more human readable -- J can serialize as executable content, C# can serialize as xml, and some not so readable -- J can serialize as array structure, C# has binary serialization). Alternatively, if I was working in javascript, I would use json (which does not have primitive support but which does have lots of available implementations). --[[User:Rdm|Rdm]] 13:51, 19 October 2011 (UTC)
:::::thanks. i am mostly interested in human readable serializations, i loathe programs who save config data in a format that i can not edit by hand. among the alternatives including json and xml i find s-expressions conceptually the easiest to parse and also the easiest to edit manually. (i don't know about json, but xml breaks if it is not nested properly making hand editing risky). whereas i believe s-expressions can not break the parser in a way that would prevent it from reading all the data (even if it is reading it the wrong way), which means there is a higher chance to be able to fix broken data programmatically after the parser is done with it. therefore i am considering to use s-expressions for my applications and one reason for this task was to find out how easy it really is to write a parser for s-expressions.--[[User:EMBee|eMBee]] 14:51, 19 October 2011 (UTC)
::::::Here are error cases to consider for human edited sexpr: too many right parenthesis, too many left parenthesis, both ("(like this,)))(((for example)"), no top-level parenthesis (or content outside the top level parenthesis), unbalanced quotes, misplaced but balanced parenthesis, misplaced or omitted spaces. If none of those bother you, you might consider csv as a widely supported alternative (though that might require thinking about your data differently which may or may not make it unsuitable). --16:54, 19 October 2011 (UTC)
:::::::they are bothering, but i can for example work around unbalanced parenthesis by adding a few extra at the beginning or end, to make the parser happy, and then later analyze where the real error is. try that with xml.
::::::::Decent editors have a Lisp mode for automatically indenting S-expressions, matching parentheses, or selecting subexpressions and moving them around, etc.[[Special:Contributions/192.139.122.42|192.139.122.42]] 22:03, 19 October 2011 (UTC)
:::::::of course this has nothing to do with the task or the definition of s-expressions, it just shows that s-expressions are easier to manipulate programmatically which probably is one of the reasons why lisp is as powerful as it is.--[[User:EMBee|eMBee]] 17:09, 19 October 2011 (UTC)
::''[... CONTINUATION] so this mechanism only makes sense for interoperability. And interoperability only makes sense when it's defined in such a way that both my implementation and that of another language are doing the same thing. And that's just not happening here, because that is not how the task is written. --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)''
::::That said, javascript does actually have some primitive support for json (eval can be used to parse json from a trusted source). --[[User:Rdm|Rdm]] 23:20, 19 October 2011 (UTC)
:::well the task is not written that way because there is no standard. however in this discussion we may be able to flesh out some kind of standard that we can agree on.--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
: Will I use what's in this task in a real project? Probably not. Firstly, the task doesn't appear to want symbols (the wording rather actively goes against keeping symbols as symbols).
::there is nothing stopping you from implementing symbols. my own solution implements symbols too.
::how would you change the wording to encourage symbols, but not require them?
:S-expression without "S" is pretty uninteresting, it's just another old tree structure carried in a string: imagine using an XML format with only two tags &lt;list> and &lt;string>. Without the "S", it can't represent things like hashtable or metadata, can't cross-reference data, can't annotate structure, can't define actions, can't do anything other than being a tree of strings looking pretty. And secondly, even that is not adequately done: even if you only want the parser to be able to exchange pretty looking trees of strings, the escape required by the task is not going to be enough: no escape for unprintable chars,
::why would that be needed? as far as i can tell the task definition has no restriction on what bytes appear in the data. it should handle unprintable characters just as well as any other. they would only need escaping if you want to output the data on a terminal, but that is not part of the task.
:no well defined whitespace behavior (consider sending it by email for example),
::huh? what kind of whitespace behavior is needed for email? i am not aware of any whitespace rules inside an email body except for >From escaping. if there is any trouble with that, email already has plenty of facilities to encode and decode data. you are free to use those to process the data before parsing it. implementing these things is beyond the scope of this task.
:no encoding spec, some implementations can't even have a double quote inside the strings.
::uhm, some implementations don't support double quote inside strings doesn't mean that you can't make your implementation handle quotes. if i find the time, i will add it to mine too...
:::The "double quote" inside of strings thing is a distraction, in my opinion. You can easily transform the data in the sexpr so if you want to add a module that re-interprets strings with an escape language, you can do that.
::::well unless the escape syntax works in a way that it transforms quotes into something completely different, the parser needs to know about the escape, otherwise how do you escape quotes? distraction or not. \" happens to be a widely accepted way to do escapes. an alternative could be [[wp:percent encoding|%hex]], [[wp:quoted printable|=hex]] or [[wp:Numeric character reference|&#hex;]]. other suggestions?--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
:::::Yes, as you point out there are escape mechanisms which do not require an updated parser. Backslash escapes also have a history of being able to identify characters by number. --[[User:Rdm|Rdm]] 13:54, 19 October 2011 (UTC)
:::There's nothing wrong with having that be an independent module -- or, rather, there's nothing wrong with that being an independent module unless you have reasons for disliking the modules. But you have to draw a line somewhere, otherwise the task becomes "write a lisp interpreter". --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)
::::i agree. modules are good. or at least having the code written in a way that it is easy to see where it needs to be modified to add functionality.--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
: I'm very much not going to use this as a general data exchange format.
:: i wasn't asking if you want to use the data format, but if you want to use your own implementation which is free to fulfill the requirements that you have. the purpose of this task is not to define the data format. however, a result of this discussion may well lead to a definition. if that is the case then the task will most likely adopt the definition.
: The task is ok for demonstrating simple parsers, but if you want to use it for real work, maybe you need something more rigorous, say, [http://people.csail.mit.edu/rivest/Sexp.txt the almost RFC]? --[[User:Ledrug|Ledrug]] 04:54, 19 October 2011 (UTC)
::i have been searching for a definition for s-expressions, and i could not find anything. this one raises some interesting points and has things i disagree with.--[[User:EMBee|eMBee]] 06:02, 19 October 2011 (UTC)
:::Common Lisp (ANSI standard), ISLisp (ISO) and Scheme are programing languages which are based on S-expressions. There are differences between them but they have a lot in common. Rivest's RFC is awful.[[Special:Contributions/192.139.122.42|192.139.122.42]] 22:07, 19 October 2011 (UTC)
 
== Existing Standards ==
Line 88 ⟶ 50:
 
very likely this definition is incomplete, possibly even has errors. please add and criticize.--[[User:EMBee|eMBee]] 06:02, 19 October 2011 (UTC)
 
=== practical solutions preferred ===
this is not supposed to be an academic exercise in writing a perfect s-expression parser, but it should provide a practical demonstration that serves as a starting point for anyone who really do needs to use an s-expression parser in their project.
 
you should ask yourself: would i use this approach in a real project? if a solution that you would really use is different from the task, please post it anyways and explain why.--[[User:EMBee|eMBee]] 04:01, 19 October 2011 (UTC)
 
::I used S-exps in a C++ project about eight years ago. More or less real S-exps with interned symbols, cons cells, etc. No garbage collection was used, but rather smart pointers with reference counting. This was used for inter-process communication over sockets, as well as for storing and retrieving complex configurations easily. The project was a mail transport. Later when an IMAP4 interface had to be developed, one of the other programmers realized that IMAP4 uses S-exps. We easily adapted the library to parse and generate IMAP4 which simplified the handling of that protocol greatly. Its commands and responses could be handled on a higher level as objects, using Lisp-like functions (in C++). None of us had known that IMAP4 uses S-exps, so we were amazed and the programmer coding the IMAP4 interface was pretty happy. (Some conventions had to be accomodated in our reader: IMAP4 has backslashes on symbols like \SEEN (flag indicating read). This is not an escape sequence but a constituent of the symbol name.) I had a package concept implemented in symbols: symbols belonged to packages (namespaces). This was became really useful for the IMAP4 module, because it was able to read forms in a private package to prevent symbols from polluting the regular namespace. Later in the project, we were given a requirement to develop some test tools that could be run on the mobile device to do some automated tests. I developed an evaluator over the S-expressions, creating a multi-threaded Lisp dialect with dynamic closures. I added a GUI where you could type expressions right on the device and hit EVAL. This saved them time because they didn't have to re-flash the whole image just to fix a bug in a script: upload the script and (load ...). Only if they needed a new intrinsic function in the interpreter did they have to rebuild the image and re-flash. Developing this dialect and the GUI window, and a concise, clear reference manual, took about three days (having the Lispy library in place). QA people developed multi-threaded tests which exercised various platform functions on the device (via intrinsic function bindings exposed in the dialect). Nobody in QA came to ask any major questions about the language, even though they had no Lisp experience, since the reference manual was thorough, and explained everything. What is a form, what is a variable, what is a binding, what is an environment, what is scoping, what does it mean to evaluate, what is a function, what is a closure, etc. '''Would I use S-exps as defined by this task?''' Not likely. You really need the correspondence to a proper Lisp-like structure which distinguishes symbols (interned objects) from strings. Interned symbols allow for identification using simple equality <code>if (car(form) == foo_s) ... // if the form begins with the foo symbol ... </code>. Strings and symbols really have to be a distinct data type, otherwise you're wasting cycles on string processing, and you have hygiene issues. If you need a unique symbol, just construct a symbol object and you got it (since it has a unique address). If symbols are strings, then you cannot make a unique symbol, only a unique-ish one based on giving the string contents that are unlikely to coincide with any other symbol. A lot of the power came from the simplicity of the API, not from what the notation looks like. So whether I would use a given S-exps solution in a project would depend heavily on the quality of the API used to manipulate the data structure corresponding to the S-exp. I would not use a solution that required me to manage memory with malloc and free, for instance, or which had some awfully encapsulated data structures with clumsy accessors. In this regard, some particular '''solutions''' to this task may be more useful than others. I would definitely not use the C code for instance. It is just an example. For a real task, you need a well-designed, ideally mature library for this type of thing, with some kind of garbage collection. [[Special:Contributions/192.139.122.42|192.139.122.42]] 23:06, 19 October 2011 (UTC)
 
::''I cannot imagine any case where I would use this for practical purposes. I already have more robust serialization and deserialization available with language primitives, [CONTINUED ...] --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)''
:::which languages are those? and what kind of serialization is it?--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
::::In the context of rosetta code, I mostly work in J. But C#, for example, also has deep support for serialization (allowing developers to define how their classes get serialized, or whether that is not possible for classes which should not be serialized). And both languages support multiple kinds of serialization (some more human readable -- J can serialize as executable content, C# can serialize as xml, and some not so readable -- J can serialize as array structure, C# has binary serialization). Alternatively, if I was working in javascript, I would use json (which does not have primitive support but which does have lots of available implementations). --[[User:Rdm|Rdm]] 13:51, 19 October 2011 (UTC)
:::::thanks. i am mostly interested in human readable serializations, i loathe programs who save config data in a format that i can not edit by hand. among the alternatives including json and xml i find s-expressions conceptually the easiest to parse and also the easiest to edit manually. (i don't know about json, but xml breaks if it is not nested properly making hand editing risky). whereas i believe s-expressions can not break the parser in a way that would prevent it from reading all the data (even if it is reading it the wrong way), which means there is a higher chance to be able to fix broken data programmatically after the parser is done with it. therefore i am considering to use s-expressions for my applications and one reason for this task was to find out how easy it really is to write a parser for s-expressions.--[[User:EMBee|eMBee]] 14:51, 19 October 2011 (UTC)
::::::Here are error cases to consider for human edited sexpr: too many right parenthesis, too many left parenthesis, both ("(like this,)))(((for example)"), no top-level parenthesis (or content outside the top level parenthesis), unbalanced quotes, misplaced but balanced parenthesis, misplaced or omitted spaces. If none of those bother you, you might consider csv as a widely supported alternative (though that might require thinking about your data differently which may or may not make it unsuitable). --16:54, 19 October 2011 (UTC)
:::::::they are bothering, but i can for example work around unbalanced parenthesis by adding a few extra at the beginning or end, to make the parser happy, and then later analyze where the real error is. try that with xml.
::::::::Decent editors have a Lisp mode for automatically indenting S-expressions, matching parentheses, or selecting subexpressions and moving them around, etc.[[Special:Contributions/192.139.122.42|192.139.122.42]] 22:03, 19 October 2011 (UTC)
:::::::of course this has nothing to do with the task or the definition of s-expressions, it just shows that s-expressions are easier to manipulate programmatically which probably is one of the reasons why lisp is as powerful as it is.--[[User:EMBee|eMBee]] 17:09, 19 October 2011 (UTC)
::''[... CONTINUATION] so this mechanism only makes sense for interoperability. And interoperability only makes sense when it's defined in such a way that both my implementation and that of another language are doing the same thing. And that's just not happening here, because that is not how the task is written. --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)''
::::That said, javascript does actually have some primitive support for json (eval can be used to parse json from a trusted source). --[[User:Rdm|Rdm]] 23:20, 19 October 2011 (UTC)
:::well the task is not written that way because there is no standard. however in this discussion we may be able to flesh out some kind of standard that we can agree on.--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
: Will I use what's in this task in a real project? Probably not. Firstly, the task doesn't appear to want symbols (the wording rather actively goes against keeping symbols as symbols).
::there is nothing stopping you from implementing symbols. my own solution implements symbols too.
::how would you change the wording to encourage symbols, but not require them?
:S-expression without "S" is pretty uninteresting, it's just another old tree structure carried in a string: imagine using an XML format with only two tags &lt;list> and &lt;string>. Without the "S", it can't represent things like hashtable or metadata, can't cross-reference data, can't annotate structure, can't define actions, can't do anything other than being a tree of strings looking pretty. And secondly, even that is not adequately done: even if you only want the parser to be able to exchange pretty looking trees of strings, the escape required by the task is not going to be enough: no escape for unprintable chars,
::why would that be needed? as far as i can tell the task definition has no restriction on what bytes appear in the data. it should handle unprintable characters just as well as any other. they would only need escaping if you want to output the data on a terminal, but that is not part of the task.
:no well defined whitespace behavior (consider sending it by email for example),
::huh? what kind of whitespace behavior is needed for email? i am not aware of any whitespace rules inside an email body except for >From escaping. if there is any trouble with that, email already has plenty of facilities to encode and decode data. you are free to use those to process the data before parsing it. implementing these things is beyond the scope of this task.
:no encoding spec, some implementations can't even have a double quote inside the strings.
::uhm, some implementations don't support double quote inside strings doesn't mean that you can't make your implementation handle quotes. if i find the time, i will add it to mine too...
:::The "double quote" inside of strings thing is a distraction, in my opinion. You can easily transform the data in the sexpr so if you want to add a module that re-interprets strings with an escape language, you can do that.
::::well unless the escape syntax works in a way that it transforms quotes into something completely different, the parser needs to know about the escape, otherwise how do you escape quotes? distraction or not. \" happens to be a widely accepted way to do escapes. an alternative could be [[wp:percent encoding|%hex]], [[wp:quoted printable|=hex]] or [[wp:Numeric character reference|&#hex;]]. other suggestions?--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
:::::Yes, as you point out there are escape mechanisms which do not require an updated parser. Backslash escapes also have a history of being able to identify characters by number. --[[User:Rdm|Rdm]] 13:54, 19 October 2011 (UTC)
:::There's nothing wrong with having that be an independent module -- or, rather, there's nothing wrong with that being an independent module unless you have reasons for disliking the modules. But you have to draw a line somewhere, otherwise the task becomes "write a lisp interpreter". --[[User:Rdm|Rdm]] 10:39, 19 October 2011 (UTC)
::::i agree. modules are good. or at least having the code written in a way that it is easy to see where it needs to be modified to add functionality.--[[User:EMBee|eMBee]] 12:04, 19 October 2011 (UTC)
: I'm very much not going to use this as a general data exchange format.
:: i wasn't asking if you want to use the data format, but if you want to use your own implementation which is free to fulfill the requirements that you have. the purpose of this task is not to define the data format. however, a result of this discussion may well lead to a definition. if that is the case then the task will most likely adopt the definition.
: The task is ok for demonstrating simple parsers, but if you want to use it for real work, maybe you need something more rigorous, say, [http://people.csail.mit.edu/rivest/Sexp.txt the almost RFC]? --[[User:Ledrug|Ledrug]] 04:54, 19 October 2011 (UTC)
::i have been searching for a definition for s-expressions, and i could not find anything. this one raises some interesting points and has things i disagree with.--[[User:EMBee|eMBee]] 06:02, 19 October 2011 (UTC)
:::Common Lisp (ANSI standard), ISLisp (ISO) and Scheme are programing languages which are based on S-expressions. There are differences between them but they have a lot in common. Rivest's RFC is awful.[[Special:Contributions/192.139.122.42|192.139.122.42]] 22:07, 19 October 2011 (UTC)
 
 
== Symbols and strings ==
Anonymous user