14

Strings and Chars

This chapter discusses the verbs which perform utility operations on strings and chars. For information on the nature of strings and chars, the rules of string and char literal construction and display, and several key verbs and operators that work on strings, see Chapter 10, Datatypes, and Chapter 11, Arrays. On the importance of strings in coercion, see Chapter 10 (under "Coercion") and Chapter 8, Addresses.

Strings are surely the most important datatype in UserTalk, both because they are the most universally coercible and because of Frontier's aptitude for batch processing and construction of text files. Considering this, the repertoire of string verbs seems none too excessive. Many utility functions need to be built by the user - several are suggested or provided in the discussion that follows - and it has taken the addition of supplements such as the regex suite and the conversion of some script-based functions to a kernel implementation to render UserTalk's string manipulation truly suitable for Web- and network-related functionalities.

All the verbs described here create a new entity as their result. Thus, when we describe the purpose of a verb as "to remove a stretch of a string," we really mean, "to return a new string equivalent to the original string with a certain stretch removed." Position indices are 1-based. A char can be fed to a verb that expects a string and be implicitly coerced; the reverse is true only if the string is one character long. String matching is case-sensitive: see Chapter 45, Punctuation.

Substrings

To insert one string into another:

string.insert (substring, originalString, startIndex)

To remove a stretch of a string:

string.delete (string, startIndex, count)

To remove the end of a string, starting at the last occurrence of a given character value:

toys.popStringSuffix (string, character)

Intended for removing delimited suffixes, such as ".html". See also the field verbs, later in this chapter.

To obtain a substring of a string:

string.mid (string, startIndex, count)

A nice feature of string.delete() and string.mid() is that count or startIndex can be too large, or 0, without generating an error. With string.mid() (but, oddly, not with string.delete()), count can even be negative. See Chapter 46, Verbs, for details.

UserTalk has no string.left() or string.right(), to return the leftmost or rightmost n characters of a string. No string.left() is really needed, since string.mid() with a startIndex of 1 will do; but it is useful to have rightN() as a utility script (you can't call it right because that's a reserved word).

*Example 14-1* *rightN( )*
on rightN (s, n)
on pos(n)
return n * (n >= 0)
return string.delete (s, 1, pos(sizeof(s) - n))

Also, it is often useful to remove the rightmost n characters of a string. The following is a utility script, rDelete() , which does it.

*Example 14-2* *rDelete( )*
on rDelete (s, n)
return string.mid (s, 1, sizeof(s) - n)

To obtain the nth character of a string:

string.nthChar (string, index)

Unlike array notation, it is not an error to supply to string.nthChar() an index which is 0 or too large.

To obtain a string's length:

string.length (string)

Otiose, since sizeOf() is completely equivalent.

Case and Character Type

To change a string's case to uppercase or lowercase:

string.upper (string)

string.lower (string)

The system is consulted, so diacritics are correctly handled.

It is useful to supplement these with boolean tests isUpper() and isLower(); here is an example.

*Example 14-3* *isUpper( )*
on isUpper (s)
return s == string.upper(s)

To test whether a character is standard alphanumeric (such as can appear in a standard variable name):

toys.alphaChar (character)

To test whether a character is alphabetic:

string.isAlpha (character)

Consults the system for its notion of an alphabetic character, so includes diacritics.

To test whether a character is a digit:

string.isNumeric (character)

The routine toys.isNumeric() is otiose.

To test whether a character is punctuation:

string.isPunctuation (character)

toys.puncChar (character)

The toys version takes a narrow view of what constitutes punctuation; the string version consults the system. I don't know what the purpose is of the toys version not seeing '@' as punctuation.

To strip out non-alphanumeric characters from a string:

toys.dropNonAlphas (string)

Chiefly intended for generating universally legal filenames. Oddly, does not use quite the same definition of an "alpha" as toys.alphaChar() (underscores are stripped).

To turn multiple words into one word with capitalized elements:

toys.innerCaseName (string)

Chiefly intended for generating universally legal filenames. For example:

toys.innerCaseName ("inner case name")

    « "innerCaseName"

It is easy to write a generalized verb that tests whether every character in a string meets a given test, thanks to UserTalk's ability to use a script address as a parameter (see Chapter 8).

*Example 14-4* *testEachChar( )*
on testEachChar (s, addrTest)
local (x)
for x = 1 to sizeOf(s)
if not addrTest^(s[x])
return false
return true

You can write your own test or use one of those already included. For example, to test whether every character in a string meets the string.isAlpha() test:

testEachChar ("thisHasNoSpaces", @string.isAlpha)

    « true

Similarly, the following is a generalized verb that strips a string of all characters failing to meet a given test.

*Example 14-5* *stripTestFailers( )*
on stripTestFailers (s, addrTest)
local (x, t = "")
for x = 1 to sizeof(s)
if addrTest^(s[x])
t = t + s[x]
return t

For example:

stripTestFailers ("The rain, in Spain", @string.isAlpha)

    « "TheraininSpain"

Find and Replace

To determine the index of the first location of a substring in a string:

string.patternMatch (substring, string)

Returns 0 if substring doesn't occur.

Recall that 0 is coerced to false and all other numbers to true if used where a boolean is expected; thus, the result of string.patternMatch() can be used as a boolean test to see whether substring occurs at all.

Not infrequently, it is desired to determine a substring's first location in a string while ignoring some portion of the beginning of the string.

*Example 14-6* *patternMatchAfter( )*
on patternMatchAfter (pattern, s, index)
local
where = string.patternMatch (pattern, \
string.mid (s, index, infinity))
return (index + where - 1) * boolean(where)

Based on this, it is easy to write a utility to obtain the nth location in a string of a substring.

*Example 14-7* *findNth( )*
on findNth (pattern, s, count)
local (index = 1, theLength = sizeOf(pattern))
loop
index = patternMatchAfter (pattern, s, index)
if index and --count
index = index + theLength
else
return index

See also the discussion of the string parsing verbs, later in this chapter.

To replace the first or all occurrences of one substring in a string with another substring:

string.replace (string, oldSubstring, newSubstring)

string.replaceAll (string, oldSubstring, newSubstring)

To delete the first or all occurrences of a substring in a string, call string.replace() or string.replaceAll() with the empty string as the third parameter. We can use this device to count the occurrences of a substring.

*Example 14-8* *countInString( )*
on countInString (pattern, s)
local (t = string.replaceAll(s, pattern, ""))
return (sizeof(s) - sizeof(t)) / sizeof(pattern)

To perform grep search and replace, the regex suite is provided. This is a port of the GNU regex library to a UCMD. There is undeniable value in being able to use grep, especially in parsing HTML and other structured text.

This is not the place for a full discussion of grep.1 The suite includes a ReadMe wptext which explains both grep and the regex verbs. Admittedly, grep is probably not everyone's cup of tea; and writing a grep expression in UserTalk is not made any easier by the rules of string literal construction. Here, for instance, is a grep pattern that finds stretches of quoted material containing an escaped double quote (\"):

"\"[^\"]*(\\\\(\"[^\"]*\\\\)*\"[^\"]*)?\""

This looks more formidable than it is; it boils down to:

"[^"]*(\\("[^"]*\\)*"[^"]*)?"

surrounded by quotes and rendered with escape characters.2 Even so, it's fairly opaque. Such is the price of being able to take advantage of the power of grep. This power is well demonstrated by the included regex.toys examples, such as the utility which parses a browser bookmark file from HTML into an outline of title-URL pairs. Tasks of this class would be very difficult in UserTalk without regex.3

Here, we content ourselves with sketching the basic behavior of the three most important regex verbs. Consult the documentation in the suite for full information. Notice that the target string must be passed by reference (an address).

To perform a grep search and replace:

regex.subst (searchFor, replaceWith, addrString, caseSensitive?,

             maxSubstitutions)

To obtain a list of all substrings matching a grep pattern:

regex.extract (searchFor, addrString, addrList, groups, caseSensitive?)

To parse a string according to a grep pattern:

regex.split (searchFor, addrString, caseSensitive?)

The result is a list of strings; these are what is left of the string at addrString when all substrings matching searchFor are removed, interspersed with the matches on any group expressions in searchFor.

Suppose, for example, we wish to put into lowercase everything not enclosed in double quotes. We can use regex.split() to break the string into pieces, run through the list lowering the case of those items that don't start with a double quote, and reassemble the string. (An extra verb, regex.join() , is supplied for just this purpose.) The following code is an illustration:

local

    myString = "This Is An \"Amazing\" Demonstration"

    theList = regex.split ("(\".*\")", @myString)

for x = 1 to sizeOf(theList)

    if not (theList[x] beginsWith "\"")

        theList[x] = string.lower(theList[x])

return regex.join("", @theList)

    « "this is an \"Amazing\" demonstration"

The group-making parentheses in the first parameter to regex.split() are crucial; otherwise, theList will be:

{"This Is An ", " Demonstration"}

Pad and Trim

To generate a string made up of one character, repeated:

string.filledString (character, length)

To attach leading zeros to a number:

toys.padWithZeros (theNumber, length)

If we are making ten files, for example, and we call them file1, file2, ... file10, then the system will show their names sorted in this order: file1, file10, file2, .... To prevent this, we call the files file01, file02, ... file10 ; toys.padWithZeros() lets us create the suffixes using a counting variable.

To remove all leading or trailing instances of a character:

string.popLeading (string, character)

string.popTrailing (string, character)

It is useful to write a combined utility verb that removes a character from both ends of a string; trim() might be a good name for it. And one of the first UserTalk scripts I ever wrote was a utility verb popAllTrailing() , which accepts a string of characters, any of which are to be removed from the end of a string.

*Example 14-9* *popAllTrailing*
on popAllTrailing(theString, theChars)
while theChars contains theString[sizeof(theString)]
delete(theString[sizeof(theString)])
return theString

Parsing

To "parse" here means to recognize divisions of a string based on some delimiter character. UserTalk has various ways of doing this.

Fields

The fields of a string are the substrings before, between, and after all occurrences of the delimiter character, except that if the last character of a string is the delimiter, there is no field after it.

For example, suppose . is the delimiter. Then in "root.system", there are two fields, root and system. In ".root.system." there are three fields: the empty string before the opening delimiter; then root; then system - there is no field after the final delimiter.

To count the number of fields:

string.countFields (string, delimiterChar)

To obtain a particular field:

string.nthField (string, delimiterChar, n)

Often what you really want to know is where the nth field is, rather than its value; the utility findNth() in Example 14-7 will help with this. Another useful utility to write is one that deletes the nth field.

Note that to deal with paragraphs (or lines) you just use the field verbs with cr as the delimiter character.

To remove a UserTalk trailing comment:

toys.commentDelete (string)

string.commentDelete (string)

toys.commentDelete() is implemented by using « as the delimiter character and returning the first field. It works around the fact that string.commentDelete() is limited to 255 characters. But so is a line of a script, and string.commentDelete() is smarter in a different way: it ignores « in a string literal.

Words

The term "words" is a little misleading; any character can serve as the delimiter, just as with fields. The real difference between fields and words is this: All runs of word-delimiters are counted as a single word-delimiter, and both leading and trailing word-delimiters are ignored.

For example, suppose . is the delimiter. Then "..root..system.." has six fields: empty, empty, root, empty, system, and empty. But it has only two words: "root" and "system".

Instead of supplying the word-delimiter as a parameter to each verb, like the field verbs, the word-delimiter is a "hidden" global. Once changed it retains its value until changed again or until Frontier is shut down, at which point it reverts to the space character. Typically, if you want the word-delimiter to be something other than a space, you'll set it, do some word operations, and restore it.4

To set or obtain the word-delimiter:

string.setWordChar (character)

string.getWordChar ()

To count the number of words:

string.countWords (string)

To obtain a particular word:

string.nthWord (string, n)

string.firstWord (string)

string.lastWord (string)

Obviously string.firstWord() and string.lastWord() could easily be written as utility scripts using string.nthWord() and string.countWords(). Given that they do exist, though, it's a pity there are no string.firstField() and string.lastField() to parallel them.

As with fields, often what one wants to know is where the nth word is. This is slightly harder to write as a utility than Example 14-7 was, because by the definition of a word, it is not sufficient to know where the n th word-delimiter is. It is implemented for you as wordInfo.getNthWordOffset() (in system.extensions). For example, here is a routine to capitalize the first letter of every word.

*Example 14-10* *titleCase( )*
on titlecase(s)
local (x, n)
for x = 1 to string.countwords(s)
n = wordInfo.getNthWordOffset(s,x)
s[n] = string.upper(s[n])
return s

To obtain the first sentence:

string.firstSentence (string)

This one's an oddity. There are no other "sentence" verbs, and the definition of a sentence is primitive. If you really need to work with genuine sentences from within Frontier, you're probably better off writing a grep pattern and using regex.

Pathnames

Pathnames specify a file, folder, or volume. Some character (on Mac OS it is a colon) delimits elements of pathnames.

To obtain just the name of a file:

file.fileFromPath (pathname)

To obtain the pathname of the enclosing folder:

file.folderFromPath (pathname)

To obtain the pathname of the enclosing volume:

file.volumeFromPath (pathname)

If a volume or folder is returned, the string includes a final colon. For example, if pathname is "HD:someFolder:someFile", these verbs return, respectively, "someFile", "HD:someFolder:", and "HD:". The idea is that you can confidently concatenate strings returned from these verbs to form new pathnames.

Formatted Numbers

To insert commas into the string representation of a number:

string.addCommas (number)

For example, string.addCommas(1234) is "1,234".

HTML-Related Conversions

The incorporation of Web site management features into Frontier has brought with it a number of utilities for performing HTML-related munging of strings. Some of these started life as scripts, were subsequently reimplemented as UCMDs, and finally were built into the kernel, for speed. Utilities optimized in this way remain in the database so that old scripts calling them don't break; for instance, toys.iso8859filter() , the original version, now simply calls string.iso8859encode() . Only the optimized version is listed here.

To pass a string's high-ASCII characters through a translate table:

string.iso8859encode (string)

string is typically the text of a Web page, and the idea is to treat all high-ASCII characters so that this text becomes universally legible over the Web, by way of ISO 8859-1 (Latin-1) encoding. The default translate table used is an internal copy of one that sits at suites.html.utilities.iso8859.table. As you can see, entry names are numeric ASCII values; these are paired with the string to which they are to be translated.

The default table tries to deal with the fact that the Mac character set does not completely translate to the ISO 8859 standard. So, the ampersand named-entity reference is used when it exists (for example, ASCII 138, ä, becomes "ä"), and otherwise a description in square brackets is employed (for example, ASCII 170, the trademark symbol, is rendered as "[trademark]"). A few characters are translated directly to low-ASCII approximations: curly quotes become straight, en- and em-dashes become hyphens, bullets become o, and the like.

The verb is more flexible than its name suggests: you are allowed to substitute your own translate table. The rule is that if there is a table at user.html. prefs.iso8859map , it will be used instead of the default. The best way to create this table is to copy suites.html.utilities.iso8859.table into user. html.prefs, rename it iso8859map, and alter the values to suit. What I do, actually, is to call the table something else, such as iso8859map1; that way, I can keep it on hand with its interfering with the default operation of string.iso8859encode(). Then, when I want a script to use my table, I have my script rename the table to iso8859map, call string.iso8859encode(), and then restore the name.

You may omit high-ASCII values from the table and the omitted values will simply translate to themselves. The routine does just one pass, so what a high-ASCII character is translated to can involve high-ASCII characters, including itself. You cannot use this verb to translate low-ASCII characters; low-ASCII numbers used as names in the translate table will be ignored.

To translate between a normal string and %-encoding:

string.URLencode (string)

string.URLdecode (string)

URLs can contain only a limited character set (as suggested, for instance, by RFC 1738); other characters are escaped using % followed by their hex ASCII value (octet), except that a space may be represented by + . These verbs perform the translation respectively to and from this encoding.

To parse CGI arguments:

string.parseHTTPargs (string)

This verb handles form data such as a browser generates as part of a POST method when the Submit button is pressed in an HTML form. Form data typically consists of name=value pairs delimited by ampersand and URL-encoded; this verb URL-decodes and parses the input into a list of strings, name followed by value. For example:

string.parseHTTPArgs ("name=Martin+M%9Fller&address=456+Main")

    {"name", "Martin Müller", "address", "456 Main")

The chief function of string.parseHTTPargs() is to be called by suites. webserver.parseArgs(), so you probably won't call it directly; it is more likely that you will call suites.webserver.parseArgs(), particularly if you pass search arguments to a CGI with a GET method. See Chapter 40, CGIs.

To split an absolute URL into its components:

toys.URLsplit (string)

Components are scheme, domain, and path; the domain does not end with a slash, so your use of this verb may need to supply one. For example:

toys.URLsplit ("http://www.tidbits.com/matt")

    « {"http://", "www.tidbits.com", "matt"}

To convert a relative URL to an absolute URL with respect to a given base:

pbs.utilities.parseRelativeURL (baseURL, relativeURL)

To extract a list of URLs:

pbs.getLinks (string)

To remove all HTML tags:

pbs.stripHTML (string)

The functionality of both pbs.getLinks() and pbs.stripHTML() might also be duplicated using regex, but the pbs versions are simpler and faster.

1. See Jeffrey E. F. Friedl, Mastering Regular Expressions (O'Reilly & Associates, Inc., 1997).

2. Since a backslash is already an escape character in grep, it must be escaped in a grep expression to be understood as a literal: to look for \ one must say \\. But now each of those must be escaped to be understood literally in a UserTalk string literal; whence the horrid \\\\ just to look for one backslash.

3. A commercial application, TextMachine, performs grep search and replace through English-like codes. For example, the grep pattern in the main text could be rendered as:

"[quote][zeroOrMore not quote][backslash]" + \

"[zeroOrMore (quote, (zeroOrMore not quote), backslash)]" + \

"[quote][zeroOrMore not quote][quote]"

which is pretty easy to code, to debug, and to read. See http://www.prefab.com/textmachine.html.

4. The word-delimiter is truly global, not confined to the current thread only; this means that you may need to take care with a script running in a multithreaded context, lest it clash with some other thread over the word-delimiter's behavior. See Chapter 21, Yielding, Pausing, Threads, and Semaphores.