Talk:Suffix tree: Difference between revisions

From Rosetta Code
Content added Content deleted
Line 36: Line 36:


::::: Yes, I misread your example. But then, your example is just like the one I provided earlier: "aa$", only with more "a"s. If you can work out how to find the substring "a$" and "aa$" in that example, extending it to arbitrary length is a no brainer. --[[User:Ledrug|Ledrug]] ([[User talk:Ledrug|talk]]) 01:54, 26 May 2013 (UTC)
::::: Yes, I misread your example. But then, your example is just like the one I provided earlier: "aa$", only with more "a"s. If you can work out how to find the substring "a$" and "aa$" in that example, extending it to arbitrary length is a no brainer. --[[User:Ledrug|Ledrug]] ([[User talk:Ledrug|talk]]) 01:54, 26 May 2013 (UTC)

:::::: I am not sure I can work that out. Let's take aaa$. The root node can have only one edge leading away from it, which has the label a. I would expect that this node is similar to the tree for 'aa$' but the wikipedia page claims that except for the root node all nodes must have at least two children, and you have stated that only one edge leading from the aa$ node is allowed. I cannot think of any implementation which can satisfy both of these constraints. Perhaps because I am stuck in my thinking about this structure, I also cannot think of any useful algorithms that would use it. --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 05:09, 26 May 2013 (UTC)

Revision as of 05:09, 26 May 2013

Different test case?

Can we use maybe "banana" from the WP page instead? "rosettacode" simply doesn't have enough interesting repetitions. --Ledrug (talk) 17:58, 17 May 2013 (UTC)

Ok --Grondilu (talk) 21:11, 25 May 2013 (UTC)

definition?

The wikipedia definition for a suffix tree currently looks like this:

The suffix tree for the string of length is defined as a tree such that:[1]
  • the paths from the root to the leaves have a one-to-one relationship with the suffixes of ,
  • edges spell non-empty strings,
  • and all internal nodes (except perhaps the root) have at least two children.
  1. Template:Harvtxt, p.90.
  2. But this can be satisfied by a tree with only a root where each node is a unique suffix. Something is missing from this definition, and that something seems to have something to do with substrings which appear in multiple locations in the string.

    What is the missing part of this definition? --Rdm (talk) 16:33, 23 May 2013 (UTC)

    On every node, each edge leading away from it must start with a unique letter. So the suffix tree of string "aa$" can't have two edges "a$" and "aa$" leading away from the root node. Instead it must have one edge "a" pointing to a second node, which in turn has two outgoing edges "a$" and "$". The reason for unique leading letters is that, otherwise when matching substrings, one can't decide which edge to follow at each node in O(1) time. --Ledrug (talk) 13:02, 24 May 2013 (UTC)
    I'm still not seeing how to make this work in O(1) time (interpreting the big-O notation as representing worst case behavior). Let's say our string is an arbitrary length sequence of a single letter (followed by the terminating character). Now we have a single node and the number of edges we have to pick between is O(n) and we need something like an oracle (or luck or a complete scan) to tell us which of them we need to follow. --Rdm (talk) 19:39, 24 May 2013 (UTC)
    Eh, either an oracle, or a hash table maybe? If you have a single node with n edges going out, but each edge begins with a unique letter, then map these letters to the edges. During a string match, just see what next letter is, and do an O(1) lookup to get the corresponding edge. --Ledrug (talk) 15:27, 25 May 2013 (UTC)
    If by "each edge begins with a unique letter" you mean "each edge begins with a uniquely different letter" then I agree. However, the example I proposed had each edge begin with a letter which is the same as every other edge (except for the final edge). This letter is in a sense unique (it's the only letter in the example before we decorate it with the final character) but it's probably better to say that we cannot be guaranteed that each edge begins with a different unique letter. Or, more concisely: did you read what I wrote? --Rdm (talk) 15:33, 25 May 2013 (UTC)
    Yes, I misread your example. But then, your example is just like the one I provided earlier: "aa$", only with more "a"s. If you can work out how to find the substring "a$" and "aa$" in that example, extending it to arbitrary length is a no brainer. --Ledrug (talk) 01:54, 26 May 2013 (UTC)
    I am not sure I can work that out. Let's take aaa$. The root node can have only one edge leading away from it, which has the label a. I would expect that this node is similar to the tree for 'aa$' but the wikipedia page claims that except for the root node all nodes must have at least two children, and you have stated that only one edge leading from the aa$ node is allowed. I cannot think of any implementation which can satisfy both of these constraints. Perhaps because I am stuck in my thinking about this structure, I also cannot think of any useful algorithms that would use it. --Rdm (talk) 05:09, 26 May 2013 (UTC)