This page is no longer maintained — Please continue to the home page at www.scala-lang.org

More elegant way of reading HTML from a URL than this?

16 replies
Kenneth McDonald
Joined: 2009-01-11,
User offline. Last seen 42 years 45 weeks ago.

Here's a bit of code I wrote to read the HTML from a URL, and return
it as a string. I was wondering if a Scala guru could show me the
"right" way to do this. I'm sure there's a more elegant solution.

-----------------------------
class URLLineReader(url:String) extends Iterator[String] {
val reader = new java.io.BufferedReader(new
java.io.InputStreamReader(new java.net.URL(url).openStream()))
var line:String = null;

def hasNext = {
line = reader.readLine()
line != null
}

def next = line
}

object Main {
def main(args: Array[String]) {
val reader = new URLLineReader("http://www.yahoo.com/")
val html = (for (line <- reader) yield line).mkString("")
println(html)
}
}
------------------------------

Kenneth McDonald
Joined: 2009-01-11,
User offline. Last seen 42 years 45 weeks ago.
More elegant way of reading HTML from a URL than this?
Here's a bit of code I wrote to read the HTML from a URL, and return it as a string. I was wondering if a Scala guru could show me the "right" way to do this. I'm sure there's a more elegant solution.

-----------------------------
class URLLineReader(url:String) extends Iterator[String] {
   val reader = new java.io.BufferedReader(new java.io.InputStreamReader(new java.net.URL(url).openStream()))
   var line:String = null;

   def hasNext = {
       line = reader.readLine()
       line != null
   }

   def next = line
}

object Main {
   def main(args: Array[String])  {
       val reader = new URLLineReader("http://www.yahoo.com/")
       val html = (for (line <- reader) yield line).mkString("")
       println(html)
   }
}
------------------------------
ounos
Joined: 2008-12-29,
User offline. Last seen 3 years 44 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

Scala apart, it's quite bad style for hasNext() not to be idempotent.

O/H Kenneth McDonald έγραψε:
> Here's a bit of code I wrote to read the HTML from a URL, and return
> it as a string. I was wondering if a Scala guru could show me the
> "right" way to do this. I'm sure there's a more elegant solution.
>
> -----------------------------
> class URLLineReader(url:String) extends Iterator[String] {
> val reader = new java.io.BufferedReader(new
> java.io.InputStreamReader(new java.net.URL(url).openStream()))
> var line:String = null;
>
> def hasNext = {
> line = reader.readLine()
> line != null
> }
>
> def next = line
> }
>
> object Main {
> def main(args: Array[String]) {
> val reader = new URLLineReader("http://www.yahoo.com/")
> val html = (for (line <- reader) yield line).mkString("")
> println(html)
> }
> }
> ------------------------------
>
>

Tako Schotanus
Joined: 2008-12-22,
User offline. Last seen 42 years 45 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
Well besides the fact that the code doesn't honor its contract because the next() operator doesn't return the next line, like it's supposed to, if you haven't called hasNext() first.

So not only it's bad style, it's just plain wrong :)


On Wed, Jan 21, 2009 at 09:07, Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com> wrote:
..8ex; padding-left: 1ex;"> Scala apart, it's quite bad style for hasNext() not to be idempotent.

O/H Kenneth McDonald έγραψε:
Here's a bit of code I wrote to read the HTML from a URL, and return it as a string. I was wondering if a Scala guru could show me the "right" way to do this. I'm sure there's a more elegant solution.

-----------------------------
class URLLineReader(url:String) extends Iterator[String] {
   val reader = new java.io.BufferedReader(new java.io.InputStreamReader(new java.net.URL(url).openStream()))
   var line:String = null;

   def hasNext = {
       line = reader.readLine()
       line != null
   }

   def next = line
}

object Main {
   def main(args: Array[String])  {
       val reader = new URLLineReader("http://www.yahoo.com/")
       val html = (for (line <- reader) yield line).mkString("")
       println(html)
   }
}
------------------------------




Ricky Clarkson
Joined: 2008-12-19,
User offline. Last seen 3 years 2 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
class URLLineReader(url: String) {
  val reader = new java.io.BufferedReader(new java.io.InputStreamReader(new java.net.URL(url).openStream(), "US-ASCII"));
  def foldLeft[T](init: T)(f: (T, String) => T): T = reader.readLine match {
    case null => init
    case line => foldLeft(f(init, line))(f)
  }
}

object Main {
  def main(args: Array[String]) = println(new URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
}
ounos
Joined: 2008-12-29,
User offline. Last seen 3 years 44 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

At least the original not-so-precise version was almost linear. This is
quadratic. And pages tend to be quite lengthy these days, so beware.

O/H Ricky Clarkson έγραψε:
> class URLLineReader(url: String) {
> val reader = new java.io.BufferedReader(new
> java.io.InputStreamReader(new java.net.URL(url).openStream(),
> "US-ASCII"));
> def foldLeft[T](init: T)(f: (T, String) => T): T = reader.readLine
> match {
> case null => init
> case line => foldLeft(f(init, line))(f)
> }
> }
>
> object Main {
> def main(args: Array[String]) = println(new
> URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
> }

loverdos
Joined: 2008-11-18,
User offline. Last seen 2 years 27 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
TRUE, TRUE, TRUE (!)

Scala apart, it's quite bad style for hasNext() not to be idempotent.

--
 __~O
-\ <,       Christos KK Loverdos
(*)/ (*)      http://ckkloverdos.com
Stepan Koltsov
Joined: 2008-12-20,
User offline. Last seen 42 years 45 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

InputStreamResource.url("http://...").readString

InputStreamResource.url("http://...").readLines

InputStreamResource.url("http://...").lines.foreach(println(_))

InputStreamResource is part of scalax.

BTW, I think InputStreamResource-like classes must be included into
the scala standard library.

S.

On Wed, Jan 21, 2009 at 06:39, Kenneth McDonald
wrote:
> Here's a bit of code I wrote to read the HTML from a URL, and return it as a
> string. I was wondering if a Scala guru could show me the "right" way to do
> this. I'm sure there's a more elegant solution.
>
> -----------------------------
> class URLLineReader(url:String) extends Iterator[String] {
> val reader = new java.io.BufferedReader(new java.io.InputStreamReader(new
> java.net.URL(url).openStream()))
> var line:String = null;
>
> def hasNext = {
> line = reader.readLine()
> line != null
> }
>
> def next = line
> }
>
> object Main {
> def main(args: Array[String]) {
> val reader = new URLLineReader("http://www.yahoo.com/")
> val html = (for (line <- reader) yield line).mkString("")
> println(html)
> }
> }
> ------------------------------
>
>

Ricky Clarkson
Joined: 2008-12-19,
User offline. Last seen 3 years 2 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
How is what I showed quadratic?

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com>
jim [dot] andreou [at] gmail [dot] com>
At least the original not-so-precise version was almost linear. This is quadratic. And pages tend to be quite lengthy these days, so beware.

O/H Ricky Clarkson έγραψε:
class URLLineReader(url: String) {
 val reader = new java.io.BufferedReader(new java.io.InputStreamReader(new java.net.URL(url).openStream(), "US-ASCII"));
 def foldLeft[T](init: T)(f: (T, String) => T): T = reader.readLine match {
   case null => init
   case line => foldLeft(f(init, line))(f)
 }
}

object Main {
 def main(args: Array[String]) = println(new URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
}


Derek Chen-Becker
Joined: 2008-12-16,
User offline. Last seen 42 years 45 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

Stepan Koltsov wrote:
> InputStreamResource.url("http://...").readString
>
> InputStreamResource.url("http://...").readLines
>
> InputStreamResource.url("http://...").lines.foreach(println(_))
>
> InputStreamResource is part of scalax.
>
> BTW, I think InputStreamResource-like classes must be included into
> the scala standard library.

I second that. Scala's included IO package is pretty anemic. At the very
least, it would be nice to have some wrappers similar to JCL to add some
nice scala-ish functionality to existing Java IO classes. Of course, I
don't personally have time to work on it so I can't complain too loudly ;)

Derek

ounos
Joined: 2008-12-29,
User offline. Last seen 3 years 44 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

Maybe my scala-code-parsing brain neurons are still too weak, but I
think you wrote the equivalent of:

val lines: Seq[String] = ...
var output = ""
for (line <- lines) output += line

No?

O/H Ricky Clarkson έγραψε:
> How is what I showed quadratic?
>
> 2009/1/21 Dimitris Andreou >
>
> At least the original not-so-precise version was almost linear.
> This is quadratic. And pages tend to be quite lengthy these days,
> so beware.
>
> O/H Ricky Clarkson έγραψε:
>
> class URLLineReader(url: String) {
> val reader = new java.io.BufferedReader(new
> java.io.InputStreamReader(new java.net.URL(url).openStream(),
> "US-ASCII"));
> def foldLeft[T](init: T)(f: (T, String) => T): T =
> reader.readLine match {
> case null => init
> case line => foldLeft(f(init, line))(f)
> }
> }
>
> object Main {
> def main(args: Array[String]) = println(new
> URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
> }
>
>
>

Bryan
Joined: 2008-12-19,
User offline. Last seen 42 years 45 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
If performance is such an issue, couldn't you first get the content-length from the HTTP headers and then allocate the initial capacity of a StringBuilder with that content-length.  StringBuilder's append should be faster than String concatenation.

On Wed, Jan 21, 2009 at 12:00 PM, Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com> wrote:
Maybe my scala-code-parsing brain neurons are still too weak, but I think you wrote the equivalent of:

val lines: Seq[String] = ...
var output = ""
for (line <- lines) output += line

No?

O/H Ricky Clarkson έγραψε:
How is what I showed quadratic?

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com <mailto:jim [dot] andreou [at] gmail [dot] com>>

   At least the original not-so-precise version was almost linear.
   This is quadratic. And pages tend to be quite lengthy these days,
   so beware.

   O/H Ricky Clarkson έγραψε:

       class URLLineReader(url: String) {
        val reader = new java.io.BufferedReader(new
       java.io.InputStreamReader(new java.net.URL(url).openStream(),
..openStream(),
       "US-ASCII"));
        def foldLeft[T](init: T)(f: (T, String) => T): T =
       reader.readLine match {
          case null => init
          case line => foldLeft(f(init, line))(f)
        }
       }

       object Main {
        def main(args: Array[String]) = println(new
       URLLineReader("http://www.yahoo.com/").foldLeft("&q..com/" target="_blank">http://www.yahoo.com/").foldLeft("")(_ + _))
       }





Viktor Klang
Joined: 2008-12-17,
User offline. Last seen 1 year 27 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
But then you'd have to have two branches in the code, one for responses _with_ Content-Length, and one for terminated-at-end-of-transmission logic.

2009/1/21 Bryan <<..

2009/1/21 Bryan <germish [at] gmail [dot] com>
If performance is such an issue, couldn't you first get the content-length from the HTTP headers and then allocate the initial capacity of a StringBuilder with that content-length.  StringBuilder's append should be faster than String concatenation.

On Wed, Jan 21, 2009 at 12:00 PM, Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com> wrote:
Maybe my scala-code-parsing brain neurons are still too weak, but I think you wrote the equivalent of:

val lines: Seq[String] = ...
var output = ""
for (line <- lines) output += line

No?

O/H Ricky Clarkson έγραψε:
How is what I showed quadratic?

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com <mailto:jim [dot] andreou [at] gmail [dot] com>>

   At least the original not-so-precise version was almost linear.
   This is quadratic. And pages tend to be quite lengthy these days,
   so beware.

   O/H Ricky Clarkson έγραψε:

       class URLLineReader(url: String) {
        val reader = new java.io.BufferedReader(new
       java.io.InputStreamReader(new java.net.URL(url).openStream(),
..openStream(),
..openStream(),
...openStream(),
       "US-ASCII"));
        def foldLeft[T](init: T)(f: (T, String) => T): T =
       reader.readLine match {
          case null => init
          case line => foldLeft(f(init, line))(f)
        }
       }

       object Main {
        def main(args: Array[String]) = println(new
       URLLineReader("http://www.yahoo.com/").foldLeft("&a..com/" target="_blank">http://www.yahoo.com/").foldLeft("&q..com/" target="_blank">http://www.yahoo.com/").foldLeft("&q..com/" target="_blank">http://www.yahoo.com/").foldLeft("")(_ + _))
       }








--
Viktor Klang
Senior Systems Analyst
Ricky Clarkson
Joined: 2008-12-19,
User offline. Last seen 3 years 2 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
Indeed.  I was looking in the URLLineReader class for, um, quadraticity.  Here's a fixed up main:

object Main {
  def main(args: Array[String]) = println(new URLLineReader("http://www.yahoo.com/").foldLeft(new StringBuilder)(_ append _))
}

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com>
Maybe my scala-code-parsing brain neurons are still too weak, but I think you wrote the equivalent of:

val lines: Seq[String] = ...
var output = ""
for (line <- lines) output += line

No?

O/H Ricky Clarkson έγραψε:
How is what I showed quadratic?

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com <mailto:jim [dot] andreou [at] gmail [dot] com>>

   At least the original not-so-precise version was almost linear.
   This is quadratic. And pages tend to be quite lengthy these days,
   so beware.

   O/H Ricky Clarkson έγραψε:

       class URLLineReader(url: String) {
        val reader = new java.io.BufferedReader(new
       java.io.InputStreamReader(new java.net.URL(url).openStream(),
..openStream(),
       "US-ASCII"));
        def foldLeft[T](init: T)(f: (T, String) => T): T =
       reader.readLine match {
          case null => init
          case line => foldLeft(f(init, line))(f)
        }
       }

       object Main {
        def main(args: Array[String]) = println(new
       URLLineReader("http://www.yahoo.com/").foldLeft("&q..com/" target="_blank">http://www.yahoo.com/").foldLeft("")(_ + _))
       }





ounos
Joined: 2008-12-29,
User offline. Last seen 3 years 44 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

Surely. It would be much faster even with the typically modest default
initial size.

I wanted to make the (obvious, in my opinion) point that making an
algorithm so much slower is inexcusable, for whatever kind of elegance's
sake. (I had thought that Ricky consciously chosen this kind of
'elegance' over that performance, but probably by mistake, so it's ok)

O/H Bryan έγραψε:
> If performance is such an issue, couldn't you first get the
> content-length from the HTTP headers and then allocate the initial
> capacity of a StringBuilder with that content-length. StringBuilder's
> append should be faster than String concatenation.
>
> On Wed, Jan 21, 2009 at 12:00 PM, Dimitris Andreou
> > wrote:
>
> Maybe my scala-code-parsing brain neurons are still too weak, but
> I think you wrote the equivalent of:
>
> val lines: Seq[String] = ...
> var output = ""
> for (line <- lines) output += line
>
> No?
>
>
> O/H Ricky Clarkson ������:
>
> How is what I showed quadratic?
>
> 2009/1/21 Dimitris Andreou >>
>
>
> At least the original not-so-precise version was almost linear.
> This is quadratic. And pages tend to be quite lengthy these
> days,
> so beware.
>
> O/H Ricky Clarkson ������:
>
> class URLLineReader(url: String) {
> val reader = new java.io.BufferedReader(new
> java.io.InputStreamReader(new
> java.net.URL(url).openStream(),
> "US-ASCII"));
> def foldLeft[T](init: T)(f: (T, String) => T): T =
> reader.readLine match {
> case null => init
> case line => foldLeft(f(init, line))(f)
> }
> }
>
> object Main {
> def main(args: Array[String]) = println(new
> URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
> }
>
>
>
>
>

Ricky Clarkson
Joined: 2008-12-19,
User offline. Last seen 3 years 2 weeks ago.
Re: More elegant way of reading HTML from a URL than this?
Actually I was choosing readability and referential transparency over performance.  Computer programs are primarily for humans to read, and only incidentally for machines to execute.  (probably a paraphrase, rather than a quote, from SICP).

2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com>
Surely. It would be much faster even with the typically modest default initial size.

I wanted to make the (obvious, in my opinion) point that making an algorithm so much slower is inexcusable, for whatever kind of elegance's sake. (I had thought that Ricky consciously chosen this kind of 'elegance' over that performance, but probably by mistake, so it's ok)

O/H Bryan έγραψε:
If performance is such an issue, couldn't you first get the content-length from the HTTP headers and then allocate the initial capacity of a StringBuilder with that content-length.  StringBuilder's append should be faster than String concatenation.

On Wed, Jan 21, 2009 at 12:00 PM, Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com <mailto:jim [dot] andreou [at] gmail [dot] com>> wrote:

   Maybe my scala-code-parsing brain neurons are still too weak, but
   I think you wrote the equivalent of:

   val lines: Seq[String] = ...
   var output = ""
   for (line <- lines) output += line

   No?


   O/H Ricky Clarkson ������:

       How is what I showed quadratic?

       2009/1/21 Dimitris Andreou <jim [dot] andreou [at] gmail [dot] com
       <mailto:jim [dot] andreou [at] gmail [dot] com> <mailto:jim [dot] andreou [at] gmail [dot] com
       <mailto:jim [dot] andreou [at] gmail [dot] com>>>


          At least the original not-so-precise version was almost linear.
          This is quadratic. And pages tend to be quite lengthy these
       days,
          so beware.

          O/H Ricky Clarkson ������:

              class URLLineReader(url: String) {
               val reader = new java.io.BufferedReader(new
              java.io.InputStreamReader(new
       java.net.URL(url).openStream(),
              "US-ASCII"));
               def foldLeft[T](init: T)(f: (T, String) => T): T =
              reader.readLine match {
                 case null => init
                 case line => foldLeft(f(init, line))(f)
               }
              }

              object Main {
               def main(args: Array[String]) = println(new
              URLLineReader("http://www.yahoo.com/").foldLeft("")(_ + _))
              }







Frank Teubler
Joined: 2009-01-22,
User offline. Last seen 3 years 37 weeks ago.
Re: More elegant way of reading HTML from a URL than this?

here the URLLineReader using the java.util.Scanner

--------------------
class URLLineReader(urlstring:String) extends Iterator[String] {
val url = new java.net.URL(urlstring)
val scan = new java.util.Scanner(url.openStream)

def hasNext = scan.hasNextLine
def next = scan.nextLine
}
--------------------

and if you like to read the text in one piece

--------------------
def text(urlstring:String):String = {
val url = new java.net.URL(urlstring)
val scan = new java.util.Scanner(url.openStream)
scan.useDelimiter("\\Z") /* End Of File */
scan.next
}
--------------------

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland