A URL, or Uniform Resource Locator, is the standard addresser used for most http interaction; however it is not limited to http.
A URL is composed of two main parts, the protocol identifier and the resource name.
http:// www.starcraft2.com
|____| | ___________________|
protocol resource name
There are several different types of protocols, other than http, like FTP (File Transfer Protocol) or one we all see quite often, File. The File protocol is the one used locally on your computer to navigate your system.
The resource name has 4 different components, 2 of which are optional and often implied.
There is the:
Host Name - the name of the machine where the resource lives - for the starcraft2 URL this is the full www.starcraft2.com
File Name - path name to the file on the machine - if you went to www.starcraft2.com/features, the /features portion is both the file name, and what is called a 'relative URL' (meaning it is a locator relative to the base URL)
Port Number - this is typically implied, but can also be specified. For most sites you will ever go to, the port number is 80
Reference - a reference to an anchor in a specified location in a file. This is optional, and you typically wont notice this.
I will expand the resource name to its full form so you can see the components:
www.yahoo.com:80/index.html
|______________| |__| |_________| Host name - Port - File name
For most browsers, when the site ends with .com/ this implies a /index.html This isn't universally true, but for most websites that is the case. For example, this doesn't work for www.starcraft2.com because, I believe, their site is so pro and Flash based, that they find these norms irrelevant (try typing it, the error page is funny).
Connecting to a page, without a browser, is actually pretty easy. One easy way is to go to your terminal and type:
curl www.starcraft2.com this wont actually give you a connection, but will just print out the source of the page. If you want to save the source using curl, you can type:
curl www.starcraft2.com >> ~/Desktop/starcraftSource.txt
Most languages have built in methods to connect to a URL, and I know you are all gonna hate it, but I will show you the JAVA way since I am making a Java Socket Server.
import java.net.*;
import java.io.*;
public class URLReader {
public static void main(String[] args) throws Exception {
URL sc2 = new URL("www.starcraft2.com");
URLConnection sc2connection = sc2.openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(
sc2connection.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
}
This will open a connection, grab all the html, and print it out. There are other things you can do with the connection, like write to the server.
Most html pages have 'forms' - GUI objects that allow interaction with the site. User to HTML page interaction is written into the URL by your browser, and then sent to the server. Once the server receives the new URL, it process it, builds a response, and sends back a bunch more HTML for the browser to view.
Lots of HTML forms use HTML POST METHOD to send data to the server. This is called 'posting to a URL'. The server recognizes a post, and then responds.
If you know what objects are meant for interaction, or the form the URL takes after an interaction, then you can just write that URL directly to your connection, and see the response.
Later today I will go over a lot of the interesting Threading issues involved with making a Socket Server.
No comments:
Post a Comment