PARAM VIR SINGH

 

Carnegie Bosch Institute Junior Chair and Associate Professor of Business Technologies
David A. Tepper School of Business
Carnegie Mellon University
Pittsburgh, PA 15213
Phone: 412-268-3585

Resources for PhD Students

How to download data from Web?
The ability to download data from the web is really helpful in business research. I personally benefitted from the codes that Professor Ravi Bapna (University of Minnesota) posted in the past on his website for helping PhD students learn how to download data. Since that webpage is no longer available, I wanted to provide this resource to the phd students who may benefit from such codes.
Here are some basic steps you will need to follow.
1. Download and install XAMPP from http://www.apachefriends.org/en/xampp.html
There are three packages (apache, php, mysql) that you would need to download data and store in a database. It is not easy to install an Apache web server and it gets harder if you want to add MySQL, PHP and Perl. XAMPP is pre-packaged software that has already added and combined these three packages. It is free.

2. To download data from the web you will need to write a PHP script. You should make sure that you save it in the “c:\xampp\htdocs” and make sure the file extension is “.php”.

3. Download a php editor such as editplus. Get the community edition which is free.
Here is an example script to download the webpage http://www.amazon.com/dp/0691010188 and save it as file1.html in “C:/”.  Copy and save the following code as example1.php  in htdocs. To execute this file start xampp  and then type localhost/example1.php in your browser and press enter

<?php
$link=mysql_connect("localhost","root","");
function project($link)
{             
                $urltoget = "http://www.amazon.com/dp/0691010188";
// $urltoget is the url we want to download
                $uhandle = fopen($urltoget,'r') or die();
                $filetosave = "c:/file1.html";
//$filetosave is the name of the file which will be saved.
//You can choose the directory where you want to save it.
$prhandle=fopen($filetosave,'w')or die();
//While loop below reads the from $urltoget line by line
                while(!feof($uhandle))
                                {
                                $gread=fgets($uhandle,4096);
                                fwrite($prhandle,$gread);
                                }
                fclose($uhandle);
                fclose($prhandle);
               
}
project($link);
?>

 

4. Getting information from the file1.html.
Here is an example script that gets the title of the book and prints it out. Copy and save the following code as example2.php  in htdocs. To execute this file start xampp  and then type localhost/example2.php in your browser and press enter

 

<?php
$link=mysql_connect("localhost","root","");
function project($link)
{             
$file = "c:/file1.html";
$prhandle=fopen($file,'r')or die();
$ctr=0;
                while(!feof($prhandle))
                                {
                                $ctr=$ctr+1;
                                $gread[$ctr]=fgets($prhandle,4096);
//by looking at the html file I find that the title of the book is preceded by a unique identified <title>
// below I use this information to identify the line which includes this identifier
                               
                                if (preg_match('/\<title\>/',$gread[$ctr]))
                                                {
                                                echo strip_tags($gread[$ctr]);
// If you wish to split the title further use a command preg_split to do it.
                                                }
                                }
                fclose($prhandle);
}
project($link);
?>

 

5. If you want to save the title in you’re a database you can do the following. Because we will use mysql it is useful to have a gui for mysql. I use sqlyog (https://code.google.com/p/sqlyog/downloads/list) as the gui. Use the community edition which is free. The gui is very straightforward.

Create a database “mydata” in mysql. The create a table “booktitle”. We will have only one column in this table called “title”. I will set the field type as smalltext. Once you have created the table you can run the following script to get the title in the table.
<?php
$link=mysql_connect("localhost","root","");
function project($link)
{             
mysql_select_db('mydata',$link);
$file = "c:/file1.html";
$prhandle=fopen($file,'r')or die();
$ctr=0;
                while(!feof($prhandle))
                                {
                                $ctr=$ctr+1;
                                $gread[$ctr]=fgets($prhandle,4096);
                               
                                if (preg_match('/\<title\>/',$gread[$ctr]))
                                                {
                                                echo strip_tags($gread[$ctr]);
                                                $title=strip_tags($gread[$ctr]);
                                                mysql_query("Insert into booktitle values ('$title')");
                                                }
                                }

                fclose($prhandle);
               
}
project($link);
?>

 

Additional Resources.
If you want to grab some information from the database and use it in your php file you can do the following. We have a database called movies, which has a table called movielist which has columns called imdbid and name.

                                mysql_select_db('movies',$link);
                                $query = mysql_query("select distinct imdbid, name from movielist order by imdbid");
                                while ($cr = mysql_fetch_assoc($query))
{
// in the while loop we are reading the data from the table row by row
 $moviename = trim($cr['name']);
                                                 $movieid=trim($cr['imdbid']);
//// you can write the code which uses $moviename or $movieid within the while loop.
}