色婷婷狠狠18禁久久YY,CHINESE性内射高清国产,国产女人18毛片水真多1,国产AV在线观看

Java爬蟲(chóng)方向怎么樣?

java主要用來(lái)做webcrawler,下面以爬取mit.mit為例

典型的爬蟲(chóng)工作步驟是這樣的:

1解析根網(wǎng)頁(yè)(“mit.edu”),并從此頁(yè)面獲取所有鏈接。要訪問(wèn)每個(gè)URL并解析HTML頁(yè)面,使用JSoup,這是一個(gè)方便和簡(jiǎn)單的Java庫(kù),類似于python的soulsoap

2使用從步驟1中檢索的URL,并解析這些URL

3在執(zhí)行上述步驟時(shí),我們需要跟蹤之前處理過(guò)的頁(yè)面,以便每個(gè)網(wǎng)頁(yè)只處理一次。這就是我們需要數(shù)據(jù)庫(kù)的原因

開(kāi)始用Java爬蟲(chóng)

1從http://jsoup.org/download下載JSoup核心庫(kù),從http://dev.mysql.com/downloads/connector/j/下載mysqljar包

2現(xiàn)在在eclipse中創(chuàng)建一個(gè)名為Crawler的工程,并把JSoup和mysqljar包放到j(luò)avabuild目錄下

3創(chuàng)建一個(gè)名為DB的類,它被用來(lái)進(jìn)行數(shù)據(jù)庫(kù)的相關(guān)操作

importjava.sql.Connection;

importjava.sql.DriverManager;

importjava.sql.ResultSet;

importjava.sql.SQLException;

importjava.sql.Statement;

publicclassDB{

publicConnectionconn=null;

publicDB(){

try{

Class.forName("com.mysql.jdbc.Driver");

Stringurl="jdbc:mysql://localhost:3306/Crawler";

conn=DriverManager.getConnection(url,"root","admin213");

System.out.println("connbuilt");

}catch(SQLExceptione){

e.printStackTrace();

}catch(ClassNotFoundExceptione){

e.printStackTrace();

}

}

publicResultSetrunSql(Stringsql)throwsSQLException{

Statementsta=conn.createStatement();

returnsta.executeQuery(sql);

}

publicbooleanrunSql2(Stringsql)throwsSQLException{

Statementsta=conn.createStatement();

returnsta.execute(sql);

}

@Override

protectedvoidfinalize()throwsThrowable{

if(conn!=null||!conn.isClosed()){

conn.close();

}

}

}

4創(chuàng)建一個(gè)名為Main的類,這將是我們的爬蟲(chóng)類

importjava.io.IOException;

importjava.sql.PreparedStatement;

importjava.sql.ResultSet;

importjava.sql.SQLException;

importjava.sql.Statement;

importorg.jsoup.Jsoup;

importorg.jsoup.nodes.Document;

importorg.jsoup.nodes.Element;

importorg.jsoup.select.Elements;

publicclassMain{

publicstaticDBdb=newDB();

publicstaticvoidmain(String[]args)throwsSQLException,IOException{

db.runSql2("TRUNCATERecord;");

processPage("http://www.mit.edu");

}

publicstaticvoidprocessPage(StringURL)throwsSQLException,IOException{

//checkifthegivenURLisalreadyindatabase

Stringsql="select*fromRecordwhereURL='"+URL+"'";

ResultSetrs=db.runSql(sql);

if(rs.next()){

}else{

//storetheURLtodatabasetoavoidparsingagain

sql="INSERTINTO`Crawler`.`Record`"+"(`URL`)VALUES"+"(?);";

PreparedStatementstmt=db.conn.prepareStatement(sql,Statement.RETURN_GENERATED_KEYS);

stmt.setString(1,URL);

stmt.execute();

//getusefulinformation

Documentdoc=Jsoup.connect("http://www.mit.edu/").get();

if(doc.text().contains("research")){

System.out.println(URL);

}

//getalllinksandrecursivelycalltheprocessPagemethod

Elementsquestions=doc.select("a[href]");

for(Elementlink:questions){

if(link.attr("href").contains("mit.edu"))

processPage(link.attr("abs:href"));

}

}

}

}