Spider类
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| package Spider;
import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileOutputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.net.HttpURLConnection; import java.net.URL;
public class Spider{ String urlPath=null; String htmlPath=null; String msg=null; public void spider() throws Exception{ URL url =new URL(urlPath); HttpURLConnection conn=(HttpURLConnection)url.openConnection(); conn.setRequestMethod("GET"); conn.setRequestProperty("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"); BufferedReader br=new BufferedReader(new InputStreamReader(conn.getInputStream(),"GBK")); BufferedWriter bw=new BufferedWriter(new OutputStreamWriter(new FileOutputStream(htmlPath),"GBK")); while (null!=(msg=br.readLine())) { bw.write(msg); bw.flush(); } br.close(); bw.close(); System.out.println("下载完成"); } }
|
Main
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| package Spider;
public class Main { public static void main(String[] args) throws Exception {
String urlPath="https://www.cnblogs.com/zhangyinhua/p/8037599.html"; String htmlPath="C:\\Users\\xianyu\\Desktop\\Jsoup(一)Jsoup详解.html";
Spider spider=new Spider(); spider.urlPath=urlPath; spider.htmlPath=htmlPath; spider.spider();
} }
|
主类中的urlPath是要爬取网站的url
htmlPath是本地存储地址
编写过程中出现的问题:
运行结果有乱码,spider虽然能正常下载网页代码,并能在txt中准确打开,但在浏览器中显示异常
结合输出,应该显示中文的地方都是?,推测是spider类输出流问题
原因是读取的格式和网页的编码不一致
spider
1
| BufferedReader br=new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
|
把utf-8改成目标网站编的码方式GBK就解决了问题
在爬网页的过程中,要先去网站看一下它的编码格式,要和它相同就不会乱码
具体可以参考博文:http://www.cnblogs.com/agileblog/p/3615250.html
补充一下有帮助的博文
用BufferedWriter 将内容写入文件:https://www.cnblogs.com/sunada2005/p/4824566.html
File类(5)-Reader和Writer、OutputStreamWriter 、BufferedWriter、字节流和字符流的区别
https://blog.csdn.net/u013225534/article/details/45727863