问题
I wrote similar question before, but I need something different what I got from previous question.
I have a html data which is written below (part of the data where I need).
I already got rcpNo value, but eleId is changed from 1 to 33, offset, length don't have any regular pattern. Three of the data is consist of numbers, sometime different digit.
I need to read rcpNO, eleId, offset, length and dtd.
(dtd is fixed as 'dart3.xsd' but I try this only one html so there is possibility different dtd value for different html data. So I want to read from html data.)
# This is the part of html
#viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)
treeNode1.appendChild(treeNode2);
treeNode2 = new Tree.TreeNode({
text: "4. The number of stocks",
id: "7",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd');}
}
});
cnt++;
Similar data is repeated so I write some part of HTML:
treeNode2 = new Tree.TreeNode({
text: "1. Summary information",
id: "12",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '12', '189335', '18247', 'dart3.xsd');}
}
});
cnt++;
treeNode1.appendChild(treeNode2);
treeNode2 = new Tree.TreeNode({
text: "2. Linked finance state",
id: "13",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd');}
}
});
cnt++;
treeNode1.appendChild(treeNode2);
treeNode2 = new Tree.TreeNode({
text: "3. Comment for linked finance state",
id: "14",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '14', '284697', '372938', 'dart3.xsd');}
}
});
cnt++;
As you can see above text and id is changed regularly. I want to read all of the dcmNo, eleId, offset, length and dtd information. especially with typical id & text.
I tried to below
string = "{viewDoc('20180515000480', '6177478', '6', '58846', '899', 'dart3.xsd');}"
>>> pattern = re.compile(r'viewDoc\(\'\d+\', \'(\d+)\', \'(\d+)\', \'(\d+)\', \'(\d+)\', \'(\d+)\' .+\)', re.MULTILINE | re.DOTALL)
and with Beautifulsoup
>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all(string = pattern)
and this command find all html, I cannot distinguish the data. But it doesn't work and it find the first text from html what I don't have to read.
Edit
This is how can I get the html from url
from bs4 import BeautifulSoup
import requests
import re
url = "http://dart.fss.or.kr/api/search.json?auth="+API_KEY \
+"&crp_cd="+company_code + "&page_set=100" \
+"&start_dt=19990101&bsn_tp=A001&bsn_tp=A002&bsn_tp=A003"
json_data = requests.get(url).json()
list = json_data['list']
data = pd.DataFrame.from_dict(list)
print(data['rcp_no'][0])
url2 = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo="+data['rcp_no'][0]
temp = requests.get(url2)
html = temp.text
soup = BeautifulSoup(html, "html.parser")
and above example of html is the part of print(soup). As I said, there are a lot of same format in html and I want to read typical line. For example, if I can find below line then I want to get the data
# viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)
viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd')
viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd')
like, ['6177478', '7', '59749', '7130', 'dart3.xsd'], ['6177478', '7', '59749', '7130', 'dart3.xsd'], number and text data (dcmNo, eleId, offset, length and dtd)
来源:https://stackoverflow.com/questions/51493468/read-html-with-beautifulsoup-and-find-typical-data