onionscan之数据提取

需求

onionscan爬取暗网数据后,是使用tiedot这个非关系型数据库进行数据存储,需要从tiedot提取出来暗网的网址链接

tiedot

简介

tiedot 是一个文档型数据库引擎,它使用 JSON 作为文档表示方法;有一个强大的查询处理器,支持高级设置操作;可以嵌入到程序中,也可以使用 HTTP 为 API 运行独立服务

github地址: https://github.com/HouzuoGuo/tiedot

启动tiedot

1
2
3
4
5
6
# gopath路径为/root/gopath
go get github.com/HouzuoGuo/tiedot
# 进入生成tiedot的目录下,如果$gopath/bin已设置为环境变量,可忽略这一步
cd /root/gopath/bin
# 启动tiedot的httpd服务
tiedot -mode=httpd -dir=/opt/onionscandb -port 9090

tiedot文件结构

进入/opt/onionscandb目录下,可以看到onionscan保存的数据目录结构如下图,crawlsrelationships是tiedot的两个集合

其中relationships目录中结构如下图,From,Identifier, Onion, Type是四个索引,被索引的字段才可以被查询到


查询语法

1、查询有哪些collection

1
curl '127.0.0.1:9090/all'

2、查询col中所有的索引

1
curl '127.0.0.1:9090/indexes?col=relationships'

3、 查询索引Type等于uri或者索引Form等于links的内容

1
2
# 127.0.0.1:9090/query?col=relationships&q=[{"eq": "uri", "in": ["Type"]},{"eq":"links","in":["From"]}]
curl -g '127.0.0.1:9090/query?col=relationships&q=[{"eq":%20"uri",%20"in":%20["Type"]},{"eq":"links","in":["From"]}]'

4、 查询索引Type等于uri且索引Form等于links的内容

1
2
# 127.0.0.1:9090/query?col=relationships&q={"n":[{"eq": "uri", "in": ["Type"]},{"eq":"links","in":["From"]}]}
curl -g '127.0.0.1:9090/query?col=relationships&q={%22n%22:[{%22eq%22:%20%22uri%22,%20%22in%22:%20[%22Type%22]},{%22eq%22:%22links%22,%22in%22:[%22From%22]}]}'

其余查询语法可参看官方文档

  1. tiedot in 10 minutes
  2. API reference and embedded usage
  3. Query processor and index

数据提取

通过tiedot查询可以获得如下数据结构,Identifier字段就是我们要提取的内容,即暗网网址

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"1015433753384374302": {
"FirstSeen": "2019-06-18T02:47:34.116827431Z",
"From": "links",
"Identifier": "booksubt62eeiyrb.onion",
"LastSeen": "2019-06-18T06:56:14.955689298Z",
"Onion": "zqktlwi4i34kbat3.onion",
"Type": "uri"
},
"1016550069890011655": {
"FirstSeen": "2019-06-18T02:47:36.792353387Z",
"From": "links",
"Identifier": "xqz3u5drneuzhaeo.onion",
"LastSeen": "2019-06-18T06:56:16.248697618Z",
"Onion": "zqktlwi4i34kbat3.onion",
"Type": "uri"
},
"1032780360597881096": {
"FirstSeen": "2019-06-18T02:47:35.40095027Z",
"From": "links",
"Identifier": "answerstedhctbek.onion",
"LastSeen": "2019-06-18T06:56:16.847924196Z",
"Onion": "zqktlwi4i34kbat3.onion",
"Type": "uri"
}
}

可以用一段简单的python代码来实现这个提取功能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
def read():
with open("./result.txt", "r") as fr:
return fr.read()
if __name__ == '__main__':
res = read()
pattern = re.compile(r'\"Identifier":("[A-Z|a-z|0-9]*.onion"){1}')
result = pattern.findall(res)
for url in result:
print(url)