需求

onionscan爬取暗网数据后，是使用tiedot这个非关系型数据库进行数据存储，需要从tiedot提取出来暗网的网址链接

tiedot

简介

tiedot 是一个文档型数据库引擎，它使用 JSON 作为文档表示方法；有一个强大的查询处理器，支持高级设置操作；可以嵌入到程序中，也可以使用 HTTP 为 API 运行独立服务

github地址: https://github.com/HouzuoGuo/tiedot

启动tiedot

# gopath路径为/root/gopath
go get github.com/HouzuoGuo/tiedot
# 进入生成tiedot的目录下，如果$gopath/bin已设置为环境变量，可忽略这一步
cd /root/gopath/bin
# 启动tiedot的httpd服务
tiedot -mode=httpd -dir=/opt/onionscandb -port 9090

tiedot文件结构

进入/opt/onionscandb目录下，可以看到onionscan保存的数据目录结构如下图，crawls和relationships是tiedot的两个集合

其中relationships目录中结构如下图，From,Identifier, Onion, Type是四个索引，被索引的字段才可以被查询到

查询语法

1、查询有哪些collection

1	curl '127.0.0.1:9090/all'

2、查询col中所有的索引

1	curl '127.0.0.1:9090/indexes?col=relationships'

3、查询索引Type等于uri或者索引Form等于links的内容

1
2

# 127.0.0.1:9090/query?col=relationships&q=[{"eq": "uri", "in": ["Type"]},{"eq":"links","in":["From"]}]
curl -g '127.0.0.1:9090/query?col=relationships&q=[{"eq":%20"uri",%20"in":%20["Type"]},{"eq":"links","in":["From"]}]'

4、查询索引Type等于uri且索引Form等于links的内容

1
2

# 127.0.0.1:9090/query?col=relationships&q={"n":[{"eq": "uri", "in": ["Type"]},{"eq":"links","in":["From"]}]}
curl -g '127.0.0.1:9090/query?col=relationships&q={%22n%22:[{%22eq%22:%20%22uri%22,%20%22in%22:%20[%22Type%22]},{%22eq%22:%22links%22,%22in%22:[%22From%22]}]}'

其余查询语法可参看官方文档

数据提取

通过tiedot查询可以获得如下数据结构,Identifier字段就是我们要提取的内容，即暗网网址

{
    "1015433753384374302": {
        "FirstSeen": "2019-06-18T02:47:34.116827431Z",
        "From": "links",
        "Identifier": "booksubt62eeiyrb.onion",
        "LastSeen": "2019-06-18T06:56:14.955689298Z",
        "Onion": "zqktlwi4i34kbat3.onion",
        "Type": "uri"
    },
    "1016550069890011655": {
        "FirstSeen": "2019-06-18T02:47:36.792353387Z",
        "From": "links",
        "Identifier": "xqz3u5drneuzhaeo.onion",
        "LastSeen": "2019-06-18T06:56:16.248697618Z",
        "Onion": "zqktlwi4i34kbat3.onion",
        "Type": "uri"
    },
    "1032780360597881096": {
        "FirstSeen": "2019-06-18T02:47:35.40095027Z",
        "From": "links",
        "Identifier": "answerstedhctbek.onion",
        "LastSeen": "2019-06-18T06:56:16.847924196Z",
        "Onion": "zqktlwi4i34kbat3.onion",
        "Type": "uri"
    }
}

可以用一段简单的python代码来实现这个提取功能

import re
def read():
    with open("./result.txt", "r") as fr:
        return fr.read()
if __name__ == '__main__':
    res = read()
    pattern = re.compile(r'\"Identifier":("[A-Z|a-z|0-9]*.onion"){1}')
    result = pattern.findall(res)
    for url in result:
        print(url)