1998년 호주 의회 토론의 디지털화

과학 데이터 10권, 기사 번호: 567(2023) 이 기사 인용

242 액세스

12 알트메트릭

측정항목 세부정보

의회에서 발언되는 내용에 대한 대중의 지식은 민주주의의 원칙이자 정치학 연구를 위한 중요한 자원입니다. 호주에서는 영국 전통에 따라 의회에서 발언된 내용을 서면으로 기록한 것을 Hansard라고 합니다. 호주의 Hansard는 항상 공개적으로 이용 가능했지만, PDF나 XML로만 이용 가능했기 때문에 대규모 거시적, 미시적 수준의 텍스트 분석 목적으로 사용하기가 어려웠습니다. 캐나다에서 이를 달성한 Linked Parliamentary Data 프로젝트의 주도에 따라 우리는 1998년부터 2022년까지 호주 의회 토론의 진행 상황을 캡처하는 새롭고 포괄적인 고품질 직사각형 데이터베이스를 제공합니다. 데이터베이스는 공개적으로 이용 가능하며 링크될 수 있습니다. 선거 결과와 같은 다른 데이터세트에 적용됩니다. 이 데이터베이스의 생성과 접근성은 새로운 질문에 대한 탐구를 가능하게 하며 연구자와 정책 입안자 모두에게 귀중한 리소스 역할을 합니다.

공식적으로 Hansard1로 알려진 의회 토론의 공식 서면 기록은 정치적 절차의 역사를 포착하고 귀중한 연구 문제의 탐구를 촉진하는 데 근본적인 역할을 합니다. 영국 의회에서 시작된 Hansard의 생산은 캐나다, 호주 등 다른 많은 영연방 국가에서도 전통이 되었습니다2. 이러한 기록의 내용과 규모를 고려하면 특히 정치학 연구의 맥락에서 중요한 의미를 갖습니다. 캐나다의 경우 Hansard는 1901년부터 20193년까지 디지털화되었습니다. Hansard의 디지털화 버전을 보유하면 연구자는 텍스트 분석 및 통계 모델링을 수행할 수 있습니다. 해당 프로젝트의 주도에 따라 본 논문에서는 호주에 대한 유사한 데이터베이스를 소개합니다. 이는 1998년 3월부터 2022년 9월까지 하원의회일별 개별 데이터셋으로 구성되어 있으며, 국회에서 발언된 모든 내용을 연구자가 쉽게 사용할 수 있는 형태로 담고 있습니다. 대규모 텍스트 분석을 위한 도구 개발을 통해 이 데이터베이스는 시간이 지남에 따라 호주의 정치적 행동을 이해하기 위한 리소스 역할을 할 것입니다.

이 데이터베이스에는 다양한 응용 분야가 있습니다. 예를 들어, 호주 내에서는 공공 정책 토론의 '질'이 하락했다는 상당한 우려가 있습니다(그렇게 정의할 수는 있지만). 우리의 데이터세트를 사용하여 상황이 특정 방식으로 실제로 악화되고 있는지 여부와 그렇다면 그 이유를 확인할 수 있습니다. 우리는 또한 특정 하위 인구가 의회에서 논의되는 내용에 적절하게 대표되는지 여부에 관심을 가질 수 있습니다. 예를 들어, 수도권에 비해 지방이 간과된다는 우려가 종종 있다. 이번에도 우리 데이터베이스를 사용하여 시간이 지남에 따라 이것이 변경되었는지 조사할 수 있습니다. 우리는 비교 분석이 가능한 다른 국가의 유사한 데이터베이스와 연결될 수 있는 방식으로 데이터베이스를 개발했습니다. 예를 들어, 우리는 전염병이나 전쟁과 같은 다양한 글로벌 사건에 따라 의회의 정책 초점이 어떻게 변하는지에 관심이 있을 수 있습니다. 국제연계는 국내문제는 다르지만 국제문제는 공통적인 비교사례를 제공한다. 이 연결을 활성화하는 예로 당사는 데이터베이스에 PartyFacts ID(https://partyfacts.herokuapp.com)를 포함시켰습니다. 이를 통해 우리 데이터베이스를 ParlaMint4, ParlSpeech5, ParlEE6 및 MAPLE7과 같은 다른 대규모 의회 연설 수집 프로젝트와 연결할 수 있습니다.

종종 '하원'으로 불리는 호주 하원은 새로운 법률을 제정하고 정부 지출을 감독하는 등 여러 가지 중요한 정부 기능을 수행합니다8, ch. 1. 하원의 정치인을 국회의원(MP)이라고 합니다. 하원은 평행 회의소 구성으로 운영됩니다. 즉, 회의소와 연맹 회의소라는 두 개의 토론 장소에서 진행이 진행됩니다. 하원의회는 사전 정의된 업무 순서를 따르며, 이는 정상 명령이라고 불리는 절차 규칙에 의해 규제됩니다8, ch. 8. 상공회의소의 일반적인 회의일에는 정부 업무에 대한 토론, 90초 회원 성명, 질문 시간8, ch. 8. 연맹회의소는 1994년 회의소의 하위 토론장으로 창설되었습니다. 이는 하우스 비즈니스의 절차가 상공회의소의 절차와 동시에 진행되므로 더 나은 시간 관리를 허용합니다8, ch. 21. 연맹 회의소의 회의는 업무 순서와 토론 범위 측면에서 회의소의 회의와 다릅니다. 연맹 회의소에서 논의되는 비즈니스 문제는 주로 법안 개발의 중간 단계와 민간 회원의 비즈니스로 제한됩니다8, ch. 21. 이는 Hansard의 기반이 되는 이러한 절차의 기록 및 편집이며, 본질적으로 완전히는 아니지만 축어적입니다.

/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p>/p> and serves as a container for the entire document. This parent node may have up to four child nodes, where the first child node contains details on the specific sitting day. Next, contains all proceedings of the Chamber, contains all proceedings of the Federation Chamber, and contains Question Time proceedings. The Federation Chamber does not meet on every sitting day, so this child element is not present in every XML file. The use of separate child nodes allows for the distinction of proceedings between the Chamber and Federation Chamber. The structure of the and nodes are generally the same, where the proceeding begins with which is followed by a series of debates. Debate nodes can contain a child node which has a child node nested within it. That said, sometimes is not nested within . Each of these three elements (i.e., , , and ) as well as their respective sub-elements contain important information on the topic of discussion, who is speaking, and what is being said. The node within each one contains the bulk of the text associated with that debate or sub-debate. A typical node begins with a sub-node, providing information on the MP whose turn it is to speak and the time of their first statement. Unsurprisingly, speeches rarely go uninterrupted in parliamentary debate settings — they are often composed of a series of interjections and continuations. These statements are categorized under different sub-nodes depending on their nature, such as or . The final key component of Hansard is Question Time, in which questions and answers are classified as unique elements. More detail on the purpose and processing of Question Time will follow./p> (highlighted in blue), followed by a child element (highlighted in yellow) with sub-child elements such as the date and parliament number, which are all highlighted in pink. Next, there is the child element containing everything that takes place in the Chamber, , which is also highlighted in yellow in Fig. 1. As previously mentioned, the first sub-node of is . The structure of this can be seen between the nodes highlighted in green in Fig. 1, where the content we parse from the business start is highlighted in orange./p> versus . The next key task stemmed from the fact that the raw text data were not separated by each statement when parsed. In other words, any interjections, comments made by the Speaker or Deputy Speaker and continuations within an individual speech were all parsed together as a single string. As such, the name, name ID, electorate and party details were only provided for the person whose turn it was to speak. There were many intricacies in the task of splitting these speeches in a way that would be generalizable across sitting days. Details on these are provided later./p> content, and some days did not have a Federation Chamber proceeding. To improve the generalizability of these scripts, if-else statements were embedded within the code wherever an error might arise due to a missing element. For example, the entire Federation Chamber block of code is wrapped in an if-else statement for each script, so that it only executes if what the code attempts to parse exists in the file./p> in all XML files prior to 14 August 2012. Having developed our first script based on Hansard from recent years, all XPath expressions for parsing Federation Chamber proceedings contain the specification. To avoid causing issues in our first script which successfully parses about 10 years of Hansard, we created a second script where we replaced all occurrences of with . After making this modification and accounting for other small changes such as timestamp formatting, this second script successfully parses all Hansard sitting days from 10 May 2011 to 28 June 2012 (inclusive)./p> are typically , and . The first child node contains data on the person whose turn it is to speak, and the second contains the entire contents of that speech –- including all interjections, comments, and continuations. After the element closes, there are typically a series of other child nodes which provide a skeleton structure for how the speech proceedings went in chronological order. For example, if the speech began, was interrupted by an MP, and then continued uninterrupted until the end, there would be one node and one node following the node. These would contain details on the MP who made each statement, such as their party and electorate./p> node. Rather than this single child node that contains all speech content, statements are categorized in individual child nodes. This means that unlike our code for parsing more current Hansards, we cannot specify a single XPath expression such as “chamber.xscript//debate//speech/talk.text” to extract all speeches, in their entirety, at once. This difference in nesting structure made many components of our second script unusable for processing transcripts preceding 10 May 2011, and required us to change our data processing approach considerably./p> node, we found that the most straightforward way to preserve the ordering of statements and to parse all speech contents at once was to parse from the element directly. The reason we did not use its child node is because every speech has a unique structure of node children, and this makes it difficult to write code for data cleaning which is generalizable across all speeches and sitting days. The challenge with parsing through the element is that every piece of data stored in that element is parsed as a single string, including all data, and all nested sub-debate data. For example, the data shown in Fig. 2 would be parsed as a single string preceding the speech content, like so:/p>

node, and used them to split statements wherever one of these patterns was found. After separating the statements, we were able to remove these patterns from the body of text. We also used this method of extracting and later removing unwanted patterns for other pieces of data which did not belong to the debate proceedings, such as sub-debate titles./p> child node, with sub-child nodes called and to differentiate the two. Questions in writing, however, are embedded in their own child node called at the end of the XML file./p> speeches used in all four scripts meant that all questions without notice content was already parsed in order. For the first two scripts, questions and answers were already separated onto their own rows. For the third and fourth scripts, just as we did with the rest of the speech content, we used those patterns of data preceding the text to separate questions and answers. Finally, since questions in writing exist in their own child node we were able to use the same parsing method for all scripts, which was to extract all question and answer elements from the child node./p> nodes to separate speeches. As evident in Fig. 3, nodes are nested within nodes, meaning that the patterns of data from interjection statements were separated out in the process. This meant that we did not need to create lists of names and titles for which to search in the text as we did before. However, we used the same list of general interjection statements on which to separate as was used in the first two scripts. We then did an additional check for statements that may have not been separated due to how they were embedded in the XML, and separated those out where needed. In particular, while most statements were categorized in their own child node and hence captured through pattern-based separation, some were not individually categorized, and had to be split manually in this step./p> nodes contain important data on the MP making each statement. As such, we could extract those data associated with each pattern by parsing one element inward, using the XPath expression “talk.start/talker”. We created a pattern lookup table with these data, and merged it with the main Hansard dataframe by the first pattern detected in each statement. Figure 6 provides an example of that lookup table. This approach enabled us to fill in missing data on each MP speaking using data extracted directly from the XML. Finally, we then used the AustralianPoliticians dataset to fill in other missing data, and flagged for interjections in the same manner as before./p> content in their own nodes that contain the voting data and division result. Since we focus primarily on the spoken Hansard content, our parsing scripts do not necessarily capture all divisions data from House proceedings. Our approach to parsing Hansard in the third and fourth scripts described in the Script Differences section naturally allowed for much of the divisions data to be added to our resulting files for 1998 to March 2011, however the parsing scripts used for May 2011 to September 2022 Hansard did not. To supplement our database and in an effort to fill this divisions data gap, we created an additional file containing all divisions data nested under the XPath “//chamber.xscript//division” from the Hansard files in our time frame. To produce this data file, for each Hansard XML we parsed the , , and child-nodes where they existed, extracted any timestamps where available, and did any additional data cleaning as necessary. We used a series of if-else statements in this script to account for variation in the structure of the node over time. Finally, we then added a date variable to distinguish between sitting days./p> element is the date. Every file passed this test, and we detected one discrepancy in an XML file from 03 June 2009, where its session header contained the wrong date. We validated that our file name and date was correct by checking the official PDF release from that sitting day./p>